Big improvement to Multidict for some languages

March 3, 2014

 by Caoimhín (Skye)


I hit on the “hunspell” improvement to Multidict almost by accident. I

had felt, ever since the days of the POOLS-T project which first developed

Wordlink and Multidict, that the main thing which the facility lacked was

some ability to do “lemmatization” – to change a wordform which you click

on into a dictionary headword for looking up in an online dictionary.

The Greek partners in the POOLS-T project in particular complained that

Wordlink only ever succeeded in finding the occasional Greek word in the

dictionary, and the Swiss-Italian partners had the same complaint to a

lesser extent. The only reason that Wordlink works so successfully for

English texts is that English has very few inflected forms (wordforms such

as “running”, “distancing”, “distanced”) compared to most languages. And

also that English is a big enough, rich enough language that many of the

online dictionaries (with some notable exceptions such as Etymonline) have

inbuilt lemmatization. But although I hoped sometime to try and add this

capability to Multidict for many languages, I thought that it would be a

huge amount of work.

I happened to be talking to Mìchael Bauer, a linguist who has done so much

for Scottish Gaelic on the Internet. He also speaks Basque and a good

Basque dictionary had “stopped working with Wordlink”. In fact, it had

merely changed its search parameters (for the better) and I soon put

things right, but I mentioned to Mìcheal in passing that none of the

Basque dictionaries worked very well anyway with Wordlink because Basque

is a highly inflected language and the dictionaries do not do

lemmatization. Mìcheal pointed me to the excellent Basque implementation

of hunspell which contains lots of Basque inflexion rules in its “eu.aff”

file. Hunspell was first developed for Hungarian, another highly

inflected language, and instead of just relying on a huge wordlist in a

.dic file like old-fashioned spell-checkers, it can make clever use of

lots of complicated inflexion rules for the language in a .aff file.

(The “.aff” stands for “affix”.)

I started trying to decipher and understand the mathematical looking

inflexion rules in the eu.aff file, but when I read up more about hunspell

I found I didn’t have to bother! Hunspell does more than just

spellchecking. It does lemmatization too: if you give it a wordform it

will give you back a dictionary headword (or several headwords if there

are several possibilities). I soon put it into service for Basque in

Multidict, and Mìcheal Bauer declared it to be a big improvement on the

whole. I had visions of hunspell giving us lemmatization “for free” for

lots of languages.

My initial excitement soon turned to disappointment, though. I tried

hunspell for Arabic, and although the “lemma” it came up with was

sometimes very good and was found in the dictionary whereas the original

wordform was not, it more often turned out that we would have been better

just sticking to the original wordform. Most other languages were

intermediate: sometimes the original wordform was better, sometimes the

lemma from hunspell was better. What we needed was a mechanism which

would give the user the best of both worlds. That is why I came up with

the brown list of suggestions and the “click again” mechanism which gives

control to the user.

There isn’t currently much explanation in Multidict to guide the user on

the new facility, but the problem is that space is at a premium in the

Multidict navigation frame and we don’t want to clutter things up or

confuse new users. I would hope to put an explanation of the facility in

the Help file. And hopefully most users will see how it works simply by


How well the new mechanism works for a language depends on how clever the

hunspell implementation is for that language. For some languages, such as

Basque and also Lithuanian by the looks of things, the .aff file has lots

of inflexion rules built into it and the mechanism works very well. For

others such as German the hunspell implementation still relies on a huge

old-fashioned wordlist in the .dic file and does us little good. New

hunspell implementations are appearing all the time, though, so we can

look out for better ones as they appear and pull them in.

For Scottish Gaelic, hunspell turned out to be of hardly any use, and for

Irish Gaelic not much better. However, Mìcheal Bauer generously gave me a

huge lemmatization table he had built for Scottish Gaelic and I have added

this into the mechanism. Likewise, Kevin Scannell in the US generously

gave me huge lemmatization table he had built for Irish Gaelic and I have

thrown this in too. I threw in too a huge public domain lemmatization

table for Italian which I found back in the days of the POOLS-T project

and had stored ever since in the hope that it would be useful some day.

So for all these languages the new mechanism is I think working

particularly well. All other languages currently rely only on hunspell.

(Actually, for the afficionados, I have also thrown in a table of Old

Irish irregular verbforms.)

For Scottish and Irish Gaelic, I have moved the old rules for the removal

of initial mutations into the new mechanism. So instead of the old system

which converted “shoilleir” to “soilleir”, “thart” to “tart”, “tsaol” to

“saol”, “bhfuar” to “fuar”, “bhfuilimid” to “fuilimid” willy-nilly and

sometimes made things worse instead of better, the new system gives

control to the user. For Irish Gaelic I notice that even for some

dictionaries such as FGB which now give good lemmatization suggestions,

the new Multidict mechanism is so slick that it is often quicker and

easier to just to click again and let Multidict do the work.

Although it is wonderful what hunspell has given us for free, it is not

perfect for our task. In particular it is not good for lemmatizing common

irregular verbs or irregular noun inflexions. Hunspell’s aims and ours

are different. Hunspell’s aim in its .aff file is to supply the inflexion

rules for regular verbs and regular nouns and thereby save the space which

would otherwise be taken up in the .dic file by hundreds of thousands of

regularly inflected wordforms. It is not bothered about the small amount

of space taken up by irregular wordforms, so it just throws them all into

the .dic file and therefore cannot lemmatize them. That is why it does

not suggest “estar” when we give it the Portugues verbform “estamos”. We

could solve much of this problem by feeding Multidict a small table of

irregular verbforms for each language – but that is something for another


There are lots of other possibilities. The new mechanism is super

flexible and can handle lemmatization suggestions from algorithmic rules

as well those from hunspell and from a lemmatization table. As an

experiment, I have thrown in a rule to try removing a final ‘s’ from

English words – so that if you click on “transducers” Multidict will

suggest “transducer” even though hunspell does not know this word. We

could add many such rules for many languages. The beauty of the new

mechanism is that since we are only providing the user with suggestions,

the rules do not have to always be perfect. We could add in a facility to

break words into component words so that if you you give it German

“Infobahn” it will suggest “Info” and “Bahn”. Or we could give it a

facility to convert between closely related languages, so that if you give

it Irish Gaelic “scáthach” it can automatically suggest trying “sgàthach”

in Scottish Gaelic dictionaries. Similarly for Danish and Norwegian,

Spanish and Portugues, Czech and Slovak perhaps.

The possibilities the new mechanism offers are numerous and exciting.

They need working through for each individual language. But this will have

to wait for future years and perhaps future projects, because there are lots of other things we ought to try and do

yet in the TOOLS project: hopefully giving Clilstore the ability to store

exercises files itself for one thing, so as to remove the need to store

them separately on Dropbox. But according to my mind

the easy opportunity unexpectedly presented by hunspell was too

good to miss.



Love Song

February 14, 2014

For Valentine’s Day – a love song on Clilstore… Multimedia, multilingual love bombs 🙂

Island Voices - Guthan nan Eilean

A modern Gaelic love song, ’S tu mo ghaol, from the Bi Beò archives forms the musical centrepiece for Island Voices’ Valentine’s Day message. You can hear it above without a transcript, or if you would like to read the words while you listen you can try this new Clilstore unit.

LyublyuFor a local visual celebration of the same sentiment we still find it hard to beat this message in Russian, beautifully crafted out of Grimsay scallop shells at Poll nan Crann in Benbecula. The story behind it is quite touching, and you can read all about it in this Facebook thread.

Meanwhile, and for one day only, An Radio, the new bilingual Uist community station, will be running specially themed playlists all day today. Just follow this link and click on the “An Radio Player”. Wherever you are, and whichever languages you like to use, Island…

View original post 8 more words



December 12, 2013

On the 25th of November the TOOLS project was presented to the teaching staff and students of the Department of Linguistics and Literatures. The emphasis was set on the Tools consortium, on the objectives, on the social networks and on the results. Newsletters were offered to participants, encouraging them to register online in order to receive the newsletter on a regular basis.

Untitledtools nas redes

The presentation was made using the IPad and on the Prezi online system. You can access the full  online Prezi presentation at: http://goo.gl/CnT7H4
The results were given particular place on this dissemination activity. It was given a brief overview of Clilstore, wordlink and multidict.

resultados tools
The manuals and guides were also given the floor, displaying the Portuguese version of each of this guidebook. As the eBook had  just been made available there was also a demonstration of the eBook PT version on this dissemination activity.

Finally, four units were chosen in order to demonstrate how the Clilstore online system operates: Knee ligament anatomy (English – A2), A warm embrace that saves lives (English – B1), Tuésmais forte (Portuguese – A2) and Monteringaf et britiskelstik (Danish – A1). The participants, both teachers and students, were very interested in using the units presented, and others available on  Clilstore. They highlighted has strong key points: the fact that is free, the fact that is online, the easiness to work with, the online dictionaries.


Clilstore and the TOOLS project introduced to ECML network

December 11, 2013

20131206_094627 European Centre for Modern Languages (ECML), a department within the Council’s Directorate General IV- Education, Culture and Heritage, Youth and Sport-, is implementing a project “Language for work”, one objective of which is to create a network of language professionals working in the field of developing skills of migrant workers, and other teachers related to teaching professional language at different educational levels. The network held a meeting on the 5-6th of December, 2013 in Graz, Austria, at the headquarters of the ECML. The network was established last December and this was its second meeting. Language teachers, from state and private sector gathered to share their experience and methodologies of teaching professional language and to learn from each other. The agenda had foreseen time for network members to present the results of their work, including, projects, outcomes of research and other activities. Or simply to provide practitioners’ view on the issue.

Rasa Zygmantaite, a member of Tools project team, who is also a member of the Language for Work network, participated in the meeting and presented Clilstore and the Tools project. 11The time, allocated for the presentation didn’t allow the teachers from the network to register and immediately create their own units during the workshop, however, all the 35 CDs with audio files on how to create your own unit in English (+ Scottish Gaelic, Irish Gaelic, Lithuanian and Portuguese), that were offered for the workshop participants, were distributed in less than five minutes! The participants of the network showed great interest in the tool, especially its ability to work with other languages than those, using the Latin alphabet, i.e. with all characters of UTF8, which is very important when you work with migrant workers from different parts of the world. The ECML network “Language for work” is finalizing its website, and the participants’ task during one of the workshops was to discuss the functionality of the website in order to make it as attractive and user friendly as possible and really used by language teaching professionals from the European Union and non-EU countries. Members of the network will be able to share their findings and methodological material in the “Library” section of the website.  And not only that- we can link our project to this network to facilitate the access of the finding of the Tools and other http://www.languages.dk projects for the big family of language teachers.
More information: www.ecml.at


Kick-off meeting Pools-3 Brussels

November 14, 2013

The POOLS-3 Transfer of Innovation project has just kicked off, and will surely be looking to try out the Clilstore platform with new languages…


Are you a language teacher interested in using the modern technology as a tool of language learning? Does lack of support and knowledge prevent you from meaningful use of ICT in class? Do you know what CLIL and CALL stand for? Are you familiar with „Task-based learning“?

Nowadays, there is a large number of teachers who want to use innovative methods. However, they lack information and effective tools which would help them to implement ICT and the digital video. The POOLS project addressed these needs by creating materials and tools for Computer Assisted Language Learning method.

POOLS-3 is a Transfer of innovation project based on the original POOLS project funded with support from the European Commission. It will adapt and translate the POOLS guides and manuals, produce digital video material for use of ICT in language learning in three new languages. It will also run teacher training courses on innovative…

View original post 282 more words


Tools presented at EfVET conference in Athens

October 29, 2013

2013-10-23 17.31.56

On the 23-26th of October, 2013 Tools project was presented at an EfVET (www.efvet.org ) conference in Athens. The theme of the conference “The College that Works” attracted more than 250 participants from all over Europe as well as several non-EU citizens from such countries as Turkey, Russia, Azerbaijan, Hong-Kong, Japan etc. Antonio Silva Mendes, Director for VET and secondary education from DG EAC was among the speakers of the conference. As usual, the conference was very keen on sharing the good practice of the EU funded project, dedicating half day of the conference agenda to a special session called “round tables”. atena-7

Here participants of different EU funded project can present the outcomes of their work to other conference participants. Kent Andersen, Tools project coordinator presented the project together with two other members of the Tools team in two half an hour sessions during the main time of the conference agenda and attracted a big number of the audiences during both rounds. The project and the tool were further discussed among the project participants and the conference delegates and the feedback was more than positive. E.g. a principal from a school in Germany said he will make sure language teachers use the tool, as the school is very keen on IT and innovation in teaching. Other teachers were prepared not only to use this during their classes, but for their own language learning during their free time. The possibility to link the Multidict with various WebPages including famous news channels make it easier and more attractive to teach politics and history, said another teacher from Denmark. atena-5
This year the project could offer the finished web based tool- the Clilstore, which is connected to Multidict and Wordlink. The teachers were eager to try the tool themselves; however, due to a bit too slow internet it was not possible to do it directly during the roundtable workshop. The participants took DIY videos which will help them to try the tool in their normal surrounding s and we look forward to new users in the Clilstore.


An t-Alltan, Clilstore, agus Guthan nan Eilean

October 25, 2013

Here is Caoimhín Ó Donnaíle’s recent presentation at a national conference for Gaelic teachers – An t-Alltan – speaking about his work for TOOLS on Clilstore.

Island Voices - Guthan nan Eilean

Seo an taisbeanadh a bh’ aig Caoimhín Ó Donnaíle o chionn treiseag aig co-labhairt nàiseanta airson tidsearan. Tha An t-Alltan air a bhith a’ ruith o 2008, ach b’ e seo a’ chiad chothrom aig Caoimhín an obair a tha e a’ dèanamh airson Pròiseact TOOLS air Clilstore, Wordlink, agus Multidict a shealltainn. Cha robh aige ach 20 mionaid, ach rinn e an gnothach glè mhath – le “plug” airson Guthan nan Eilean san òraid aige cuideachd. Taing mhòr, Chaoimhín!



Here is Caoimhín Ó Donnaíle’s recent presentation at a national conference for Gaelic teachers – An t-Alltan – speaking about his work for TOOLS on Clilstore, Wordlink, and Multidict. He only had 20 minutes but still managed a plug for Island Voices….

View original post