Archive for March, 2014



March 24, 2014

It can be said that by mere accident the Clilstore, as well as the software of the project and the units have found entirely new application! They are used at Marijampole College for the students of Applied Foreign Languages educational programme to teach conference interpreting. Marijampole College is an institution of tertiary education providing professional BA in different programmes of social and applied sciences as well as several technical fields.


It started with dissemination, as it had been planned since the proposal stage of the project that one of the groups with which the Clilstore units would be piloted and then used will be students of Marijampole College Teacher training department. This was successfully completed, and the software was so interesting that the Dean offered to try it with the students of business English. The course of conference interpreting is rather short and the students after taking an exam are granted three credits. In fact this was the first course ever, as the programme is rather new and the third year students are the first ones to graduate from this course this year. It was a challenge to prepare something new and catchy for the students who were quite known among the staff for their lack of motivation and interest in studies. One of the things to make them come and pass the course is accumulative scoring, so that even those who have missed several classes would be able to take the exam, presenting individual work.

Everybody who is learning/teaching languages knows that this process requires a lot of individual work, i.e. you have to do homework, while the students nowadays (things were really different in my times!!!!:)) rely mostly not on their memory but on the Internet, which makes language learning simply impossible! Thus, if you want to catch their interest, you must do it with something original, something they never experienced before.

And-  Voilà!– here we have the Clilstore, something easy to use and really attractive! The tool is just perfect for teaching conference interpreting- I discovered after some research on the website. There are quite a few interesting courses that can be found on the website, including an international project of Vilnius University together with some other prominent HE institutions of Europe; not to mention the resources of DG Interpretation of the EC, who use video for training new interpreters. However, we talk about students whose vocation is not necessarily interpreting or translation! They have a slightly more limited vocabulary in store and their fluency is far from that used by high level professionals, aiming to BECOME interpreters! Choosing a complicated video wouldn’t work, if the students are not equipped with the appropriate amount and variety of vocabulary. Hence they would simply loose interest, while Clilstore gives the students a possibility to work individually keeping the right pace and further creating their own units as individual task.

The story has just begun, but I see how the courses (of applied English) and the Tool of the Tools4Clil fell in love with each other. This had to happen- an entirely new application of the project tool is simply a developer wouldn’t even dreamed of!


Big improvement to Multidict for some languages

March 3, 2014

 by Caoimhín (Skye)


I hit on the “hunspell” improvement to Multidict almost by accident. I

had felt, ever since the days of the POOLS-T project which first developed

Wordlink and Multidict, that the main thing which the facility lacked was

some ability to do “lemmatization” – to change a wordform which you click

on into a dictionary headword for looking up in an online dictionary.

The Greek partners in the POOLS-T project in particular complained that

Wordlink only ever succeeded in finding the occasional Greek word in the

dictionary, and the Swiss-Italian partners had the same complaint to a

lesser extent. The only reason that Wordlink works so successfully for

English texts is that English has very few inflected forms (wordforms such

as “running”, “distancing”, “distanced”) compared to most languages. And

also that English is a big enough, rich enough language that many of the

online dictionaries (with some notable exceptions such as Etymonline) have

inbuilt lemmatization. But although I hoped sometime to try and add this

capability to Multidict for many languages, I thought that it would be a

huge amount of work.

I happened to be talking to Mìchael Bauer, a linguist who has done so much

for Scottish Gaelic on the Internet. He also speaks Basque and a good

Basque dictionary had “stopped working with Wordlink”. In fact, it had

merely changed its search parameters (for the better) and I soon put

things right, but I mentioned to Mìcheal in passing that none of the

Basque dictionaries worked very well anyway with Wordlink because Basque

is a highly inflected language and the dictionaries do not do

lemmatization. Mìcheal pointed me to the excellent Basque implementation

of hunspell which contains lots of Basque inflexion rules in its “eu.aff”

file. Hunspell was first developed for Hungarian, another highly

inflected language, and instead of just relying on a huge wordlist in a

.dic file like old-fashioned spell-checkers, it can make clever use of

lots of complicated inflexion rules for the language in a .aff file.

(The “.aff” stands for “affix”.)

I started trying to decipher and understand the mathematical looking

inflexion rules in the eu.aff file, but when I read up more about hunspell

I found I didn’t have to bother! Hunspell does more than just

spellchecking. It does lemmatization too: if you give it a wordform it

will give you back a dictionary headword (or several headwords if there

are several possibilities). I soon put it into service for Basque in

Multidict, and Mìcheal Bauer declared it to be a big improvement on the

whole. I had visions of hunspell giving us lemmatization “for free” for

lots of languages.

My initial excitement soon turned to disappointment, though. I tried

hunspell for Arabic, and although the “lemma” it came up with was

sometimes very good and was found in the dictionary whereas the original

wordform was not, it more often turned out that we would have been better

just sticking to the original wordform. Most other languages were

intermediate: sometimes the original wordform was better, sometimes the

lemma from hunspell was better. What we needed was a mechanism which

would give the user the best of both worlds. That is why I came up with

the brown list of suggestions and the “click again” mechanism which gives

control to the user.

There isn’t currently much explanation in Multidict to guide the user on

the new facility, but the problem is that space is at a premium in the

Multidict navigation frame and we don’t want to clutter things up or

confuse new users. I would hope to put an explanation of the facility in

the Help file. And hopefully most users will see how it works simply by


How well the new mechanism works for a language depends on how clever the

hunspell implementation is for that language. For some languages, such as

Basque and also Lithuanian by the looks of things, the .aff file has lots

of inflexion rules built into it and the mechanism works very well. For

others such as German the hunspell implementation still relies on a huge

old-fashioned wordlist in the .dic file and does us little good. New

hunspell implementations are appearing all the time, though, so we can

look out for better ones as they appear and pull them in.

For Scottish Gaelic, hunspell turned out to be of hardly any use, and for

Irish Gaelic not much better. However, Mìcheal Bauer generously gave me a

huge lemmatization table he had built for Scottish Gaelic and I have added

this into the mechanism. Likewise, Kevin Scannell in the US generously

gave me huge lemmatization table he had built for Irish Gaelic and I have

thrown this in too. I threw in too a huge public domain lemmatization

table for Italian which I found back in the days of the POOLS-T project

and had stored ever since in the hope that it would be useful some day.

So for all these languages the new mechanism is I think working

particularly well. All other languages currently rely only on hunspell.

(Actually, for the afficionados, I have also thrown in a table of Old

Irish irregular verbforms.)

For Scottish and Irish Gaelic, I have moved the old rules for the removal

of initial mutations into the new mechanism. So instead of the old system

which converted “shoilleir” to “soilleir”, “thart” to “tart”, “tsaol” to

“saol”, “bhfuar” to “fuar”, “bhfuilimid” to “fuilimid” willy-nilly and

sometimes made things worse instead of better, the new system gives

control to the user. For Irish Gaelic I notice that even for some

dictionaries such as FGB which now give good lemmatization suggestions,

the new Multidict mechanism is so slick that it is often quicker and

easier to just to click again and let Multidict do the work.

Although it is wonderful what hunspell has given us for free, it is not

perfect for our task. In particular it is not good for lemmatizing common

irregular verbs or irregular noun inflexions. Hunspell’s aims and ours

are different. Hunspell’s aim in its .aff file is to supply the inflexion

rules for regular verbs and regular nouns and thereby save the space which

would otherwise be taken up in the .dic file by hundreds of thousands of

regularly inflected wordforms. It is not bothered about the small amount

of space taken up by irregular wordforms, so it just throws them all into

the .dic file and therefore cannot lemmatize them. That is why it does

not suggest “estar” when we give it the Portugues verbform “estamos”. We

could solve much of this problem by feeding Multidict a small table of

irregular verbforms for each language – but that is something for another


There are lots of other possibilities. The new mechanism is super

flexible and can handle lemmatization suggestions from algorithmic rules

as well those from hunspell and from a lemmatization table. As an

experiment, I have thrown in a rule to try removing a final ‘s’ from

English words – so that if you click on “transducers” Multidict will

suggest “transducer” even though hunspell does not know this word. We

could add many such rules for many languages. The beauty of the new

mechanism is that since we are only providing the user with suggestions,

the rules do not have to always be perfect. We could add in a facility to

break words into component words so that if you you give it German

“Infobahn” it will suggest “Info” and “Bahn”. Or we could give it a

facility to convert between closely related languages, so that if you give

it Irish Gaelic “scáthach” it can automatically suggest trying “sgàthach”

in Scottish Gaelic dictionaries. Similarly for Danish and Norwegian,

Spanish and Portugues, Czech and Slovak perhaps.

The possibilities the new mechanism offers are numerous and exciting.

They need working through for each individual language. But this will have

to wait for future years and perhaps future projects, because there are lots of other things we ought to try and do

yet in the TOOLS project: hopefully giving Clilstore the ability to store

exercises files itself for one thing, so as to remove the need to store

them separately on Dropbox. But according to my mind

the easy opportunity unexpectedly presented by hunspell was too

good to miss.