h1

Big improvement to Multidict for some languages

March 3, 2014

 by Caoimhín (Skye)

error

I hit on the “hunspell” improvement to Multidict almost by accident. I

had felt, ever since the days of the POOLS-T project which first developed

Wordlink and Multidict, that the main thing which the facility lacked was

some ability to do “lemmatization” – to change a wordform which you click

on into a dictionary headword for looking up in an online dictionary.

The Greek partners in the POOLS-T project in particular complained that

Wordlink only ever succeeded in finding the occasional Greek word in the

dictionary, and the Swiss-Italian partners had the same complaint to a

lesser extent. The only reason that Wordlink works so successfully for

English texts is that English has very few inflected forms (wordforms such

as “running”, “distancing”, “distanced”) compared to most languages. And

also that English is a big enough, rich enough language that many of the

online dictionaries (with some notable exceptions such as Etymonline) have

inbuilt lemmatization. But although I hoped sometime to try and add this

capability to Multidict for many languages, I thought that it would be a

huge amount of work.

I happened to be talking to Mìchael Bauer, a linguist who has done so much

for Scottish Gaelic on the Internet. He also speaks Basque and a good

Basque dictionary had “stopped working with Wordlink”. In fact, it had

merely changed its search parameters (for the better) and I soon put

things right, but I mentioned to Mìcheal in passing that none of the

Basque dictionaries worked very well anyway with Wordlink because Basque

is a highly inflected language and the dictionaries do not do

lemmatization. Mìcheal pointed me to the excellent Basque implementation

of hunspell which contains lots of Basque inflexion rules in its “eu.aff”

file. Hunspell was first developed for Hungarian, another highly

inflected language, and instead of just relying on a huge wordlist in a

.dic file like old-fashioned spell-checkers, it can make clever use of

lots of complicated inflexion rules for the language in a .aff file.

(The “.aff” stands for “affix”.)

I started trying to decipher and understand the mathematical looking

inflexion rules in the eu.aff file, but when I read up more about hunspell

I found I didn’t have to bother! Hunspell does more than just

spellchecking. It does lemmatization too: if you give it a wordform it

will give you back a dictionary headword (or several headwords if there

are several possibilities). I soon put it into service for Basque in

Multidict, and Mìcheal Bauer declared it to be a big improvement on the

whole. I had visions of hunspell giving us lemmatization “for free” for

lots of languages.

My initial excitement soon turned to disappointment, though. I tried

hunspell for Arabic, and although the “lemma” it came up with was

sometimes very good and was found in the dictionary whereas the original

wordform was not, it more often turned out that we would have been better

just sticking to the original wordform. Most other languages were

intermediate: sometimes the original wordform was better, sometimes the

lemma from hunspell was better. What we needed was a mechanism which

would give the user the best of both worlds. That is why I came up with

the brown list of suggestions and the “click again” mechanism which gives

control to the user.

There isn’t currently much explanation in Multidict to guide the user on

the new facility, but the problem is that space is at a premium in the

Multidict navigation frame and we don’t want to clutter things up or

confuse new users. I would hope to put an explanation of the facility in

the Help file. And hopefully most users will see how it works simply by

experimentation.

How well the new mechanism works for a language depends on how clever the

hunspell implementation is for that language. For some languages, such as

Basque and also Lithuanian by the looks of things, the .aff file has lots

of inflexion rules built into it and the mechanism works very well. For

others such as German the hunspell implementation still relies on a huge

old-fashioned wordlist in the .dic file and does us little good. New

hunspell implementations are appearing all the time, though, so we can

look out for better ones as they appear and pull them in.

For Scottish Gaelic, hunspell turned out to be of hardly any use, and for

Irish Gaelic not much better. However, Mìcheal Bauer generously gave me a

huge lemmatization table he had built for Scottish Gaelic and I have added

this into the mechanism. Likewise, Kevin Scannell in the US generously

gave me huge lemmatization table he had built for Irish Gaelic and I have

thrown this in too. I threw in too a huge public domain lemmatization

table for Italian which I found back in the days of the POOLS-T project

and had stored ever since in the hope that it would be useful some day.

So for all these languages the new mechanism is I think working

particularly well. All other languages currently rely only on hunspell.

(Actually, for the afficionados, I have also thrown in a table of Old

Irish irregular verbforms.)

For Scottish and Irish Gaelic, I have moved the old rules for the removal

of initial mutations into the new mechanism. So instead of the old system

which converted “shoilleir” to “soilleir”, “thart” to “tart”, “tsaol” to

“saol”, “bhfuar” to “fuar”, “bhfuilimid” to “fuilimid” willy-nilly and

sometimes made things worse instead of better, the new system gives

control to the user. For Irish Gaelic I notice that even for some

dictionaries such as FGB which now give good lemmatization suggestions,

the new Multidict mechanism is so slick that it is often quicker and

easier to just to click again and let Multidict do the work.

Although it is wonderful what hunspell has given us for free, it is not

perfect for our task. In particular it is not good for lemmatizing common

irregular verbs or irregular noun inflexions. Hunspell’s aims and ours

are different. Hunspell’s aim in its .aff file is to supply the inflexion

rules for regular verbs and regular nouns and thereby save the space which

would otherwise be taken up in the .dic file by hundreds of thousands of

regularly inflected wordforms. It is not bothered about the small amount

of space taken up by irregular wordforms, so it just throws them all into

the .dic file and therefore cannot lemmatize them. That is why it does

not suggest “estar” when we give it the Portugues verbform “estamos”. We

could solve much of this problem by feeding Multidict a small table of

irregular verbforms for each language – but that is something for another

project.

There are lots of other possibilities. The new mechanism is super

flexible and can handle lemmatization suggestions from algorithmic rules

as well those from hunspell and from a lemmatization table. As an

experiment, I have thrown in a rule to try removing a final ‘s’ from

English words – so that if you click on “transducers” Multidict will

suggest “transducer” even though hunspell does not know this word. We

could add many such rules for many languages. The beauty of the new

mechanism is that since we are only providing the user with suggestions,

the rules do not have to always be perfect. We could add in a facility to

break words into component words so that if you you give it German

“Infobahn” it will suggest “Info” and “Bahn”. Or we could give it a

facility to convert between closely related languages, so that if you give

it Irish Gaelic “scáthach” it can automatically suggest trying “sgàthach”

in Scottish Gaelic dictionaries. Similarly for Danish and Norwegian,

Spanish and Portugues, Czech and Slovak perhaps.

The possibilities the new mechanism offers are numerous and exciting.

They need working through for each individual language. But this will have

to wait for future years and perhaps future projects, because there are lots of other things we ought to try and do

yet in the TOOLS project: hopefully giving Clilstore the ability to store

exercises files itself for one thing, so as to remove the need to store

them separately on Dropbox. But according to my mind

the easy opportunity unexpectedly presented by hunspell was too

good to miss.

 

2 comments

  1. The new facilities described here are currently available at http://test.multidict.net/, but I hope to move them very soon, if people are happy with them, to http://multidict.net/


  2. It was good to share the details of the programming work , adventures, high and low points, thinking and maths needed, excitement and had work to understand WHY programme works or not or behaves strangely ?!
    I think you covered with clear examples the numerous possibilities to be happy that we are not in your place as programmers and the results that would make happier the users 🙂



Leave a comment