Big improvement to Multidict for some languages
March 3, 2014
by Caoimhín (Skye)
I hit on the “hunspell” improvement to Multidict almost by accident. I
had felt, ever since the days of the POOLS-T project which first developed
Wordlink and Multidict, that the main thing which the facility lacked was
some ability to do “lemmatization” – to change a wordform which you click
on into a dictionary headword for looking up in an online dictionary.
The Greek partners in the POOLS-T project in particular complained that
Wordlink only ever succeeded in finding the occasional Greek word in the
dictionary, and the Swiss-Italian partners had the same complaint to a
lesser extent. The only reason that Wordlink works so successfully for
English texts is that English has very few inflected forms (wordforms such
as “running”, “distancing”, “distanced”) compared to most languages. And
also that English is a big enough, rich enough language that many of the
online dictionaries (with some notable exceptions such as Etymonline) have
inbuilt lemmatization. But although I hoped sometime to try and add this
capability to Multidict for many languages, I thought that it would be a
huge amount of work.
I happened to be talking to Mìchael Bauer, a linguist who has done so much
for Scottish Gaelic on the Internet. He also speaks Basque and a good
Basque dictionary had “stopped working with Wordlink”. In fact, it had
merely changed its search parameters (for the better) and I soon put
things right, but I mentioned to Mìcheal in passing that none of the
Basque dictionaries worked very well anyway with Wordlink because Basque
is a highly inflected language and the dictionaries do not do
lemmatization. Mìcheal pointed me to the excellent Basque implementation
of hunspell which contains lots of Basque inflexion rules in its “eu.aff”
file. Hunspell was first developed for Hungarian, another highly
inflected language, and instead of just relying on a huge wordlist in a
.dic file like old-fashioned spell-checkers, it can make clever use of
lots of complicated inflexion rules for the language in a .aff file.
(The “.aff” stands for “affix”.)
I started trying to decipher and understand the mathematical looking
inflexion rules in the eu.aff file, but when I read up more about hunspell
I found I didn’t have to bother! Hunspell does more than just
spellchecking. It does lemmatization too: if you give it a wordform it
will give you back a dictionary headword (or several headwords if there
are several possibilities). I soon put it into service for Basque in
Multidict, and Mìcheal Bauer declared it to be a big improvement on the
whole. I had visions of hunspell giving us lemmatization “for free” for
lots of languages.
My initial excitement soon turned to disappointment, though. I tried
hunspell for Arabic, and although the “lemma” it came up with was
sometimes very good and was found in the dictionary whereas the original
wordform was not, it more often turned out that we would have been better
just sticking to the original wordform. Most other languages were
intermediate: sometimes the original wordform was better, sometimes the
lemma from hunspell was better. What we needed was a mechanism which
would give the user the best of both worlds. That is why I came up with
the brown list of suggestions and the “click again” mechanism which gives
control to the user.
There isn’t currently much explanation in Multidict to guide the user on
the new facility, but the problem is that space is at a premium in the
Multidict navigation frame and we don’t want to clutter things up or
confuse new users. I would hope to put an explanation of the facility in
the Help file. And hopefully most users will see how it works simply by
experimentation.
How well the new mechanism works for a language depends on how clever the
hunspell implementation is for that language. For some languages, such as
Basque and also Lithuanian by the looks of things, the .aff file has lots
of inflexion rules built into it and the mechanism works very well. For
others such as German the hunspell implementation still relies on a huge
old-fashioned wordlist in the .dic file and does us little good. New
hunspell implementations are appearing all the time, though, so we can
look out for better ones as they appear and pull them in.
For Scottish Gaelic, hunspell turned out to be of hardly any use, and for
Irish Gaelic not much better. However, Mìcheal Bauer generously gave me a
huge lemmatization table he had built for Scottish Gaelic and I have added
this into the mechanism. Likewise, Kevin Scannell in the US generously
gave me huge lemmatization table he had built for Irish Gaelic and I have
thrown this in too. I threw in too a huge public domain lemmatization
table for Italian which I found back in the days of the POOLS-T project
and had stored ever since in the hope that it would be useful some day.
So for all these languages the new mechanism is I think working
particularly well. All other languages currently rely only on hunspell.
(Actually, for the afficionados, I have also thrown in a table of Old
Irish irregular verbforms.)
For Scottish and Irish Gaelic, I have moved the old rules for the removal
of initial mutations into the new mechanism. So instead of the old system
which converted “shoilleir” to “soilleir”, “thart” to “tart”, “tsaol” to
“saol”, “bhfuar” to “fuar”, “bhfuilimid” to “fuilimid” willy-nilly and
sometimes made things worse instead of better, the new system gives
control to the user. For Irish Gaelic I notice that even for some
dictionaries such as FGB which now give good lemmatization suggestions,
the new Multidict mechanism is so slick that it is often quicker and
easier to just to click again and let Multidict do the work.
Although it is wonderful what hunspell has given us for free, it is not
perfect for our task. In particular it is not good for lemmatizing common
irregular verbs or irregular noun inflexions. Hunspell’s aims and ours
are different. Hunspell’s aim in its .aff file is to supply the inflexion
rules for regular verbs and regular nouns and thereby save the space which
would otherwise be taken up in the .dic file by hundreds of thousands of
regularly inflected wordforms. It is not bothered about the small amount
of space taken up by irregular wordforms, so it just throws them all into
the .dic file and therefore cannot lemmatize them. That is why it does
not suggest “estar” when we give it the Portugues verbform “estamos”. We
could solve much of this problem by feeding Multidict a small table of
irregular verbforms for each language – but that is something for another
project.
There are lots of other possibilities. The new mechanism is super
flexible and can handle lemmatization suggestions from algorithmic rules
as well those from hunspell and from a lemmatization table. As an
experiment, I have thrown in a rule to try removing a final ‘s’ from
English words – so that if you click on “transducers” Multidict will
suggest “transducer” even though hunspell does not know this word. We
could add many such rules for many languages. The beauty of the new
mechanism is that since we are only providing the user with suggestions,
the rules do not have to always be perfect. We could add in a facility to
break words into component words so that if you you give it German
“Infobahn” it will suggest “Info” and “Bahn”. Or we could give it a
facility to convert between closely related languages, so that if you give
it Irish Gaelic “scáthach” it can automatically suggest trying “sgàthach”
in Scottish Gaelic dictionaries. Similarly for Danish and Norwegian,
Spanish and Portugues, Czech and Slovak perhaps.
The possibilities the new mechanism offers are numerous and exciting.
They need working through for each individual language. But this will have
to wait for future years and perhaps future projects, because there are lots of other things we ought to try and do
yet in the TOOLS project: hopefully giving Clilstore the ability to store
exercises files itself for one thing, so as to remove the need to store
them separately on Dropbox. But according to my mind
the easy opportunity unexpectedly presented by hunspell was too
good to miss.
The new facilities described here are currently available at http://test.multidict.net/, but I hope to move them very soon, if people are happy with them, to http://multidict.net/
by caoimhinsmo March 4, 2014 at 12:32 amIt was good to share the details of the programming work , adventures, high and low points, thinking and maths needed, excitement and had work to understand WHY programme works or not or behaves strangely ?!
by ALB Conseil Paris March 4, 2014 at 11:10 amI think you covered with clear examples the numerous possibilities to be happy that we are not in your place as programmers and the results that would make happier the users 🙂