недела, март 14, 2010

So, what am I working on?

That seems like a logical place to start with. For a while now I've been experimenting with Moses, a statistical machine translation system. My aim is to train a satisfactory Macedonian-English (and vice versa) MT system. The most notable (and probably only) SMT system including Macedonian is Google Translate, but since they don't publish they corpora/engine, we'll have to manage with something else :) Francis Tyers is also working on a similar set, we'll probably continue doing something together.

So, first thing first, what is Statistical Machine Translation? To put things simply, SMT works by "learning patters of translation" from a parallel text between the source and target language (i.e. parallel corpora), and from those pattern is able to generate the most probable translation for a sentence in the source language. It is not an easy task, takes up lots of computing power and memory, but is quite satisfactory when its done :)
Theoretically, in order to create a system for translating text from one language to another using SMT, you only need parallel corpora. Probably the best freely available parallel corpora is JRC-Acquis. It contains 22 languages, including all languages spoken in the European Union. Unfortunately, Macedonian is not in that group yet.

The Macedonian-English parallel corpora that I'm using for my experiments is generated from the Southeast European Times online newspaper. It is significantly smaller compared to JRC-Acquis, but will be enough for testing purposes. Francis has published the entire corpora on his web site, including alignments between 10 languages. Since a some of the Macedonian articles in SETimes are not translated, I had stripped them.

Moses is the core of a SMT system, but along the way we use many other tools. Initially, I will just list all of those tools, and in the following posts I'll go through each of them in detail.
  1. Corpus retrieval: wget :)
  2. Corpus filtering, parsing etc - my own tools, messy code, dont know if it'll be published.
  3. Part of speech tagging - SVMTool, TreeTagger, Apertium, T'n'T, MBT etc.
  4. Sentence alignment - I'm using HunAlign, but Vanilla should be ok too.
  5. Word alignment - any GIZA++ derivative like giza-pp.
  6. Language modelling toolkits - I'm using IRTSLM, other popular options are SRILM and RandLM.
  7. SMT decoder - the already mentioned Moses.
  8. Optimisations - various scripts included in Moses.
  9. Scoring - for now, BLEU, NIST.
Since many of the processes, especially word alignment, decoding and optimisation (minimum error rate training) are slow, they are run on a grid cluster, part of SEE-GRID. Ten-fold validation is even a bigger bugger, so running everything at home is not an option :)

That'll be enough for now, next time I'll cover something more on the SETimes corpus and sentence alignment.

4 коментари:

Unknown рече...

interesno,
so shto praveshe sentence segmentation?

pozdrav
Martin Saveski

Milosh рече...

Со MorphAdorner (http://morphadorner.northwestern.edu/morphadorner/sentencesplitter/). Не е идеален, ама заврши работа.

Анонимен рече...

There is no disputing about tastes.


--------------
Sofia University of St. Kliment Ohridski

Анонимен рече...

While there is life there is hope.

-----------------------------------