The last post was a month ago... How time flies :)
So, first thing, corpus retrieval. As previously stated, the corpus that i'm using is taken from the
Southeast European Times news site. You can either retrieve it manually (wget or htttrack is your friend), or just grab the ready-made version
here. Whatever thing you do, make sure that you:
- Clean up html tags. You can either use regular expressions, or a ready made tool (ex. redirect w3m output).
- Normalize quotes, hyphens etc (ex. „ “ to ", – and -- to -)
- Sentence-split the entire corpus. You can use Europarl's sentence splitter or some other integrated tool like MorphAdorner.
- You'll probably need to lowercase everything, but keep the original as well.
The Macedonian texts are probably wrongly interpreted, and contain invalid characters. You'll need to replace:
- Latin o with њ
- Latin f with ѓ
- Latin Y with џ
Some texts in Macedonian (and probably other languages as well) are not translated, so you'll have to remove them. In order to identify such texts, a
n-gram language classifier would do the trick. You have one in MorphAdorner, separate implementation in
libTextCat, or you can type it your self.
Next: tokenization. Again, you have ready tools for it, in MorphAdorner or tokenizer.perl in Europarl's tools.
After XML conversion, the corpus looks like the following:
<div type="article" n="3" xml:id="SETmk.3">
<p>
<s xml:id="SETmk.3.1"><w>Македонските</w> <w>партии</w> <w>од</w> <w>опозицијата</w> <w>ја</w> <w>бојкотираа</w> <w>средбата</w> <w>со</w> <w>владата</w> <c>..</c> </s>
<s xml:id="SETmk.3.2"><w>СКОПЈЕ</w> <c>,</c> <w>Македонија</w> <c>-</c> <w>Лидерите</w> <w>на</w> <w>најголемите</w> <w>партии</w> <w>од</w> <w>опозицијата</w> <w>во</w> <w>Македонија</w> <c>,</c> <w>Социјал</w> <w>демократскиот</w> <w>сојуз</w> <w>на</w> <w>Македонија</w> <w>и</w> <w>Демократската</w> <w>унија</w> <w>за</w> <w>интеграција</w> <c>(</c> <w>ДУИ</w> <c>)</c> <c>-</c> <w>Радмила</w> <w>Шекеринска</w> <w>и</w> <w>Али</w> <w>Ахмети</w> <c>-</c> <w>не</w> <w>присуствуваа</w> <w>на</w> <w>работниот</w> <w>појадок</w> <w>со</w> <w>владините</w> <w>лидери</w> <w>и</w> <w>владејачките</w> <w>партии</w> <w>во</w> <w>понеделникот</w> <c>(</c> <w type="dig">25</w> <w>декември</w> <c>)</c> <c>.</c> </s>
Easy so far :)
Depending on the number of texts removed from the corpus, the total token count per language could vary. Initially, the corpus contains around 5.5M tokens per language. My filtered corpus contains 3.6M tokens in Macedonian, and 3.4M tokens in the English version. Of those, 83% (~ 3M) are words, and the rest various punctuation characters.
* Edit: added link to libTextCat, as mentioned by Unhammer.