This HOWTO will explain how to train an MBMT system from scratch.
Two data sets and a test set:
We'll train the languge model (LM) first. The following steps will be performed:
We assume our corpus is called reuters.txt. The following steps will create instances (we take a left context of 3 words which we can train with Timbl.
wopr -r lexicon -p filename:reuters.txt
This will create a lexicon file and a counts file used for smoothing.
wopr -r window_s -p filename:reuters.txt,ws:3
This will create the instances with a left context of three words. Finally, we create the instance base:
wopr -r make_ibase -p filename:reuters.txt.ws3,timbl:”-a1 +D”
The LM server should be running in the background when running the decoder step (mbmt-decode). It is possible to run the MBMT system without a language model. In that case, edit mbmt-decode.c set #define USEWOPR 1 to #define USEWOPR 0 and recompile everything.
To start the server, do the following:
wopr -r server2 -p ibasefile:reuters.txt.ws3.ibase,ws:3,timbl:”-a1 +D”,port:1982,sentence:0,output:0,lexicon:reuters.txt.lex
Note the port parameter. Wopr should run on the same port as specified in mbmt-decode.c:
#define MACHINE “localhost”
#define PORT “1982”
If you want to run Wopr on another machine, the MACHINE setting in mbmt-decode.c should be adjusted as well.
Now we'll prepare the memory based translation system. First the instances. We assume the aligned file is called sv-en.A3, and the test set is called tst.txt.
mbmt-create-training sv-en.A3 > sv-en.A3.111.inst
mbmt-create-test tst.txt > tst.txt.111.inst
Training the translation system, and translating the test set is done with the following command:
Timbl -f sv-en.A3.111.inst -I sv-en.A3.111.inst.ibase -a1 +vdb+di +D -Beam=1 -t tst.txt.111.inst -o tst.txt.111.inst.out
It is also possible to do this in two separate steps. First the training step:
Timbl -f sv-en.A3.111.inst -I sv-en.A3.111.inst.ibase -a1 +vdb+di +D
And then the testing step:
Timbl -i sv-en.A3.111.inst.ibase -a1 +vdb+di +D -Beam=1 -t tst.txt.111.inst -o tst.txt.111.inst.out
Finally, we'll create sententes in the target language from the Timbl output:
mbmt-decode tst.txt.111.inst.out > tst.txt.out