==== MBMT howto ==== This HOWTO will explain how to train an MBMT system from scratch. === 1. Requirements === Two data sets and a test set: * An GIZA++ aligned file ("A3" format) to train the translation system. * A corpus in the target language for the language model (plain text, one sentence per line). * Test set in the source language (one sentence per line). === 2. Language model === We'll train the languge model (LM) first. The following steps will be performed: - Create instances - Train instance base - Start server We assume our corpus is called ''reuters.txt''. The following steps will create instances (we take a left context of 3 words which we can train with ''Timbl''. ''wopr -r lexicon -p filename:reuters.txt'' This will create a lexicon file and a counts file used for smoothing. ''wopr -r window_s -p filename:reuters.txt,ws:3'' This will create the instances with a left context of three words. Finally, we create the instance base: ''wopr -r make_ibase -p filename:reuters.txt.ws3,timbl:"-a1 +D"'' The LM server should be running in the background when running the decoder step (''mbmt-decode''). It is possible to run the MBMT system without a language model. In that case, edit ''mbmt-decode.c'' set ''#define USEWOPR 1'' to ''#define USEWOPR 0'' and recompile everything. To start the server, do the following: ''wopr -r server2 -p ibasefile:reuters.txt.ws3.ibase,ws:3,timbl:"-a1 +D",port:1982,sentence:0,output:0,lexicon:reuters.txt.lex'' Note the ''port'' parameter. Wopr should run on the same port as specified in ''mbmt-decode.c'': #define MACHINE "localhost" #define PORT "1982" If you want to run Wopr on another machine, the ''MACHINE'' setting in ''mbmt-decode.c'' should be adjusted as well. === MBMT === Now we'll prepare the memory based translation system. First the instances. We assume the aligned file is called ''sv-en.A3'', and the test set is called ''tst.txt''. ''mbmt-create-training sv-en.A3 > sv-en.A3.111.inst'' ''mbmt-create-test tst.txt > tst.txt.111.inst'' === Training and translating === Training the translation system, and translating the test set is done with the following command: ''Timbl -f sv-en.A3.111.inst -I sv-en.A3.111.inst.ibase -a1 +vdb+di +D -Beam=1 -t tst.txt.111.inst -o tst.txt.111.inst.out'' It is also possible to do this in two separate steps. First the training step: ''Timbl -f sv-en.A3.111.inst -I sv-en.A3.111.inst.ibase -a1 +vdb+di +D'' And then the testing step: ''Timbl -i sv-en.A3.111.inst.ibase -a1 +vdb+di +D -Beam=1 -t tst.txt.111.inst -o tst.txt.111.inst.out'' === Decoding === Finally, we'll create sententes in the target language from the ''Timbl'' output: ''mbmt-decode tst.txt.111.inst.out > tst.txt.out''