MBMT howto

This HOWTO will explain how to train an MBMT system from scratch.

1. Requirements

Two data sets and a test set:

  • An GIZA++ aligned file (“A3” format) to train the translation system.
  • A corpus in the target language for the language model (plain text, one sentence per line).
  • Test set in the source language (one sentence per line).

2. Language model

We'll train the languge model (LM) first. The following steps will be performed:

  1. Create instances
  2. Train instance base
  3. Start server

We assume our corpus is called reuters.txt. The following steps will create instances (we take a left context of 3 words which we can train with Timbl.

wopr -r lexicon -p filename:reuters.txt

This will create a lexicon file and a counts file used for smoothing.

wopr -r window_s -p filename:reuters.txt,ws:3

This will create the instances with a left context of three words. Finally, we create the instance base:

wopr -r make_ibase -p filename:reuters.txt.ws3,timbl:”-a1 +D”

The LM server should be running in the background when running the decoder step (mbmt-decode). It is possible to run the MBMT system without a language model. In that case, edit mbmt-decode.c set #define USEWOPR 1 to #define USEWOPR 0 and recompile everything.

To start the server, do the following:

wopr -r server2 -p ibasefile:reuters.txt.ws3.ibase,ws:3,timbl:”-a1 +D”,port:1982,sentence:0,output:0,lexicon:reuters.txt.lex

Note the port parameter. Wopr should run on the same port as specified in mbmt-decode.c:

#define MACHINE “localhost”

#define PORT “1982”

If you want to run Wopr on another machine, the MACHINE setting in mbmt-decode.c should be adjusted as well.

MBMT

Now we'll prepare the memory based translation system. First the instances. We assume the aligned file is called sv-en.A3, and the test set is called tst.txt.

mbmt-create-training sv-en.A3 > sv-en.A3.111.inst

mbmt-create-test tst.txt > tst.txt.111.inst

Training and translating

Training the translation system, and translating the test set is done with the following command:

Timbl -f sv-en.A3.111.inst -I sv-en.A3.111.inst.ibase -a1 +vdb+di +D -Beam=1 -t tst.txt.111.inst -o tst.txt.111.inst.out

It is also possible to do this in two separate steps. First the training step:

Timbl -f sv-en.A3.111.inst -I sv-en.A3.111.inst.ibase -a1 +vdb+di +D

And then the testing step:

Timbl -i sv-en.A3.111.inst.ibase -a1 +vdb+di +D -Beam=1 -t tst.txt.111.inst -o tst.txt.111.inst.out

Decoding

Finally, we'll create sententes in the target language from the Timbl output:

mbmt-decode tst.txt.111.inst.out > tst.txt.out

 
uvt/mbmt_howto.txt · Last modified: 2009/02/08 11:35 by peter
 
Except where otherwise noted, content on this wiki is licensed under the following license:CC Attribution-Noncommercial-Share Alike 3.0 Unported
Recent changes RSS feed Donate Powered by PHP Valid XHTML 1.0 Valid CSS Driven by DokuWiki