Broadening statistical machine translation with comparable corpora and generalized models

author: Chris Quirk, Microsoft Research
recorded by: Center for Language and Speech Processing
published: Feb. 15, 2012,   recorded: November 2008,   views: 2954

Related Open Educational Resources

Related content

Report a problem or upload files

If you have found a problem with this lecture or would like to send us extra material, articles, exercises, etc., please use our ticket system to describe your request and upload the data.
Enter your e-mail into the 'Cc' field, and we will keep you updated with your request's status.
Lecture popularity: You need to login to cast your vote.


As we scale statistical machine translation systems to general domain, we face many challenges. This talk outlines two approaches for building better broad-domain systems.

First, progress in data-driven translation is limited by the availability of parallel data. A promising strategy for mitigating data scarcity is to mine parallel data from comparable corpora. Although comparable corpora seldom contain parallel sentences, they often contain parallel words or phrases. Recent fragment extraction approaches have shown that including parallel fragments in SMT training data can significantly improve translation quality. We describe efficient and effective generative models for extracting fragments, and demonstrate that these algorithms produce substantial improvements on out-of-domain test data without suffering in-domain degradation.

Second, many modern SMT systems are very heavily lexicalized. While such information excels on in-domain test data, quality falls off as the test data broadens. This next section of the talk describes robust generalized models that leverage lexicalization when available, and back off to linguistic generalizations otherwise. Such an approach results in large improvements over baseline phrasal systems when using broad domain test sets.

Link this page

Would you like to put a link to this lecture on your homepage?
Go ahead! Copy the HTML snippet !

Write your own review or comment:

make sure you have javascript enabled or clear this field: