Notes on Machine Translation

This is my notes on Machine Translation. I have no background in NLP and I am slowly learning Machine Translation by myself. I am trying to put together quality introduction materials from different sources and construct a logical introduction to Machine Translation for people with no NLP background. Only very basic probability background is assumed. References (This post is based on the materials from the following sources)

  1.  (An intro tutorial by Prof. Kevin Knight from USC)
  2. (A set of slides on MT from Prof. Philipp Koehn, University of Edinburgh)

This post is going to start from the basics of MT and gradually go deeper to survey the more advanced topics in Machine Translation. In this post, I will focus on statistical machine translation. (Here we are going to use basic probability) We start with an example to show the goal and approach of machine translation. Given a french sentence, f, we are trying to find the english sentence, e,  that best translate the sentence. That is we are trying to find e that maximizes P(e|f) Since, P(e|f) = P(e)*P(f|e) / P(f) In order to find P(e|f), we need to find P(e) and P(f|e). We can ignore P(f) because it is the same for all P(e) and P(f/e). It is essentially a normalization constant. There are three important components to SMT.

  1. Language Model (models for estimating P(e)
  2. Translation Model (models for estimating P(f|e)
  3. Decoder (maximizes P(e)*P(f|e))

An extra interesting part is Automatic Evaluation using Bleu. I will discuss the evaluation approaches if I have more time left for this post.

1. Language Model

Language Model describes the likelihood that we see the sentence e in an english corpus. We will start with the language model that try to estimate P(e). P(e) is a prior probability that tries to describe how likely the english sentence, “e”, is to appear in the english language. For example, a grammatically correct english sentence should have a higher probability than a grammatically incorrect sentence, P(Ecorrect) > P (Ewrong). In effect, P(e) worries about the order of the sentence. To model P(e), it is common to create a “Language Model” for the language. In this case, we are going to try to create a model for English. A common model is using N-gram model that breaks down a sentence into “components of substrings”. For example, “I like” is a bigram, and “I” is a unigram, or word. Research has shown Tri-gram model has the best performance. We calculate the probability of x, following y,z in the following format P(x|y, z). For example, “I like ” + x , the probability of x being “snake” is P(“snake” | “I”, “Like”). We know that P(x|yz) = p(x,y,z)/p(y,z). As a side note, smoothing is sometimes applied to p(z|xy) in the case that it becomes zero. We don’t want the probability of a certain P(f|e) to be zero. However, we need ways to show that Tri-gram model is the best. To do this, we need to perform a form of model selection. The model should have the maximum P(model | test-data). Using Bayes Rule, P(model | test-data) = P(model) * P(test-data|model)/P(data). Assuming that P(model) is uniform, the best model maximizes p(test-data|model). Tri-gram model should have a higher P(test-data|Tri-gram) than P(test-data|bigram) to show that Tri-gram is better than Bi-gram model.

2. Translation Modeling

Translation model describes the likelihood that the english sentence, e, translates to the French sentence, f. There are several translation models, including

  1. Formal logic model (Sentences gets converted into predicate logic, or perhaps a conjunction of atomic logical assertions).
  2. Syntax based models. (Sentences gets parsed into parse trees. Each node represents a subject/verb, adjective/noun …)
  3. Word based models (IBM model 3)
  4. Phrase based models
  5. Discriminative Training

we are going to focus on the Word-based and Phrase-based models for now.

2.1 Word based models (IBM model 3)

The high level idea is that each word in English is mapped to a certain number of words in French. The mapped words are then scrambled around to form the French sentence. To define this general idea in more precise manner, we introduce the following parameters

  • Fertility: Phi-i is the fertility of the ith word in English, ei. This function outputs the number of french words associated with the ith english word, ei.  The probability for an english word with fertility phi-i  is described by n(phi-i|ei)
  • Choose the number phi-0 of “spurious” French words and mapped to the NULL word that is prefixed to every sentence (NULL I like snake). e0 = NULL. phi-0 = number of spurious words.
  • Let m be the sum of fertilities for all words, including NULL. It also represents the length of the French sentence. length(f)
  • Let l be the length of the english sentence. length(e)
  • For each i = 0, 1, 2, …, l and each k = 1, 2, …, phi-i, choose a French word tau-ik with probability t(tau-ik|ei).
  • For each i = 1, 2, …, l, and each k = 1, 2, …, phi-i, choose target French position pi-ii with probability d(pi-ik|i,l,m). This is called a distortion parameter. 

2.1.2 Word Alignment and Alignment Probabilities in word based models Word alignment aligns a each word in english to 0 or more words in the french sentence. (Given a matching English and French sentence). Word alignment can be used to generate parameters n, t, p, d automatically. However, there are multiple ways to match english words to french words. Given a pair of sentences, multiple alignments can be produced. Assume we know the probability distribution of each alignment, a, for a matching english sentence e, with a french sentence, f. The alignment probabilities is described by P(a | e, f)). P(a|e,f) = P(a,f|e)/P(f|e) = P(a,f|e)/sum-over-a(P(a,f|e)). As a result, the alignment probability problem is reduced to finding a formula to calculate P(a,f|e). formular-for-p(a,f|e) (a picture is shown the complex formula that involves using the parameters we described previously, from the first reference tutorial by Kevin Knight).  2.1.3 Estimating the parameters using EM (Estimation Maximization) Problem Statement: We need the parameters to calculate the alignment probabilities. However, to estimate the parameters, we need to have the alignment probabilities. Solution: Estimation Maximization The high level idea for Estimation Maximization used here can be summarized in the following key points

  1. initialize model parameters (e.g. uniform)
  2. assign probabilities to the missing data (the alignment)
  3. estimate model parameters from completed data (with each alignment)
  4. iterate step 2 and step 3 until it converges.

At the end of the EM algorithm, hidden structure/mapping relationships are revealed by EM. We also have decent parameter estimations at the end of the EM algorithm.  2.1.4 Practical problems with the training using EM reordering and one to many mapping can be impossible to compute. As a result, we have to run on simplified models and then try to transfer parameters from simplified models to more complicated models.

2.2 Phrase based models

Motivation: to include many-to-many mappings, “phrase mappings”. Each phrase is translated and can be reordered. It uses local context in translation. Additionally, longer phrases can be learned with more data. Since we have less information source on the Phrase based models, we are going to focus more on high level ideas rather than detailed parameter description described in section 2.1. 2.2.2 Word Alignment and Alignment Probabilities in word based models We can construct a table of size l x m.  with the first column being the english sentence. Each row represents a word in the english sentence and each column represents a word in French. Then we construct alignment that if it goes in the

  • horizontal direction: than we are mapping one english word to many french words
  • vertical direction: mapping multiple english words to the same french word

This alignment is still a representation for the old word-based alignment. However, we can create square brackets around certain sections of the matrix to isolate out phrases. Slide 103- 108 ( ) shows how word alignments are grouped into longer and longer phrases to generate the final translation. In the end, there is a reordering phase to make sure the phrases are put together in the right order.

2.3 Syntax-based SMT


  • Incorporating more grammar information into the model
    • Reordering: only syntactical sound reordering
    • Function words: prepositions, determiners
    • Conditioning syntactically related words: verb may depend on subject or object
  • Generate outputs that follow grammatical rules

The high level ides is to match french words to english syntax trees. However, the challenge here is that foreign string might not match english syntax tress in the first place.

2.4 Hierarchical Phrase Model

This entry was posted in NLP. Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s