Joint Language and Translation Modeling with Recurrent Neural Networks

Page 1

Joint Language and Translation Modeling with Recurrent Neural Networks
Michael Auli, Michel Galley, Chris Quirk, Geoffrey Zweig Microsoft Research Redmond, WA, USA {michael.auli,mgalley,chrisq,gzweig}@microsoft.com Abstract
We present a joint language and transla- tion model based on a recurrent neural net- work which predicts target words based on an unbounded history of both source and tar- get words. The weaker independence as- sumptions of this model result in a vastly larger search space compared to related feed- forward-based language or translation models. We tackle this issue with a new lattice rescor- ing algorithm and demonstrate its effective- ness empirically. Our joint model builds on a well known recurrent neural network language model (Mikolov, 2012) augmented by a layer of additional inputs from the source language. We show competitive accuracy compared to the traditional channel model features. Our best results improve the output of a system trained on WMT 2012 French-English data by up to 1.5 BLEU, and by 1.1 BLEU on average across several test sets.
1 Introduction
Recently, several feed-forward neural network- based language and translation models have achieved impressive accuracy improvements on sta- tistical machine translation tasks (Allauzen et al., 2011; Le et al., 2012b; Schwenk et al., 2012). In this paper we focus on recurrent neural network archi- tectures, which have recently advanced the state of the art in language modeling (Mikolov et al., 2010; Mikolov et al., 2011a; Mikolov, 2012), outperform- ing multi-layer feed-forward based networks in both perplexity and word error rate in speech recognition (Arisoy et al., 2012; Sundermeyer et al., 2013). The major attraction of recurrent architectures is their potential to capture long-span dependencies since predictions are based on an unbounded history of previous words. This is in contrast to feed-forward networks as well as conventional n-gram models, both of which are limited to fixed-length contexts. Building on the success of recurrent architectures, we base our joint language and translation model on an extension of the recurrent neural network lan- guage model (Mikolov and Zweig, 2012) that intro- duces a layer of additional inputs (§2). Most previous work on neural networks for speech recognition or machine translation used a rescoring setup based on n-best lists (Arisoy et al., 2012; Mikolov, 2012) for evaluation, thereby side stepping the algorithmic and engineering challenges of direct decoder-integration.1 Instead, we exploit lattices, which offer a much richer representation of the decoder output, since they compactly encode an exponential number of translation hypotheses in polynomial space. In contrast, n-best lists are typi- cally very redundant, representing only a few com- binations of top scoring arcs in the lattice. A major challenge in lattice rescoring with a recurrent neural network model is the effect of the unbounded history on search since the usual dynamic programming as- sumptions which are exploited for efficiency do not hold up anymore. We apply a novel algorithm to the task of rescoring with an unbounded language model and empirically demonstrate its effectiveness (§3). The algorithm proves robust, leading to signif- icant improvements with the recurrent neural net- work language model over a competitive n-gram baseline across several language pairs. We even ob- serve consistent gains when pairing the model with a large n-gram model trained on up to 575 times more
1One notable exception is Le et al. (2012a) who rescore reorder-
ing lattices with a feed-forward network-based model.

Page 2

data, demonstrating that the model provides comple- mentary information (§4). Our joint modeling approach is based on adding a continuous space representation of the foreign sen- tence as an additional input to the recurrent neu- ral network language model. With this extension, the language model can measure the consistency between the source and target words in a context- sensitive way. The model effectively combines the functionality of both the traditional channel and lan- guage model features. We test the power of this new model by using it as the only source of tradi- tional channel information. Overall, we find that the model achieves accuracy competitive with the older channel model features and that it can improve over the gains observed with the recurrent neural network language model (§5).
2 Model Structure
We base our model on the recurrent neural network language model of Mikolov et al. (2010) which is factored into an input layer, a hidden layer with re- current connections, and an output layer (Figure 1). The input layer encodes the target language word at time t as a 1-of-N vector et, where |V | is the size of the vocabulary, and the output layer yt represents a probability distribution over target words; both of size |V |. The hidden layer state ht encodes the his- tory of all words observed in the sequence up to time step t. This model is extended by an auxiliary input layer ft which provides complementary information to the input layer (Mikolov and Zweig, 2012). While the auxiliary input layer can be used to feed in arbi- trary additional information, we focus on encodings of the foreign sentence (§5). The state of the hidden layer is determined by the input layer, the auxiliary input layer and the hidden layer configuration of the previous time step ht−1. The weights of the connections between the layers are summarized in a number of matrices: U, F and W, represent weights from the input layer to the hid- den layer, from the auxiliary input layer to the hid- den layer, and from the previous hidden layer to the current hidden layer, respectively. Matrix V repre- sents connections between the current hidden layer and the output layer; G represents direct weights be- tween the auxiliary input and output layers.
et ht-1 ft ht yt V G F W U D Figure 1: Structure of the recurrent neural network model, including the auxiliary input layer ft.
The hidden and output layers are computed via a series of matrix-vector products and non-linearities: ht = s(Uet + Wht−1 + Fft) yt = g(Vht + Gft) where s(z) = 1 1 + exp{−z} , g(zm) = exp{zm} ∑
k exp{zk}
are sigmoid and softmax functions, respectively. Additionally, the network is interpolated with a maximum entropy model of sparse n-gram features over input words (Mikolov et al., 2011a).2 The max- imum entropy weights are added to the output acti- vations before computing the softmax. The model is optimized via a maximum likeli- hood objective function using stochastic gradient descent. Training is based on the back propaga- tion through time algorithm, which unrolls the net- work and then computes error gradients over mul- tiple time steps (Rumelhart et al., 1986). Af- ter training, the output layer represents posteriors p(et+1|et
t−n+1,ht,ft); the probabilities of words in
the output vocabulary given the n previous input words et
t−n+1, the hidden layer configuration ht as
well as the auxiliary input layer configuration ft.
2While these features depend on multiple input words, we de-
picted them for simplicity as a connection between the current input word vector et and the output layer (D).

Page 3

Naıve computation of the probability distribution over the next word is very expensive for large vo- cabularies. A well established efficiency trick uses word-classing to create a more efficient two-step process (Goodman, 2001; Emami and Jelinek, 2005; Mikolov et al., 2011b) where each word is assigned a unique class. To compute the probability of a word, we first compute the probability of its class, and then multiply it by the probability of the word conditioned on the class: p(et+1|et
t−n+1,ht,ft) =
p(ci|et
t−n+1,ht,ft) × p(et+1|ci,et t−n+1,ht,ft)
This factorization reduces the complexity of com- puting the output probabilities from O(|V |) to O(|C| + maxi |ci|) where |C| is the number of classes and |ci| is the number of words in class ci. The best case complexity O(√|V |) requires the number of classes and words to be evenly balanced, i.e., each class contains exactly as many words as there are classes.
3 Lattice Rescoring with an Unbounded Language Model
We evaluate our joint language and translation model in a lattice rescoring setup, allowing us to search over a much larger space of translations than would be possible with n-best lists. While very space efficient, lattices also impose restrictions on the context available to features, a particularly chal- lenging setting for our model which depends on the entire prefix of a translation. In the ensuing de- scription we introduce a new algorithm to efficiently tackle this issue. Phrase-based decoders operate by maintaining a set of states representing competing translations, ei- ther partial or complete. Each state is scored by a number of features including the n-gram language model. The independence assumptions of the fea- tures determine the amount of context each state needs to maintain in order for it to be possible to assign a score to it. For example, a trigram language model is indifferent to any context other than the two immediately preceding words. Assuming the trigram model dominates the Markov assumptions of all other features, which is typically the case, then we have to maintain at least two words at each state, also known as the n-gram context.
1: function RESCORELATTICE(k, V , E, s, T) 2:
Q ← TOPOLOGICALLY-SORT(V )
3:
for all v in V do ⊳ Heaps of split-states
4:
Hv ← MINHEAP()
5:
end for
6:
h0 ← 0 ⊳ Initialize start-state
7:
Hs.ADD(h0)
8:
for all v in Q do ⊳ Examine outgoing arcs
9:
for 〈v, x〉 in E do
10:
for h in Hv do ⊳ Extend LM states
11:
h ← SCORERNN(h, phrase(h))
12:
parent(h ) ← h ⊳ Backpointers
13:
if Hx.size() ≥ k∧ ⊳ Beam width
14:
Hx.MIN()<score(h ) then
15:
Hx.REMOVEMIN()
16:
if Hx.size()<k then
17:
Hx.ADD(h )
18:
end for
19:
end for
20:
end for
21:
I = MAXHEAP()
22:
for all t in T do ⊳ Find best final split-state
23:
I.MERGE(Ht)
24:
end for
25:
return I.MAX()
26: end function
Figure 2: Push-forward rescoring with a recurrent neu- ral network language model given a beam-width for lan- guage model split-states k, decoder states V , edges E, a start state s and final states T.
However, a recurrent neural network language model makes much weaker independence assump- tions. In fact, the predictions of such a model depend on all previous words in the sentence, which would imply a potentially very large context. But storing all words is an inefficient solution from a dynamic programming point of view. Fortunately, we do not need to maintain entire translations as context in the states: the recurrent model compactly encodes the entire history of previous words in the hidden layer configuration hi. It is therefore sufficient to add hi as context, instead of the entire translation. The lan- guage model can then simply score any new words

Page 4

based on hi from the previous state when a new state is created. A much larger problem is that items, that were previously equivalent from a dynamic programming perspective, may now be different. Standard phrase- based decoders (Koehn et al., 2007) recombine de- coder states with the same context into a single state because they are equivalent to the model fea- tures; usually recombination retains only the high- est scoring candidate.3 However, if the context is large, then the amount of recombination will de- crease significantly, leading to less variety in the de- coder beam. This was confirmed in preliminary ex- periments where we simulated context sizes of up to 100 words but found that accuracy dropped by be- tween 0.5-1.0 BLEU. Integrating a long-span language model naıvely requires to keep context equivalent to the entire left prefix of the translation, a setting which would per- mit very little recombination. Instead of using ineffi- cient long-span contexts, we propose to maintain the usual n-gram context and to keep a fixed number of hidden layer configurations k at each decoder state. This leads to a new split-state dynamic program which splits each decoder state into at most k new items, each with a separate hidden layer configura- tion representing an unbounded history (Figure 2). This maintains diversity in the explored translation hypothesis space and preserves high-scoring hidden layer configurations. What is the effect of this strategy? To answer this question we measured translation accuracy for various settings of k on our lattice rescoring setup (see §4 for details). In the same experiment, we compare lattices to n-best lists in terms of accuracy, model score and wall time impact.4 The results (Ta- ble 1 and Figure 3) show that reranking accuracy on lattices is not significantly better, however, rescor- ing lattices with k = 1 is much faster than n-best lists. Similar observations have been made in previ- ous work on minimum error-rate training (Macherey
3Assuming a max-translation decision rule. In a minimum-risk
setting, we may assign the sum of the scores of all candidates to the retained item.
4We measured running times on an HP z800 workstation
equipped with 24 GB main memory and two Xeon E5640 CPUs with four cores each, clocked at 2.66 GHz. All experi- ments were run single-threaded.
BLEU oracle sec/sent Baseline 28.25 - 0.173 100-best 28.90 37.22 0.470 1000-best 28.99 40.06 3.920 lattice (k = 1) 29.00 43.50 0.093 lattice (k = 10) 29.04 43.50 0.599 lattice (k = 100) 29.03 43.50 4.531
Table 1: Rescoring n-best lists and lattices with various language model beam widths k. Accuracy is based on the news2011 French-English task. Timing results are in addition to the baseline. Figure 3: BLEU vs. log probabilities of 1-best transla- tions when rescoring n-best lists and lattices (cf. Table 1).
et al., 2008). The recurrent language model adds an overhead of about 54% at k = 1 on top of the time to produce the baseline 1-best output, a consider- able but not necessarily prohibitive overhead. Larger values of k return higher probability solutions, but there is little impact on accuracy: the BLEU score is nearly identical when retaining up to 100 histories compared to keeping only the highest scoring. While surprising at first, we believe that this ef- fect is due to the high similarity of the translations represented by the histories in the beam. Each his- tory represents a different translation but all transla- tion hypothesis share the same n-gram context, and, more importantly, they are translations of the same foreign words, since they have exactly the same cov- erage vector. These commonalities are likely to re- sult in similar recurrent histories, which in turn re- duces the effect of aggressive pruning.
4 Language Model Experiments
Recurrent neural network language models have previously only been used in n-best rescoring

Page 5

settings and on small-scale tasks with baseline language models trained on only 17.5m words (Mikolov, 2012). We extend this work by experi- menting on lattices using strong baselines with n- gram models trained on over one billion words and by evaluating on a number of language pairs. 4.1 Experimental Setup Baseline. We experiment with an in-house phrase- based system similar to Moses (Koehn et al., 2003), scoring translations by a set of common fea- tures including maximum likelihood estimates of source given target mappings pMLE(e|f) and vice versa pMLE(f|e), as well as lexical weighting es- timates pLW (e|f) and pLW (f|e), word and phrase- penalties, a linear distortion feature and a lexicalized reordering feature. Log-linear weights are estimated with minimum error rate training (Och, 2003). Evaluation. We use training and test data from the WMT 2012 campaign and report results on French-English, German-English and English- German. Translation models are estimated on 102m words of parallel data for French-English, 91m words for German-English and English-German; be- tween 3.5-5m words are newswire, depending on the language pair, and the remainder are parliamentary proceedings. The baseline systems use two 5-gram modified Kneser-Ney language models; the first is estimated on the target-side of the parallel data, while the second is based on a large newswire corpus released as part of the WMT campaign. For French- English and German-English we use a language model based on 1.15bn words, and for English- German we train a model on 327m words. We eval- uate on the newswire test sets from 2010-2011 con- taining between 2034-3003 sentences. Log-linear weights are estimated on the 2009 data set compris- ing 2525 sentences. We rescore the lattices produced by the baseline systems with an aggressive but effec- tive context beam of k = 1 that did not harm accu- racy in preliminary experiments (§3). Neural Network Language Model. The vocab- ularies of the language models are comprised of the words in the training set after removing single- tons. We obtain word-classes using a version of Brown-Clustering with an additional regularization term to optimize the runtime of the language model (Brown et al., 1992; Zweig and Makarychev, 2013). Direct connections use maximum entropy features over unigrams, bigrams and trigrams (Mikolov et al., 2011a). We use the standard settings for the model with the default learning rate α = 0.1 that decays exponentially if the validation set entropy does not increase after each epoch. Back propagation through time computes error gradients over the past twenty time steps. Training is stopped after 20 epochs or when the validation entropy does not decrease over two epochs. We experiment with varying training data sizes and randomly draw the data from the same corpora used for the baseline systems. Throughout, we use a hidden layer size of 100 which provided a good trade-off between time and accuracy in initial experiments. 4.2 Results Training times for neural networks can be a major bottleneck. Recurrent architectures are particularly hard to parallelize due to their inherent dependence on the previous hidden layer configuration. One straightforward way to influence training time is to change the size of the training corpus. Our results (Table 2, Table 3 and Table 4) show that even small models trained on only two million words significantly improve over the 1-best decoder output (Baseline); this represents only 0.6 percent of the data available to the n-gram model used by the baseline. Models of this size can be trained in only about 3.5 hours. A model trained on 50m words took 63 hours to train. When paired with an n-gram model trained on 25 times more data, accuracy im- proved by up to 0.7 BLEU on French-English.
5 Joint Model Experiments
In the next set of experiments, we turn to the joint language and translation model, an extension of the recurrent neural network language model with ad- ditional inputs for the foreign sentence. We first introduce two continuous space representations of the foreign sentence (§5.1). Using these represen- tations we evaluate the accuracy of the joint model in the lattice rescoring setup and compare against the traditional translation channel model features (§5.2). Next, we establish an upper bound on accuracy for the joint model via an oracle experiment (§5.3). In- spired by the results of the oracle experiment we

Page 6

dev news2010 news2011 newssyscomb2011 Avg(test) Baseline 26.6 27.6 28.3 27.5 27.8 +RNNLM (2m) 27.5 28.1 28.6 28.1 28.3 +RNNLM (50m) 27.7 28.2 29.0 28.1 28.5
Table 2: French-English results when rescoring with the recurrent neural network language model; the baseline relies on an n-gram model trained on 1.15bn words.
dev news2010 news2011 newssyscomb2011 Avg(test) Baseline 21.2 20.7 19.2 20.6 20.0 +RNNLM (2m) 21.8 20.9 19.4 20.9 20.3 +RNNLM (50m) 22.1 21.1 19.7 21.0 20.5
Table 3: German-English results when rescoring with the recurrent neural network language model.
dev news2010 news2011 newssyscomb2011 Avg(test) Baseline 15.2 15.6 14.3 15.7 15.1 +RNNLM (2m) 15.7 15.9 14.6 16.0 15.4 +RNNLM (50m) 15.8 15.9 14.7 16.1 15.5
Table 4: English-German results when rescoring with the recurrent neural network language model; the baseline relies on an n-gram model trained on 327m words.
train a transform between the source words and the reference representations. This leads to the best re- sults improving 1.5 BLEU over the 1-best decoder output and adding 0.2 BLEU on average to the gains achieved by the recurrent language model (§5.4). Setup. Conventional language models can be trained on monolingual or bilingual data; however, the joint model can only be trained on the latter. In order to control for data size effects, we restrict training of all models, including the baseline n-gram model, to the target side of the parallel corpus, about 102m words for French-English. Furthermore we train recurrent models only on the newswire portion (about 3.5m words for training and 250k words for validation) since initial experiments showed compa- rable results to using the full parallel corpus, avail- able to the baseline. This is reasonable since the test data is newswire. Also, it allows for more rapid ex- perimentation. 5.1 Foreign Sentence Representations We represent foreign sentences either by latent se- mantic analysis (LSA; Deerwester et al. 1990) or by word encodings produced as a by-product of train- ing the recurrent neural network language model on the source words. LSA is widely used for representing words and documents in low-dimensional vector space. The method applies reduced singular value decomposi- tion (SVD) to a matrix M of word counts; in our setting, rows represent sentences and columns rep- resent foreign words. SVD reduces the number of columns while preserving similarity among the rows, effectively mapping from a high-dimensional representation of a sentence, as a set of words, to a low-dimensional set of concepts. The output of SVD is an approximation of M by three matrices: T con- tains single word representations, R represents full sentences, and S is a diagonal scaling matrix: M ≈ TSRT Given vocabulary V and n sentences, we construct M as a matrix of size |V × n|. The ij-th entry is the number of times word i occurs in sentence j, also known as the term frequency value; the entry is also weighted by the inverse document frequency, the rel- ative importance of word i among all sentences, ex- pressed as the negative logarithm of the fraction of sentences in which word i occurs. As a second representation we use single word

Page 7

embeddings implicitly learned by the input layer weights U of the recurrent neural network language model (§2), denoted as RNN. Each word is repre- sented by a vector of size |hi|, the number of neu- rons in the hidden layer; in our experiments, we consider concatenations of individual word vectors to represent foreign word contexts. These encodings have previously been found to capture syntactic and semantic regularities (Mikolov et al., 2013) and are readily available in our experimental framework via training a recurrent neural network language model on the source-side of the parallel corpus. 5.2 Results We first experiment with the two previously intro- duced representations of the source-side sentence. Table 5 shows the results compared to the 1-best de- coder output and an RNN language model (target- only). We first try LSA encodings of the entire foreign sentence as 80 or 240 dimensional vectors (sent-lsa-dim80, sent-lsa-dim240). Next, we experi- ment with single-word RNN representations of slid- ing word-windows in the hope of representing rel- evant context more precisely. Word-windows are constructed relative to the source words aligned to the current target word, and individual word vec- tors are concatenated into a single vector. We first try contexts which do not include the aligned source words, in the hope of capturing information not already modeled by the channel models, start- ing with the next five words (ww-rnn-dim50.n5), the five previous and the next five words (ww-rnn- dim50.p5n5) as well as the previous three words (ww-rnn-dim50.p3). Next, we experiment with word-windows of up to five aligned source words (ww-rnn-dim50.c5). Finally, we try contexts based on LSA word vectors (ww-lsa-dim50.n5, ww-lsa- dim50.p3).5 While all models improve over the baseline, none significantly outperforms the recurrent neural net- work language model in terms of BLEU. However, the perplexity results suggest that the models uti- lize the foreign representations since all joint mod- els improve vastly over the target-only language
5We ignore the coverage vector when determining word-
windows which risks including already translated words. Building word-windows based on the coverage vector requires additional state in a rescoring setting meant to be light-weight.
−p(f|e) −p(e|f) −p(e|f) Baseline without CM 24.0 22.5 + target-only 24.5 22.6 + sent-lsa-dim240 24.9 23.3 + ww-rnn-dim50.n5 24.9 24.0 + ww-rnn-dim50.p5n5 24.6 23.7 + ww-rnn-dim50.p3 24.6 22.3 + ww-rnn-dim50.c5 24.9 24.0 + ww-lsa-dim50.n5 24.8 23.9 + ww-lsa-dim50.p3 23.8 23.2
Table 6: Comparison of the joint model and the chan- nel model features (CM) by removing channel features corresponding to −p(e|f) from the lattices, or both di- rections −p(e|f), −p(f|e) and replacing them by vari- ous joint models. We re-tuned the log-linear weights for different feature-sets. Accuracy is based on the average BLEU over news2010, newssyscomb2010, news2011.
model. The lowest perplexity is achieved by the context covering the aligned source words (ww-rnn- dim50.c5) since the source words are a better pre- dictor of the target words than outside context. The experiments so far measured if the joint model can improve in addition to the four channel model features used by the baseline, that is, the max- imum likelihood and lexical translation features in both translation directions. The joint model clearly overlaps with these features, but how well does the recurrent model perform compared against the channel model features? To answer this question, we removed channel model features corresponding to the same translation direction as the joint model, specifically pMLE(e|f) and pLW (e|f), from the lat- tices and measured the effect of adding the joint models. The results (Table 6, column −p(e|f)) clearly show that our joint models are competitive with the channel model features by outperforming the orig- inal baseline with all channel model features (24.7 BLEU) by 0.2 BLEU (ww-rnn-dim50.n5, ww-rnn- dim50.c5). As a second experiment, we removed all channel model features (column −p(e|f),p(f|e)), diminishing baseline accuracy to 22.5 BLEU. In this setting, the best joint model is able to make up 1.5 of the 2.2 BLEU lost due to removal of the channel

Page 8

dev news2010 news2011 newssyscomb2010 Avg(test) PPL Baseline 24.3 24.4 25.1 24.3 24.7 341 target-only 25.1 25.1 26.4 25.0 25.6 218 sent-lsa-dim80 25.2 25.2 26.3 25.1 25.6 147 sent-lsa-dim240 25.1 25.0 26.2 24.9 25.4 126 ww-rnn-dim50.n5 24.9 25.0 26.3 24.8 25.4 61 ww-rnn-dim50.p5n5 25.0 24.8 26.2 24.7 25.3 59 ww-rnn-dim50.p3 25.1 25.1 26.5 24.9 25.6 143 ww-rnn-dim50.c5 24.8 24.9 26.0 24.8 25.3 16 ww-lsa-dim50.n5 25.0 25.0 26.2 24.8 25.4 76 ww-lsa-dim50.p3 25.1 25.1 26.5 24.9 25.6 151
Table 5: Translation accuracy of the joint model with various encodings of the foreign sentence measured on the French-English task. Perplexity (PPL) is based on news2011.
model features, while modeling only a single trans- lation direction. This setup also shows the negligible effect of the target-only language model in the ab- sence of translation scores, whereas the joint models are much more effective since they do model transla- tion. Overall, the best joint models prove very com- petitive to the traditional channel features. 5.3 Oracle Experiment The previous section examined the effect of a set of basic foreign sentence representations. Although we find some benefit from these representations, the differences are not large. One might naturally ask whether there is greater potential upside from this channel model. Therefore we turn to measuring the upper bound on accuracy for the joint approach as a whole. Specifically, we would like to find a bound on ac- curacy given an ideal representation of the source sentence. To answer this question, we conducted an experiment where the joint model has access to an LSA representation of the reference translation. Table 7 shows that the joint approach has an ora- cle accuracy of up to 4.3 BLEU above the baseline. This clearly confirms that the joint approach can ex- ploit the additional information to improve BLEU, given a good enough representation of the foreign sentence. In terms of perplexity, we see an improve- ment of up to 65% over the target-only model. It should be noted that since LSA representations are computed on reference words, perplexity no longer has its standard meaning. BLEU PPL Baseline 25.2 341 target-only 26.4 218 oracle (sent-lsa-dim40) 27.7 124 oracle (sent-lsa-dim80) 28.5 103 oracle (sent-lsa-dim160) 29.0 86 oracle (sent-lsa-dim240) 29.5 76
Table 7: Oracle accuracy of the joint model when us- ing an LSA encoding of the references, measured on the news2011 French-English task.
5.4 Target Language Projections Our experiments so far showed that joint models based on direct representations of the source words are very competitive to the traditional channel mod- els (§5.2). However, these experiments have not shown any improvements over the normal recurrent neural network language model. The previous sec- tion demonstrated that good representations can lead to substantial gains (§5.3). In order to bridge the gap, we propose to learn a separate transform from the foreign words to an encoding of the reference target words, thus making the source-side representations look more like the target-side encodings used in the oracle experiment. Specifically, we learn a linear transform dθ : x → r mapping directly from a vector en- coding of the foreign sentence x to an l-dimensional LSA representation r of the reference sentence. At test and training time we apply dθ to the foreign words and use the transformation instead of a direct

Page 9

dev news2010 news2011 newssyscomb2010 Avg(test) PPL Baseline 24.3 24.4 25.1 24.3 24.7 341 target-only 25.1 25.1 26.4 25.0 25.6 218 proj-lsa-dim40 25.1 25.3 26.5 25.2 25.8 145 proj-lsa-dim80 25.1 25.3 26.6 25.2 25.8 134
Table 8: Translation accuracy of the joint model with a source-target transform, measured on the French-English task. Perplexity (PPL) is based on news2011; differences to target-only are significant at the p < 0.001 level.
source-side representation. The transform models all foreign words in the par- allel corpus except singletons, which are collapsed into a unique class, similar to the recurrent neural network language model. We train the transform to minimize the squared error with respect to the ref- erence LSA vector using an SGD online learner: θ
∗
= arg min
θ n
∑
i=1
( ri − dθ(xi) )2 (1) We found a simple constant learning rate, tuned on the validation data, to be as effective as sched- ules based on constant decay, or reducing the learn- ing rate when the validation error increased. Our feature-set includes unigram and bigram word fea- tures. The value of unigram features is simply the unigram count in that sentence; bigram features re- ceive a weight of the bigram count divided by two to help prevent overfitting. Then the vector for each sentence was divided by its L2 norm. Both weight- ing and normalization led to substantial improve- ments in test set error. More complex features such as skip-bigrams, trigrams and character n-grams did not yield any significant improvements. Even this representation of sentences is composed of a large number of instances, and so we resorted to feature hashing by computing feature ids as the least signif- icant 20 bits of each feature name. Our best trans- form achieved a cosine similarity of 0.816 on the training data, 0.757 on the validation data, and 0.749 on news2011. The results (Table 8) show that the transform im- proves over the recurrent neural network language model on all test sets and by 0.2 BLEU on average. We verified significance over the target-only model using paired bootstrap resampling (Koehn, 2004) over all test sets (7526 sentences) at the p < 0.001 level. Overall, we improve accuracy by up to 1.5 BLEU and by 1.1 BLEU on average across all test sets over the decoder 1-best with our joint language and translation model.
6 Related Work
Our approach of combining language and translation modeling is very much in line with recent work on n-gram-based translation models (Crego and Yvon, 2010), and more recently continuous space-based translation models (Le et al., 2012a; Gao et al., 2013). The joint model presented in this paper dif- fers in a number of key aspects: we use a recur- rent architecture representing an unbounded history of both source and target words, rather than a feed- forward style network. Feed-forward networks and n-gram models have a finite history which makes predictions independent of anything but a small his- tory of words. Furthermore, we only model the target-side which is different to previous work mod- eling both sides. We introduced a new algorithm to tackle lattice rescoring with an unbounded model. The auto- matic speech recognition community has previously addressed this issue by either approximating long- span language models via simpler but more tractable models (Deoras et al., 2011b), or by identifying con- fusable subsets of the lattice from which n-best lists are constructed and rescored (Deoras et al., 2011a). We extend their work by directly mapping a recur- rent neural network model onto the structure of the lattice, rescoring all states instead of focusing only on subsets.
7 Conclusion and Future Work
Joint language and translation modeling with recur- rent neural networks leads to substantial gains over the 1-best decoder output, raising accuracy by up to 1.5 BLEU and by 1.1 BLEU on average across

Page 10

several test sets. The joint approach also improves over the gains of the recurrent neural network lan- guage model, adding 0.2 BLEU on average across several test sets. Our models are competitive to the traditional channel models, outperforming them in a head-to-head comparison. Furthermore, we tackled the issue of lattice rescoring with an unbounded recurrent model by means of a novel algorithm that keeps a beam of re- current histories. Finally, we have shown that the recurrent neural network language model can sig- nificantly improve over n-gram baselines across a range of language-pairs, even when the baselines were trained on 575 times more data. In future work we plan to directly learn represen- tations of the source-side during training of the joint model. Thus, the model itself can decide which en- coding is best for the task. We also plan to change the cross entropy objective to a BLEU-inspired ob- jective in a discriminative training regime, which we hope to be more effective. We would also like to ap- ply recent advances in tackling the vanishing gradi- ent problem (Pascanu et al., 2013) using a regular- ization term to maintain the magnitude of the gradi- ents during back propagation through time. Finally, we would like to integrate the recurrent model di- rectly into first-pass decoding, a straightforward ex- tension of lattice rescoring using the algorithm we developed.
Acknowledgments
We would like to thank Anthony Aue, Hany Has- san Awadalla, Jon Clark, Li Deng, Sauleh Eetemadi, Jianfeng Gao, Qin Gao, Xiaodong He, Will Lewis, Arul Menezes, and Kristina Toutanova for helpful discussions related to this work as well as for com- ments on previous drafts. We would also like to thank the anonymous reviewers for their comments.
References
Alexandre Allauzen, Hél`ene Bonneau-Maynard, Hai-Son Le, Aurélien Max, Guillaume Wisniewski, François Yvon, Gilles Adda, Josep Maria Crego, Adrien Lardilleux, Thomas Lavergne, and Artem Sokolov. 2011. LIMSI @ WMT11. In Proc. of WMT, pages 309–315, Edinburgh, Scotland, July. Association for Computational Linguistics. Ebru Arisoy, Tara N. Sainath, Brian Kingsbury, and Bhu- vana Ramabhadran. 2012. Deep Neural Network Language Models. In NAACL-HLT Workshop on the Future of Language Modeling for HLT, pages 20–28, Stroudsburg, PA, USA. Association for Computational Linguistics. Peter F. Brown, Peter V. deSouza, Robert L. Mercer, Vin- cent J. Della Pietra, and Jenifer C. Lai. 1992. Class- based n-gram models of natural language. Computa- tional Linguistics, 18(4):467–479, Dec. Josep Crego and Franois Yvon. 2010. Factored bilingual n-gram language models for statistical machine trans- lation. Machine Translation, 24(2):159–175. Scott Deerwester, Susan T. Dumais, George W. Furnas, Thomas K. Landauer, and Richard Harshman. 1990. Indexing by Latent Semantic Analysis. Journal of the American Society for Information Science, 41(6):391– 407. Anoop Deoras, Tomáš Mikolov, and Kenneth Church. 2011a. A Fast Re-scoring Strategy to Capture Long- Distance Dependencies. In Proc. of EMNLP, pages 1116–1127, Stroudsburg, PA, USA, July. Association for Computational Linguistics. Anoop Deoras, Tomáš Mikolov, Stefan Kombrink, M. Karafiat, and Sanjeev Khudanpur. 2011b. Varia- tional Approximation of Long-Span Language Models for LVCSR. In Proc. of ICASSP, pages 5532–5535. Ahmad Emami and Frederick Jelinek. 2005. A Neural Syntactic Language Model. Machine Learning, 60(1- 3):195–227, September. Jianfeng Gao, Xiaodong He, Wen-tau Yih, and Li Deng. 2013. Learning Semantic Representations for the Phrase Translation Model. Technical Report MSR- TR-2013-88, Microsoft Research, September. Joshua Goodman. 2001. Classes for Fast Maximum En- tropy Training. In Proc. of ICASSP. Philipp Koehn, Franz Josef Och, and Daniel Marcu. 2003. Statistical Phrase-Based Translation. In Proc. of HLT-NAACL, pages 127–133, Edmonton, Canada, May. Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, Chris Dyer, Ondrej Bojar, Alexandra Con- stantin, and Evan Herbst. 2007. Moses: Open Source Toolkit for Statistical Machine Translation. In Proc. of ACL Demo and Poster Sessions, pages 177–180, Prague, Czech Republic, Jun. Philipp Koehn. 2004. Statistical Significance Tests for Machine Translation Evaluation. In Proc. of EMNLP, pages 388–395, Barcelona, Spain, Jul. Hai-Son Le, Alexandre Allauzen, and François Yvon. 2012a. Continuous Space Translation Models with

Page 11

Neural Networks. In Proc. of HLT-NAACL, pages 39– 48, Montréal, Canada. Association for Computational Linguistics. Hai-Son Le, Thomas Lavergne, Alexandre Allauzen, Marianna Apidianaki, Li Gong, Aurélien Max, Artem Sokolov, Guillaume Wisniewski, and François Yvon. 2012b. LIMSI @ WMT12. In Proc. of WMT, pages 330–337, Montréal, Canada, June. Association for Computational Linguistics. Wolfgang Macherey, Franz Josef Och, Ignacio Thayer, and Jakob Uszkoreit. 2008. Lattice-based Minimum Error Rate Training for Statistical Machine Transla- tion. In Proc. of EMNLP, pages 725–734, Strouds- burg, PA, USA. Association for Computational Lin- guistics. Tomáš Mikolov and Geoffrey Zweig. 2012. Con- text Dependent Recurrent Neural Network Language Model. In Proc. of Spoken Language Technologies (SLT), pages 234–239, Dec. Tomáš Mikolov, Karafiát Martin, Lukáš Burget, Jan Cer- nocký, and Sanjeev Khudanpur. 2010. Recurrent Neural Network based Language Model. In Proc. of INTERSPEECH, pages 1045–1048. Tomáš Mikolov, Anoop Deoras, Daniel Povey, Lukáš Burget, and Jan ˇCernocký. 2011a. Strategies for Training Large Scale Neural Network Language Mod- els. In Proc. of ASRU, pages 196–201. Tomáš Mikolov, Stefan Kombrink, Lukáš Burget, Jan Cernocký, and Sanjeev Khudanpur. 2011b. Exten- sions of Recurrent Neural Network Language Model. In Proc. of ICASSP, pages 5528–5531. Tomáš Mikolov, Wen-tau Yih, and Geoffrey Zweig. 2013. Linguistic Regularities in Continuous Space- Word Representations. In Proc. of NAACL, pages 746–751, Stroudsburg, PA, USA, June. Association for Computational Linguistics. Tomáš Mikolov. 2012. Statistical Language Models based on Neural Networks. Ph.D. thesis, Brno Uni- versity of Technology. Franz Josef Och. 2003. Minimum Error Rate Training in Statistical Machine Translation. In Proc. of ACL, pages 160–167, Sapporo, Japan, July. Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. 2013. On the difficulty of training Recurrent Neural Networks. Proc. of ICML, abs/1211.5063. David E. Rumelhart, Geoffrey E. Hinton, and Ronald J. Williams. 1986. Learning Internal Representations by Error Propagation. In Symposium on Parallel and Distributed Processing. Holger Schwenk, Anthony Rousseau, and Mohammed Attik. 2012. Large, Pruned or Continuous Space Lan- guage Models on a GPU for Statistical Machine Trans- lation. In NAACL-HLT Workshop on the Future of Language Modeling for HLT, pages 11–19. Associa- tion for Computational Linguistics. Martin Sundermeyer, Ilya Oparin, Jean-Luc Gauvain, Ben Freiberg, Ralf Schlüter, and Hermann Ney. 2013. Comparison of Feedforward and Recurrent Neural Network Language Models. In IEEE International Conference on Acoustics, Speech, and Signal Process- ing, pages 8430–8434, Vancouver, Canada, May. Geoff Zweig and Konstantin Makarychev. 2013. Speed Regularization and Optimality in Word Classing. In Proc. of ICASSP.

Joint Language and Translation Modeling with Recurrent Neural Networks

Recent Documents:

Recent Search: