Joint Language and Translation Modeling with Recurrent Neural Networks
Michael Auli, Michel Galley, Chris Quirk, Geoffrey Zweig
Microsoft Research
Redmond, WA, USA
{michael.auli,mgalley,chrisq,gzweig}@microsoft.com
Abstract
We present a joint language and transla-
tion model based on a recurrent neural net-
work which predicts target words based on
an unbounded history of both source and tar-
get words. The weaker independence as-
sumptions of this model result in a vastly
larger search space compared to related feed-
forward-based language or translation models.
We tackle this issue with a new lattice rescor-
ing algorithm and demonstrate its effective-
ness empirically. Our joint model builds on a
well known recurrent neural network language
model (Mikolov, 2012) augmented by a layer
of additional inputs from the source language.
We show competitive accuracy compared to
the traditional channel model features. Our
best results improve the output of a system
trained on WMT 2012 French-English data by
up to 1.5 BLEU, and by 1.1 BLEU on average
across several test sets.
1 Introduction
Recently, several feed-forward neural network-
based language and translation models have
achieved impressive accuracy improvements on sta-
tistical machine translation tasks (Allauzen et al.,
2011; Le et al., 2012b; Schwenk et al., 2012). In this
paper we focus on recurrent neural network archi-
tectures, which have recently advanced the state of
the art in language modeling (Mikolov et al., 2010;
Mikolov et al., 2011a; Mikolov, 2012), outperform-
ing multi-layer feed-forward based networks in both
perplexity and word error rate in speech recognition
(Arisoy et al., 2012; Sundermeyer et al., 2013). The
major attraction of recurrent architectures is their
potential to capture long-span dependencies since
predictions are based on an unbounded history of
previous words. This is in contrast to feed-forward
networks as well as conventional n-gram models,
both of which are limited to fixed-length contexts.
Building on the success of recurrent architectures,
we base our joint language and translation model
on an extension of the recurrent neural network lan-
guage model (Mikolov and Zweig, 2012) that intro-
duces a layer of additional inputs (§2).
Most previous work on neural networks for
speech recognition or machine translation used a
rescoring setup based on n-best lists (Arisoy et al.,
2012; Mikolov, 2012) for evaluation, thereby side
stepping the algorithmic and engineering challenges
of direct decoder-integration.1 Instead, we exploit
lattices, which offer a much richer representation
of the decoder output, since they compactly encode
an exponential number of translation hypotheses in
polynomial space. In contrast, n-best lists are typi-
cally very redundant, representing only a few com-
binations of top scoring arcs in the lattice. A major
challenge in lattice rescoring with a recurrent neural
network model is the effect of the unbounded history
on search since the usual dynamic programming as-
sumptions which are exploited for efficiency do not
hold up anymore. We apply a novel algorithm to the
task of rescoring with an unbounded language model
and empirically demonstrate its effectiveness (§3).
The algorithm proves robust, leading to signif-
icant improvements with the recurrent neural net-
work language model over a competitive n-gram
baseline across several language pairs. We even ob-
serve consistent gains when pairing the model with a
large n-gram model trained on up to 575 times more
1One notable exception is Le et al. (2012a) who rescore reorder-
ing lattices with a feed-forward network-based model.
data, demonstrating that the model provides comple-
mentary information (§4).
Our joint modeling approach is based on adding a
continuous space representation of the foreign sen-
tence as an additional input to the recurrent neu-
ral network language model. With this extension,
the language model can measure the consistency
between the source and target words in a context-
sensitive way. The model effectively combines the
functionality of both the traditional channel and lan-
guage model features. We test the power of this
new model by using it as the only source of tradi-
tional channel information. Overall, we find that the
model achieves accuracy competitive with the older
channel model features and that it can improve over
the gains observed with the recurrent neural network
language model (§5).
2 Model Structure
We base our model on the recurrent neural network
language model of Mikolov et al. (2010) which is
factored into an input layer, a hidden layer with re-
current connections, and an output layer (Figure 1).
The input layer encodes the target language word at
time t as a 1-of-N vector et, where |V | is the size
of the vocabulary, and the output layer yt represents
a probability distribution over target words; both of
size |V |. The hidden layer state ht encodes the his-
tory of all words observed in the sequence up to time
step t. This model is extended by an auxiliary input
layer ft which provides complementary information
to the input layer (Mikolov and Zweig, 2012). While
the auxiliary input layer can be used to feed in arbi-
trary additional information, we focus on encodings
of the foreign sentence (§5).
The state of the hidden layer is determined by the
input layer, the auxiliary input layer and the hidden
layer configuration of the previous time step ht−1.
The weights of the connections between the layers
are summarized in a number of matrices: U, F and
W, represent weights from the input layer to the hid-
den layer, from the auxiliary input layer to the hid-
den layer, and from the previous hidden layer to the
current hidden layer, respectively. Matrix V repre-
sents connections between the current hidden layer
and the output layer; G represents direct weights be-
tween the auxiliary input and output layers.
et
ht-1
ft
ht
yt
V
G
F
W
U
D
Figure 1: Structure of the recurrent neural network
model, including the auxiliary input layer ft.
The hidden and output layers are computed via a
series of matrix-vector products and non-linearities:
ht = s(Uet + Wht−1 + Fft)
yt = g(Vht + Gft)
where
s(z) =
1
1 + exp{−z}
, g(zm) =
exp{zm}
∑
k exp{zk}
are sigmoid and softmax functions, respectively.
Additionally, the network is interpolated with a
maximum entropy model of sparse n-gram features
over input words (Mikolov et al., 2011a).2 The max-
imum entropy weights are added to the output acti-
vations before computing the softmax.
The model is optimized via a maximum likeli-
hood objective function using stochastic gradient
descent. Training is based on the back propaga-
tion through time algorithm, which unrolls the net-
work and then computes error gradients over mul-
tiple time steps (Rumelhart et al., 1986). Af-
ter training, the output layer represents posteriors
p(et+1|et
t−n+1,ht,ft); the probabilities of words in
the output vocabulary given the n previous input
words et
t−n+1, the hidden layer configuration ht as
well as the auxiliary input layer configuration ft.
2While these features depend on multiple input words, we de-
picted them for simplicity as a connection between the current
input word vector et and the output layer (D).
Naıve computation of the probability distribution
over the next word is very expensive for large vo-
cabularies. A well established efficiency trick uses
word-classing to create a more efficient two-step
process (Goodman, 2001; Emami and Jelinek, 2005;
Mikolov et al., 2011b) where each word is assigned
a unique class. To compute the probability of a
word, we first compute the probability of its class,
and then multiply it by the probability of the word
conditioned on the class:
p(et+1|et
t−n+1,ht,ft) =
p(ci|et
t−n+1,ht,ft) × p(et+1|ci,et
t−n+1,ht,ft)
This factorization reduces the complexity of com-
puting the output probabilities from O(|V |) to
O(|C| + maxi |ci|) where |C| is the number of
classes and |ci| is the number of words in class
ci. The best case complexity O(√|V |) requires the
number of classes and words to be evenly balanced,
i.e., each class contains exactly as many words as
there are classes.
3 Lattice Rescoring with an Unbounded
Language Model
We evaluate our joint language and translation
model in a lattice rescoring setup, allowing us to
search over a much larger space of translations than
would be possible with n-best lists. While very
space efficient, lattices also impose restrictions on
the context available to features, a particularly chal-
lenging setting for our model which depends on the
entire prefix of a translation. In the ensuing de-
scription we introduce a new algorithm to efficiently
tackle this issue.
Phrase-based decoders operate by maintaining a
set of states representing competing translations, ei-
ther partial or complete. Each state is scored by a
number of features including the n-gram language
model. The independence assumptions of the fea-
tures determine the amount of context each state
needs to maintain in order for it to be possible to
assign a score to it. For example, a trigram language
model is indifferent to any context other than the
two immediately preceding words. Assuming the
trigram model dominates the Markov assumptions
of all other features, which is typically the case, then
we have to maintain at least two words at each state,
also known as the n-gram context.
1: function RESCORELATTICE(k, V , E, s, T)
2:
Q ← TOPOLOGICALLY-SORT(V )
3:
for all v in V do
⊳ Heaps of split-states
4:
Hv ← MINHEAP()
5:
end for
6:
h0 ← 0
⊳ Initialize start-state
7:
Hs.ADD(h0)
8:
for all v in Q do
⊳ Examine outgoing arcs
9:
for 〈v, x〉 in E do
10:
for h in Hv do
⊳ Extend LM states
11:
h ← SCORERNN(h, phrase(h))
12:
parent(h ) ← h
⊳ Backpointers
13:
if Hx.size() ≥ k∧
⊳ Beam width
14:
Hx.MIN()<score(h ) then
15:
Hx.REMOVEMIN()
16:
if Hx.size()<k then
17:
Hx.ADD(h )
18:
end for
19:
end for
20:
end for
21:
I = MAXHEAP()
22:
for all t in T do ⊳ Find best final split-state
23:
I.MERGE(Ht)
24:
end for
25:
return I.MAX()
26: end function
Figure 2: Push-forward rescoring with a recurrent neu-
ral network language model given a beam-width for lan-
guage model split-states k, decoder states V , edges E, a
start state s and final states T.
However, a recurrent neural network language
model makes much weaker independence assump-
tions. In fact, the predictions of such a model depend
on all previous words in the sentence, which would
imply a potentially very large context. But storing
all words is an inefficient solution from a dynamic
programming point of view. Fortunately, we do not
need to maintain entire translations as context in the
states: the recurrent model compactly encodes the
entire history of previous words in the hidden layer
configuration hi. It is therefore sufficient to add hi
as context, instead of the entire translation. The lan-
guage model can then simply score any new words
based on hi from the previous state when a new state
is created.
A much larger problem is that items, that were
previously equivalent from a dynamic programming
perspective, may now be different. Standard phrase-
based decoders (Koehn et al., 2007) recombine de-
coder states with the same context into a single
state because they are equivalent to the model fea-
tures; usually recombination retains only the high-
est scoring candidate.3 However, if the context is
large, then the amount of recombination will de-
crease significantly, leading to less variety in the de-
coder beam. This was confirmed in preliminary ex-
periments where we simulated context sizes of up to
100 words but found that accuracy dropped by be-
tween 0.5-1.0 BLEU.
Integrating a long-span language model naıvely
requires to keep context equivalent to the entire left
prefix of the translation, a setting which would per-
mit very little recombination. Instead of using ineffi-
cient long-span contexts, we propose to maintain the
usual n-gram context and to keep a fixed number of
hidden layer configurations k at each decoder state.
This leads to a new split-state dynamic program
which splits each decoder state into at most k new
items, each with a separate hidden layer configura-
tion representing an unbounded history (Figure 2).
This maintains diversity in the explored translation
hypothesis space and preserves high-scoring hidden
layer configurations.
What is the effect of this strategy? To answer
this question we measured translation accuracy for
various settings of k on our lattice rescoring setup
(see §4 for details). In the same experiment, we
compare lattices to n-best lists in terms of accuracy,
model score and wall time impact.4 The results (Ta-
ble 1 and Figure 3) show that reranking accuracy on
lattices is not significantly better, however, rescor-
ing lattices with k = 1 is much faster than n-best
lists. Similar observations have been made in previ-
ous work on minimum error-rate training (Macherey
3Assuming a max-translation decision rule. In a minimum-risk
setting, we may assign the sum of the scores of all candidates
to the retained item.
4We measured running times on an HP z800 workstation
equipped with 24 GB main memory and two Xeon E5640
CPUs with four cores each, clocked at 2.66 GHz. All experi-
ments were run single-threaded.
BLEU oracle sec/sent
Baseline
28.25
-
0.173
100-best
28.90 37.22
0.470
1000-best
28.99 40.06
3.920
lattice (k = 1)
29.00 43.50
0.093
lattice (k = 10)
29.04 43.50
0.599
lattice (k = 100) 29.03 43.50
4.531
Table 1: Rescoring n-best lists and lattices with various
language model beam widths k. Accuracy is based on
the news2011 French-English task. Timing results are in
addition to the baseline.
Figure 3: BLEU vs. log probabilities of 1-best transla-
tions when rescoring n-best lists and lattices (cf. Table 1).
et al., 2008). The recurrent language model adds an
overhead of about 54% at k = 1 on top of the time
to produce the baseline 1-best output, a consider-
able but not necessarily prohibitive overhead. Larger
values of k return higher probability solutions, but
there is little impact on accuracy: the BLEU score
is nearly identical when retaining up to 100 histories
compared to keeping only the highest scoring.
While surprising at first, we believe that this ef-
fect is due to the high similarity of the translations
represented by the histories in the beam. Each his-
tory represents a different translation but all transla-
tion hypothesis share the same n-gram context, and,
more importantly, they are translations of the same
foreign words, since they have exactly the same cov-
erage vector. These commonalities are likely to re-
sult in similar recurrent histories, which in turn re-
duces the effect of aggressive pruning.
4 Language Model Experiments
Recurrent neural network language models have
previously only been used in n-best rescoring
settings and on small-scale tasks with baseline
language models trained on only 17.5m words
(Mikolov, 2012). We extend this work by experi-
menting on lattices using strong baselines with n-
gram models trained on over one billion words and
by evaluating on a number of language pairs.
4.1 Experimental Setup
Baseline. We experiment with an in-house phrase-
based system similar to Moses (Koehn et al.,
2003), scoring translations by a set of common fea-
tures including maximum likelihood estimates of
source given target mappings pMLE(e|f) and vice
versa pMLE(f|e), as well as lexical weighting es-
timates pLW (e|f) and pLW (f|e), word and phrase-
penalties, a linear distortion feature and a lexicalized
reordering feature. Log-linear weights are estimated
with minimum error rate training (Och, 2003).
Evaluation.
We use training and test data
from the WMT 2012 campaign and report results
on French-English, German-English and English-
German. Translation models are estimated on 102m
words of parallel data for French-English, 91m
words for German-English and English-German; be-
tween 3.5-5m words are newswire, depending on the
language pair, and the remainder are parliamentary
proceedings. The baseline systems use two 5-gram
modified Kneser-Ney language models; the first is
estimated on the target-side of the parallel data,
while the second is based on a large newswire corpus
released as part of the WMT campaign. For French-
English and German-English we use a language
model based on 1.15bn words, and for English-
German we train a model on 327m words. We eval-
uate on the newswire test sets from 2010-2011 con-
taining between 2034-3003 sentences. Log-linear
weights are estimated on the 2009 data set compris-
ing 2525 sentences. We rescore the lattices produced
by the baseline systems with an aggressive but effec-
tive context beam of k = 1 that did not harm accu-
racy in preliminary experiments (§3).
Neural Network Language Model. The vocab-
ularies of the language models are comprised of
the words in the training set after removing single-
tons. We obtain word-classes using a version of
Brown-Clustering with an additional regularization
term to optimize the runtime of the language model
(Brown et al., 1992; Zweig and Makarychev, 2013).
Direct connections use maximum entropy features
over unigrams, bigrams and trigrams (Mikolov et al.,
2011a). We use the standard settings for the model
with the default learning rate α = 0.1 that decays
exponentially if the validation set entropy does not
increase after each epoch. Back propagation through
time computes error gradients over the past twenty
time steps. Training is stopped after 20 epochs or
when the validation entropy does not decrease over
two epochs. We experiment with varying training
data sizes and randomly draw the data from the same
corpora used for the baseline systems. Throughout,
we use a hidden layer size of 100 which provided a
good trade-off between time and accuracy in initial
experiments.
4.2 Results
Training times for neural networks can be a major
bottleneck. Recurrent architectures are particularly
hard to parallelize due to their inherent dependence
on the previous hidden layer configuration. One
straightforward way to influence training time is to
change the size of the training corpus.
Our results (Table 2, Table 3 and Table 4) show
that even small models trained on only two million
words significantly improve over the 1-best decoder
output (Baseline); this represents only 0.6 percent
of the data available to the n-gram model used by
the baseline. Models of this size can be trained in
only about 3.5 hours. A model trained on 50m words
took 63 hours to train. When paired with an n-gram
model trained on 25 times more data, accuracy im-
proved by up to 0.7 BLEU on French-English.
5 Joint Model Experiments
In the next set of experiments, we turn to the joint
language and translation model, an extension of the
recurrent neural network language model with ad-
ditional inputs for the foreign sentence. We first
introduce two continuous space representations of
the foreign sentence (§5.1). Using these represen-
tations we evaluate the accuracy of the joint model
in the lattice rescoring setup and compare against the
traditional translation channel model features (§5.2).
Next, we establish an upper bound on accuracy for
the joint model via an oracle experiment (§5.3). In-
spired by the results of the oracle experiment we
dev news2010 news2011 newssyscomb2011 Avg(test)
Baseline
26.6
27.6
28.3
27.5
27.8
+RNNLM (2m)
27.5
28.1
28.6
28.1
28.3
+RNNLM (50m) 27.7
28.2
29.0
28.1
28.5
Table 2: French-English results when rescoring with the recurrent neural network language model; the baseline relies
on an n-gram model trained on 1.15bn words.
dev news2010 news2011 newssyscomb2011 Avg(test)
Baseline
21.2
20.7
19.2
20.6
20.0
+RNNLM (2m)
21.8
20.9
19.4
20.9
20.3
+RNNLM (50m) 22.1
21.1
19.7
21.0
20.5
Table 3: German-English results when rescoring with the recurrent neural network language model.
dev news2010 news2011 newssyscomb2011 Avg(test)
Baseline
15.2
15.6
14.3
15.7
15.1
+RNNLM (2m)
15.7
15.9
14.6
16.0
15.4
+RNNLM (50m) 15.8
15.9
14.7
16.1
15.5
Table 4: English-German results when rescoring with the recurrent neural network language model; the baseline relies
on an n-gram model trained on 327m words.
train a transform between the source words and the
reference representations. This leads to the best re-
sults improving 1.5 BLEU over the 1-best decoder
output and adding 0.2 BLEU on average to the gains
achieved by the recurrent language model (§5.4).
Setup.
Conventional language models can be
trained on monolingual or bilingual data; however,
the joint model can only be trained on the latter.
In order to control for data size effects, we restrict
training of all models, including the baseline n-gram
model, to the target side of the parallel corpus, about
102m words for French-English. Furthermore we
train recurrent models only on the newswire portion
(about 3.5m words for training and 250k words for
validation) since initial experiments showed compa-
rable results to using the full parallel corpus, avail-
able to the baseline. This is reasonable since the test
data is newswire. Also, it allows for more rapid ex-
perimentation.
5.1 Foreign Sentence Representations
We represent foreign sentences either by latent se-
mantic analysis (LSA; Deerwester et al. 1990) or by
word encodings produced as a by-product of train-
ing the recurrent neural network language model on
the source words.
LSA is widely used for representing words and
documents in low-dimensional vector space. The
method applies reduced singular value decomposi-
tion (SVD) to a matrix M of word counts; in our
setting, rows represent sentences and columns rep-
resent foreign words. SVD reduces the number
of columns while preserving similarity among the
rows, effectively mapping from a high-dimensional
representation of a sentence, as a set of words, to a
low-dimensional set of concepts. The output of SVD
is an approximation of M by three matrices: T con-
tains single word representations, R represents full
sentences, and S is a diagonal scaling matrix:
M ≈ TSRT
Given vocabulary V and n sentences, we construct
M as a matrix of size |V × n|. The ij-th entry is the
number of times word i occurs in sentence j, also
known as the term frequency value; the entry is also
weighted by the inverse document frequency, the rel-
ative importance of word i among all sentences, ex-
pressed as the negative logarithm of the fraction of
sentences in which word i occurs.
As a second representation we use single word
embeddings implicitly learned by the input layer
weights U of the recurrent neural network language
model (§2), denoted as RNN. Each word is repre-
sented by a vector of size |hi|, the number of neu-
rons in the hidden layer; in our experiments, we
consider concatenations of individual word vectors
to represent foreign word contexts. These encodings
have previously been found to capture syntactic and
semantic regularities (Mikolov et al., 2013) and are
readily available in our experimental framework via
training a recurrent neural network language model
on the source-side of the parallel corpus.
5.2 Results
We first experiment with the two previously intro-
duced representations of the source-side sentence.
Table 5 shows the results compared to the 1-best de-
coder output and an RNN language model (target-
only). We first try LSA encodings of the entire
foreign sentence as 80 or 240 dimensional vectors
(sent-lsa-dim80, sent-lsa-dim240). Next, we experi-
ment with single-word RNN representations of slid-
ing word-windows in the hope of representing rel-
evant context more precisely. Word-windows are
constructed relative to the source words aligned to
the current target word, and individual word vec-
tors are concatenated into a single vector. We
first try contexts which do not include the aligned
source words, in the hope of capturing information
not already modeled by the channel models, start-
ing with the next five words (ww-rnn-dim50.n5),
the five previous and the next five words (ww-rnn-
dim50.p5n5) as well as the previous three words
(ww-rnn-dim50.p3). Next, we experiment with
word-windows of up to five aligned source words
(ww-rnn-dim50.c5). Finally, we try contexts based
on LSA word vectors (ww-lsa-dim50.n5, ww-lsa-
dim50.p3).5
While all models improve over the baseline, none
significantly outperforms the recurrent neural net-
work language model in terms of BLEU. However,
the perplexity results suggest that the models uti-
lize the foreign representations since all joint mod-
els improve vastly over the target-only language
5We ignore the coverage vector when determining word-
windows which risks including already translated words.
Building word-windows based on the coverage vector requires
additional state in a rescoring setting meant to be light-weight.
−p(f|e)
−p(e|f) −p(e|f)
Baseline without CM
24.0
22.5
+ target-only
24.5
22.6
+ sent-lsa-dim240
24.9
23.3
+ ww-rnn-dim50.n5
24.9
24.0
+ ww-rnn-dim50.p5n5
24.6
23.7
+ ww-rnn-dim50.p3
24.6
22.3
+ ww-rnn-dim50.c5
24.9
24.0
+ ww-lsa-dim50.n5
24.8
23.9
+ ww-lsa-dim50.p3
23.8
23.2
Table 6: Comparison of the joint model and the chan-
nel model features (CM) by removing channel features
corresponding to −p(e|f) from the lattices, or both di-
rections −p(e|f), −p(f|e) and replacing them by vari-
ous joint models. We re-tuned the log-linear weights for
different feature-sets. Accuracy is based on the average
BLEU over news2010, newssyscomb2010, news2011.
model. The lowest perplexity is achieved by the
context covering the aligned source words (ww-rnn-
dim50.c5) since the source words are a better pre-
dictor of the target words than outside context.
The experiments so far measured if the joint
model can improve in addition to the four channel
model features used by the baseline, that is, the max-
imum likelihood and lexical translation features in
both translation directions. The joint model clearly
overlaps with these features, but how well does
the recurrent model perform compared against the
channel model features? To answer this question,
we removed channel model features corresponding
to the same translation direction as the joint model,
specifically pMLE(e|f) and pLW (e|f), from the lat-
tices and measured the effect of adding the joint
models.
The results (Table 6, column −p(e|f)) clearly
show that our joint models are competitive with the
channel model features by outperforming the orig-
inal baseline with all channel model features (24.7
BLEU) by 0.2 BLEU (ww-rnn-dim50.n5, ww-rnn-
dim50.c5). As a second experiment, we removed all
channel model features (column −p(e|f),p(f|e)),
diminishing baseline accuracy to 22.5 BLEU. In this
setting, the best joint model is able to make up 1.5
of the 2.2 BLEU lost due to removal of the channel
dev news2010 news2011 newssyscomb2010 Avg(test) PPL
Baseline
24.3
24.4
25.1
24.3
24.7 341
target-only
25.1
25.1
26.4
25.0
25.6 218
sent-lsa-dim80
25.2
25.2
26.3
25.1
25.6 147
sent-lsa-dim240
25.1
25.0
26.2
24.9
25.4 126
ww-rnn-dim50.n5
24.9
25.0
26.3
24.8
25.4
61
ww-rnn-dim50.p5n5 25.0
24.8
26.2
24.7
25.3
59
ww-rnn-dim50.p3
25.1
25.1
26.5
24.9
25.6 143
ww-rnn-dim50.c5
24.8
24.9
26.0
24.8
25.3
16
ww-lsa-dim50.n5
25.0
25.0
26.2
24.8
25.4
76
ww-lsa-dim50.p3
25.1
25.1
26.5
24.9
25.6 151
Table 5: Translation accuracy of the joint model with various encodings of the foreign sentence measured on the
French-English task. Perplexity (PPL) is based on news2011.
model features, while modeling only a single trans-
lation direction. This setup also shows the negligible
effect of the target-only language model in the ab-
sence of translation scores, whereas the joint models
are much more effective since they do model transla-
tion. Overall, the best joint models prove very com-
petitive to the traditional channel features.
5.3 Oracle Experiment
The previous section examined the effect of a set
of basic foreign sentence representations. Although
we find some benefit from these representations, the
differences are not large. One might naturally ask
whether there is greater potential upside from this
channel model. Therefore we turn to measuring the
upper bound on accuracy for the joint approach as a
whole.
Specifically, we would like to find a bound on ac-
curacy given an ideal representation of the source
sentence. To answer this question, we conducted an
experiment where the joint model has access to an
LSA representation of the reference translation.
Table 7 shows that the joint approach has an ora-
cle accuracy of up to 4.3 BLEU above the baseline.
This clearly confirms that the joint approach can ex-
ploit the additional information to improve BLEU,
given a good enough representation of the foreign
sentence. In terms of perplexity, we see an improve-
ment of up to 65% over the target-only model. It
should be noted that since LSA representations are
computed on reference words, perplexity no longer
has its standard meaning.
BLEU PPL
Baseline
25.2 341
target-only
26.4 218
oracle (sent-lsa-dim40)
27.7 124
oracle (sent-lsa-dim80)
28.5 103
oracle (sent-lsa-dim160)
29.0
86
oracle (sent-lsa-dim240)
29.5
76
Table 7: Oracle accuracy of the joint model when us-
ing an LSA encoding of the references, measured on the
news2011 French-English task.
5.4 Target Language Projections
Our experiments so far showed that joint models
based on direct representations of the source words
are very competitive to the traditional channel mod-
els (§5.2). However, these experiments have not
shown any improvements over the normal recurrent
neural network language model. The previous sec-
tion demonstrated that good representations can lead
to substantial gains (§5.3). In order to bridge the gap,
we propose to learn a separate transform from the
foreign words to an encoding of the reference target
words, thus making the source-side representations
look more like the target-side encodings used in the
oracle experiment.
Specifically, we learn a linear transform
dθ : x → r mapping directly from a vector en-
coding of the foreign sentence x to an l-dimensional
LSA representation r of the reference sentence. At
test and training time we apply dθ to the foreign
words and use the transformation instead of a direct
dev news2010 news2011 newssyscomb2010 Avg(test) PPL
Baseline
24.3
24.4
25.1
24.3
24.7 341
target-only
25.1
25.1
26.4
25.0
25.6 218
proj-lsa-dim40 25.1
25.3
26.5
25.2
25.8 145
proj-lsa-dim80 25.1
25.3
26.6
25.2
25.8 134
Table 8: Translation accuracy of the joint model with a source-target transform, measured on the French-English task.
Perplexity (PPL) is based on news2011; differences to target-only are significant at the p < 0.001 level.
source-side representation.
The transform models all foreign words in the par-
allel corpus except singletons, which are collapsed
into a unique class, similar to the recurrent neural
network language model. We train the transform to
minimize the squared error with respect to the ref-
erence LSA vector using an SGD online learner:
θ
∗
= arg min
θ
n
∑
i=1
(
ri − dθ(xi)
)2
(1)
We found a simple constant learning rate, tuned
on the validation data, to be as effective as sched-
ules based on constant decay, or reducing the learn-
ing rate when the validation error increased. Our
feature-set includes unigram and bigram word fea-
tures. The value of unigram features is simply the
unigram count in that sentence; bigram features re-
ceive a weight of the bigram count divided by two
to help prevent overfitting. Then the vector for each
sentence was divided by its L2 norm. Both weight-
ing and normalization led to substantial improve-
ments in test set error. More complex features such
as skip-bigrams, trigrams and character n-grams did
not yield any significant improvements. Even this
representation of sentences is composed of a large
number of instances, and so we resorted to feature
hashing by computing feature ids as the least signif-
icant 20 bits of each feature name. Our best trans-
form achieved a cosine similarity of 0.816 on the
training data, 0.757 on the validation data, and 0.749
on news2011.
The results (Table 8) show that the transform im-
proves over the recurrent neural network language
model on all test sets and by 0.2 BLEU on average.
We verified significance over the target-only model
using paired bootstrap resampling (Koehn, 2004)
over all test sets (7526 sentences) at the p < 0.001
level. Overall, we improve accuracy by up to 1.5
BLEU and by 1.1 BLEU on average across all test
sets over the decoder 1-best with our joint language
and translation model.
6 Related Work
Our approach of combining language and translation
modeling is very much in line with recent work on
n-gram-based translation models (Crego and Yvon,
2010), and more recently continuous space-based
translation models (Le et al., 2012a; Gao et al.,
2013). The joint model presented in this paper dif-
fers in a number of key aspects: we use a recur-
rent architecture representing an unbounded history
of both source and target words, rather than a feed-
forward style network. Feed-forward networks and
n-gram models have a finite history which makes
predictions independent of anything but a small his-
tory of words. Furthermore, we only model the
target-side which is different to previous work mod-
eling both sides.
We introduced a new algorithm to tackle lattice
rescoring with an unbounded model. The auto-
matic speech recognition community has previously
addressed this issue by either approximating long-
span language models via simpler but more tractable
models (Deoras et al., 2011b), or by identifying con-
fusable subsets of the lattice from which n-best lists
are constructed and rescored (Deoras et al., 2011a).
We extend their work by directly mapping a recur-
rent neural network model onto the structure of the
lattice, rescoring all states instead of focusing only
on subsets.
7 Conclusion and Future Work
Joint language and translation modeling with recur-
rent neural networks leads to substantial gains over
the 1-best decoder output, raising accuracy by up
to 1.5 BLEU and by 1.1 BLEU on average across
several test sets. The joint approach also improves
over the gains of the recurrent neural network lan-
guage model, adding 0.2 BLEU on average across
several test sets. Our models are competitive to the
traditional channel models, outperforming them in a
head-to-head comparison.
Furthermore, we tackled the issue of lattice
rescoring with an unbounded recurrent model by
means of a novel algorithm that keeps a beam of re-
current histories. Finally, we have shown that the
recurrent neural network language model can sig-
nificantly improve over n-gram baselines across a
range of language-pairs, even when the baselines
were trained on 575 times more data.
In future work we plan to directly learn represen-
tations of the source-side during training of the joint
model. Thus, the model itself can decide which en-
coding is best for the task. We also plan to change
the cross entropy objective to a BLEU-inspired ob-
jective in a discriminative training regime, which we
hope to be more effective. We would also like to ap-
ply recent advances in tackling the vanishing gradi-
ent problem (Pascanu et al., 2013) using a regular-
ization term to maintain the magnitude of the gradi-
ents during back propagation through time. Finally,
we would like to integrate the recurrent model di-
rectly into first-pass decoding, a straightforward ex-
tension of lattice rescoring using the algorithm we
developed.
Acknowledgments
We would like to thank Anthony Aue, Hany Has-
san Awadalla, Jon Clark, Li Deng, Sauleh Eetemadi,
Jianfeng Gao, Qin Gao, Xiaodong He, Will Lewis,
Arul Menezes, and Kristina Toutanova for helpful
discussions related to this work as well as for com-
ments on previous drafts. We would also like to
thank the anonymous reviewers for their comments.
References
Alexandre Allauzen, Hél`ene Bonneau-Maynard, Hai-Son
Le, Aurélien Max, Guillaume Wisniewski, François
Yvon, Gilles Adda, Josep Maria Crego, Adrien
Lardilleux, Thomas Lavergne, and Artem Sokolov.
2011. LIMSI @ WMT11. In Proc. of WMT, pages
309–315, Edinburgh, Scotland, July. Association for
Computational Linguistics.
Ebru Arisoy, Tara N. Sainath, Brian Kingsbury, and Bhu-
vana Ramabhadran. 2012. Deep Neural Network
Language Models. In NAACL-HLT Workshop on the
Future of Language Modeling for HLT, pages 20–28,
Stroudsburg, PA, USA. Association for Computational
Linguistics.
Peter F. Brown, Peter V. deSouza, Robert L. Mercer, Vin-
cent J. Della Pietra, and Jenifer C. Lai. 1992. Class-
based n-gram models of natural language. Computa-
tional Linguistics, 18(4):467–479, Dec.
Josep Crego and Franois Yvon. 2010. Factored bilingual
n-gram language models for statistical machine trans-
lation. Machine Translation, 24(2):159–175.
Scott Deerwester, Susan T. Dumais, George W. Furnas,
Thomas K. Landauer, and Richard Harshman. 1990.
Indexing by Latent Semantic Analysis. Journal of the
American Society for Information Science, 41(6):391–
407.
Anoop Deoras, Tomáš Mikolov, and Kenneth Church.
2011a. A Fast Re-scoring Strategy to Capture Long-
Distance Dependencies. In Proc. of EMNLP, pages
1116–1127, Stroudsburg, PA, USA, July. Association
for Computational Linguistics.
Anoop Deoras, Tomáš Mikolov, Stefan Kombrink,
M. Karafiat, and Sanjeev Khudanpur. 2011b. Varia-
tional Approximation of Long-Span Language Models
for LVCSR. In Proc. of ICASSP, pages 5532–5535.
Ahmad Emami and Frederick Jelinek. 2005. A Neural
Syntactic Language Model. Machine Learning, 60(1-
3):195–227, September.
Jianfeng Gao, Xiaodong He, Wen-tau Yih, and Li Deng.
2013. Learning Semantic Representations for the
Phrase Translation Model. Technical Report MSR-
TR-2013-88, Microsoft Research, September.
Joshua Goodman. 2001. Classes for Fast Maximum En-
tropy Training. In Proc. of ICASSP.
Philipp Koehn, Franz Josef Och, and Daniel Marcu.
2003. Statistical Phrase-Based Translation. In Proc.
of HLT-NAACL, pages 127–133, Edmonton, Canada,
May.
Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris
Callison-Burch, Marcello Federico, Nicola Bertoldi,
Brooke Cowan, Wade Shen, Christine Moran, Richard
Zens, Chris Dyer, Ondrej Bojar, Alexandra Con-
stantin, and Evan Herbst. 2007. Moses: Open Source
Toolkit for Statistical Machine Translation. In Proc.
of ACL Demo and Poster Sessions, pages 177–180,
Prague, Czech Republic, Jun.
Philipp Koehn. 2004. Statistical Significance Tests for
Machine Translation Evaluation. In Proc. of EMNLP,
pages 388–395, Barcelona, Spain, Jul.
Hai-Son Le, Alexandre Allauzen, and François Yvon.
2012a. Continuous Space Translation Models with
Neural Networks. In Proc. of HLT-NAACL, pages 39–
48, Montréal, Canada. Association for Computational
Linguistics.
Hai-Son Le, Thomas Lavergne, Alexandre Allauzen,
Marianna Apidianaki, Li Gong, Aurélien Max, Artem
Sokolov, Guillaume Wisniewski, and François Yvon.
2012b. LIMSI @ WMT12. In Proc. of WMT, pages
330–337, Montréal, Canada, June. Association for
Computational Linguistics.
Wolfgang Macherey, Franz Josef Och, Ignacio Thayer,
and Jakob Uszkoreit. 2008. Lattice-based Minimum
Error Rate Training for Statistical Machine Transla-
tion. In Proc. of EMNLP, pages 725–734, Strouds-
burg, PA, USA. Association for Computational Lin-
guistics.
Tomáš Mikolov and Geoffrey Zweig. 2012. Con-
text Dependent Recurrent Neural Network Language
Model. In Proc. of Spoken Language Technologies
(SLT), pages 234–239, Dec.
Tomáš Mikolov, Karafiát Martin, Lukáš Burget, Jan Cer-
nocký, and Sanjeev Khudanpur. 2010. Recurrent
Neural Network based Language Model. In Proc. of
INTERSPEECH, pages 1045–1048.
Tomáš Mikolov, Anoop Deoras, Daniel Povey, Lukáš
Burget, and Jan ˇCernocký. 2011a. Strategies for
Training Large Scale Neural Network Language Mod-
els. In Proc. of ASRU, pages 196–201.
Tomáš Mikolov, Stefan Kombrink, Lukáš Burget, Jan
Cernocký, and Sanjeev Khudanpur. 2011b. Exten-
sions of Recurrent Neural Network Language Model.
In Proc. of ICASSP, pages 5528–5531.
Tomáš Mikolov, Wen-tau Yih, and Geoffrey Zweig.
2013. Linguistic Regularities in Continuous Space-
Word Representations. In Proc. of NAACL, pages
746–751, Stroudsburg, PA, USA, June. Association
for Computational Linguistics.
Tomáš Mikolov. 2012. Statistical Language Models
based on Neural Networks. Ph.D. thesis, Brno Uni-
versity of Technology.
Franz Josef Och. 2003. Minimum Error Rate Training
in Statistical Machine Translation. In Proc. of ACL,
pages 160–167, Sapporo, Japan, July.
Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio.
2013. On the difficulty of training Recurrent Neural
Networks. Proc. of ICML, abs/1211.5063.
David E. Rumelhart, Geoffrey E. Hinton, and Ronald J.
Williams. 1986. Learning Internal Representations
by Error Propagation. In Symposium on Parallel and
Distributed Processing.
Holger Schwenk, Anthony Rousseau, and Mohammed
Attik. 2012. Large, Pruned or Continuous Space Lan-
guage Models on a GPU for Statistical Machine Trans-
lation. In NAACL-HLT Workshop on the Future of
Language Modeling for HLT, pages 11–19. Associa-
tion for Computational Linguistics.
Martin Sundermeyer, Ilya Oparin, Jean-Luc Gauvain,
Ben Freiberg, Ralf Schlüter, and Hermann Ney. 2013.
Comparison of Feedforward and Recurrent Neural
Network Language Models. In IEEE International
Conference on Acoustics, Speech, and Signal Process-
ing, pages 8430–8434, Vancouver, Canada, May.
Geoff Zweig and Konstantin Makarychev. 2013. Speed
Regularization and Optimality in Word Classing. In
Proc. of ICASSP.