THE_URL:https//www.semanta.nl/neural machine translation
THE_TITLE: So go beyond the traditional statistical model for
machine tran
So go beyond the traditional statistical model for
machine translation.
More recently people start to develop a series of
machine translation model based on deep learning, and
we call it neural machine translation.
The goal of this model is to build a single
large neural network that can read in a sentence
then output a translation directly.
So this is very different from the phrase based system
that consist of many different modules.
In this neural network translation model usually it's based on
encoded decoder framework for example encoder.
Especially based on recurrent neural network,
we are reading and
encode source sets into our fixed length continuous vector.
Then this vector will be fit into our decoder
which is also based on the recurrent network.
And the output our variable length translation
from the encoded vectors.
The encoder decoder are then, are jointly trained
on the training data which we show us we show before.
Which trained on the translation pass between two languages.
And the whole encoder decoder are trained to optimize
the target likelihood given the input sentence.
We will show some details of this kind of model in the next.
Here is a typical example of the encoder decoder model for
this translation.
We usually call this sequence to sequence models.
You can see suppose the input sentence A B C is
a source sentence say for a another language like French.
Then in the encoder side the recurrent network like RNN or
long short memory network.
LSTM, is used to read in the sentence
like one word by one word.
Then at the end of the sentence, the model where encoded whole
sentence as a vector here we call it W.
Then the W will be treated as an input
to a next decoder which is also RNN.
Then from the decoder at x step
the decoder output one English word like x, then, y, and z
at each step the information carry
over from the encode vector will be transmitted into
different steps of the decoders at the hidden layer.
And meanwhile the output of the decoder in the last step that's
X, will be treated as an input information
to the same decoder to generate the next like Y.
This process keeps going on until the decoder generate
the special symbol called end up sentence.
Let's finish the whole translation process.
In practice
people found different way to read source sentence.
For example, some people found,
if they read the source sentence in a reverse order
rather than the original order could make even better result.
So the whole model is trained on a bad text and
optimized by a targeted likelihood using the start gradient decent.
So when we talk about the details of recurrent network
and wonder why it can be a
problem to encode a whole sentence.
In theory, the recurrent network can store in its hidden layer h
all the information so far inputted at this model.
Because at every timestamp the recurrent network,
well inherent information from the previous timestamps
hidden layer plus the input from the current timestamp.
As we can see here.
the recurrent network
will always take input from the previous h, at
the previous timestamp, plus the input n as the time t.
In theory this model can actually carry over all the information
of the whole sentence once we reach the end of the sentence.
In practice, the standard recurring neural network cannot
capture very long distance dependency.
Because with every ten step,
there's an update in the hidden layer.
Which may start to forget some of the information which
originally, at the very beginning, input it in the RNN.
To solve this issue, people have invented a new kind of recurrent
network called, long short-term memory also LSTM in short.
In this model people invented a gate mechanism which can help to keep
the key information which have been input in early stage and
to catch this long dependency across the whole sentence.
Here is an illustration of the memory cell
in a long short-term memory network,
which was invented almost 20 years ago.
In this case there's a special cell designed in this model.
For example, called Ct, the value of Ct, Is decided
by multiple gates including one gate called forgetting gate
which decides how much of the information in the previous
step in this cell should be kept.
Another gate called input gate decides how much
information from the input shall be taken into the cell or
how much the cell should be updated based on the current new observation?
And then there's another gate called output gate.
Decide, how much of this information should be output
to influence the output at this timestamp from each cell.
All these three gates work together to decide how
the memory shall be kept and
shall be used to generate the output.
And more recently people invented some other gated
mechanism to be more efficient.
For example the GRU called Gated  Recurrent Unit is invented to be
a more simplified and effective way to keeping the memory and kept 
important information in the hidden layers.
Which works also very well on this kind of task for machine
translation.
More recently people are also motivated by the like what are examined in the
classical machine translation process.
People started to invent attention mechanisms
it's like we shall encode the input sentence
not just in one single vector at the end.
Rather than we shall actually encode the whole input sentence
as a sequence of vectors.
And keep the information as each timestamp or at each source word.
By doing this, we actually keep more information and
then in the decoding process, we can flexibly retrieve back and
look back, and retrieve corresponding vector
as different location to help us to make better decision where we
generate the translation word by word in the output set.
So here's an example of the attention model.
So the Y is the generation part, the decoding part which is on top,
and the X on the bottom is the source input sentence.
So as I can see, first in the source side and X side.
Each word is eventually encoded by its corresponding
hidden layers, and the used figure is a bidirectional RSTM.
So each word X is encoded by two hidden layers, and
we configure them together called the vector for
particular word at particular position.
Then, look at the outside, the y, and the upper part.
So every time when we try to decide which output at time t
like Yt, we're actually trying to look at all these vectors
in the X side, and we're trying to compute which of these X odds
are more relevant to predict the output in the Yt.
We tried to use the hidden layer as time t so
far in the decoder side as a query or reference.
To check against all this X words in the source side
and assign different weights to different words in the source
side.
We call it attention weight.
From alpha one to alpha t and then we generate with
the sum of what is source side representation vector.
Which catch the information for a source side and
pay more attention on the words in the source side
are more relevant to the word Yt, which we are going to generate.
So by doing this actually we can get some so
called more rich context information.
From this rich sentence vector.
Which can help us to generate the Yt, Mt,
the decoder side more efficiently.
So by doing this,
we actually can build a single neural machine translation model
that can reach the state of the art performance.
compared to previous  statistical based 
machine translation modules.