THE_URL:https//www.semanta.nl/neural machine translation THE_TITLE: So go beyond the traditional statistical model for machine tran So go beyond the traditional statistical model for machine translation. More recently people start to develop a series of machine translation model based on deep learning, and we call it neural machine translation. The goal of this model is to build a single large neural network that can read in a sentence then output a translation directly. So this is very different from the phrase based system that consist of many different modules. In this neural network translation model usually it's based on encoded decoder framework for example encoder. Especially based on recurrent neural network, we are reading and encode source sets into our fixed length continuous vector. Then this vector will be fit into our decoder which is also based on the recurrent network. And the output our variable length translation from the encoded vectors. The encoder decoder are then, are jointly trained on the training data which we show us we show before. Which trained on the translation pass between two languages. And the whole encoder decoder are trained to optimize the target likelihood given the input sentence. We will show some details of this kind of model in the next. Here is a typical example of the encoder decoder model for this translation. We usually call this sequence to sequence models. You can see suppose the input sentence A B C is a source sentence say for a another language like French. Then in the encoder side the recurrent network like RNN or long short memory network. LSTM, is used to read in the sentence like one word by one word. Then at the end of the sentence, the model where encoded whole sentence as a vector here we call it W. Then the W will be treated as an input to a next decoder which is also RNN. Then from the decoder at x step the decoder output one English word like x, then, y, and z at each step the information carry over from the encode vector will be transmitted into different steps of the decoders at the hidden layer. And meanwhile the output of the decoder in the last step that's X, will be treated as an input information to the same decoder to generate the next like Y. This process keeps going on until the decoder generate the special symbol called end up sentence. Let's finish the whole translation process. In practice people found different way to read source sentence. For example, some people found, if they read the source sentence in a reverse order rather than the original order could make even better result. So the whole model is trained on a bad text and optimized by a targeted likelihood using the start gradient decent. So when we talk about the details of recurrent network and wonder why it can be a problem to encode a whole sentence. In theory, the recurrent network can store in its hidden layer h all the information so far inputted at this model. Because at every timestamp the recurrent network, well inherent information from the previous timestamps hidden layer plus the input from the current timestamp. As we can see here. the recurrent network will always take input from the previous h, at the previous timestamp, plus the input n as the time t. In theory this model can actually carry over all the information of the whole sentence once we reach the end of the sentence. In practice, the standard recurring neural network cannot capture very long distance dependency. Because with every ten step, there's an update in the hidden layer. Which may start to forget some of the information which originally, at the very beginning, input it in the RNN. To solve this issue, people have invented a new kind of recurrent network called, long short-term memory also LSTM in short. In this model people invented a gate mechanism which can help to keep the key information which have been input in early stage and to catch this long dependency across the whole sentence. Here is an illustration of the memory cell in a long short-term memory network, which was invented almost 20 years ago. In this case there's a special cell designed in this model. For example, called Ct, the value of Ct, Is decided by multiple gates including one gate called forgetting gate which decides how much of the information in the previous step in this cell should be kept. Another gate called input gate decides how much information from the input shall be taken into the cell or how much the cell should be updated based on the current new observation? And then there's another gate called output gate. Decide, how much of this information should be output to influence the output at this timestamp from each cell. All these three gates work together to decide how the memory shall be kept and shall be used to generate the output. And more recently people invented some other gated mechanism to be more efficient. For example the GRU called Gated Recurrent Unit is invented to be a more simplified and effective way to keeping the memory and kept important information in the hidden layers. Which works also very well on this kind of task for machine translation. More recently people are also motivated by the like what are examined in the classical machine translation process. People started to invent attention mechanisms it's like we shall encode the input sentence not just in one single vector at the end. Rather than we shall actually encode the whole input sentence as a sequence of vectors. And keep the information as each timestamp or at each source word. By doing this, we actually keep more information and then in the decoding process, we can flexibly retrieve back and look back, and retrieve corresponding vector as different location to help us to make better decision where we generate the translation word by word in the output set. So here's an example of the attention model. So the Y is the generation part, the decoding part which is on top, and the X on the bottom is the source input sentence. So as I can see, first in the source side and X side. Each word is eventually encoded by its corresponding hidden layers, and the used figure is a bidirectional RSTM. So each word X is encoded by two hidden layers, and we configure them together called the vector for particular word at particular position. Then, look at the outside, the y, and the upper part. So every time when we try to decide which output at time t like Yt, we're actually trying to look at all these vectors in the X side, and we're trying to compute which of these X odds are more relevant to predict the output in the Yt. We tried to use the hidden layer as time t so far in the decoder side as a query or reference. To check against all this X words in the source side and assign different weights to different words in the source side. We call it attention weight. From alpha one to alpha t and then we generate with the sum of what is source side representation vector. Which catch the information for a source side and pay more attention on the words in the source side are more relevant to the word Yt, which we are going to generate. So by doing this actually we can get some so called more rich context information. From this rich sentence vector. Which can help us to generate the Yt, Mt, the decoder side more efficiently. So by doing this, we actually can build a single neural machine translation model that can reach the state of the art performance. compared to previous statistical based machine translation modules.