By ReNu Khandelwal
Let’s start with the following questions
 Cyclic neural network can solve the problems of artificial neural network and convolution neural network.
 Where can I use RNN?
 What is RNN and how does it work?
 Anti gradient loss and gradient explosion challenging RNN
 How can LSTM and Gru address these challenges
Suppose we’re writing a message, “let’s meet for you” We need to predict what the next word is. The next word can be lunch, dinner, breakfast or coffee. It is easier for us to infer from the context. Assuming we know that we are meeting in the afternoon and that this information is always in our memory, we can easily predict that we might meet at lunch.
When we need to process the sequence data which needs to be in multiple time steps, we use the recurrent neural network (RNN)
A fixed size neural network is required to generate fixed input and output vectors on the neural network.
For example, we use 128 × 128 vector input images to predict images of dogs, cats, or cars. We can’t predict with variable size images
Now, what if we need to manipulate sequence data that depends on previous input states, such as messages, or if the sequence data can be in input or output, or both, which is where we use RNNs.
In RNN, we share the weights and feedback the output to the loop input, which is helpful to process the sequence data.
RNN uses continuous data to infer who is speaking, what to say, what the next word might be, and so on.
RNN is a kind of neural network with a loop to store information. RNNs are called loops because they perform the same task on each element in the sequence, and the output element depends on the previous element or state. This is how RNN persists information to infer using context.
RNN is a kind of recurrent neural network
Where is RNN used？
The RNN described above may have one or more inputs and one or more outputs, i.e. variable input and variable output.
RNN can be used for
 Classified images
 Image acquisition
 MT
 Video classification
 Emotional analysis
How does RNN work?
Explain the symbols first.
 H is hidden
 X is the input
 Y is the output
 W is the weight
 T is the time step
When we process sequence data, RNN takes an input x on time step t. RNN takes the hidden state value on time step T1 to calculate the hidden state h on time step T and applies tanh activation function. We use tanh or relu to express the nonlinear relationship between output and time t.
Each step of RNN is expanded into a four layer neural network.
Hidden states connect information from the previous state and thus act as memory for RNN. The output of any time step depends on the current input and the previous state.
Unlike other deep neural networks that use different parameters for each hidden layer, RNN shares the same weight parameters in each step.
We initialize the weight matrix randomly. In the training process, we need to find the value of the matrix so that we can have ideal behavior, so we calculate the loss function L. The loss function L is calculated by measuring the difference between the actual output and the predicted output. The cross entropy function is used to calculate L.
RNN, where the loss function L is the sum of all losses in each layer
In other words, we use the RNN to share the weight on all the steps of the back propagation, but in order to reduce the weight loss of the traditional neural network at all levels. In this way, the error gradient of each step also depends on the loss of the previous step.
In the example above, in order to calculate the gradient of step 4, we need to add the loss of the first three steps to the loss of the fourth step. This is called back propagation through time bppt.
We calculate the gradient of the error relative to the weight to learn the correct weight for us and obtain the ideal output for us.
Because W is used in every step until the final output, we back propagate from t = 4 to t = 0. In traditional neural networks, we do not share weights, so we do not need to sum the gradients. In RNN, we share weights, and we need to sum the gradients of W at each time step.
Calculating the gradient of h at time step t = 0 involves many factors of W, because we need to backpropagation through each RNN unit. Even if we don’t want the weight matrix and multiply it by the same scalar value again and again, if the time step is too large, say 100 time steps, it will be a challenge.
If the maximum singular value is greater than 1, the gradient will explode, which is called explosion gradient.
If the maximum singular value is less than 1, the gradient will disappear, which is called vanishing gradient.
Weights are shared across all layers, causing gradients to explode or disappear
For gradient explosion problem, we can use gradient clipping, where we can set a threshold in advance. If the gradient value is greater than the threshold value, we can clip it.
In order to solve the problem of vanishing gradient, the commonly used method is to use longterm memory (LSTM) or gating cycle unit (Gru).
In our message example, in order to predict the next word, we need to go back a few time steps to understand the previous word. It is possible that there is a sufficient gap between the two relevant information. As the gap widens, it is difficult for RNN to learn and connect information. But this is a powerful function of LSTM.
Long short term memory network (LSTM)
Lstms can learn longterm dependence faster. Lstms can learn to span 1000 step intervals. This is achieved by an efficient gradient based algorithm.
To predict the next word in the message, we can store the context at the beginning of the message so that we have the right context. That’s how our memories work.
Let’s take a closer look at the LSTM architecture and how it works
The behavior of lstms is to remember information for a long time, so it needs to know what to remember and what to forget.
LSTM uses four gates, and you can think of them as if they need to remember the previous state. Cell state plays a key role in lstms. The LSTM can use four gates to decide whether to add or delete information from the unit state.
These doors act like faucets, determining how much information should be passed through.
 The first step in LSTM is to decide whether we need to remember or forget the state of the unit. The forgetting gate uses the sigmoid activation function with an output value of 0 or 1. The output 1 of the forgetting gate tells us to keep the value and the value 0 tells us to forget it.
 The second step is to decide what new information we will store in the cell state. There are two parts: one is the input gate, which determines whether to write the cell state by using the sigmoid function; the other is using the tanh activation function to determine what new information is added.

In the last step, we create the cell state by combining the outputs of step 1 and step 2. The output of step 1 and step 2 is to apply the tanh activation function of the current time step to the output of the output gate and multiply the cell state. The tanh activation function gives the output range between – 1 and + 1

Cell state is the internal memory of a cell, which multiplies the previous cell state by the forgetting gate, and then multiplies the newly calculated hidden state (g) by the output of input gate I.
Finally, the output will be based on the cell state
The backward propagation from the current cell state to the previous cell state is only the element multiplication of forgetting gate, and there is no matrix multiplication of W. therefore, the vanishing and explosion gradient problems are eliminated by using the element state
LSTM determines when and how to change memory at each time step by deciding what to forget, what to remember, and what information to update. This is how lstms can help store longterm memory.
The following is an example of how LSTM predicts our messages
Gru, a variant of LSTM
Gru uses two gates, a reset gate and an update gate, which is different from the three steps in LSTM. Gru has no internal memory
The reset gate determines how the new input is combined with the memory of the previous time step.
The update gate determines how much of the previous memory should be retained. Update gate is a combination of input gate and forgetting gate that we understand in LSTM.
Gru is a simple variant of LSTM for solving vanishing gradient problems
Link to the original text: https://medium.com/datadriveninvestor/recurrentneuralnetworkrnn52dd4f01b7e8
Welcome to visit pan Chuang AI blog station:
Sklearn machine learning Chinese official document:
Welcome to our blog: