Learning to Learn by Gradient Descent by Gradient Descent
I had a trip to Quebec city for 4 days. Behind the lingering from the travel, I prepared for the meetup this week. I could not join it because of birthday dinner with my girlfriend. However, I studied the original paper seriously, and the topic involves some interesting ideas, so I want to introduce about it.
Long short term memory (LSTM)
To understand the paper, precedently, need to understand LSTM. I recommend chapter 10 of the deeplearning book.
We have partly discussed recurrent neural network (RNN) when studied Hopfield net. This feedback networks have interesting property to remember the informations. However, after many iterations, the activations of the network become flat due to the limit of the numerical precision. Many ideas were considered, and one of the suggested solutions was using gates to make the network forget. You can adjust the gauge of amnesia of the machine1.
Thus, this LSTM has amazing applications in deep learning. Next time, I might also introduce other applications using this LSTM, such as sequence to sequence, generative adversarial nets and so on.
Learning to learn by gradient descent by gradient descent (L2L) and TensorFlow
The idea of the L2L is not so complicated. The original paper is also quite short. It is not automatic that we choose the proper optimizer for the model, and finely tune the parameter of the optimizer. Sometimes, I feel it is even chaotic that there is no definite standard of the optimizations. There are too many trials and errors in computer science. This L2L is a method to make an optimization for parameters such as learning rates and momentums2. LSTM is used to memorize the states in this optimization. Thus, we need the other optimizer to minimize the loss objective of the neural networks. In the paper, they use Adam optimizer to minimize the net. Therefore, there are two optimizers of the L2L. Adam and LSTM optimizer3. After Adam optimization, the LSTM optimizer perform extremely better than others.
This is the loss objective and the update rules for the algorithm to find the best optimizer4. This objective is differentiable. The terminology, differentiable, is a bit different in machine learning. It means we can use back-propagation. The system is fully differentiable with the and allow us to optimize to seek better optimizer.
This is a computational graph used for computing the gradient of the optimizer4. $m$ is the RNN. This tensor network update the gradient, $\nabla_t$, the state (paramters), $h_t$, and the optimizer, $g_t$. $f_t$ is the optimizee function with parameter, $\theta_t$. $\phi$ is a parameter of the $g_t$4.
Google deepmind opens the source for their research of L2L. For simple function optimizer example, it does not take too much time to train the model.
I have used TensorBoard of TensorFlow to help us to understand how L2L works with the above figure from the paper. You can look closer after opening the image in a new tab/window. Compared to the paper, this shows where Adam optimizer works.
To simplify the graph, I reduced the system in many ways. In the original paper, they use 2-layer LSTM, but I used 1-layer for the TensorBoard. The dimension of the target polynomial is 7. In other words, we want to find the 7 coefficients of the polynomial from the model. The size of the state is 19. The number of the training step is 5. The cell is LSTM. Note that I have run the Adam optimizer twice.
- When I check Keras or Tensorflow LSTM class, they just fully open the forget gate, and do not have option for adjustment. If want to adjust, need to use the subclass of the LSTM class.
- Some recent popular optimizers like RMSprop use momentum instead of using the gradient to change the position of the weight particle. See the tutorial by Geoffrey Hinton, if you want some detail. [return]
- I am suspicious if the L2L optimizer is faster than other optimizers overall. The performance by iteration steps are amazing, but basically need to run two optimizers. I run TensorFlow using my mac, so the efficiency of the LSTM optimizer was bad, and could not test how effective it is. Besides, the performance of L2L optimization depends on the Adam, too. If you use the normal gradient descent to minimize the loss function of the network, LSTM optimizer performs worse than RMSprop. [return]
- The formula and the graph were captured from the paper. [return]