Judgement Day


It is the first time I did not post for 4 days. I was too busy to prepare for the meetup this week. The day before yesterday meetup topic was the reinforcement learning as I mentioned at previous post. It is not a long research paper, but includes 143 references. Ah, not my favorite. This A Brief Survey of Deep Reinforcement Learning did not explain the detail of what I am interested in.


One of the skills I learned during graduate study is how to find the key references in the most recent research papers. It took a lot of time to learn it but it turns out very helpful when I start the new things.


Anyway, it includes a lot of topics and it would take huge amount of time. I stopped everything except for studying deep reinforcement learning and spent so much time on it. This post is only part of the deep reinforcement learning1, but is going to be so long, too. It also took more than a day to write this post.


Even when I did not yet dive into machine learning, I watched the games of AlphaGo. And I was so shocked about the result. At the day AlphaGo won, a lot of people talked about the day machines will rule out human being. Recently I showed the famous ATARI Breakout video clip to my girlfriend and her reaction was also usual. Scary. Recently robotics using deep reinforcement learning had so much progress and we might see the daily use of neural-networked machine in the real life soon. Yeah, scary.


However, I clearly remember the one win of the human and his smile after his victory2. YES, I am always at human’s side.


png


At that time, I have heard the reinforcement learning is the key of the AlphaGo. And finally I studied it and introduce to you how cool it is. We will make a DQN Python machine learning using openAI gym, and Keras deep learning library.


GYM AI and Environment


Easiest targets among ATARI are classical_controls. Among them, cart pole is very clear problem implemented by Rich Sutton3.


The environment Python code is given. Look into it. Two methods seems crucial.


def __init__(self):
    self.gravity = 9.8
    self.masscart = 1.0
    self.masspole = 0.1
    self.total_mass = (self.masspole + self.masscart)
    self.length = 0.5 # actually half the pole's length
    self.polemass_length = (self.masspole * self.length)
    self.force_mag = 10.0
    self.tau = 0.02  # seconds between state updates

    # Angle at which to fail the episode
    self.theta_threshold_radians = 12 * 2 * math.pi / 360
    self.x_threshold = 2.4

    # Angle limit set to 2 * theta_threshold_radians so failing observation is still within bounds
    high = np.array([
        self.x_threshold * 2,
        np.finfo(np.float32).max,
        self.theta_threshold_radians * 2,
        np.finfo(np.float32).max])

    self.action_space = spaces.Discrete(2)
    self.observation_space = spaces.Box(-high, high)

    self._seed()
    self.viewer = None
    self.state = None

    self.steps_beyond_done = None


Above all, this init code came from the gym AI. Not included in our Python code. Our system is a 2-dimensional world with 1 dimensional gravity.


Masses of cart, pole are given. Length of the pole is given. Updating time is set. Some driven force is set. Threshold angle and position is set. I guess the angle is calculated theoretically, but threshold of position seems just setting. We do not need that kind of threshold at more realistic problem. In other words, the threshold of position is just boundary of the space of our system.


def _step(self, action):
    assert self.action_space.contains(action), "%r (%s) invalid"%(action, type(action))
    state = self.state
    x, x_dot, theta, theta_dot = state
    force = self.force_mag if action==1 else -self.force_mag
    costheta = math.cos(theta)
    sintheta = math.sin(theta)
    temp = (force + self.polemass_length * theta_dot * theta_dot * sintheta) / self.total_mass
    thetaacc = (self.gravity * sintheta - costheta* temp) / (self.length * (4.0/3.0 - self.masspole * costheta * costheta / self.total_mass))
    xacc  = temp - self.polemass_length * thetaacc * costheta / self.total_mass
    x  = x + self.tau * x_dot
    x_dot = x_dot + self.tau * xacc
    theta = theta + self.tau * theta_dot
    theta_dot = theta_dot + self.tau * thetaacc
    self.state = (x,x_dot,theta,theta_dot)
    done =  x < -self.x_threshold \
            or x > self.x_threshold \
            or theta < -self.theta_threshold_radians \
            or theta > self.theta_threshold_radians
    done = bool(done)

    if not done:
        reward = 1.0
    elif self.steps_beyond_done is None:
        # Pole just fell!
        self.steps_beyond_done = 0
        reward = 1.0
    else:
        if self.steps_beyond_done == 0:
            logger.warning("You are calling 'step()' even though this environment has already returned done = True. You should always call 'reset()' once you receive 'done = True' -- any further steps are undefined behavior.")
        self.steps_beyond_done += 1
        reward = 0.0

    return np.array(self.state), reward, done, {}


Above all, this step code came from the gym AI. Not included in our Python code.

step() method of the gym returns (np.array(self.state), reward, done, {}) after physical mechanics during 0.2 seconds. The third, done is a control factor. And the forth, {} is unknown blank for something?


This code is a rule of steps. From it, we can know the dimensions of the state and the action. Respectively, 4 and 2. It is enough to solve the problem, but I will present more about the system. The cart is moving 1-dimensional perpendicular to the gravity. One end of the pole is fixed on the cart and the other of the pole is free in the air. Then we have 4 parameters to determine the motions. Position and velocity of the cart, and angle and angular velocity of the pole. The steps every 0.2 seconds are determined by the equations of motions in the system. Some constant force will be acted on the left or right side of the cart in order to keep the pole stands on.


Deep Q-learning


Google DeepMind team’s research paper, Playing ATARI with Deep Reinforcement Learning is the perfect tutorial for this problem, deep Q-learning reinforcement learning. Just read this 9 page-research paper. This is very clear research paper and does not require much background knowledge. Just go read it. If I explain, it is almost the copy of the half of the paper. And I will just go directly to solve our model using the algorithm they suggested.


png


This is the algorithm of deep Q-learning.


We want to find sequence of states, s, 4 numbers in the cart pole problem, and actions, a, 1 or 0 in the problem, which means left or right. The goal of the agent is to interact with the emulator by selecting actions in a way that maximizes future rewards. In this game, the reward is turns to survive.


Algorithm and Python Code


Let us attack the algorithm line by line.


First line


We will add the states, previous states, rewards, and actions and also draw the samples from memory. Think it as short term memory. When you drive and turn, you still remember more detail how you turn, but will forget about the detail soon since it is not needed.


Thus, we set the limited capacity of the memory and throw out from the memory when it is full. For this cart-pole problem, if how many memory of the states are needed? Guess. We will see later.


class memory:  
    samples = []

    def __init__(self, capacity):
        self.capacity = capacity

    def add(self, sample):
        self.samples.append(sample)        

        if len(self.samples) > self.capacity:
            self.samples.pop(0)

    def sample(self, n):
        n = min(n, len(self.samples))
        return random.sample(self.samples, n)

N =  1000   # capacity
D = memory(N)


2nd line


OK, the simple part is done. We need to store the initial sample in the empty memory with random things in some range. Q-functions are obtained from the states, so we will instead initialize the states and then we will give the weights from neural networks.


The first line of the M-loop


Now the main dish! for double-loop. The both loops are on by simple iterations. I will call them M-loop and T-loop. We call the environment and reset the environment to initialize the state. You can find the various AI environment here, and I choose ‘Cartpole-v1’. It has max_episode_steps=500, and reward_threshold=475.0.


import gym
env = gym.make('CartPole-v1')


def _reset(self):
    self.state = self.np_random.uniform(low=-0.05, high=0.05, size=(4,))
    self.steps_beyond_done = None
    return np.array(self.state)


Above all, this reset code came from the gym AI. Not included in our Python code. The reset method of class CartPoleEnv will give us the initial state. In this code, state is the same as the observation at the algorithm and the ATARI paper. Think it as your first trial. It would not be precise, but approximate using your intuition. Of course our machine is not that smart, so we real human give the range.


Apparently, you can just initialize the state as we have done so far in this blog. However, anyway I do not want to make all the graphics codings irrelevant to machine learning core, and so at last we will use the environment code. Furthermore, it is a good time for beginners to learn class.


In the algorithm, the preprocessing is acted on image frame. Sometimes, we also do it to normalized the states, but our states are already restricted by the threshold and we do not need to preprocess the sequences at all.


The best way to study the objective oriented programming I guarantee is THIS. Study some packages consist of many classes, and practice to write the code by yourself without using the class. In other words, expand the code. And then, close the original object-oriented code, and rewrite the objective oriented code by yourself.


The T-loop includes two different algorithms


Q-learning : $\epsilon$-greedy policy dynamic programming for Q-value function.


Policy is just mapping from the sequences to actions. We use the dynamic programming to find the Q-value function. We use the recursive way to calculate the total reward as a function of an action and a state.


The recursive Q is convergent with the coefficient $\gamma < 1$, but if it unstable for non-linear Q-function, and could take so much time for the learning. Do you want the machine to drive your car if it makes a decision so late? The impromptu reaction is crucial in robotics.


The lesson from the machine : Do not be greedy too much!


I want to emphasize on the power of the mercy. Greed makes you look only your foot. It makes you see the tree and deprive a chance to see the forest. This $\epsilon$-possibility random mercy makes you explore and find the amazing answer. I have seen it from the move of AlphaGo, too. We will not see it from our example because it does not have many degree of freedom. However, if the machine handles with very high dimensional problems, that less greedy algorithm will extend perspective of human being.


Simulated annealing


We also use simulated annealing for the $epsilon$. When we use stochastic property to explore, the simulated annealing is useful to remove the randomness after using it. Without the annealing, it explores over and over even after it finds the best optimized state, and it slows down the performance.


def epsilon(n, Max):
  if n < M/2:
    return 0.01+ 2*(1 - 0.01)/Max * (n - Max/2)
  else :
    return 0.01


Neural network optimization for Q


Thus, we use neural networks to optimize the Q-function. As we have seen from perceptron, neural network is great optimizer, and also working well for non-linear function.


$Q(s, a; \theta_i) \sim Q^* (s, a)$. $Q(s, a; \theta_i)$ is the approximate Q-function by the network. $\theta_i$ are weight for i-th T-iteration. $Q^* (s, a)$ is the optimal Q-value function.


By the way, to reduce the time more, we use minibatch gradient descent on the loss function. Loss function is nothing but the square of the $Q(s, a; \theta_i) - Q^* (s, a)$. Do not forget the $\epsilon$-greedy policy should be considered.


Minibatch gradient descent was already discussed in the blog. It is a small sampled stochastic gradient descent. For every iteration of T-loop, we need to run a learning by neural network. It is not different from linear regression we had with perceptron. I will do this procedure using Keras. It would take too much time to explain how it works, so I will just show you the Python code.


Neural network and Keras


Before the loop code, we need to set the neural network model.


from keras.models import Sequential
from keras.layers import *
from keras.optimizers import *
import numpy as np


model = Sequential()
model.add(Dense(units=100, activation='relu', input_dim= 4))
model.add(Dense(units= 2, activation='linear'))
model.compile(metrics=['accuracy'], loss='mse', optimizer=Adagrad(lr=0.00025))


Finally, a new library. Just go to Keras website and study for few hours. You need to understand how neural network works and the meanings of the detail set up. For example, Sequential is a well-known model of a linear stack of layers. Dense is Just your regular densely-connected neural network layer. units are the number of batches, input_dim is here the dimension of states. Activation is the function you use in neuron. Sigmoid function is the most famous activation function. relu is the rectified linear unit. The last layer should have 2 units because our action is 2-dimensional. Adagrad is one of the gradient descent model. The weak point of Adagrad is that it already shirinks the learning rate. This did not make any problem for this example. I chose it because I thought it would be consistent with our minibatch stochastic approach, and also it is based on gradient descent the algorithm used. mse will give the meas squared error for our minibatch model. It is the same as the loss function suggested by the algorithm.


Double loop


size_of_batch = 100
R = 0
M= 10000
T = 10
gamma = 0.5

for episode in range(M):

    state = env.reset()

    for turn in range(T):

        env.render() # graphical loop is on.

        if random.random() <  epsilon(episode, M):
            action = random.randint(0, 1)
        else:
            action = np.argmax(model.predict(state.reshape(1, 4)).flatten)

        next_state, reward, done, EMPTY = env.step(action) # caution for done


        D.add( (state, action, reward, next_state) )

        minibatch = D.sample(size_of_batch)
        real_size_of_batch = len(minibatch)

        no_state = np.zeros(4)

        observe_state = np.array([ e[0] for e in minibatch ]) #draw he state from minibatch
        observe_next_state = np.array([ (no_state if e[3] is done else e[3]) for o in minibatch ])

        Q = model.predict(observe_state)
        next_Q = model.predict(observe_next_state)


        batch_state = np.zeros((real_size_of_batch, 4))
        batch_action = np.zeros((real_size_of_batch, 2))

        for i in range(real_size_of_batch):
            e = minibatch[i]
            state = e[0]; action = e[1]; reward = e[2]; next_state = e[3]

            yj = Q[i] # from the network. Q(theta)
            if next_state is None:
                yj[action] = reward  # off-policy
            else:
                yj[action] = reward + gamma * np.amax(next_Q[i])

            batch_state[i] = state
            batch_action[i] = yj

        model.model.fit(batch_state ,batch_action, batch_size=100, epochs=1, verbose=0)

        R +=reward

        state = next_state

        if done: # if the pole or cart go over the threshold
            break

gif


I do not explain the detail of the code. I have used the almost the same symbols for the parameters and functions. Will be straightforward for you to understand. If the theory is not clear, then, I say again, go read the original paper.


I myself experimented a lot of things. However, I leave you the fun part. The M-iteration number I chose above is enough big to make very good learning. If you use smaller episodes, you could see it fails in short period. Make a running function dependent on the iteration number, and can test how it is going, too. You can try to record the total reward and change the memory, and so on, in order to see if your intuition is correct. For example, what do you expect the plot of the total reward as a function of the M-iteration? There are ton of tests you can do, have fun!


Pool


In this simple game, we did not have the image input and as the result, we do not have convolution neural network. However, we decrease the size from the emptying memory. Here, the size of the batch is 100. And capacity of the memory in the above example is only 1000! I started from 100000 and decrease the capacity. It means there is the black magic of deep learning emerged again. For example, imagine you are doing the game on your hand by yourself. You need to focus on your action right now rather than remembering what you did 1 minute ago. If the game becomes complicated, you need to remember more, but still you are free to forget the informations you will not use.




  1. There are many other interesting deep reinforcement learning. In particular, the model-free policy search is super interesting. Hope I have a chance to post about it soon. [return]
  2. Of course, I agree with Google DeepMind team that it was not a win of machine but win of human technology.
    [return]
  3. He is one of the authors of the famous reinforcement learning introduction textbook. [return]