5 Nov 2024

Reinforced snake

Creating an environment for training an AI snake. Putting Q-learning into practice.

Programming Python ML AI Reinforcement learning

Something that has piqued my interest recently is reinforcement learning. I believe that to the layman, reinforcement learning (RL) should be easy to understand because of its similarities to biology.

The creature with no senses

To understand reinforcement learning I would like to present a metaphor. Imagine, if you will, a being that has no senses and no control over any actions. It is stagnant in its environment and all that ever happens to the creature, be it positive or negative, happens by pure and random chance.

Now imagine giving the creature some sense of its nearest surroundings. The creature might now be able to smell and observe what is in its nearest vicinity.

At some points in time:

What is in that creatures vicinity may find its way to the creatures mouth and the creature will feel a sense of positive association with that smell.
Other times the creature may feel pain associated with another smell.

Now, lets give the creature the ability to move forwards, backwards and side to side. The creature can now actively explore its surroundings using random actions and its sense of smell, to find what gives it pleasure, and avoid what gives it pain.

Over time this creature learns more and more about its environment through exploration and new experiences. Soon the experiences become less novel to the creature, and the creature may start to exploit its knowledge more and more.

This is the way most living beings navigate their environment, and it is what RL is inspired by. Instead of this creature living in a physical world as a physical being, we can simulate this biological system as an abstract mathematical system using computers.

We, as the designer of such a system are responsible for designing the environment for the creature in the form of possible actions, senses, rewards and punishments. In a sense we are designing the creature itself, and how it should perceive the world around it.

The creature itself is responsible for learning, exploring and exploiting the information that we give it. Like most other machine learning systems we can employ a host of different statistical methods, including neural networks to estimate a mathematical link between the sensed environment, actions, rewards and punishments.

Snake

The objective of the game snake is for the snake to eat as many apples as it can, without eating itself or the walls. The enviroment is fairly closed, in the sense that it is predictable, with one important caveat: every time an apple is eaten a new one appears in a random place in the environment. This means that we cannot simply remember where the apple is, nor the order of where it appears, we have to learn how to find the apple using our senses. Consequently, the path of the snake has to change for every apple, and the previous placement of the tail will also be influenced by this random factor. All these random variables has to be learned by the snake.

Sense (observing the state of the world)

When we think about senses, we have to remember that the agent has no inherent ability to filter out the noise from the actual valuable information. Therefore, we can't just give the snake an omniscient view of the world, like the coordinates of everything. (although learning to filter the information is possible through the use of neural networks, it is out of the scope of this essay)

If we did this, we would require an incredible amount of experiences for the snake to learn exactly what information is important, at exactly the correct time. In other words: the snake would have to have seen an unfathomable amount of states of the world to gather anything meaningful.

An example of such a ludicrous state would be: "the apple is in the (4,10) position, I am at the (50, 20) position and have to move x steps in that direction and y steps in that direction, and I also have to be careful about my tail which is in x, y, z, a, b, c etc. positions.". A state, represented as a vector would look like this: [4,10,50,20,x,y,z,a,b,c...], giving us a combinatorial explosion of possible combinations.

We rather want the snake to have senses like the following:

The snake needs to be aware of where the apple is relative to itself, direction of the apple.
The snake needs to be aware of where its tail is. This is important to avoid eating itself.
The snake needs to be aware of how far it is from a wall.
The snake needs to have information about what direction it is already headed in.

The thought process of the snake would be for example: "The apple is relatively far away from me in the northwest direction, I am not headed that way at the moment. My tail is in the way if I went directly to the apple. I cannot go left because the wall is fairly close." As a vector this would look like: [far, northwest, heading south, tail-in-way, wall to left], even though this is also a fairly long vector the amount of possible values per space in the vector is fairly limited. This gives us relatively few dimensions of information to worry about, and we can discretely split up the possible combinations of these senses.

This should in fact be sufficient for the snake to navigate its own world through actions.

Actions

Defining the actions that the snake could take is fairly straightforward. As long as we define an action to be a single discrete step. The snake can move forward, not changing direction, and it can change its direction to the left or to the right. Now we come to the difficult part of reinforcement learning, the reward structure.

Rewards

We want the snake to not be afraid to explore in the hopes that it may receive a big reward in the future, but we also want the snake to not sacrifice itself to get the apple. We also need to think about how important an apple is in the future rather than the immediate gratification of just surviving a single discrete step.

Q-learning

One of the most basic reinforcement learning algorithms have no neural network associated with it. It is what we call a model of a Markov decision process where every step is only dependent on a previous step. This lets us think about actions and senses (state) in a single table, where the rows are possible states of the world, and the actions are the columns. The cells of this table is what we call a Q-value.

state/action	left	right	forward
apple north	0.2	0.4	0.3
apple north-east	0.5	0.2	0.8
apple east	0.1	0.02	0.1
...	...	...	...
(Q-table example, other dimensions of state left out for brevity)

Q-values is what we continuously update as the snake explores its enviroment. When the snake encounters a new state of the world, like "apple north" in the table, and performs and action, like "right", and receives a reward/punishment it can update the Q-value in the table for that state-action pair. After a while the snake may take actions based on the row (state) it's in, in the Q-table, it would select the highest Q-value action. Now, we can intuit a simple formula for how the snake should update its Q-table given the state and action it took and the reward it received:

$$ Q^{new}(S_t, A_t) \leftarrow Q(S_t, A_t) + R_{t+1} $$

I.e. the new Q-value is the old Q-value plus the reward that the snake received when it performed the action. This may, however, not be optimal. We want to let the scale of the Q-value remain relatively stable, and we only want the reward to change the Q-value a little bit. The reason we want this is that we do not know for example, whether the snake just happened to slide over the apple, or whether the action that the snake performed led it to the apple. In other words, we need a learning rate or an $\alpha$ as it is called.

$$ Q^{new}(S_t, A_t) \leftarrow (1-\alpha) Q(S_t, A_t) + \alpha R_{t+1} $$

This is better, but we have no way of knowing whether this action in this state may lead us to a large reward in the future. Actually, we do have information about the future rewards after having taken an action. We can look at the new state and the new Q-values related to that state. We can then just select the highest Q-value (best action) of that new state. Consequently that Q-value will presumably be calculated from the most valuable Q-value of the next state, and on and on. We can see then, that each Q-value is dependent on all future rewards of that state-action pair. But, we also need to aknowledge that an immediate reward is more certain, and therefore worth more to us then future rewards. With this aknowledgement we add a so called discount factor $$\gamma$$, which reduces all subsequent rewards by a factor of $$\gamma^{t}$$ where t is how many steps forward in the chain we are looking.

The formula will then look like this:

$$ Q^{new}(S_t, A_t) \leftarrow (1-\alpha) \cdot Q(S_t, A_t) + \alpha \cdot (R_{t+1} + \gamma \cdot \max_{a} Q(S_{t+1}, a)) $$

Exploration vs exploitation

As described earlier, it is essential that, early in the process, the snake moves around the enviroment and learns as much as possible, to fill up the possible states in the Q-table with information. When the snake is exploring it is going to prioritise doing random actions, instead of the highest Q-value. When it is exploiting it is going to select actions that yields the highest Q-values. We need to explore new states and we should also exploit the best actions, this explore-exploit tradeoff is often central to machine learning in general.

Algorithm

So how do we translate this into a usable Q-learning algorithm?

Using pseudocode:

table = QTable

epsilon = 1.0 # Start with a 100 % likelihood of exploring

observation = observe_environment()
repeat 1000: # play game 1000 times
    playing = true

    while playing: # Play until loosing or running out of moves
        previous_observation = observation

        if random_zero_to_one < epsilon: # We explore
            action = sample_random_action()
        else: # Else we exploit
            action = select_best_action(table, observation)
        
        reward, playing = perform_action(action)
        
        update_Q(table, prev_observation, observation, action, reward)
    
    # Reduce the randomness a little bit every game
    epsilon = 0.998 * epsilon

I programmed all this in python and trained the snake for about 7000 games. This is what I got:

Computer playing snake

Conclusion

Although I have presented the basics of Q-learning, we have only just scratched the surface of what is possible. Later I will show how we can incorporate neural methods to estimate the Q-function instead of remembering every state. This lets us also choose data with more dimensions (for example images), letting the network estimate the best "senses"/features to use.

omni-

thin·ker