Reinforcement Learning

Here we program an agent that learns to navigate a map, using value iteration to find the best policy. That is, by repeatedly running the experiment, the agent gradually learns which actions give the highest reward from each state.

Preliminaries

Some core concepts of Markov Decision Processes (MDPs):

Let's say we have a state s_i, reward R_i for that state, policy pi (that outputs an action pi(s) for that state). There's also a parameter phi called the discount factor. Then the expected reward when starting in that state is...

V(s_i) = R_i + phi \sum_{j} P_{s_i, pi(s_i)}(s_i) V(s_j).

Assuming we know R_i and P_{s_i, a} (for all actions a) perfectly, we can solve the resulting system of equations to determine V(s_i) for all i.

Our goal is to find an optimal policy pi^*. For this we use an algorithm called value iteration.

  1. Initialise V(s_i) = 0 for all i.
  2. Set V(s_i) := R_i + argmax_{a} phi \sum_{j} P_{s_i, a}(s_j) V(s_j).
  3. Repeat until convergence.

This is guaranteed to find the optimal policy. If the state transition probabilities and/or rewards are unknown, then we can repeatedly run simulations to estimate them and update our policy each time.

Imports

The Simulation

Define the "game": states, rewards, and possible actions. Goal is to get to the top right corner. There's also a bad position adjacent to the goal that we should avoid. This is based on an example in Andrew Ng's lectures, which in turn was taken from Norvig's AI book.

And the game logic.

Test it out with a policy that picks a random action each time. (i.e. it's not a fixed policy).

Visualising the map.

Value iteration to find the optimal policy

First, supposing we've calculated the expected value from each state, we can construct the optimal policy. Let's also implement value iteration to actually calculate the expected values.

Initialise our expected values and estimates of rewards/transition probs.

Now the main loop. In each iteration, run some simulations, then use the refined estimates of rewards / transition probabilities to update the expected values (via value iteration).

Visualising the policy we've derived.

Evaluating the policy

As stated previously, for a given policy we can determine its exact expected value by solving a system of linear equations. The i-th state gives the following equation:

V(s_i) = R_i + phi \sum_{j} P_{s_i, pi(s_i)}(s_j) V(s_j),
\sum_{j} V(s_j) (phi P_{s_i, pi(s_i)}(s_j) - [i=j])  = -R_i,

where phi is the dropoff, and [i=j] is 1 if i=j and 0 otherwise. For this, let's use the true value transition probabilities & rewards rather than the estimates, though it might be interesting to see the difference between the estimated expected values and the true expected values.

Hmm, this is giving nonsensical values (e.g. expected value of >1 for many states). And I'm not that bothered to debug it.