Solving Grid World problems is vary interesting topic among AI community. In this post, I will touch the Value Iteration method for Grid World problem.
Let's consider a 10*10 grid consisting of an agent, a reward cell, some wall cells and some penalty cells. The reward cell, wall cells and penalty cells are placed at random locations. The agent tries to reach the reward while avoiding the wall cells and penalty cells with the shortest path. This problem can be considered as solving sequential decision problems in computer games. Consider a two stage procedure to solve this problem. In the first stage, value iteration algorithm is applied which gives the policy to reach the reward stage. In the second stage, the path planning algorithm is used to select the shortest path for agent to reach the reward cell. This modified value iteration algorithm takes into consideration all the wall cells and penalty cells and returns the policy. We analyze the policy returned after certain number of iterations and then apply path planning algorithm. The policy changes as the number of iterations and hence the path too. During value iteration process, we directly select the optimal choice for each cell.
Assumptions:
1) The position of the walls is not fixed but it is known to the agent.
2) The agent can move only in horizontal and vertical direction.
3) The agent can only move one cell in a unit time.
4) The positions of reward cell, penalty cell and wall cell are randomly generated for each new game.
5) The starting point for agent is fixed at cell (1, 1).
6) The outermost rows and columns (eleventh row and eleventh column) are considered as walls.
7) The agent moves continuously without stopping till the agent finds the reward.
Algorithm:
We can consider the given task as solving a Sequential Decision Problem. We will be using fully observable MDP as the agent has complete knowledge of grid.
The modified value iteration is an iterative algorithm which is based on the direct resolution of Bellman optimality equation. It requires an initialization of utility values and searches over values and finds the resulting policy. It calculates the expected utility of each state using the utilities of the neighboring states until the utilities calculated on two successive steps are close enough.
Where e is predefined threshold value. The smaller the threshold is, the higher is the precision of the algorithm. Given certain conditions on the environment, the utility values are guaranteed to converge.
Given a utility matrix we can calculate a corresponding policy according to the maximum expected utility principle, i.e. choosing an optimal policy as the equation below.
For each iteration, the algorithms explore all the set A of actions, defined in the MDP, to improve the current policy, this may be causing unnecessary calculations. The transition model used is fixed, but, as we shall see later, in robotics a model of this type can degrade the results and not be in agreement with the theory of MDP. In our algorithm, we use modified value iteration which takes into account multiple number of penalty cells and multiple number of wall cells. It performs well even if there are multiple penalty states or multiple walls present.
Path Planning:
Path planning is used extensively in the field of Artificial Intelligence. It finds the shortest path for the agent towards the goal state. Various informed and uninformed search strategies are used for path planning. The path planning algorithm should return the optimal path for the agent to the goal state.
A grid-based algorithm should firstly overlay a grid on the maze space, and assume each state is identified with a grid point. At each grid point, the robot is allowed to move to adjacent grid points as long as the line between them is contiguous. For our work, we use path planning for the agent to reach the reward cell in the best possible way.
Some sample results are as follows. The value of each cell is shown.
Figure 1: Sample Grid 1
Figure 2: Sample Grid 2
Discussion:
For various configuration of maze, the agent is able to reach the positive state by using the policy produced from the value iteration process. Even if there are different paths produced by the policy on certain maze, the numbers of steps are exactly the same because the agent is similarly reaching the positive state in rectangular or zigzag movement. With the increasing number of iteration, the smallest value in each maze is increased with the value difference, from 20 iteration to 60 iteration, up to around +0.7 value.