LowComplexity Modular Policies: Learning to Play PacMan and a New Framework beyond MDPs
Abstract.
In this paper we propose a method that learns to play PacMan. We define a set of highlevel observation and action modules. Actions are temporally extended, and multiple action modules may be in effect concurrently. A decision of the agent is represented as a rulebased policy. For learning, we apply the crossentropy method, a recent global optimization algorithm. The learned policies reached better score than the handcrafted policy, and neared the score of average human players. We argue that learning is successful mainly because (i) the policy space includes the combination of individual actions and thus it is sufficiently rich, (ii) the search is biased towards lowcomplexity policies and low complexity solutions can be found quickly if they exist. Based on these principles, we formulate a new theoretical framework, which can be found in the Appendix as supporting material.
1. Introduction
During the last two decades, reinforcement learning has reached a mature state, and has been laid on solid foundations. We have a large variety of algorithms, including valuefunction based, direct policy search and hybrid methods (Sutton and Barto, 1998; Bertsekas and Tsitsiklis, 1996). The basic properties of many such algorithms are relatively well understood (e.g. conditions for convergence, complexity, effect of various parameters etc.), although it is needless to say that there are still lots of important open questions. There are also plenty of test problems (like various mazenavigation tasks, polebalancing, car on the hill etc.) on which the capabilities of RL algorithms have been demonstrated, and the number of successful largescale RL applications is also growing steadily. However, there is still a sore need for more successful applications to validate the place of RL as a major branch of artificial intelligence.
We think that games (including the diverse set of classical board games, card games, modern computer games etc.) are ideal test environments for reinforcement learning. Games are intended to be interesting and challenging for human intelligence and therefore, they are ideal means to explore what artificial intelligence is still missing. Furthermore, most games fit well into the RL paradigm: they are goaloriented sequential decision problems, where each decision can have longterm effect. In many cases, hidden information, random events, unknown environment, known, or unknown players account for (part of) the difficulty of playing the game. Such circumstances are in the focus of the reinforcement learning idea. They are also attractive for testing new methods: the decision space is huge in most cases, so finding a good strategy is a challenging task.
There is another great advantage of games as test problems: the rules of the games are fixed, so the danger of ‘tailoring the task to the algorithm’ – i.e., to tweak the rules and/or the environment so that they meet the capabilities of the proposed RL algorithm – is reduced, compared, e.g., to various maze navigation tasks.
RL has been tried in many classical games, including checkers (Samuel, 1959), backgammon (Tesauro, 1994), and chess Baxter et al. (2001). On the other hand, modern computer games got into the spotlight only recently, and there are not very many successful attempts to learn them with AI tools. Notable exceptions are, e.g., roleplaying game Baldur’s Gate (Spronck et al., 2003), realtime strategy game Wargus (Ponsen and Spronck, 2004)), and possibly, Tetris (Szita and Lőrincz, 2006). These games are also interesting from the point of view of RL, as they catch different aspects of human intelligence: instead of deep and wide logical deduction chains, most modern computer games need shortterm strategies, but many observations have to be considered in parallel, and both the observation space and the action space can be huge.
In this spirit, we decided to investigate the arcade game PacMan. The game is interesting on its own, as it is largely unsolved, but also imposes several important questions in RL, which we will overview in Section 7. We will show that a hybrid approach is more successful than either tabula rasa learning or a handcoded strategy alone. We will provide handcoded highlevel actions and observations, and the task of RL is to learn how to combine them into a good policy. We will apply rulebased policies because they are easy to interpret, and it is easy to include human domainknowledge. For learning, we will apply the crossentropy method, a recently developed general optimization algorithm.
In the next section we overview the PacMan game and the related literature. We also investigate the emerging questions upon casting this game as a reinforcement learning task. In sections 3 and 4 we give a short description of rulebased policies and the crossentropy optimization method, respectively. In section 5 we describe the details of the learning experiments, and in section 6 we present our results. Finally, in section 7 we summarize and discuss our approach with an emphasis on its implications for other RL problems.
2. PacMan and reinforcement learning
2.1. The PacMan game
The videogame PacMan was first released in 1979, and reached immense success, it is considered to be one of the most popular video games to date (Wikipedia, 2006).
The player maneuvers PacMan in a maze (see Fig. 1), while ‘eating’ the dots in the maze. There are 174 dots, each one is worth 10 points. A level is finished when all the dots are eaten. To make things more difficult, there are also four ghosts in the maze ‘who’ try to catch PacMan, and if they succeed, PacMan loses a ‘life’. Initially, ‘he’ has three lives, and gets an extra life after reaching 10,000 points.
There are four powerup items in the corners of the maze, called power dots (worth 40 points). After PacMan eats a power dot, the ghosts turn blue for a short period, they slow down and try to escape from PacMan. During this time, PacMan is able to eat them, which is worth 200, 400, 800 and 1600 points, consecutively. The point values are reset to 200 each time another power dot is eaten, so it is advantageous for the player to eat all four ghosts per power dot. After being eaten, ghosts are ‘reborn’ in the center of the maze.
Our investigations are restricted to learning an optimal policy for the first level, so the maximum achievable score is .^{1}^{1}1The rules of the original PacMan game are slightly different. The above description applies to the opensource PacMan implementation of Courtillat (2001). The two versions are about equivalent in terms of complexity and entertainment value.
In the original version of PacMan, ghosts move on a complex but deterministic route, so it is possible to learn a deterministic action sequence that does not require any observations. Many such patterns were found by enthusiastic players. In our implementation, ghosts moved randomly in 20% of the time and straight towards PacMan in the remaining 80%, but ghosts may not turn back (in accordance with the original implementation). This way, there is no single optimal action sequence, observations are required for optimal decision making. Similar methods of randomization are implemented in many PacMan’s sequels (e.g., Ms. PacMan).
2.2. Previous work on PacMan
Although the game can be properly formalized as a finite MDP, the resulting model would have about states. The learning task is hard even with approximation techniques, so the only RL approach known to us (Bonet and Stauffer, 1999) restricts observations to a window centered at PacMan. Through a series of increasingly difficult learning tasks, they were able to teach basic pelletcollecting and ghostavoidance behaviors in greatly simplified versions of the game: they used simple mazes containing no power pellet and only one ghost.
There have been several other attempts using genetic algorithms, and the only fullscale PacMan learner that we know uses genetic algorithms with handcrafted features and it applies a neural network position evaluator (Lucas, 2005).
2.3. PacMan as an RL task
PacMan meets all the criteria of a reinforcement learning task. The agent has to make a sequence of decisions that depend on its observations. The environment is stochastic (the path of ghosts is unpredictable). There is also a welldefined reward function (the score for eating things), and actions influence the collected reward in the remote future.
The full description of the state would include (1) whether the dots have been eaten (one bit for each dot and one for each power dot), (2) the position and direction of PacMan, (3) the position and direction of the four ghosts, (4) whether the ghosts are blue (one bit for each ghost), (5) the number of lives left. The resulting state space is enormous, so some kind of function approximation or featureextraction are necessary for RL.
The action space seems less problematic, as there are only four basic actions: go north/south/east/west. However, a typical game consists of multiple hundreds of steps, so the number of possible combinations is still enormous. This indicates the need for temporally extended actions.
We have a moderate amount of domain knowledge about PacMan: on one hand, it is quite easy to define highlevel observations and action modules that are potentially useful. On the other hand, constructing a wellperforming policy seems much more difficult. Therefore, we chose a hybrid approach: we use domain knowledge to preprocess the state information and to define action modules, and combine them into a rulebased policy. However, we use policy search reinforcement learning to learn the proper combination.
3. Rulebased policies
In a basic formulation, a rule is a sentence of the form "if Condition holds, then do Action". A rulebased policy is a set of rules with some mechanism for breaking ties, i.e., to decide which rule is executed, if there are multiple rules with satisfied conditions.
Rulebased policies are humanreadable, it is easy to include domain knowledge, and they are able to represent complex behaviors. For these reasons, they are often used in many areas of artificial intelligence, e.g. (Spronck et al., 2003).
In order to apply rulebased policies to PacMan, we need to specify four things: (1) what are the possible actions (2) what are the possible conditions and how are they constructed from observations, (3) How to make rules form consitions and actions, and (4) how to combine the rules into policies. These will be described in the following sections.
3.1. Action modules
Name  Description 

ToDot  Go towards the nearest dot. 
ToPowerDot  Go towards the nearest power dot. 
FromPowerDot  Go in direction opposite to the nearest power dot. 
ToEdGhost  Go towards the nearest edible (blue) ghost. 
FromGhost  Go in direction opposite to the nearest ghost. 
ToSafeJunction  For all four directions, the "safety" of the nearest junction is estimated in that direction. If PacMan is steps away from the junction and the nearest ghost is steps away, then the safety value of this junction is . A negative value means that PacMan possibly cannot reach that junction. PacMan goes towards the maximally safe junction. 
FromGhostCenter  Go in a direction which maximizes the Euclidean distance from the geometrical center of ghosts. 
KeepDirection  Go further in the current direction (or turn right/left if that is impossible). 
ToLowerGhostDensity  Each ghost defines a density cloud (with radius = 10 and linear decay). PacMan goes in the direction where the cumulative ghost density decreases fastest. 
ToGhostFreeArea  Chooses a location on the board where the minimum ghost distance is largest, and heads towards it on the shortest path. 
We can define a list of potentially useful action modules for PacMan (see Table 1). Some of these are intuitive, while the last five were deduced by playing and analyzing the game.
Note that these modules are not exclusive. For example, while escaping from the ghosts, PacMan may prefer the route where more dots can be eaten, or it may want to head towards a power dot. Without the possibility of such parallel actions, the performance of the PacMan agent may be reduced, and preliminary experiments showed that this is the case, indeed.
We need a mechanism for conflict resolution, because different action modules may suggest different directions. We do this by assigning priorities to the modules. When the agent switches on an action module, he also decides its priority. This is also a decision, and learning this decision is part of the learning task.
We implemented this with the following mechanism: a decision of the agent concerns action modules: the agent can either switch on or, switch off an action module. That is, the agent is able to use any subset of the action modules – at least in principle –, instead of selecting a single one at each time step. Basically, the module(s) with highest priority decide(s) the direction of PacMan. If there are more than one equally ranked directions, or modules with equal priority suggest different directions, then lowerpriority modules are checked. If the direction cannot be decided after checking all switchedon modules, then a random direction is chosen.
3.2. Conditions and Observations
Name  Description 

Constant  Constant 1 value. 
NearestDot  Distance of nearest dot. 
NearestPowerDot  Distance of nearest power dot. 
NearestGhost  Distance of nearest ghost. 
NearestEdGhost  Distance of nearest edible (blue) ghost. 
MaxJunctionSafety  For all four directions, the "safety" of the nearest junction in that direction is estimated, as defined in the description of action "ToSafeJunction". The observation returns the value of the maximally safe junction. 
GhostCenterDist  Euclidean distance from the geometrical center of ghosts. 
DotCenterDist  Euclidean distance from the geometrical center of uneaten dots. 
GhostDensity  Each ghost defines a density cloud (with radius = 10 and linear decay). Returns the value of the cumulative ghost density. 
Similarly to actions, we can easily define a list of observations which are potentially useful for decision making. The observations and their descriptions are summarized in Table 2. Distances denote the "length of the shortest path", unless noted otherwise. Distance to a particular object type is ‘infinite’ if no such object exists at that moment.
Now we have the necessary tools for defining the conditions of a rule. A typical condition is true if its observations are in a given range. We note that the status of each action module is also important for proper decision making. For example, the agent may decide that if a ghost is very close, then it switches off all modules except the escape module. Therefore we allow conditions that check whether an action module is ‘on’ or ‘off’.
For the sake of simplicity, conditions were restricted to have the form "[observation] > [value]", "[observation] < [value]", "[action]+", "[action]", or the conjunction of such terms. For example,
"(NearestDot<5) and (NearestGhost>8) and (FromGhost+)" 
is a valid condition for our rules.
3.3. Constructing rules from conditions and actions
Now, we have conditions and actions. A rule has the form:
"if [Condition] holds, then do [Action]". For
example,
"if (NearestDot<5) and (NearestGhost>8) and (FromGhost+)
then FromGhostCenter+"
is a valid rule. In all of our experiments, we considered only rules with at most three conditions.
3.4. Constructing policies from rules
Decision lists are standard forms of constructing policies from single rules. This is the approach we pursue here, too. Decision lists are simply lists of rules, together with a mechanism that decides the order in which the rules are checked.
We assign priorities to each rule. When the agent has to make a decision, it checks its list of rules, starting with the highest priority ones. If the conditions of a rule are fulfilled, then the corresponding action is executed, and the decisionmaking process halts.
Note that in principle, the priority of a rule can be different from the priority of action modules. However, for the sake of simplicity, we make no distinction: if a rule with priority switches on an action module, then the priority of the action module is also taken as . Intuitively, this makes sense: if an important rule is activated, then its effect should also be important. Naturally, if a rule with priority switches off a module, then it is executed, regardless of the priority of the module.
3.5. An example
Let us consider the example shown in Table 3. This is a rulebased policy for the PacMan agent.
Rule No.  Priority  Rule 

Rule 1  [1]  if (NearestGhost<4) then FromGhost+ 
Rule 2  [1]  if (NearestGhost>7) and (JunctionSafety>4) then FromGhost 
Rule 3  [2]  if (NearestEdGhost>99) then ToEdGhost 
Rule 4  [2]  if (NearestEdGhost<99) then ToEdGhost+ 
Rule 5  [3]  if (Constant>0) then KeepDirection+ 
Rule 6  [3]  if (FromPowerDot) then ToPowerDot+ 
Rule 7  [3]  if (GhostDensity<1.5) and (NearestPowerDot<5) then FromPowerDot+ 
Rule 8  [3]  if (NearestEdGhost>99) then FromPowerDot 
Rule 9  [3]  if (NearestPowerDot>10) then FromPowerDot 
The first two rules manage ghost avoidance: if a ghost is too close, then the agent should flee, and should do so until it gets to a safe distance. Ghost avoidance has priority over any other activities. The next two rules regulate that if there is an edible ghost on the board, then the agent should chase it (the value of NearestEdGhost is infinity () if there are no edible ghosts, but it is on our board, if there are). This activity has also relatively high priority, because eating ghosts is worth lots of points, but it must be done before the blue color of the ghost disappears, so it must be done quickly. The fifth rule says that the agent should not turn back, if all directions are equally good. This rule prevents unnecessary zigzagging (when no dots are eaten), and it is surprisingly effective. The remaining rules tweak the management of power dots. Basically, the agent prefers to eat a power dot. However, if there are blue ghosts on the board, then a power dot resets the score counter to 200, so it is a bad move. Furthermore, if ghost density is low around the agent, then most probably it will be hard to collect all of the ghosts, so it is preferable to wait with eating the power dot.
The mechanism of decision making is depicted in Fig 2. In short, the (hidden) statespace is the world of the PacMan and the Ghosts. The dynamics of this (hidden) statespace determines the vector of observations, which can be checked by the conditions. If the conditions of a rule are satisfied, the corresponding action module is switched on or off. As a consequence, multiple actions may be in effect at once. For example, the decision depicted in Fig. 2 sets two actions to work together.
3.6. Learning rulebased policies by policy search
We will perform policy search RL in the space of rulebased policies. Our algorithm will construct policies according to its parameter set. The policies will be tested in the environment, by using them to control PacMan and measure the collected rewards. The results of these tests are then used to improve the parameter set, and consequently, the policy construction procedure.
4. The crossentropy method
Our goal is to learn a rulebased policy that has the form described in the previous section, by performing policy search in the space of all legal rulebased policies. For this search we apply the crossentropy method, a recently published global optimization algorithm (Rubinstein, 1999). Below we summarize the mechanism of this method briefly.
4.1. The general form of the algorithm
The crossentropy (CE) method is a general algorithm for (approximately) solving global optimization tasks of the form
(1) 
where is a general objective function (e.g., we do not need to assume continuity or differentiability). While most optimization algorithms maintain a single candidate solution in each time step, CE maintains a distribution over possible solutions. From this distribution, solution candidates are drawn at random. This is essentially random guessing, but with a nice trick it is turned into a highly effective optimization method.
Random guessing is an overly simple ‘optimization’ method: we draw many samples from a fixed distribution , then select the best sample as an estimation of the optimum. In the limit case of infinitely many samples, random guessing finds the global optimum. We have two notes here: (i) as it has been shown by Wolpert and Macready (1997), for the most general problems, uniform random guessing is not worse than any other method, (ii) nonetheless, for practical problems, uniform random guessing can be extremely inefficient. Thus, random guessing is safe to start with, but as one proceeds with the collection of experiences, it should be limited as much as possible.
The efficiency of random guessing depends greatly on the distribution from which the samples are drawn. For example, if is sharply peaked around , then a tremendous number of examples are needed to get a good estimate of the global optimum. The case is the opposite, if the distribution is sharply peaked at : very few samples may be sufficient to get a good estimate. Naturally, finding a good distribution is at least as hard as finding .
The idea of CE is that after drawing moderately many samples from distribution , we may not be able to give an acceptable approximation of , but we may still obtain a better sampling distribution. We will pick from a family of parameterized distributions, denoted by , and describe an algorithm that iteratively improves the parameters of the distribution .
For each , the set of highvalued samples,
provides an approximation to the level set
Let be the uniform distribution over the level set . For large values of , this distribution will be peaked around , so it would be suitable for random sampling. There are two problems with that: (i) for large values will contain very few points (possibly none), making accurate approximation impossible, and (ii) the level set is usually not a member of the parameterized distribution family.
CE avoids the first problem by making a compromise in the choice of : it prefers large improvements, so does not set too low, but it does not set too high either in order to keep plenty of samples in . This compromise is achieved as follows: CE chooses a ratio and adjusts to be the set of the best samples. This corresponds to setting , provided that the samples are arranged in decreasing order of their values. The best samples are called the elite samples. In practice, is typically chosen from the range .
The other problem is solved by changing the goal of the approximation: CE chooses the distribution from the distribution family that approximates best the empirical distribution over . The best is found by minimizing the distance of and the uniform distribution over the elite samples. The measure of distance is the crossentropy distance (often called KullbackLeibler divergence). The crossentropy distance of two distributions and is defined as
(2) 
4.2. The crossentropy method for Bernoulli distribution
For many parameterized distribution families, the parameters of the minimum crossentropy member can be computed easily from simple statistics of the elite samples. We provide the formulae for Bernoulli distributions, as these are needed for our purposes. The derivations and a list of other discrete and continuous distributions that have simple update rules can be found in the tutorial of de Boer et al. (2004).
Let the domain of optimization be , and each component be drawn from independent Bernoulli distributions, i.e. . Each distribution is parameterized with an dimensional vector . When using for sampling, component of the sample will be
(3) 
After drawing samples and fixing a threshold value , let denote the set of elite samples, i.e.,
(4) 
With this notation, the distribution with minimum CEdistance from the uniform distribution over the elite set has the following parameters:
(5)  
(6) 
In other words, the parameters of are simply the componentwise empirical probabilities of 1’s in the elite set. For the derivation of this rule, see de Boer et al. (2004).
Changing the distribution parameters from to can be too coarse, so in some cases, applying a stepsize parameter is preferable. The resulting algorithm is summarized in Table 5.
We will also need to optimize functions over with . In the simplest case, distributions over this domain can be parameterized by parameters: with and for each (this is a special case of the multinomial distribution).
The update rule of the parameters is essentially the same as eq. 6 for the Bernoulli case:
(7) 
Note that constraint is satisfied automatically for each .
5. Description of experiments
All of the learning experiments used CE, which means drawing a population of policies from some distribution, evaluating them by playing the game, and updating the distribution parameters.
5.1. Learning a policy from a handcoded rulebase
In the first experiment, we constructed a rulebase by hand. It consisted of rules that were considered potentially useful. The agent had to learn which rules to use, together with the corresponding priorities.
From the rulebase, policies were constructed via the following mechanism: a policy had rule slots. For each , slot was filled with a rule from the rulebase with probability , and left empty with probability . Each slot had a fixed priority from the set . For each element of this set, we had 10 slots.^{2}^{2}2According to our preliminary experiments, the quality of the learned policy did not improve by increasing the priority set or the number of the slots. If it was decided that a slot should be filled, then a particular rule () was selected with probability , where for each slot . As a result, policies could contain up to 30 rules, but possibly much less.
Both the values and the values were learnt simultaneously with the CE method (Table 5), using the update rules (6) and (7), respectively. This gave a total of parameters to optimize (although the effective number of parameters is much less, because the values of unused slots are irrelevant). Initial probabilities were set to and .
In each iteration, a population of policies were drawn according to the actual probabilities. The value of a policy was the average score reached in three consecutive games. Selection ratio and step size were set to and , respectively. Furthermore, in each iteration during learning, we slowly decayed the slot usage probabilities with decay factor . This choice slightly biased the optimization towards shorter policies.
5.2. Automatically constructed rulebase
In this experiment, we applied the same policy selection mechanism as in the previous experiment, but we did not use a handcoded rulebase. At the beginning of learning, rules were drawn randomly for each pair with and . A random rule is a random pair of a randomly drawn condition set and a randomly drawn action. Random condition sets contained 1, 2, or 3 conditions. These rules were not changed during learning, only their corresponding probabilities were optimized.
The following parameter values were used: population size: , number of rule slots: , number of possible rules in each slot: , selection ratio: , stepsize: , decay rate: .
5.3. Baseline experiments
A large amount of domain knowledge was used while constructing the highlevel observations and actions, which is obviously a key factor in reaching good performance. In order to isolate and assess the contribution of learning, we performed two additional experiments with different amounts of domain knowledge and no learning.
In the first nonlearning experiment, we used the rulebase of 40 handcoded rules (identical to the rulebase of the first learning experiment). Ten rules were selected at random, and random priorities were assigned to them. We measured the performance of policies constructed in this way.
In the second nonlearning experiment, we handcoded a full policy (both rules and priorities). The policy is shown in Table 3, and has been constructed by some trialanderror. Naturally, the policy was constructed before knowing the results of the learning experiments.
In the final experiment, five human subjects were asked to play the first level of PacMan and we measured their performances. Each of the subjects has played PacMan and/or similar games before, but none of them was an experienced player.
6. Experimental results
Human experiments were performed on the first level of an opensource PacMan clone of Courtillat (2001). For the other experiments we applied the Delphi reimplementation of the code.
In both learning experiments, 10 parallel learning runs were executed, each one for 300 episodes. This training period was sufficient to tune all probabilities either to 0 or 1, so the learned policy could be determined in all cases. Each obtained policy was tested by using it for 50 consecutive games, giving a total of 500 test games per experiment.
In the nonlearning experiments the agents played 500 test games, too, using random policies and the handcoded policy, respectively. Each human subject played 20 games, giving a total of 100 test games.
Method  Mean  High  # of rules 

Random rulebase + CE  6312  13900  3.9 
Handcoded rulebase + CE  7636  13900  8.0 
Handcoded rulebase + random rules  257  2010  10 
Handcoded policy  5670  10660  9 
Human play  8064  >13900^{3}^{3}3f   
Both the average scores and the high scores are summarized in Table 6. Comparing scores for handcoded domain knowledge with and without learning, we found that the contribution of crossentropy learning is significant. The average number of rules in the learned policies shown in the last column of the table, varies. Policies found by the learning methods performed better than the handcoded policies and they were shorter on the average.
On the other hand, the learned policies are still far from being optimal, and could not reach the level of nonexperienced human players. We investigated how the game is played by various policies in order to identify the possible reasons of superior human performance. It seems that the major flaw of the rulebased policies is that they cannot eat all ghosts when the ghosts turn blue. This is a serious handicap. For example, if the agent can eat only three ghosts after ghosts turn blue, but otherwise plays perfectly, it can only reach points. The task of catching all ghosts in a limited time period can be successful only if all the ghosts are nearby, and this requires strategic planning: power dots should be eaten only after all ghosts have been lured close to it. The set of available high level observations does not enable such planning: the agent cannot observe how scattered the ghosts are or how far the farthest ghost is. This type of information is easily available for human players, who ‘see’ the board and observe the topological structure of the maze.
7. Discussion
7.1. The role of domain knowledge
When demonstrating the abilities of an RL algorithm, it is often required that learning starts from scratch, so that the contribution of learning is clearly measurable. However, the choice of test problem is often misleading: many ‘abstract’ domains contain considerable amount of domain knowledge in an implicit way. As an example, consider gridworld navigation tasks, an often used class of problems for ‘tabula rasa’ learning. In a simple version of the gridworld navigation task, the state is an integer that uniquely identifies the position of the agent, and the atomic actions are moves to grid cells north/south/east/west from the actual cell.
The concepts of north, south, etc. corresponds to very highlevel abstraction, they have has a meaning to humans only, so they are domain knowledge. In fact, they are very similar to the domain knowledge provided by us, the highlevel observations and actions: observations like ‘distance of nearest ghost is ’ or ’PacMan is at position ’ are both highlevel observations. Similarly, action ’go north’ and action ’go towards the nearest power dot’ are essentially of the same level.
The implicit presence of highlevel concepts becomes even more apparent as we move from abstract MDPs to the ‘realworld’. Consider a robotic implementation of the maze task: the observation of the state is not available for the robot. It sees only local features and it may not see all local features at a time. To obtain the exact position, or to move by one unit length in the prescribed direction, the robot has to integrate information from movement sensors, optical/radar sensors etc. Such information fusion, although necessary, but does not belong to reinforcement learning. Thus, in this task, there is a great amount of domain knowledge that needs to be provided before our CE based policy search method could be applied.
Naturally, assessing the effectiveness of a learning algorithm is more difficult for nonabstract tasks, because we have to measure the contribution of human knowledge somehow. Our experiments with random and handpicked policies intend to estimate the contribution of (a varying amount of) human knowledge.
In our opinion, the role of human knowledge is that it selects the set of observations and actions that suit the learning algorithm. Such extra knowledge is typically necessary for most applications. Nonetheless, numerous (moreorless successful) approaches exist for obtaining such domain knowledge automatically. According to one approach, the set of observations is chosen from a rich (and redundant) set of observations by some feature selection method. The crossentropy method seems promising here, too (see Szita, 2006, for an application to feature selection from brain fMRI data at the 2006 Pittsburgh Brain Activity Interpretation Competition). According to a different approach, successful combinations of lower level rules can be joined into higher level concepts/rules. Machine learning has powerful tools here, e.g., arithmetic coding for data compression (Witten et al., 1987). It is applied in many areas, including the writing tool Dasher developed by Ward and MacKay (2002). Such extensions are to be included into the framework of reinforcement learning.
7.2. Lowcomplexity policies
The space of legal policies is huge (potentially infinite), so it is an interesting question how search can be effective in this huge space. Direct search is formidable. We think that an implicit bias towards lowcomplexity policies can be useful. Solutions can be used as building blocks in a continued search of lowcomplexity policies. Lowcomplexity policy here means that even if a policy consists of very many rules, in most cases, only a few of them is applied in the game.^{4}^{4}4Of course, it is possible to construct long policies so that each rule gets applied. However, the chance is tiny that we find long policies by random sampling. Unused rules do not get rewarded (nor do they get punished unless they limit a useful rule), so the effective length of policies is biased towards short policies. This implicit bias is strengthened by an explicit one in our work: the probabilities of application of a rule decay, so indifferent rules get wiped out soon.
The bias towards short policies reduces the effective search space considerably. Further, for many reallife problems, lowcomplexity solutions exist (for an excellent analysis of possible reasons, see Schmidhuber, 1997). Therefore, search is concentrated on a relevant part of the policy space, and pays less attention to more complex (and therefore less likely) policies.
7.3. Summary and Outlook
In this article we proposed a method that learns to play PacMan. We have defined a set of highlevel observation and action modules with the following properties: (i) actions are temporally extended, (ii) actions are not exclusive; actions may work concurrently. Our method can uncover action combinations together with their priorities. Thus, our agent can pursue multiple goals in parallel.
The decision of the agent concerns whether an action module should be turned on (if it is off) or off (if it is on). Further, decisions depend on the current observations and may depend on the state of action modules. The policy of the agent is represented as a list of ifthen rules with priorities. Such policies are easy to interpret and analyze. It is also easy to incorporate additional human knowledge. The crossentropy method is used for learning policies that play well. Learning is biased towards lowcomplexity policies, which is a consequence of both the policy representation and the applied learning method. The learned policies reached better score than the handcoded policy, and neared the score of average human players.
The applied architecture has the potentials to handle large, structured observation and actionspaces, partial observability, temporally extended and concurrent actions. Despite its versatility, policy search can be effective, because it is biased towards lowcomplexity policies. These properties are attractive from the point of view of largescale applications.
Acknowledgments
This material is based upon work supported partially by the European Office of Aerospace Research and Development, Air Force Office of Scientific Research, Air Force Research Laboratory, under Contract No. FA073029. This research has also been supported by an EC FET grant, the ‘New Ties project’ under contract 003752. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the European Office of Aerospace Research and Development, Air Force Office of Scientific Research, Air Force Research Laboratory, the EC, or other members of the EC New Ties project.
Appendix: The lowcomplexity Modular Policy Framework
8. A critique of Markov decision processes
Modelling RL problems as (finite) Markov decision processes (MDPs) has proved very fruitful both in the theoretical grounding and in some practical applications. However, because of the simplifications of the MDP model, such as full observability, the Markov property, finite and unstructured state and action space, equal sized time steps etc., it does not scale well for typical “reallife" applications. Therefore, most of the recent research in RL tries to extend the MDP framework in various directions or tries to find alternative models.
The MDP model is too general in some respects as it has been noted e.g. in Lane and Smart (2005): an RL algorithm is expected to solve any MDPs (at least approximately) in the same manner, and it is well known that this cannot be done faster than polynomial in the number of the states. However, practical problems often have billions of states and polynomial time solutions are intractable. Nonetheless, many of these problems have compact structured descriptions that might enable more specific algorithms. We also note that computational intractability, e.g., the “curse of dimensionality", severely restricts MDPs and its extensions, e.g., partially observable MDPs (POMDPs), predictive state representations, observable operator models, semiMDPs, with a few notable exceptions like factored MDPs.
We collect here several requirements that have to be resolved for largescale, “reallife" RL tasks. We argue that these requirements can be handled in a unified way, provided that the attributes of the agent, such as action, state, and memory, are treated on equal footings, and that the agent is characterized by a (factored) set of modules. Each of these modules may be statelike, actionlike etc., or even the mixture of these. We show that in this formalism, policy is a module to module mapping that makes mathematics simple. This is true even for complex policies involving partial observability, memory management, attention focusing or parallel actions, issues that emerge in many practical problems.
We also show that if the complexity of the policy is low then, in our formalism, the learning task becomes tractable without further compromises. We provide an algorithm that learns lowcomplexity modular policies in the form of decision queues, show that it is convergent, and – in the idealistic limit case – it finds the optimum.
9. Modular representation: An informal description
9.1. An illustrative example
Let us consider driving a car in the city in order to list the challenges of reallife RL agents. When driving towards a destination, the driver has to cross intersections, has to pass other cars, and has to obey the traffic signs and traffic lights. Unfortunately, the driver cannot observe everything about its current situation, e.g. if one looks to the left, she cannot see what is on the right; if she looks at the mileometer, she cannot see what is happening on the road (partial observability). She decides where to look depending on the situation: at the car before her, the traffic signs, the control panel, or something else (attention focusing). She is aided by her shortterm memory: she remembers recent observations. She is engaged in parallel multiple activities: steers and speeds up for an overtake, uses the brake and looks around in a crossing. Such combined actions are typical in driving. The durations of the actions and events may vary, and are not well defined (nonuniform time steps). Also, actions can be continuous, like braking, or discrete, like switching the lights on. Similarly, observations can also be continuous, like the distance from the crossing, or discrete, like the color of traffic light.
Although the policy of the driver is very complex, only a tiny fraction of the possible policies is ever tried. For example, most drivers may never try to find out the immediate reward for looking right, pushing the brake, steering the wheels to the right, getting into a small street, then getting the car straight, to speed up and look back, not to mention other combinations, like looking right and turning the wheels left. Despite the complexity of the policy, it is built up from simple ones by means of simple combination rules.
9.2. Modules
As the above description illustrates, all the attributes of the agent, its observations, actions and memory have much in common: (1) they have factorized structure (2) they can be either continuous or discrete (3) they can be influenced by the policy and (4) the policy can be influenced by them. Furthermore, the distinction between them is blurred, e.g. the action ‘turn back to see what’s behind me’ is an observation, manipulates memory, and focuses attention (what to observe). Therefore it seems reasonable to treat them as different forms of a single concept that we will call modules. We can talk about observationlike, memorylike or actionlike modules, but these concepts are not necessarily exclusive.
Using such a representation, the agent is described by a set of modules. These modules constitute a factored representation, and their domain is arbitrary (continuous or discrete). Naturally, the agent itself can modify only some of its modules, others are influenced by its environment and again, these two sets are not exclusive.
9.2.1. Preserving computational tractability
Because of the factored structure, it is possible that the decisions of the agent depend only on a few modules, and affect only a few other ones. Therefore, we have the opportunity to express simple activities with simple (short) policy descriptions. This enables us to make the policy search tractable by restricting search to simple policies.
We shall define ’simplicity’ rigorously in the next section. Basically, we are looking for policies that are composed of relatively few decisions and these decisions have compact descriptions about their conditions and their effects. Then we can manage the search, which is polynomial in the size of the problem description. This restricted policy space still contains interesting policies: many reallife solutions have simple structures, despite of the size of the state space. Problems with complex (near)optimal solutions are hard for humans, too, and they are outside of our present considerations.
9.3. Advantages of modular representation
Below we summarize the expected advantages of the modular representation. Firstly, we are able to handle partial observability, memory, and in particular, focus of attention. Secondly, because of the unified treatment of various agent attributes, policies assume a simple form despite of their complexities compared to, e.g. memoryless MDP policies. Furthermore, we can handle composite attributes, e.g. a single module may have observationlike and actionlike components. The factored representation enables us to use multiple state variables and/or multiple parallel actions and, in turn, many interesting problems may have compact descriptions. And finally, we can use differential policy representation (the policy prescribes how to change the actual representation), simple policies can have compact descriptions. Thus, we can restrict our searches to the set of simple policies.
9.4. Related literature
Due to the limitations of space, we can mention only the most relevant frameworks and methods.
The general framework for handling partial observability is the POMDP framework Murphy (2005), but recently, other alternatives were also proposed, including predictive state representations Singh et al. (2003) and observable operator models Jaeger (1999). In POMDPs, memory and attention are handled implicitly, but there are also numerous methods that use explicit memory management, e.g. memory bits, finite state machines, variablelength history suffixes, or attention focusing. Another direction that extends MDP is the semiMDP (SMDP) framework Sutton et al. (1999), which enables e.g. the use of parallel, varyinglength actions (although they must be synchronized). SMDP is also used in hierarchical methods Barto and Mahadevan (2003).
These models are all extensions of MDPs, so general solution algorithms for them are computationally at least as intractable as for MDPs or may be even harder. Function approximation (FAPP) and direct policy search (see e.g. Sutton et al. (2000)) are two common and successful techniques for reducing complexity. However, policies learnt by policy search and/or FAPP keep many of the MDP restrictions; they are memoryless, use reactive policies, and can not handle parallel and varyinglength actions. Furthermore, constraints of the parameter space introduce other restrictions that are often nonintuitive.
In our approach, state space representation is similar to factored MDPs (e.g. Guestrin et al. (2003)), but we proceed by policy search instead of learning value functions.
10. Formal description of the lowcomplexity modular policy framework
Often, nonmodifiable components, such as the value of an observation, or the execution of a longer action, have related components that control its usage, e.g. if we can observe that variable, if the action is running or not, or if the relevance of the component is high or low for the agent. Therefore, it seems practical to define modules as pairs, consisting of (i) the output value of the module, and (ii) the extent that the module is used or whether it is used at all. In principle, the range of output values can be from an arbitrary set, but for the sake of simplicity, we restrict it to (subsets of) real numbers. Also, we can restrict modules to on and off states that can be switched, or we can use real numbers to represent their influence, which can be tuned on a continuous scale.
Definition 10.1 (Module).
a pair is called a module, where is the actual output value of the module, and is its influence..
Definition 10.2 (Modular state representation).
For , the ordered set is called an dimensional modular state representation, if is a module for any . The set of modular state representations for a fixed is denoted by .
Let be the set of all mappings. A modular policy that belongs to can be subject to restrictions. For example, we may ensure that the policy is constrained to legal actions and the agent does not execute actions that are unsafe or contradictory, it cannot modify the actual values of the observations etc. Such constraints will be encoded by a problemspecific mapping . is the internal dynamics and maps the current representation and the one proposed by the policy to the realized state representation. We shall also limit the complexity of the policies; subset will denote the set of ‘simple’ policies (see later).
We define the environment of the agent as a general controllable dynamic process that provides observable quantities and rewards. We do not assume anything, e.g. full observability, beyond that.
Definition 10.3 (Environment).
Let , and be arbitrary state, observation and action spaces, respectively. The environment is a tuple , where

is the initial state,

is the transition function of the environment,

is the observation function,

is the reward function.
The agent is determined by its policy, its internal dynamics, and the interfaces that map primitive observations to modules and modules to primitive actions (and may handle conflicting actions).
Definition 10.4 (Modular representation agent).
For a given observation space and action space , a modular representation agent is a tuple , where

is the initial module representation,

is the input interface that tells the effect of observations on the modules of the agent,

is the output interface that translates modules to primitive actions,

is the policy of the agent,

is the internal dynamics of the agent.
With these definitions, we can formally describe the agentenvironment interaction: consider an environment and a modular representation agent . At , the environment is in state and the agent is in state . The interaction is as follows:
(observation)  
(reward)  
(observationandmoduletomodule mapping)  
(decision of the agent)  
(internal dynamics)  
(moduletoaction mapping)  
(environment dynamics) 
The decision task can be formalized by fixing the parameters of the environment and the interface:
Definition 10.5.
A modular sequential decision problem is given by an environment , a set of allowed policies and a family of agents with fixed interface mappings and internal dynamics, and a discount factor .
A solution of this problem is a policy for which the expected discounted cumulative reward,
is maximal, supposed that the system is working according to the equations above.
10.1. Lowcomplexity modular policies
We have to define a restricted policy set . There are different approaches for bounding complexity: one can describe policies with a fixed (small) number of parameters (used e.g. in policy search methods), decision trees, or decision queues. As an example, we shall apply decision queues here, which is a flexible structure and fits nicely into the general optimization algorithm to be utilized.
A decision queue is an unordered list of rules, where every rule assumes the form
where is a Boolean expression depending on the current module representation, is a policy, which is considered atomic, and determines the order of the rules in the queue. The action taken by a decision queue is determined by checking all the rules in the order of their priorities (ties are broken arbitrarily). We choose the first rule with satisfied conditions, and execute its prescribed atomic policy.
To achieve low complexity, both the conditions and the atomic policies are chosen from a finite set with polynomial size in the number of modules, and the number of priorities is also kept low. This ensures that the building blocks have simple (short) encodings. Furthermore, the number of building blocks in a queue will be also limited. Policies of this kind will be called lowcomplexity modular policies (LCM policies).
11. Finding optimal LCM policies
Let be the set of possible rules and be the maximum number of allowed rules in a policy. For all , let be a subset of applicable rules belonging to index and is the subset of applicable priorities. Let be the set of allowed priority queues. To apply the CE method, we define a distribution over : in episode , let us denote the probability that rule will be selected by . If rule is used, we have choices of rules to choose from. The probability of the one is denoted by . We draw from independent Bernoulli distributions with choices. We can directly apply the CE method to them: in each episode, we draw a population of policies according to the current distribution, try them to get their cumulated reward, select the elite, and use Eq. 6 to update the distribution.
We prefer short policies: probabilities are discounted by a factor in each step.
11.1. Convergence
It is known that under mild regularity conditions, the CE method converges with probability 1 Margolin (2004). Furthermore, for a sufficiently large population, the global optimum is found with high probability.
The CE method has become attractive through a large number of experimental evidences that – even with small populations – it finds good local optima of large, hard instances of NPhard problems (cf. references in de Boer et al. (2004)). Also, performance is insensitive to the particular choice of optimization parameters in a broad range, so little finetuning is necessary.
12. Applying the LCMP Framework to PacMan
As we could see, some kind of processing of the input (and possibly output) is necessary to make the learning problem tractable. This is easy; one can implement potentially useful features and primitive actions similar those applied by human players, e.g. ‘the distance of the nearest ghost/pellet/power pellet’, ‘average distance of ghosts/pellets’, ‘length of current corridor’, ‘go towards the nearest pellet/power pellet/edible ghost’, ‘keep direction’, ‘go away from nearest ghost’, etc. The real challenge is how to utilize these modules and how to combine them.
These features are inherently continuous and modular, some of the actions may run sidebyside (e.g. ‘go to nearest pellet’ and ‘keep direction’), others may conflict, and their duration may vary. Any of the observations may prove useful in certain situations, but the agent will never need all of them at once. All of these properties are in concordance with the LCMP framework and this framework can be readily applied if features and primitive actions are all treated as modules. Note that PacMan’s policy may be nonMarkovian, because CE does not exploit the Markov property.
13. Discussion
The LCMP framework provides a general model for formalizing reinforcement learning problems. In this model, the agent’s state representation is a set of parallel modules that can be switched. Modules unify observations, actions and memory in a mathematically simple, general concept. We showed that modular policies satisfy a number of desirable requirements in a natural way. By bounding the complexity of modular policies, the learning problem becomes tractable. To demonstrate this, we described an applicatoin of the framework to PacMan.
We note that our formalism allows one to provide a large amount of prewired knowledge (such as those used in the PacMan experiments). For many reallife problems, such knowledge is easily available, and we believe that it is also necessary for obtaining good performance. The problem, how to emerge highlevel concepts by machine learning is out of the scope of the present study. We also note that we are not aware of any RL method that would be able to handle the large state space, partial observability, parallel and varyinglength activities that are present in the fullscale PacMan game.
Exploration of the potentials of LCM policies is still at an early stage, so there are many open questions. For example, it is unclear how to perform credit assignment, i.e. how to decide the contribution of a given rule to the total performance of the policy. Bucket brigadelike methods applied in evolutionary methods Bull and Kovacs (2005) seem promising here.
References
 Barto and Mahadevan [2003] A. Barto and S. Mahadevan. Recent advances in hierarchical reinforcement learning. Discrete Event Dynamic Systems, 13(4):341–379, 2003.
 Baxter et al. [2001] J. Baxter, A. Tridgell, and L. Weaver. Reinforcement learning and chess. pages 91–116, 2001.
 Bertsekas and Tsitsiklis [1996] D. P. Bertsekas and J. N. Tsitsiklis. NeuroDynamic Programming. Athena Scientific, 1996.
 Bonet and Stauffer [1999] J. S. De Bonet and C. P. Stauffer. Learning to play PacMan using incremental reinforcement learning, 1999. URL http://www.debonet.com/Research/Learning/~PacMan/. [Online; accessed 09 October 2006].
 Bull and Kovacs [2005] L. Bull and T. Kovacs, editors. Foundations of Learning Classifier Systems. Springer, 2005.
 Courtillat [2001] P. Courtillat. NoNSeNS Pacman 1.6 with C sourcecode, 2001. URL http://www.programmersheaven.com/download/17847/download.aspx. [Online; accessed 09 October 2006].
 de Boer et al. [2004] P. de Boer, D. Kroese, S. Mannor, and R. Rubinstein. A tutorial on the crossentropy method. Annals of Operations Research, 134(1):19–67, 2004. URL citeseer.csail.mit.edu/deboer02tutorial.html.
 Guestrin et al. [2003] C. Guestrin, D. Koller, R. Parr, and S. Venkataraman. Efficient solution algorithms for factored MDPs. JAIR, 19:399–468, 2003.
 Jaeger [1999] H. Jaeger. Action selection for delayed, stochastic reward. In Proc. 4th Annual Conf. of the German Cognitive Science Society (KogWis99), pages 213–219, 1999.
 Lane and Smart [2005] Terran Lane and Bill Smart. Why (PO)MDPs lose for spatial tasks and what to do about it. In Proc. ICML 2005 Workshop on Rich Repr. for Reinforcement Learning, 2005.
 Lucas [2005] S. M. Lucas. Evolving a neural network location evaluator to play Ms. PacMan. In IEEE Symposium on Computational Intelligence and Games, pages 203–210, 2005.
 Margolin [2004] L. Margolin. On the convergence of the crossentropy method. Annals of Operations Research, 134:201–214, 2004.
 Murphy [2005] Kevin P. Murphy. A survey of POMDP solution techniques, 2005. Retrieved from http://www.cs.ubc.ca/murphyk/papers.html.
 Ponsen and Spronck [2004] M. Ponsen and P. Spronck. Improving adaptive game AI with evolutionary learning. In Computer Games: Artificial Intelligence, Design and Education, 2004. URL http://www.cs.unimaas.nl/p.spronck/Pubs/PonsenCGAIDE.pdf.
 Rubinstein [1999] R. Rubinstein. The crossentropy method for combinatorial and continuous optimization. Methodology and Computing in Applied Probability, 1:127–190, 1999. URL citeseer.ist.psu.edu/rubinstein99crossentropy.html.
 Samuel [1959] A. L. Samuel. Some studies in machine learning using the game of checkers. IBM Journal of Research and Development, 6:211–229, 1959.
 Schmidhuber [1997] J. Schmidhuber. A computer scientist’s view of life, the universe, and everything. In C. Freksa, M. Jantzen, and R. Valk, editors, Foundations of Computer Science: Potential  Theory  Cognition, volume 1337, pages 201–208. Lecture Notes in Computer Science, Springer, Berlin, 1997. ISBN 354063746X.
 Singh et al. [2003] Satinder P. Singh, Michael L. Littman, Nicholas K. Jong, David Pardoe, and Peter Stone. Learning predictive state representations. In Proc. 20th ICML, pages 712–719, 2003.
 Spronck et al. [2003] P. Spronck, I. SprinkhuizenKuyper, and E. Postma. Online adaptation of game opponent AI in simulation and in practice. In Q. Mehdi, N. Gough, and S. Natkin, editors, Proceedings of the 4th International Conference on Intelligent Games and Simulation, pages 93–100, 2003.
 Sutton and Barto [1998] R. Sutton and A. G. Barto. Reinforcement Learning: An Introduction. MIT Press, Cambridge, 1998.
 Sutton et al. [2000] R. Sutton, D. McAllester, S. Singh, and Y. Mansour. Policy gradient methods for reinforcement learning with function approximation. In NIPS, volume 13, pages 1057–1063. MIT Press, 2000.
 Sutton et al. [1999] Richard S. Sutton, Doina Precup, and Satinder P. Singh. Between MDPs and semiMDPs: A framework for temporal abstraction in reinforcement learning. Artif. Intell., 112:181–211, 1999.
 Szita [2006] I. Szita. How to select the 100 voxels that are best for prediction – a simplistic approach. Technical report, Eötvös Loránd University, Hungary, 2006. URL http://www.ebc.pitt.edu/2006/2006results.html.
 Szita and Lőrincz [2006] I. Szita and A. Lőrincz. Learning tetris using the noisy crossentropy method. Neural Computation, 2006. (to appear).
 Tesauro [1994] G. Tesauro. TDGammon, a selfteaching backgammon program, achieves masterlevel play. Neural Computation, 6(2):215–219, 1994.
 Ward and MacKay [2002] D. J. Ward and D. J. C. MacKay. Fast handsfree writing by gaze direction. Nature, 418:838–540, 2002.
 Wikipedia [2006] Wikipedia. PacMan — Wikipedia, the free encyclopedia. Wikipedia, 2006. URL http://en.wikipedia.org/wiki/PacMan. [Online; accessed 09 October 2006].
 Witten et al. [1987] I. A. Witten, R. M. Neal, and J. G. Cleary. Arithmetic coding for data compression. Communications of the ACM, 30:520–540, 1987.
 Wolpert and Macready [1997] D. H. Wolpert and W. G. Macready. No free lunch theorems for optimization. IEEE Transactions on Evolutionary Computation, 1:67–82, 1997.