Dr. Dobb's Journal March 1997
You're about to let a web spider loose on the Internet. How do you know it will seek out the information you want, without disrupting the Internet? You're in charge of writing the scheduling algorithm for a bank of elevators. How do you know when to go up and when to go down? In general, how do you make software do the right thing for its users? How do you even know what the right thing is?
Many people see agents and agent-based programming ushering in a new era in computing, particularly on the Internet. The optimists believe that all the protocols for data transfer, encryption, security, and payment will be sorted out in a year or so, after which we can go about writing new and exciting agent-based applications. In this article, we'll use the elevator problem to explain why using agents is not business as usual -- rather, it requires a new way of looking at problems and their solutions. We'll also examine how reinforcement learning can be used to address these problems.
When you hire a human agent to do something for you, you rarely spell out a detailed plan of action. Instead, you define the state of the environment that you want to achieve (for example, you might tell a contractor that you want a new front porch with comfortable seating for under $2000). In more complex and uncertain situations, you specify preferences rather than state outright goals, as when you tell a stockbroker that the more money you make, the better, but count capital gains as, say, 30 percent better than dividend income. Your agent then takes actions on your behalf -- even negotiates with other agents -- to help achieve your preferences.
Since we want software agents to behave the same way, we need both a way to describe our preferences to software agents and a methodology for building agents that best satisfy our preferences. The pleasant surprise is that for many problems, once preferences are known, we're almost done! Given the preferences, a list of possible actions, and enough time to practice taking actions, you can apply the formalism of reinforcement learning to build an agent that acts according to the preferences in a near-optimal way.
Consider the problem of programming a bank of elevators (borrowed from Crites and Barto, "Improving Elevator Performance Using Reinforcement Learning," Advances in Neural Information Processing Systems 8, 1996). Figure 1 shows a ten-floor building with three elevators. How should you write the program to control them?
One way is to think of this as an agent-based program. Once every, say, half second, the program examines the information available to the elevator sensors and decides what actions the elevators should take. The sensor information (or percepts) include the buttons that are lit on each floor and inside each elevator. (More advanced elevators can detect the weight of passengers in each car and the breaking of an electric eye beam at each door; we'll assume these elevators do not.) An empty elevator can either stop, go up, or go down. One constraint, driven by passenger expectation, is that an occupied elevator must continue in its current direction and stop at all floors requested by its passengers before reversing directions. We assume that we don't need to explicitly control the doors: They will always open when the elevator is stopped at a floor with a corresponding lit button and close before the elevator moves.
You could use separate agents for each elevator, but for the moment, assume that there is one central program. On each iteration, the program receives as percepts an "on" or "off" signal for each button, and the current floor and direction of each elevator. The program's output is a list of actions, one for each elevator (for example, "up," "stop," and "down" for the three elevators in Figure 1), that obey the stated constraints.
In most complex problems, the percepts alone are not enough to decide on good actions. For example, it is useful to know how long a passenger has been waiting for an elevator to arrive. Since this is not available from the percepts, the system will need to have some memory. We'll use the term "state" to denote the available information, both percept and memory. (In control theory, "state" implies certain technical constraints, which we ignore for simplicity; see Barto, Bradtke, and Singh's "Learning to Act Using Real-time Dynamic Programming," Artificial Intelligence, 72, 1995).
As Figure 2 illustrates, the program's behavior is a feedback loop. At each time step, the program has access to the current state of the system and has to decide what action to execute. After each action, the system moves to a new state, and the decision process is repeated. Technically, we say that a mapping from the space of states to the space of actions is a policy. Our problem is that of finding an optimal or near-optimal policy.
What is optimal? There are many criteria we could consider, among them:
Typically, there are trade-offs among these; for example, reducing wear and tear may increase wait time. Some criteria are best thought of as costs to be minimized while others are rewards to be maximized. For simplicity, we treat them all as rewards, with costs as negative rewards.
We can then define an optimal policy as one that maximizes the expected sum of the rewards it receives over its lifetime. Here, "expected" means that we have to average over all sources of randomness in the system. In more complex systems, the reward you get often depends on many random factors beyond the agent's control. For example, in the elevator system, the number of passengers and their arrival and destination floors are random.
Assume you want to minimize two quantities: the time spent by passengers waiting for an elevator, and the distance traveled by the elevators (which affects both energy and maintenance costs). To put this into the language of rewards, you need a common scale for the two quantities. For concreteness, let's say that at each time step, you receive a reward of -50 for every floor that has passengers waiting and -1 for every elevator in motion.
In the traditional approach to this problem, the programmer comes up with a set of rules and procedures on when to take what action based on common sense or rough heuristics. There are three drawbacks to this approach. First, it requires a lot of work on the part of the programmer. Second, when the environment changes in some way (perhaps faster elevators are installed), the programmer needs to modify the code. Third, there's no way to guarantee that this will lead to an optimal policy; sometimes, common sense is just wrong. Even if the rules and procedures are quite sensible in general, they might not consider the traffic patterns of a particular building, which means they might be missing an opportunity for fine tuning.
Reinforcement learning (RL) addresses all three drawbacks of the traditional approach. First, RL requires little programmer effort, because most of the work is done by automatic training, not programming. Second, if the environment changes, you can retrain the program without modifying the code. In fact, training occurs continuously as the system runs. Third, RL is mathematically guaranteed to converge to an optimal policy (given certain assumptions described shortly).
What does an agent need to know in order to take optimal actions? Clearly, it would be a big help if the agent could predict the reward function, the reward it would receive from taking a given action in a given state. But that would not be enough. Consider an empty elevator that is approaching the seventh floor on its way up, and assume that there are passengers waiting to go down on the seventh, eighth, and ninth floors. Stopping at the seventh floor would give a larger immediate reward than either continuing on up or reversing direction. But if we did stop at the seventh floor, we would have to take that passenger down and return later for the other passengers. A wiser strategy -- one with a better long-term reward -- would be to bypass the seventh and eighth floors, pick up the passengers on the ninth floor first, then pick up the passengers on the eighth and seventh floors on the way down.
The problem is that the reward function just captures the immediate or short-term consequences of these actions. What we need instead is a function that captures the long-term consequences. Such a function is called a "utility" function.
Intuitively, the utility of taking action a in some state s is the expected immediate reward for that action plus the sum of the long-term rewards over the rest of the agent's lifetime, assuming the agent uses the best policy. The equations in Figures 3(a) and 3(b) formally define the utility function.
If we knew the utility function, then the optimal policy would be to enumerate all possible actions and choose the action with the highest utility. Solving the elevator problem therefore reduces to learning the utility function. Theoretically, it is possible to solve exactly for the utility function. Figure 4 shows how the utility of a state is related to immediate reward and the utility of the resulting state. If there are n actions and m states, this defines a system of nm nonlinear equations in nm unknowns. These equations are called the "Bellman Optimality Equations," and can be solved by standard methods, such as dynamic programming (see Bertsekas, Dynamic Programming and Optimal Control, Volumes 1 and 2). Figure 5 shows the dynamic programming method to solve these equations. In practice, however, this rarely works.
One problem is that the simulation model of the elevator system consists of probability distributions for passenger arrivals and destinations. To compute the expectations in the equation (Figure 5) to do dynamic programming, you must first convert the simulation model into a model of state-transition probabilities. Doing so is computationally expensive.
Another problem is that, in the elevator example, the state-transition probabilities are continuous, so even if you had them, computing the expectations would be very expensive. Finally, the dynamic-programming method requires an iteration over the entire state space of the system. The elevator example has more than 1022 states -- iterating over all of them is impossible. If you did not have a simulation model to begin with, doing dynamic programming would first require estimating a model by experimenting with the real elevator system.
Reinforcement learning is an iterative approach that eliminates these problems to come up with an approximate solution to the utility function. The basic idea is to guess the utility function, and improve the guess based on the elevator system's experience. To gain experience, you need either a simulated elevator system or the ability to select actions and acquire data from the real elevator system.
In either case, the experience that the program acquires is a state-action trajectory with associated rewards (Figure 6). On each step through the trajectory, the agent takes some action and receives some reward. The reward is information the agent can use to update its estimate of the utility function, according to Figure 7. By focusing computation on the states that actually occur in the simulation, RL deals with larger state spaces than dynamic-programming methods. As the program gains more experience, the difference between its estimated utility function and true utility function decreases.
We are finally able to say exactly how RL allows us to write an elevator-control program:
1.Code a simulation of the elevator system. Make some measurements or estimates of passenger arrival rates and other parameters and write a program to simulate the elevator environment.
2.Determine the reward function.
3.Make an initial guess at a utility function. This can be based on simple rules such as "it is better to move up if there is a passenger waiting above." The details are not important because the guess will be improved.
4.Run the simulation, updating the utility function on each step, using the equation shown in Figure 7. Given enough time to explore in the simulated system, our program will converge to a near-optimal policy.
5.The final version of the program can then be used to control the real elevator system. If desired, the program could continually monitor its performance and update its utility function to reflect changes in the real environment.
What type of problems are appropriate for RL? Crites and Barto have recently applied RL to the elevator problem discussed here and shown it produces significantly better performance than the best solutions previously available.
RL has also been successfully applied in many areas, including process control, scheduling, resource allocation, queuing, and adaptive games. For example, we have applied RL to the problem of channel assignment in cellular-telephone systems and shown that it yields better performance than previously available solutions (see Singh and Bertsekas). Zhang and Dietterich have shown similar results in a job-shop scheduling problem, and Tesauro has used RL to develop the world's best computer backgammon player, which is nearly as good as the human champion. Many other applications are being presented at this year's Machine Learning and Neural Information Processing Society (NIPS) conferences.
When is convergence to an optimal policy guaranteed? Technically, RL will converge for all finite stationary Markov environments. What does that mean? First, it may not work if there are an infinite number of states, because it won't be able to explore them all. Second, it may not work if the environment is constantly changing, although in practice, small, slow changes in the environment are accommodated well. Finally, the mathematics will break down if the result of an action depends on states other than the current state. These limitations affect only the guarantee of convergence; there are many instances of RL working well despite violating these conditions. These limitations are actively being addressed in current research.
Is it sensible to treat all preferences as numeric rewards on a single scale? Theoretically, yes. D.W. North has proposed a theorem that says that if you believe four fairly simple axioms about preferences, then you can derive the existence of a real-valued utility function. (The only mildly controversial axiom is substitutability: If you prefer A to B, then you must prefer a coin flip between A and C to a coin flip between B and C.) In practice, it is not so simple. Users often find it hard to articulate their preferences as numbers. (For example, you have to design the controller for a nuclear power plant. How many dollars is a human life worth?)
How should the program store the utility function? The utility function maps state-action pairs to real numbers. If the size of the state-action space is small enough, this function can be stored in the form of a table; otherwise, some form of compact function approximation is used. Statisticians have a variety of representations for this purpose: decision trees, polynomial approximations, neural networks, and the like. Using such a representation not only saves space, it also gives us generalization: The program can take a sensible action from a state it has never seen before by interpolating or extrapolating from known states.
How much information should be kept as state? In some situations, there may be too many sensors, some of them giving redundant information. Too much information needlessly increases the size of the state space and makes learning the utility function slower. Progress is being made in techniques to detect and remove redundancy. In other situations, there may be too few sensors. Too little information can prevent learning from making much progress. Memory can be used to augment the limited information in these situations.
What if you don't have a simulation of the elevator problem available? You could use state trajectories obtained by controlling and experimenting with the real elevator system to train the RL solution. In this case, methods for optimal experiment design (in choosing what actions to take) are important, because real-life elevators run much slower than simulations, and because we don't want to alienate passengers as the system is being trained.
How should agents communicate and interact? We designed the elevator solution as a single agent that runs all three elevators. But consider the problem of controlling cars on a freeway. A central controlling agent would be unmanageable -- we would need to use many interacting agents instead. Game and economic-market theory can be used to extend the RL algorithm to handle these multiagent situations.
So RL says you should always take the action with the highest estimated utility, right? Actually, no. If you know the true utility function, you should always take the action that maximizes it. But the estimated utility may be wrong. Taking suboptimal actions may get you off the beaten track enough to learn something new, thereby updating the estimate, and enabling much better actions in the future. Consequently, there is always a trade-off between exploitation of the best known action and exploration of the consequences of other actions.
The increasing complexity of the problems being tackled with software systems means that in many cases, it is no longer cost-effective (or possible) for the programmer to prescribe the best action in every state of the environment. To meet these new challenges, we need better programming tools and protocols, and we need a new way of thinking about problem solving.
Just as when we hire human agents to solve problems, we want to specify only high-level preferences to our software agents, and have them figure out the best course of action. This is not only more cost effective, but it also leads to better and more robust performance in real-world, changing, and uncertain environments. Reinforcement learning enables software agents to learn near-optimal solutions to complex problems from a specification of high-level preferences.
North, D.W. "A Tutorial Introduction to Decision Theory." IEEE Transactions on Systems, Man and Cybernetics, SSC-4(3), September 1968.
Tesauro, G.J. "Practical Issues in Temporal Difference Learning." Machine Learning, 8(3/4):257-277, May 1992.
Zhang, W. and T.G. Dietterich. "High Performance Job-Shop Scheduling with a Time-Delay td
Network." Advances in Neural Information Processing Systems, 8, Cambridge, MA: MIT Press, 1996.
DDJ