Specification gaming examples in AI

Various examples (and lists of examples) of unintended behaviors in AI systems have appeared in recent years. One interesting type of unintended behavior is finding a way to game the specified objective: generating a solution that literally satisfies the stated objective but fails to solve the problem according to the human designer’s intent. This occurs when the objective is poorly specified, and includes reinforcement learning agents hacking the reward function, evolutionary algorithms gaming the fitness function, etc. While ‘specification gaming’ is a somewhat vague category, it is particularly referring to behaviors that are clearly hacks, not just suboptimal solutions.

Since these examples are currently scattered across several lists, I have put together a master list of examples collected from the various existing sources. This list is intended to be comprehensive and up-to-date, and serve as a resource for AI safety research and discussion. If you know of any interesting examples of specification gaming that are missing from the list, please submit them through this form.

Thanks to Gwern Branwen, Catherine Olsson, Alex Irpan, and others for collecting and contributing examples!


4 thoughts on “Specification gaming examples in AI

  1. Stuart Russell

    The notion of “gaming” and “hack” suggests the AI system knows the user’s intent but decides to violate it anyway by sticking to the letter of the objective function. I think that this is likely to be misleading for the lay person. Instead, we should think of these as errors in specifying the objective, period.

    Liked by 2 people

    1. Victoria Krakovna Post author

      Thanks Stuart! I certainly agree that these behaviors are caused by errors in specifying the objective (I’ve added a sentence in the post to clarify this). Gaming / hacking by humans is similarly caused by poorly designed incentive systems.

      I see your point that “gaming” can be interpreted as understanding the designer’s intent but deciding to violate it anyway, though I’m not sure it has to be interpreted that way. For example, schoolchildren who are optimizing for grades might not realize that they are not satisfying the intended objective of school.

      Do you have a better term in mind for these sorts of degenerate behaviors that completely fail to satisfy the intended objective? Maybe something like “shortcuts” or “literal solutions”?


  2. tdietterich

    These are essentially programming bugs where the programmer did not set up the optimization problem properly. There are many lists online of typically programming errors (and advice on how to avoid them). See, for example, https://www.iiitd.edu.in/~jalote/papers/CommonBugs.pdf. Similarly, there are online resources for learning how to correctly formulate optimization problems for standard linear and integer programming packages (e.g., CPLEX and Gurobi). See for example, https://pubsonline.informs.org/doi/pdf/10.1287/ited.7.2.153.

    It is interesting to ask why these optimization errors are qualitatively different. Here are two thoughts. First, these problems are not expressed in a standard high level optimization framework like CPLEX. This can lead to problems with incomplete sandboxing of the optimizer (so that it is allowed to access parts of the environment that it should not be able to touch). Second, specifying the objective in terms of rewards may be a bad programming language. Many of the errors result from incorrect rewards that were added to “help” the learner. Maybe there are better ways to specify the desired behavior than to use reward functions?

    Our field is still learning how to formulate problems well, and this list will be very useful for this purpose. As we go forward, I hope we will create better tools for debugging our optimizations and for monitoring their behavior.

    Liked by 1 person


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s