I’ve been pleasantly surprised by how much this resource has caught on in terms of people using it and referring to it (definitely more than I expected when I made it). There were 30 examples on the list when was posted in April 2018, and 20 new examples have been contributed through the form since then. I think the list has several properties that contributed to wide adoption: it’s fun, standardized, up-to-date, comprehensive, and collaborative.
Some of the appeal is that it’s fun to read about AI cheating at tasks in unexpected ways (I’ve seen a lot of people post on Twitter about their favorite examples from the list). The standardized spreadsheet format seems easier to refer to as well. I think the crowdsourcing aspect is also helpful – this helps keep it current and comprehensive, and people can feel some ownership of the list since can personally contribute to it. My overall takeaway from this is that safety outreach tools are more likely to be impactful if they are fun and easy for people to engage with.
This list had a surprising amount of impact relative to how little work it took me to put it together and maintain it. The hard work of finding and summarizing the examples was done by the people putting together the lists that the master list draws on (Gwern, Lehman, Olsson, Irpan, and others), as well as the people who submit examples through the form. What I do is put them together in a common format and clarify and/or shorten some of the summaries. I also curate the examples to determine whether they fit the definition of specification gaming (as opposed to simply a surprising behavior or solution). Overall, I’ve probably spent around 10 hours so far on creating and maintaining the list, which is not very much. This makes me wonder if there is other low hanging fruit in the safety resources space that we haven’t picked yet.
I have been using it both as an outreach and research tool. On the outreach side, the resource has been helpful for making the argument that safety problems are hard and need general solutions, by making it salient just in how many ways things could go wrong. When presented with an individual example of specification gaming, people often have a default reaction of “well, you can just close the loophole like this”. It’s easier to see that this approach does not scale when presented with 50 examples of gaming behaviors. Any given loophole can seem obvious in hindsight, but 50 loopholes are much less so. I’ve found this useful for communicating a sense of the difficulty and importance of Goodhart’s Law.
On the research side, the examples have been helpful for trying to clarify the distinction between reward gaming and tampering problems. Reward gaming happens when the reward function is designed incorrectly (so the agent is gaming the design specification), while reward tampering happens when the reward function is implemented incorrectly or embedded in the environment (and so can be thought of as gaming the implementation specification). The boat race example is reward gaming, since the score function was defined incorrectly, while the Qbert agent finding a bug that makes the platforms blink and gives the agent millions of points is reward tampering. We don’t currently have any real examples of the agent gaining control of the reward channel (probably because the action spaces of present-day agents are too limited), which seems qualitatively different from the numerous examples of agents exploiting implementation bugs.
I’m curious what people find the list useful for – as a safety outreach tool, a research tool or intuition pump, or something else? I’d also be interested in suggestions for improving the list (formatting, categorizing, etc). Thanks everyone who has contributed to the resource so far!