Last weekend, I attended OpenAI’s self-organizing conference on machine learning (SOCML 2016), meta-organized by Ian Goodfellow (thanks Ian!). It was held at OpenAI’s new office, with several floors of large open spaces. The unconference format was intended to encourage people to present current ideas alongside with completed work. The schedule mostly consisted of 2-hour blocks with broad topics like “reinforcement learning” and “generative models”, guided by volunteer moderators. I especially enjoyed the sessions on neuroscience and AI and transfer learning, which had smaller and more manageable groups than the crowded popular sessions, and diligent moderators who wrote down the important points on the whiteboard. Overall, I had more interesting conversation but also more auditory overload at SOCML than at other conferences.
To my excitement, there was a block for AI safety along with the other topics. The safety session became a broad introductory Q&A, moderated by Nate Soares, Jelena Luketina and me. Some topics that came up: value alignment, interpretability, adversarial examples, weaponization of AI.
One value alignment question was how to incorporate a diverse set of values that represents all of humanity in the AI’s objective function. We pointed out that there are two complementary problems: 1) getting the AI’s values to be in the small part of values-space that’s human-compatible, and 2) averaging over that space in a representative way. People generally focus on the ways in which human values differ from each other, which leads them to underestimate the difficulty of the first problem and overestimate the difficulty of the second. We also agreed on the importance of allowing for moral progress by not locking in the values of AI systems.
Nate mentioned some alternatives to goal-optimizing agents – quantilizers and approval-directed agents. We also discussed the limitations of using blacklisting/whitelisting in the AI’s objective function: blacklisting is vulnerable to unforeseen shortcuts and usually doesn’t work from a security perspective, and whitelisting hampers the system’s ability to come up with creative solutions (e.g. the controversial move 37 by AlphaGo in the second game against Sedol).
Been Kim brought up the recent EU regulation on the right to explanation for algorithmic decisions. This seems easy to game due to lack of good metrics for explanations. One proposed metric was that a human would be able to predict future model outputs from the explanation. This might fail for better-than-human systems by penalizing creative solutions if applied globally, but seems promising as a local heuristic.
Ian Goodfellow mentioned the difficulties posed by adversarial examples: an imperceptible adversarial perturbation to an image can make a convolutional network misclassify it with very high confidence. There might be some kind of No Free Lunch theorem where making a system more resistant to adversarial examples would trade off with performance on non-adversarial data.
We also talked about dual-use AI technologies, e.g. advances in deep reinforcement learning for robotics that could end up being used for military purposes. It was unclear whether corporations or governments are more trustworthy with using these technologies ethically: corporations have a profit motive, while governments are more likely to weaponize the technology.
More detailed notes by Janos coming soon! For a detailed overview of technical AI safety research areas, I highly recommend reading Concrete Problems in AI Safety.
Cross-posted to the FLI blog.
I really think people should stop talking about approval-directed agents as being an “alternative to goal-directed agents”.
They just have a different goal (approval), and are not doing long-term planning (i.e. by assuming the world in a contextual bandit).
Both of these could plausibly make them safer than other goal-directed agents, but there’s nothing fundamentally different going on, and I think the main thing which would make them safe is having a horizon on 1, which can be applied with any goal (a good thing), but seems likely to substantially hurt performance (a bad thing).
“There might be some kind of No Free Lunch theorem where making a system more resistant to adversarial examples would trade off with performance on non-adversarial data.”
^ is there some theoretical or hand-wavy reason to suspect this?
I think arms-races-to-the-bottom wrt safety are more of a concern for governments, because we lack effective international government / coordination. As Toby Ord has pointed out, in theory corporations can make binding contracts to subvert race-to-the-bottom dynamics and rely of governments to enforce those contracts.
LikeLiked by 1 person
We were discussing alternatives to goal-optimizers, not goal-directed agents in general. Approval-directed agents are goal-directed, but they avoid relying on an explicit specification of that goal and maximizing it directly.
Ian didn’t give a theoretical argument here, it’s just that adversarial examples are easy to generate and hard to defend against. OpenAI’s Clever Hans benchmark gives you a lower bound on how susceptible your algorithm is to adversarial examples, but not an upper bound, which seems much harder to establish.
However, even if an NFL theorem exists, we might still be able to reduce the harm from adversarial examples by shifting the high-error region towards less consequential errors like optical illusions.
There does seem to be more trust and cooperation between companies than there is between governments. Companies also have an incentive to collaborate on safety because accidents can make everyone in the field look bad.
I think those kinds of contracts would be easy to game unless they are very specific and precise. Building a culture of caring about safety within the industry seems more effective and robust than outside regulation.
Sorry for the incredibly late reply…
1. “Approval-directed agents are goal-directed, but they avoid relying on an explicit specification of that goal and maximizing it directly.” <– I don't agree with that characterization; I'd say they have a goal of maximizing approval.
3a. RE accidents making everyone look bad… my model of this is more that accidents make everyone dead, BUT most people don't care that much more about killing everyone than about just themselves dying, so the incentive is not strong enough by default.
3b. The type of contract Toby is talking about is quite radical: use weighted coin-flips to consolidate organizations (thus removing competitive pressures).
BTW, Another idea I like for coordination problems is "physical recourse": if you have a governing body which can punish but not prevent misbehaviour, then you give competitors shut-down switches for each-others systems (intended for use only for preventing existential catastrophes), but punish them for using them illegally (e.g. to disrupt a competitor's product launch).
Pingback: Import AI: Issue 12: Learning to drive in GTAV, machine-generated TV, and a t-SNE explainer | Mapping Babel
Pingback: OpenAI Unconference on Machine Learning – DailyNews Artificial Intelligence
Pingback: 2016-17 New Year review | Deep Safety