Author Archives: Victoria Krakovna

2017-18 New Year review

2017 progress


FLI / other AI safety:

Rationality / effectiveness:

  • Streamlined self-tracking data analysis and made an iPython notebook for plots. Found that the amount of sleep I get is correlated with tiredness (0.32), but not with mood indicators (anger, anxiety, or distractability). Anger and anxiety are correlated with each other though (0.36). Distractability is correlated with tiredness (0.27) and anticorrelated with anger the next day for some reason (-0.31).
  • Ran house check-in sessions on goals and habits 1-2 times a month, two house sessions on Hamming questions, and check-ins with Janos every 1-2 weeks.
  • Did a sleep CBT program with sleep restriction for 2 months. Comparing the 5 months before the program vs the 5 months after the program, evening insomnia rate went down from 16% to 8.2% of the time, and morning insomnia rate didn’t change (9%). Average hours of sleep didn’t change (7 hours), but going to sleep around 22 minutes earlier on average. This excludes jetlag days (at most 3 days after a flight with at least 3 hours of time difference).
  • Did around 80 exercise classes (starting in March)

Fun stuff:

  • Moved into our new group house (Deep End).
  • Explored the UK (hiking in Wales, Scotland, Lake District).
  • Got back into aerial silks.
  • Got into circling.
  • Got a pixie haircut.
  • Family reunion in France with Russian relatives I haven’t seen in a decade.
  • Went to Burning Man and learned to read Tarot (as part of our camp theme).
  • Did the Stoic Week.
  • Played a spy scavenger hunt game.



2017 prediction outcomes


  1. Our AI safety team will have at least two papers accepted for publication at a major conference, not counting workshops (70%) – 2 papers (human preferences paper at NIPS and reward corruption paper at IJCAI)
  2. I will write at least 9 blog posts (50%) – 6 posts
  3. I will meditate at least 250 days (45%) – 237 days
  4. I will exercise at least 250 days (55%) – 194 days
  5. I will visit at least 2 new countries (80%) – France, Switzerland
  6. I will attend Burning Man (85%) – yes


  • Everything that got at least 70% confidence was correct, everything lower was wrong.
  • Like last year, my low predictions seem overconfident (though too few data points to judge).

2018 goals and predictions


  1. Write at least 2 AI blog posts that are not about conferences (1 last year) (70%)
  2. Avoid processed sugar* at least until end of March (90%)
  3. Do at most 4 non-research talks/panels (7 last year) (50%)
  4. Meditate on at least 250 days (50%)

* not in a super strict way: it’s ok to eat fruit and 90% chocolate and try a really small quantity (< teaspoon) of a dessert.


  1. Our AI safety team will have at least two papers accepted for publication at a major conference, not counting workshops (80%)
  2. I will write at least 6 blog posts (60%)
  3. I will go to at least 100 exercise classes (80 last year) (60%)
  4. 1-2 housemate turnover at the Deep End (3 last year) (70%)
  5. I will visit at least 3 new cities with population over 100,000 (4 last year) (50%)
  6. I will go on at least 2 hikes (4 last year) (90%)

Past new year reviews: 2016-17, 2015-16, 2014-15.


NIPS 2017 Report


This year’s NIPS gave me a general sense that near-term AI safety is now mainstream and long-term safety is slowly going mainstream. On the near-term side, I particularly enjoyed Kate Crawford’s keynote on neglected problems in AI fairness, the ML security workshops, and the Interpretable ML symposium debate that addressed the “do we even need interpretability?” question in a somewhat sloppy but entertaining way. There was a lot of great content on the long-term side, including several oral / spotlight presentations and the Aligned AI workshop.

Value alignment papers

Inverse Reward Design (Hadfield-Menell et al) defines the problem of an RL agent inferring a human’s true reward function based on the proxy reward function designed by the human. This is different from inverse reinforcement learning, where the agent infers the reward function from human behavior. The paper proposes a method for IRD that models uncertainty about the true reward, assuming that the human chose a proxy reward that leads to the correct behavior in the training environment. For example, if a test environment unexpectedly includes lava, the agent assumes that a lava-avoiding reward function is as likely as a lava-indifferent or lava-seeking reward function, since they lead to the same behavior in the training environment. The agent then follows a risk-averse policy with respect to its uncertainty about the reward function.


The paper shows some encouraging results on toy environments for avoiding some types of side effects and reward hacking behavior, though it’s unclear how well they will generalize to more complex settings. For example, the approach to reward hacking relies on noticing disagreements between different sensors / features that agreed in the training environment, which might be much harder to pick up on in a complex environment. The method is also at risk of being overly risk-averse and avoiding anything new, whether it be lava or gold, so it would be great to see some approaches for safe exploration in this setting.

Repeated Inverse RL (Amin et al) defines the problem of inferring intrinsic human preferences that incorporate safety criteria and are invariant across many tasks. The reward function for each task is a combination of the task-invariant intrinsic reward (unobserved by the agent) and a task-specific reward (observed by the agent). This multi-task setup helps address the identifiability problem in IRL, where different reward functions could produce the same behavior.

repeated irl

The authors propose an algorithm for inferring the intrinsic reward while minimizing the number of mistakes made by the agent. They prove an upper bound on the number of mistakes for the “active learning” case where the agent gets to choose the tasks, and show that a certain number of mistakes is inevitable when the agent cannot choose the tasks (there is no upper bound in that case). Thus, letting the agent choose the tasks that it’s trained on seems like a good idea, though it might also result in a selection of tasks that is less interpretable to humans.

Deep RL from Human Preferences (Christiano et al) uses human feedback to teach deep RL agents about complex objectives that humans can evaluate but might not be able to demonstrate (e.g. a backflip). The human is shown two trajectory snippets of the agent’s behavior and selects which one more closely matches the objective. This method makes very efficient use of limited human feedback, scaling much better than previous methods and enabling the agent to learn much more complex objectives (as shown in MuJoCo and Atari).


Dynamic Safe Interruptibility for Decentralized Multi-Agent RL (El Mhamdi et al) generalizes the safe interruptibility problem to the multi-agent setting. Non-interruptible dynamics can arise in a group of agents even if each agent individually is indifferent to interruptions. This can happen if Agent B is affected by interruptions of Agent A and is thus incentivized to prevent A from being interrupted (e.g. if the agents are self-driving cars and A is in front of B on the road). The multi-agent definition focuses on preserving the system dynamics in the presence of interruptions, rather than on converging to an optimal policy, which is difficult to guarantee in a multi-agent setting.

Aligned AI workshop

This was a more long-term-focused version of the Reliable ML in the Wild workshop held in previous years. There were many great talks and posters there – my favorite talks were Ian Goodfellow’s “Adversarial Robustness for Aligned AI” and Gillian Hadfield’s “Incomplete Contracting and AI Alignment”.

Ian made the case of ML security being important for long-term AI safety. The effectiveness of adversarial examples is problematic not only from the near-term perspective of current ML systems (such as self-driving cars) being fooled by bad actors. It’s also bad news from the long-term perspective of aligning the values of an advanced agent, which could inadvertently seek out adversarial examples for its reward function due to Goodhart’s law. Relying on the agent’s uncertainty about the environment or human preferences is not sufficient to ensure safety, since adversarial examples can cause the agent to have arbitrarily high confidence in the wrong answer.

ian talk_3

Gillian approached AI safety from an economics perspective, drawing parallels between specifying objectives for artificial agents and designing contracts for humans. The same issues that make contracts incomplete (the designer’s inability to consider all relevant contingencies or precisely specify the variables involved, and incentives for the parties to game the system) lead to side effects and reward hacking for artificial agents.

Gillian talk_4

The central question of the talk was how we can use insights from incomplete contracting theory to better understand and systematically solve specification problems in AI safety, which is a really interesting research direction. The objective specification problem seems even harder to me than the incomplete contract problem, since the contract design process relies on some level of shared common sense between the humans involved, which artificial agents do not currently possess.

Interpretability for AI safety

I gave a talk at the Interpretable ML symposium on connections between interpretability and long-term safety, which explored what forms of interpretability could help make progress on safety problems (slides, video). Understanding our systems better can help ensure that safe behavior generalizes to new situations, and it can help identify causes of unsafe behavior when it does occur.

For example, if we want to build an agent that’s indifferent to being switched off, it would be helpful to see whether the agent has representations that correspond to an off-switch, and whether they are used in its decisions. Side effects and safe exploration problems would benefit from identifying representations that correspond to irreversible states (like “broken” or “stuck”). While existing work on examining the representations of neural networks focuses on visualizations, safety-relevant concepts are often difficult to visualize.

Local interpretability techniques that explain specific predictions or decisions are also useful for safety. We could examine whether features that are idiosyncratic to the training environment or indicate proximity to dangerous states influence the agent’s decisions. If the agent can produce a natural language explanation of its actions, how does it explain problematic behavior like reward hacking or going out of its way to disable the off-switch?

There are many ways in which interpretability can be useful for safety. Somewhat less obvious is what safety can do for interpretability: serving as grounding for interpretability questions. As exemplified by the final debate of the symposium, there is an ongoing conversation in the ML community trying to pin down the fuzzy idea of interpretability – what is it, do we even need it, what kind of understanding is useful, etc. I think it’s important to keep in mind that our desire for interpretability is to some extent motivated by our systems being fallible – understanding our AI systems would be less important if they were 100% robust and made no mistakes. From the safety perspective, we can define interpretability as the kind of understanding that help us ensure the safety of our systems.

For those interested in applying the interpretability hammer to the safety nail, or working on other long-term safety questions, FLI has recently announced a new grant program. Now is a great time for the AI field to think deeply about value alignment. As Pieter Abbeel said at the end of his keynote, “Once you build really good AI contraptions, how do you make sure they align their value system with our value system? Because at some point, they might be smarter than us, and it might be important that they actually care about what we care about.”

(Thanks to Janos Kramar for his feedback on this post, and to everyone at DeepMind who gave feedback on the interpretability talk.)

Tokyo AI & Society Symposium

I just spent a week in Japan to speak at the inaugural symposium on AI & Society – my first conference in Asia. It was inspiring to take part in an increasingly global conversation about AI impacts, and interesting to see how the Japanese AI community thinks about these issues. Overall, Japanese researchers seemed more open to discussing controversial topics like human-level AI and consciousness than their Western counterparts. Most people were more interested in near-term AI ethics concerns but also curious about long term problems.

The talks were a mix of English and Japanese with translation available over audio (high quality but still hard to follow when the slides are in Japanese). Here are some tidbits from my favorite talks and sessions.

Danit Gal’s talk on China’s AI policy. She outlined China’s new policy report aiming to lead the world in AI by 2030, and discussed various advantages of collaboration over competition. It was encouraging to see that China’s AI goals include “establishing ethical norms, policies and regulations” and “forming robust AI safety and control mechanisms”. Danit called for international coordination to help ensure that everyone is following compatible concepts of safety and ethics.

Next breakthrough in AI panel (Yasuo Kuniyoshi from U Tokyo, Ryota Kanai from Araya and Marek Rosa from GoodAI). When asked about immediate research problems they wanted the field to focus on, the panelists highlighted intrinsic motivation, embodied cognition, and gradual learning. In the longer term, they encouraged researchers to focus on generalizable solutions and to not shy away from philosophical questions (like defining consciousness). I think this mindset is especially helpful for working on long-term AI safety research, and would be happy to see more of this perspective in the field.

Long-term talks and panel (Francesca Rossi from IBM, Hiroshi Nakagawa from U Tokyo and myself). I gave an overview of AI safety research problems in general and recent papers from my team. Hiroshi provocatively argued that a) AI-driven unemployment is inevitable, and b) we need to solve this problem using AI. Francesca talked about trustworthy AI systems and the value alignment problem. In the panel, we discussed whether long-term problems are a distraction from near-term problems (spoiler: no, both are important to work on), to what extent work on safety for current ML systems can carry over to more advanced systems (high-level insights are more likely to carry over than details), and other fun stuff.

Stephen Cave’s diagram of AI ethics issues. Helpfully color-coded by urgency.

Stephen diagram

Luba Elliott’s talk on AI art. Style transfer has outdone itself with a Google Maps Mona Lisa.

mona lisa

There were two main themes I noticed in the Western presentations. People kept pointing out that AlphaGo is not AGI because it’s not flexible enough to generalize to hexagonal grids and such (this was before AlphaGo Zero came out). Also, the trolley problem was repeatedly brought up as a default ethical question for AI (it would be good to diversify this discussion with some less overused examples).

The conference was very well-organized and a lot of fun. Thanks to the organizers for bringing it together, and to all the great people I got to meet!

We also had a few days of sightseeing around Tokyo, which involved a folk dance festival, an incessantly backflipping aye-aye at the zoo, and beautiful netsuke sculptures at the national museum. I will miss the delicious conveyor belt sushi, the chestnut puree desserts from the convenience store, and the vending machines with hot milk tea at every corner :).

[This post originally appeared on the Deep Safety blog. Thanks to Janos Kramar for his feedback.]

Portfolio approach to AI safety research

dimensionsLong-term AI safety is an inherently speculative research area, aiming to ensure safety of advanced future systems despite uncertainty about their design or algorithms or objectives. It thus seems particularly important to have different research teams tackle the problems from different perspectives and under different assumptions. While some fraction of the research might not end up being useful, a portfolio approach makes it more likely that at least some of us will be right.

In this post, I look at some dimensions along which assumptions differ, and identify some underexplored reasonable assumptions that might be relevant for prioritizing safety research. (In the interest of making this breakdown as comprehensive and useful as possible, please let me know if I got something wrong or missed anything important.)

Assumptions about similarity between current and future AI systems

If a future general AI system has a similar algorithm to a present-day system, then there are likely to be some safety problems in common (though more severe in generally capable systems). Insights and solutions for those problems are likely to transfer to some degree from current systems to future ones. For example, if a general AI system is based on reinforcement learning, we can expect it to game its reward function in even more clever and unexpected ways than present-day reinforcement learning agents do. Those who hold the similarity assumption often expect most of the remaining breakthroughs on the path to general AI to be compositional rather than completely novel, enhancing and combining existing components in novel and better-implemented ways (many current machine learning advances such as AlphaGo are an example of this).

Note that assuming similarity between current and future systems is not exactly the same as assuming that studying current systems is relevant to ensuring the safety of future systems, since we might still learn generalizable things by testing safety properties of current systems even if they are different from future systems.

Assuming similarity suggests a focus on empirical research based on testing the safety properties of current systems, while not making this assumption encourages more focus on theoretical research based on deriving safety properties from first principles, or on figuring out what kinds of alternative designs would lead to safe systems. For example, safety researchers in industry tend to assume more similarity between current and future systems than researchers at MIRI.

Here is my tentative impression of where different safety research groups are on this axis. This is a very approximate summary, since views often vary quite a bit within the same research group (e.g. FHI is particularly diverse in this regard).similarity_axis
On the high-similarity side of the axis, we can explore the safety properties of different architectural / algorithmic approaches to AI, e.g. on-policy vs off-policy or model-free vs model-based reinforcement learning algorithms. It might be good to have someone working on safety issues for less commonly used agent algorithms, e.g. evolution strategies.

Assumptions about promising approaches to safety problems

Level of abstraction. What level of abstraction is most appropriate for tackling a particular problem. For example, approaches to the value learning problem range from explicitly specifying ethical constraints to capability amplification and indirect normativity, with cooperative inverse reinforcement learning somewhere in between. These assumptions could be combined by applying different levels of abstraction to different parts of the problem. For example, it might make sense to explicitly specify some human preferences that seem obvious and stable over time (e.g. “breathable air”), and use the more abstract approaches to impart the most controversial, unstable and vague concepts (e.g. “fairness” or “harm”). Overlap between the more and less abstract specifications can create helpful redundancy (e.g. air pollution as a form of harm + a direct specification of breathable air).

For many other safety problems, the abstraction axis is not as widely explored as for value learning. For example, most of the approaches to avoiding negative side effects proposed in Concrete Problems (e.g. impact regularizers and empowerment) are on a medium level of abstraction, while it also seems important to address the problem on a more abstract level by formalizing what we mean by side effects (which would help figure out what we should actually be regularizing, etc). On the other hand, almost all current approaches to wireheading / reward hacking are quite abstract, and the problem would benefit from more empirical work.

Explicit specification vs learning from data. Whether a safety problem is better addressed by directly defining a concept (e.g. the Low Impact AI paper formalizes the impact of an AI system by breaking down the world into ~20 billion variables) or learning the concept from human feedback (e.g. Deep Reinforcement Learning from Human Preferences paper teaches complex objectives to AI systems that are difficult to specify directly, like doing a backflip). I think it’s important to address safety problems from both of these angles, since the direct approach is unlikely to work on its own, but can give some idea of the idealized form of the objective that we are trying to approximate by learning from data.

Modularity of AI design. What level of modularity makes it easier to ensure safety? Ranges from end-to-end systems to ones composed of many separately trained parts that are responsible for specific abilities and tasks. Safety approaches for the modular case can limit the capabilities of individual parts of the system, and use some parts to enforce checks and balances on other parts. MIRI’s foundations approach focuses on a unified agent, while the safety properties on the high-modularity side has mostly been explored by Eric Drexler (more recent work is not public but available upon request). It would be good to see more people work on the high-modularity assumption.


To summarize, here are some relatively neglected assumptions:

  • Medium similarity in algorithms / architectures
  • Less popular agent algorithms
  • Modular general AI systems
  • More / less abstract approaches to different safety problems (more for side effects, less for wireheading, etc)
  • More direct / data-based approaches to different safety problems

From a portfolio approach perspective, a particular research avenue is worthwhile if it helps to cover the space of possible reasonable assumptions. For example, while MIRI’s research is somewhat controversial, it relies on a unique combination of assumptions that other groups are not exploring, and is thus quite useful in terms of covering the space of possible assumptions.

I think the FLI grant program contributed to diversifying the safety research portfolio by encouraging researchers with different backgrounds to enter the field. It would be good for grantmakers in AI safety to continue to optimize for this in the future (e.g. one interesting idea is using a lottery after filtering for quality of proposals).

When working on AI safety, we need to hedge our bets and look out for unknown unknowns – it’s too important to put all the eggs in one basket.

(Cross-posted to the FLI blog and Approximately Correct. Thanks to Janos Kramar, Jan Leike and Shahar Avin for their feedback on this post. Thanks to Jaan Tallinn and others for inspiring discussions.)

Takeaways from self-tracking data

I’ve been collecting data about myself on a daily basis for the past 3 years. Half a year ago, I switched from using 42goals (which I only remembered to fill out once every few days) to a Google form emailed to me daily (which I fill out consistently because I check email often). Now for the moment of truth – a correlation matrix!

The data consists of “mood variables” (anxiety, tiredness, and “zoneout” – how distracted / spacey I’m feeling), “action variables” (exercise and meditation) and sleep variables (hours of sleep, sleep start/end time, insomnia). There are 5 binary variables (meditation, exercise, evening/morning insomnia, headache) and the rest are ordinal or continuous. Almost all the variables have 6 months of data, except that I started tracking anxiety 5 months ago and zoneout 2 months ago.

The matrix shows correlations between mood and action variables for day X, sleep variables for the night after day X, and mood variables for day X+1 (marked by ‘next’):

corr heatmap over 2017.png

The most surprising thing about this data is how many things are uncorrelated that I would expect to be correlated:

  • evening insomnia and tiredness the next day (or the same day)
  • anxiety and sleep variables the following night
  • exercise and sleep variables the following night
  • tiredness and hours of sleep the following night
  • average hours of sleep (over the past week) is only weakly correlated with tiredness the next day (-0.15)
  • hours of sleep (average or otherwise) and anxiety or zoneout the next day (so my mood is less affected by sleep than I have expected)
  • action variables and mood variables the next day
  • meditation and feeling zoned out

Some things that were correlated after all:

  • hours of sleep and tiredness the next day (-0.3) – unsurprising but lower than expected
  • tiredness and zoneout (0.33)
  • tiredness and insomnia the following morning (0.29) (weird)
  • anxiety and zoneout were anticorrelated (-0.25) on adjacent days (weird)
  • exercise and anxiety (-0.18)
  • meditation and anxiety (-0.15)
  • meditating and exercising (0.17) – both depend on how agenty / busy I am that day
  • meditation and insomnia (0.24), probably because I usually try to meditate if I’m having insomnia to make it easier to fall asleep
  • headache and evening insomnia (0.14)

Some falsified hypotheses:

  • Exercise and meditation affect mood variables the following day
  • My tiredness level depends on the average amount of sleep the preceding week
  • Anxiety affects sleep the following night
  • Exercise helps me sleep the following night
  • I sleep more when I’m more tired
  • Sleep deprivation affects my mood

The overall conclusion is that my sleep is weird and also matters less than I thought for my well-being (at least in terms of quantity).

Addendum:  For those who would like to try this kind of self-tracking, here is a Google Drive folder with the survey form and the iPython notebook. You need to download the spreadsheet of form responses as a CSV file before running the notebook code. You can use the Send button in the form to email it to yourself, and then bounce it back every day using Google Inbox,, or a similar service.

Highlights from the ICLR conference: food, ships, and ML security

It’s been an eventful few days at ICLR in the coastal town of Toulon in Southern France, after a pleasant train ride from London with a stopover in Paris for some sightseeing. There was more food than is usually provided at conferences, and I ended up almost entirely subsisting on tasty appetizers. The parties were memorable this year, including one in a vineyard and one in a naval museum. The overall theme of the conference setting could be summarized as “finger food and ships”.


There were a lot of interesting papers this year, especially on machine learning security, which will be the focus on this post. (Here is a great overview of the topic.)

On the attack side, adversarial perturbations now work in physical form (if you print out the image and then take a picture) and they can also interfere with image segmentation. This has some disturbing implications for fooling vision systems in self-driving cars, such as impeding them from recognizing pedestrians. Adversarial examples are also effective at sabotaging neural network policies in reinforcement learning at test time.


In more encouraging news, adversarial examples are not entirely transferable between different models. For targeted examples, which aim to be misclassified as a specific class, the target class is not preserved when transferring to a different model. For example, if an image of a school bus is classified as a crocodile by the original model, it has at most 4% probability of being seen as a crocodile by another model. The paper introduces an ensemble method for developing adversarial examples whose targets do transfer, but this seems to only work well if the ensemble includes a model with a similar architecture to the new model.

On the defense side, there were some new methods for detecting adversarial examples. One method augments neural nets with a detector subnetwork, which works quite well and generalizes to new adversaries (if they are similar to or weaker than the adversary used for training). Another approach analyzes adversarial images using PCA, and finds that they are similar to normal images in the first few thousand principal components, but have a lot more variance in later components. Note that the reverse is not the case – adding arbitrary variation in trailing components does not necessarily encourage misclassification.

There has also been progress in scaling adversarial training to larger models and data sets, which also found that higher-capacity models are more resistant against adversarial examples than lower-capacity models. My overall impression is that adversarial attacks are still ahead of adversarial defense, but the defense side is starting to catch up.


(Cross-posted to the FLI blog and Approximately Correct. Thanks to Janos Kramar for his feedback on this post.)

2016-17 New Year review

2016 progress

Research / career:

  • Got a job at DeepMind as a research scientist in AI safety.
  • Presented MiniSPN paper at ICLR workshop.
  • Finished RNN interpretability paper and presented at ICML and NIPS workshops.
  • Attended the Deep Learning Summer School.
  • Finished and defended PhD thesis.
  • Moved to London and started working at DeepMind.


  • Talk and panel (moderator) at Effective Altruism Global X Boston
  • Talk and panel at the Governance of Emerging Technologies conference at ASU
  • Talk and panel at Brain Bar Budapest
  • AI safety session at OpenAI unconference
  • Talk and panel at Effective Altruism Global X Oxford
  • Talk and panel at Cambridge Catastrophic Risk Conference run by CSER

Rationality / effectiveness:

  • Went to a 5-day Zentensive meditation retreat with Janos, in between grad school and moving to London. This was very helpful for practicing connecting with my direct emotional experience, and a good way to reset during a life transition.
  • Stopped using 42goals (too glitchy) and started recording data in a Google form emailed to myself daily. Now I am actually entering accurate data every day instead of doing it retroactively whenever I remember. I tried a number of goal tracking apps, but all of them seemed too inflexible (I was surprised not to find anything that provides correlation charts between different goals, e.g. meditation vs. hours of sleep).

Random cool things:

  • Hiked in the Andes to an altitude of 17,000 feet.
  • Visited the Grand Canyon.
  • New countries visited: UK, Bolivia, Spain.
  • Started a group house in London (moving there in a few weeks).
  • Started contributing to the new blog Approximately Correct on societal impacts of machine learning.


2016 prediction outcomes


  1. Finish PhD thesis (70%) – done
  2. Write at least 12 blog posts (40%) – 9
  3. Meditate at least 200 days (50%) – 245
  4. Exercise at least 200 days (50%) – 282
  5. Do at least 5 pullups in a row (40%) – still only 2-3
  6. Record at least 50 new thoughts (50%) – 29
  7. Stay up past 1:30am at most 20% of the nights (40%) – 26.8%
  8. Do at least 10 pomodoros per week on average (50%) – 13


  1. At least one paper accepted for publication (70%) – two papers accepted to workshops
  2. I will get at least one fellowship (40%)
  3. Insomnia at most 20% of nights (20%) – 18.3%
  4. FLI will co-organize at least 3 AI safety workshops (50%) – AAAI, ICML, NIPS


  • Low predictions (20-40%): 1/5 = 20% (overconfident)
  • Medium predictions (50-70%): 6/7 = 85% (underconfident)
  • It’s interesting that my 40% predictions were all wrong, and my 50% predictions were almost all correct. I seem to be translating system 1 labels of ‘not that likely’ and ‘reasonably likely’ to 40% and 50% respectively, while they should translate to something more like 25% and 70%. After the overconfident predictions last year, I tried to tone down the predictions for this year, but the lower ones didn’t get toned down enough.
  • I seem to be more accurate on predictions than resolutions, probably due to wishful thinking. Experimenting with no resolutions for next year.

2017 predictions

  1. Our AI safety team will have at least two papers accepted for publication at a major conference, not counting workshops (70%).
  2. I will write at least 9 blog posts (50%).
  3. I will meditate at least 250 days (45%).
  4. I will exercise at least 250 days (55%).
  5. I will visit at least 2 new countries (80%).
  6. I will attend Burning Man (85%).