Author Archives: Victoria Krakovna

2019-20 New Year review

This is an annual post reviewing the last year and making resolutions and predictions for next year. This year’s edition features sleep tracking, intermittent fasting, overcommitment busting, and evaluating calibration for all annual predictions since 2014.

2019 review

AI safety research:

AI safety outreach:

  • Co-organized FLI’s Beneficial AGI conference in Puerto Rico, a more long-term focused sequel to the original Puerto Rico conference and the Asilomar conference. This year I was the program chair for the technical safety track of the conference.
  • Co-organized the ICLR AI safety workshop, Safe Machine Learning: Specification, Robustness and Assurance. This was my first time running a paper reviewing process.
  • Gave a talk at the IJCAI AI safety workshop on specification, robustness an assurance problems.
  • Took part in the DeepMind podcast episode on AI safety (“I, robot”).

Work effectiveness:

  • At the beginning of the year, I found myself overcommitted and kind of burned out. My previous efforts to reduce overcommitment had proved insufficient to not feel stressed and overwhelmed most of the time.
  • In February, I made a rule for myself to decline all non-research commitments that don’t seem like exceptional opportunities. The form that I made last year for evaluating commitments (which I have to fill out before accepting anything) has been helpful for enforcing this rule and avoiding impulsive decisions. The number of commitments went down from 24 in 2019 to 10 in 2019. This has been working well in terms of having more time for research and feeling better about life.
  • Organizing a conference and a workshop back to back was a bit much, and I feel done with organizing large events for a while.


  • Stopped using work cycles and pomodoros since I’ve been finding the structure a bit too restrictive. Might resume at some point.
  • Hours of “deep work” per month, as defined in the Deep Work book. This includes things like thinking about research problem, coding, reading and writing papers. It does not include email, organizing, most meetings, coding logistics (e.g. setup or running experiments), etc.


  • For comparison, deep work hours from 2018. My definition of deep work has somewhat broadened over time, but not enough to account for this difference.deep_work_2018

Health / self-care:

  • I had 7 colds this past year, which is a lot more than my usual rate of 1-2 per year. The first three were in Jan-Feb, which seemed related to the overcommitment burnout. Hopefully supplementing zinc will help.
  • Averaged 7.2 hours of sleep per night, excluding jetlag (compared to 6.9 hours in 2018).
  • About a 10% rate of insomnia (excluding jetlag), similar to the end of last year.
  • Tried the Oura ring and the Dreem headband for measuring sleep quality. The Oura ring consistently thinks I wake up many times per night (probably because I move around a lot) and predicts less than half an hour each of deep and REM sleep. The Dreem, which actually measures EEG signals, estimates that I get an average of 1.3 hours of deep sleep and 1.8 hours of REM sleep per night, which is more than I expected.
  • Started a relaxed form of intermittent fasting in March (aiming for a 10 hour eating window), mostly for longevity and to improve my circadian rhythm. My average eating window length over the year was 10.5 hours, so I wasn’t very strict about it (mostly just avoiding snacks after dinner). One surprising thing I learned was that I can fall asleep just fine while hungry, and am often less hungry when I wake up. My average hours of sleep went up from 6.96 in the 6 months before starting intermittent fasting to 7.32 in the 6 months after. I went to sleep 44 minutes earlier and waking up 20 minutes earlier on average, though the variance of my bedtime actually went up a bit. Overall it seems plausibly useful easy enough to continue next year.

Fun stuff:

  • Did a Caucasus hiking trek in Georgia with family, and consumed a lot of wild berries and hazelnuts along the way.


  • Did a road trip in southern Iceland (also with family), saw a ridiculous number of pretty waterfalls, and was in the same room with (artificial) lava.


  • Took an advanced class in aerial silks for the first time (I felt a bit underqualified, but learned a lot of fun moves).
  • Ran a half-marathon along the coast in Devon on hilly terrain in 3 hours and 23 minutes.
  • Made some progress on handstands in yoga class (can hold it away from the wall for a few seconds)
  • Did two circling retreats (relational meditation)
  • Read books: The Divide, 21 Lessons for the 21st Century, The Circadian Code, So Good They Can’t Ignore You, Ending Aging (skimmed).
  • Got into Duolingo (brushed up my Spanish and learned a bit of Mandarin). Currently in a quasi-competition with Janos for studying each other’s languages.

2019 prediction outcomes


  1. Author or coauthor two or more academic papers (50%) – yes (3 papers)
  2. Accept at most 17 non-research commitments (24 last year) (60%) – yes (10 commitments)
  3. Meditate on at least 250 days (60%) – yes (290 days)


  • Relative reachability paper accepted at a major conference, not counting workshops (60%) – no
  • Continue avoiding processed sugar for the next year (85%) – no (still have the intention and mostly follow it, but less strictly / consistently)
  • 1-2 housemate turnover at Deep End (2 last year) (80%) – yes (1 housemate moved in)
  • At least 5 rationality sessions will be hosted at Deep End (80%) – no

Calibration over all annual reviews:

  • 50-70% well-calibrated, 80-90% overconfident (66 predictions total)
  • Calibration is generally better in 2017-19 (23 predictions) than in 2014-16 (43 predictions). There were only 3 70% predictions in 2017-19, so the 100% accuracy is noisy.
  • Unsurprisingly, resolutions are more often correct than other predictions (72% vs 56% correct)


2020 resolutions and predictions


  • Author or coauthor three or more academic papers (3 last year) (70%)
  • At most 12 non-research commitments (10 last year) (80%)
  • Meditate on at least 270 days (290 last year) (80%)
  • Read at least 7 books (5 last year) (70%)
  • At least 700 deep work hours (551 last year) (70%)


  • I will write at least 5 blog posts (60%)
  • Eating window at most 11 hours on at least 240 days (228 last year) (70%)
  • I will visit at least 4 new cities with population over 100,000 (11 last year) (70%)
  • At most 1 housemate turnover at Deep End (70%)
  • I finish a language in Duolingo (60%)

Past new year reviews: 2018-19, 2017-182016-172015-162014-15.


Retrospective on the specification gaming examples list

My post about the specification gaming list was recently nominated for the LessWrong 2018 Review (sort of like a test of time award), which prompted me to write a retrospective (cross-posted here). 

I’ve been pleasantly surprised by how much this resource has caught on in terms of people using it and referring to it (definitely more than I expected when I made it). There were 30 examples on the list when was posted in April 2018, and 20 new examples have been contributed through the form since then.  I think the list has several properties that contributed to wide adoption: it’s fun, standardized, up-to-date, comprehensive, and collaborative.

Some of the appeal is that it’s fun to read about AI cheating at tasks in unexpected ways (I’ve seen a lot of people post on Twitter about their favorite examples from the list). The standardized spreadsheet format seems easier to refer to as well. I think the crowdsourcing aspect is also helpful – this helps keep it current and comprehensive, and people can feel some ownership of the list since can personally contribute to it. My overall takeaway from this is that safety outreach tools are more likely to be impactful if they are fun and easy for people to engage with.

This list had a surprising amount of impact relative to how little work it took me to put it together and maintain it. The hard work of finding and summarizing the examples was done by the people putting together the lists that the master list draws on (Gwern, Lehman, Olsson, Irpan, and others), as well as the people who submit examples through the form. What I do is put them together in a common format and clarify and/or shorten some of the summaries. I also curate the examples to determine whether they fit the definition of specification gaming (as opposed to simply a surprising behavior or solution). Overall, I’ve probably spent around 10 hours so far on creating and maintaining the list, which is not very much. This makes me wonder if there is other low hanging fruit in the safety resources space that we haven’t picked yet. 

I have been using it both as an outreach and research tool. On the outreach side, the resource has been helpful for making the argument that safety problems are hard and need general solutions, by making it salient just in how many ways things could go wrong. When presented with an individual example of specification gaming, people often have a default reaction of “well, you can just close the loophole like this”. It’s easier to see that this approach does not scale when presented with 50 examples of gaming behaviors. Any given loophole can seem obvious in hindsight, but 50 loopholes are much less so. I’ve found this useful for communicating a sense of the difficulty and importance of Goodhart’s Law. 

On the research side, the examples have been helpful for trying to clarify the distinction between reward gaming and tampering problems. Reward gaming happens when the reward function is designed incorrectly (so the agent is gaming the design specification), while reward tampering happens when the reward function is implemented incorrectly or embedded in the environment (and so can be thought of as gaming the implementation specification). The boat race example is reward gaming, since the score function was defined incorrectly, while the Qbert agent finding a bug that makes the platforms blink and gives the agent millions of points is reward tampering. We don’t currently have any real examples of the agent gaining control of the reward channel (probably because the action spaces of present-day agents are too limited), which seems qualitatively different from the numerous examples of agents exploiting implementation bugs. 

I’m curious what people find the list useful for – as a safety outreach tool, a research tool or intuition pump, or something else? I’d also be interested in suggestions for improving the list (formatting, categorizing, etc). Thanks everyone who has contributed to the resource so far!

Classifying specification problems as variants of Goodhart’s Law

(Coauthored with Ramana Kumar and cross-posted from the Alignment Forum.)

There are a few different classifications of safety problems, including the Specification, Robustness and Assurance (SRA) taxonomy and the Goodhart’s Law taxonomy. In SRA, the specification category is about defining the purpose of the system, i.e. specifying its incentives.  Since incentive problems can be seen as manifestations of Goodhart’s Law, we explore how the specification category of the SRA taxonomy maps to the Goodhart taxonomy. The mapping is an attempt to integrate different breakdowns of the safety problem space into a coherent whole. We hope that a consistent classification of current safety problems will help develop solutions that are effective for entire classes of problems, including future problems that have not yet been identified.

The SRA taxonomy defines three different types of specifications of the agent’s objective: ideal (a perfect description of the wishes of the human designer), design (the stated objective of the agent) and revealed (the objective recovered from the agent’s behavior). It then divides specification problems into design problems (e.g. side effects) that correspond to a difference between the ideal and design specifications, and emergent problems (e.g. tampering) that correspond to a difference between the design and revealed specifications.

In the Goodhart taxonomy, there is a variable U* representing the true objective, and a variable U representing the proxy for the objective (e.g. a reward function). The taxonomy identifies four types of Goodhart effects: regressional (maximizing U also selects for the difference between U and U*), extremal (maximizing U takes the agent outside the region where U and U* are correlated), causal (the agent intervenes to maximize U in a way that does not affect U*), and adversarial (the agent has a different goal W and exploits the proxy U to maximize W).

We think there is a correspondence between these taxonomies: design problems are regressional and extremal Goodhart effects, while emergent problems are causal Goodhart effects. The rest of this post will explain and refine this correspondence.


Continue reading

ICLR Safe ML workshop report

This year the ICLR conference hosted topic-based workshops for the first time (as opposed to a single track for workshop papers), and I co-organized the Safe ML workshop. One of the main goals was to bring together near and long term safety research communities.


The workshop was structured according to a taxonomy that incorporates both near and long term safety research into three areas – specification, robustness and assurance.

Specification: define the purpose of the system Robustness: design system to withstand perturbations Assurance: monitor and control system activity
  • Reward hacking
  • Side effects
  • Preference learning
  • Fairness
  • Adaptation
  • Verification
  • Worst-case robustness
  • Safe exploration
  • Interpretability
  • Monitoring
  • Privacy
  • Interruptibility

We had an invited talk and a contributed talk in each of the three areas.

Continue reading

2018-19 New Year review

2018 progress

Research / AI safety:

Rationality / effectiveness:

  • Attended the CFAR mentoring workshop in Prague, and started running rationality training sessions with Janos at our group house.
  • Started using work cycles – focused work blocks (e.g. pomodoros) with built-in reflection prompts. I think this has increased my productivity and focus to some degree. The prompt “how will I get started?” has been surprisingly helpful given its simplicity.
  • Stopped eating processed sugar for health reasons at the end of 2017 and have been avoiding it ever since.
    • This has been surprisingly easy, especially compared to my earlier attempts to eat less sugar. I think there are two factors behind this: avoiding sugar made everything taste sweeter (so many things that used to taste good now seem inedibly sweet), and the mindset shift from “this is a luxury that I shouldn’t indulge in” to “this is not food”.
    • Unfortunately, I can’t make any conclusions about the effects on my mood variables because of some issues with my data recording process :(.
  • Declining levels of insomnia (excluding jetlag):
    • 22% of nights in the first half of 2017, 16% in the second half of 2017, 16% in the first half of 2018, 10% in the second half of 2018.
    • This is probably an effect of the sleep CBT program I did in 2017, though avoiding sugar might be a factor as well.
  • Made some progress on reducing non-research commitments (talks, reviewing, organizing, etc).
    • Set up some systems for this: a spreadsheet to keep track of requests to do things (with 0-3 ratings for workload and 0-2 ratings for regret) and a form to fill out whenever I’m thinking of accepting a commitment.
    • My overall acceptance rate for commitments has gone down a bit from 29% in 2017 to 24% in 2018. The average regret per commitment went down from 0.66 in 2017 to 0.53 in 2018.
    • However, since the number of requests has gone up, I ended up with more things to do overall: 12 commitments with a total of 23 units of workload in 2017 vs 19 commitments with a total of 33 units of workload in 2018. (1 unit of workload ~ 5 hours)

Continue reading

Discussion on the machine learning approach to AI safety

At this year’s EA Global London conference, Jan Leike and I ran a discussion session on the machine learning approach to AI safety. We explored some of the assumptions and considerations that come up as we reflect on different research agendas. Slides for the discussion can be found here.

The discussion focused on two topics. The first topic examined assumptions made by the ML safety approach as a whole, based on the blog post Conceptual issues in AI safety: the paradigmatic gap. The second topic zoomed into specification problems, which both of us work on, and compared our approaches to these problems.

Continue reading

Measuring and avoiding side effects using relative reachability

A major challenge in AI safety is reliably specifying human preferences to AI systems. An incorrect or incomplete specification of the objective can result in undesirable behavior like specification gaming or causing negative side effects. There are various ways to make the notion of a “side effect” more precise – I think of it as a disruption of the agent’s environment that is unnecessary for achieving its objective. For example, if a robot is carrying boxes and bumps into a vase in its path, breaking the vase is a side effect, because the robot could have easily gone around the vase. On the other hand, a cooking robot that’s making an omelette has to break some eggs, so breaking eggs is not a side effect.

side effects robots

(image credits: 1, 2, 3)

How can we measure side effects in a general way that’s not tailored to particular environments or tasks, and incentivize the agent to avoid them? This is the central question of our latest paper.

Continue reading