Author Archives: Victoria Krakovna

Moving on from community living

After 7 years at Deep End (and 4 more years in other group houses before that), Janos and I have moved out to live near a school we like and some lovely parks. The life change is bittersweet – we will miss living with our friends, but also look forward to a logistically simpler life with our kids. Looking back, here are some thoughts on what worked and didn’t work well about living in a group house with kids.

Pros. There were many things that we enjoyed about living at Deep End, and for a long time I couldn’t imagine ever wanting to leave. We had a low-effort social life – it was great to have spontaneous conversations with friends without arranging to meet up. This was especially convenient for us as new parents, when it was harder to make plans and get out of the house, particularly when we were on parental leave. The house community also made a huge difference to our wellbeing during the pandemic, because we had a household bubble that wasn’t just us. 

We did lots of fun things together with our housemates – impromptu activities like yoga / meditation / dancing / watching movies, as well as a regular check-in to keep up on each other’s lives. We were generally more easily exposed to new things – meeting friends of friends, trying new foods or activities that someone in the house liked, etc. Our friends often enjoyed playing with the kids, and it was helpful to have someone entertain them while we left the living room for a few minutes. Our 3 year old seems more social than most kids of the pandemic generation, which is partly temperament and partly growing up in a group house. 

Cons. The main issue was that the group house location was obviously not chosen with school catchment areas or kid-friendly neighbourhoods in mind. The other downsides of living there with kids were insufficient space, lifestyle differences, and extra logistics (all of which increased when we had a second kid).

Our family was taking up more and more of the common space – the living room doubled as a play room and a nursery, so it was a bit cramped. With 4 of us (plus visiting grandparents) and 4 other housemates in the house, the capacity of the house was maxed out (particularly the fridge, which became a realm of mystery and chaos). I am generally sensitive to clutter, and having the house full of our stuff and other people’s stuff was a bit much, while only dealing with our own things and mess is more manageable. 

Another factor was a mismatch in lifestyles and timings with our housemates, who tended to have later schedules. They often got home and started socializing or heading out to evening events when we already finished dinner and it was time to put the kids to bed, which was FOMO-inducing at times. Daniel enjoyed evening gatherings like the house check-in, but often became overstimulated and was difficult to put to bed afterwards. The time when we went to sleep in the evening was also a time when people wanted to watch movies on the projector, and it made me sad to keep asking them not to. 

There were also more logistics involved with running a group house, like managing shared expenses and objects, coordinating chores and housemate turnover. Even with regular decluttering, there was a lot of stuff at the house that didn’t belong to anyone in particular (e.g. before leaving I cleared the shoe rack of 9 pairs of shoes that turned out to be abandoned by previous occupants of the house). With two kids, we have more of our own logistics to deal with, so reducing other logistics was helpful.

Final thoughts. We are thankful to our housemates, current and former, for all the great times we had over the years and the wonderful community we built together. Visiting the house after moving out, it was nice to see the living room decked out with pretty decorations and potted plants and not overflowing with kid stuff – it reminded me of what the house was like when we first started it. Without the constraints of children living at the house, I hope to see Deep End return to its former self as a social place with more events and gatherings, and we will certainly be back to visit often.

It is a big change to live on our own after all these years. We moved near a few other friends with kids, which will be fun too. We are enjoying our own space right now, though we are not set on living by ourselves indefinitely. We might want to live with others again in the future, but probably with 1-2 close friends rather than in a big group house. 

2023-24 New Year review

This is an annual post reviewing the last year and setting intentions for next year. I look over different life areas (work, health, parenting, effectiveness, travel, etc) and draw conclusions from my life tracking data.

Overall, this year went pretty well (and definitely better than the previous two). Highlights include a second kid, hiking in Newfoundland, some parenting milestones (night potty training and stopping breastfeeding), and how iron deficiency can feel like burnout.

  1. 2023 review
    1. Life updates
    2. Work
    3. Health
    4. Parenting 
    5. Effectiveness
    6. Travel
    7. Fun stuff
  2. 2023 prediction outcomes
  3. 2024 goals and predictions

2023 review

Life updates

We received a special gift for New Year’s – Michael (“Misha”) arrived just in time to be born in 2023! Daniel is already getting the hang of rocking his brother and singing him lullabies.

Continue reading

Retrospective on my posts on AI threat models

Last year, a major focus of my research was developing a better understanding of threat models for AI risk. This post is looking back at some posts on threat models I (co)wrote in 2022 (based on my reviews of these posts for the LessWrong 2022 review).

I ran a survey on DeepMind alignment team opinions on the list of arguments for AGI ruin. I expect the overall agreement distribution probably still holds for the current GDM alignment team (or may have shifted somewhat in the direction of disagreement), though I haven’t rerun the survey so I don’t really know. Looking back at the “possible implications for our work” section, we are working on basically all of these things. 

Thoughts on some of the cruxes in the post based on developments in 2023:

  • Is global cooperation sufficiently difficult that AGI would need to deploy new powerful technology to make it work? – There has been a lot of progress on AGI governance and broad endorsement of the risks this year, so I feel somewhat more optimistic about global cooperation than a year ago.
  • Will we know how capable our models are? – The field has made some progress on designing concrete capability evaluations – how well they measure the properties we are interested in remains to be seen.
  • Will systems acquire the capability to be useful for alignment / cooperation before or after the capability to perform advanced deception? – At least so far, deception and manipulation capabilities seem to be lagging a bit behind usefulness for alignment (e.g. model-written evals / critiques, weak-to-strong generalization), but this could change in the future. 
  • Is consequentialism a powerful attractor? How hard will it be to avoid arbitrarily consequentialist systems? – Current SOTA LLMs seem surprisingly non-consequentialist for their level of capability. I still expect LLMs to be one of the safest paths to AGI in terms of avoiding arbitrarily consequentialist systems. 

In Clarifying AI X-risk, we presented a categorization of threat models and our consensus threat model, which posits some combination of specification gaming and goal misgeneralization leading to misaligned power-seeking, or “SG+GMG→MAPS”. I still endorse this categorization of threat models and the consensus threat model. I often refer people to this post and use the “SG + GMG → MAPS” framing in my alignment overview talks. I remain uncertain about the likelihood of the deceptive alignment part of the threat model (in particular the requisite level of goal-directedness) arising in the LLM paradigm, relative to other mechanisms for AI risk. 

Source: Clarifying AI X-risk (Kenton et al, 2022)

In terms of adding new threat models to the categorization, the main one that comes to mind is Deep Deceptiveness, which I would summarize as “non-deceptiveness is anti-natural / hard to disentangle from general capabilities”. I would probably put this under “SG → MAPS”, assuming an irreducible kind of specification gaming where it’s very difficult (or impossible) to distinguish deceptiveness from non-deceptiveness (including through feedback on the model’s reasoning process). Though it could also be GMG where the “non-deceptiveness” concept is incoherent and thus very difficult to generalize well. 

Refining the Sharp Left Turn was an attempt to understand this threat model better (or at all) and make it a bit more concrete. I still endorse the breakdown of claims in this post.

The post could be improved by explicitly relating the claims to the “consensus” threat model summarized in Clarifying AI X-risk. Overall, SLT seems like a special case of this threat model, which makes a subset of the SLT claims: 

  • Claim 1 (capabilities generalize far) and Claim 3 (humans fail to intervene), but not Claims 1a/b (simultaneous / discontinuous generalization) or Claim 2 (alignment techniques stop working). 
  • It probably relies on some weaker version of Claim 2 (alignment techniques failing to apply to more powerful systems in some way) is needed for deceptive alignment to arise, e.g. if our interpretability techniques fail to detect deceptive reasoning. However, I expect that most ways this could happen would not be due to the alignment techniques being fundamentally inadequate for the capability transition to more powerful systems (the strong version of Claim 2 used in SLT).

When discussing AI risks, talk about capabilities, not intelligence

Public discussions about catastrophic risks from general AI systems are often derailed by using the word “intelligence”. People often have different definitions of intelligence, or associate it with concepts like consciousness that are not relevant to AI risks, or dismiss the risks because intelligence is not well-defined. I would advocate for using the term “capabilities” or “competence” instead of “intelligence” when discussing catastrophic risks from AI, because this is what the concerns are really about. For example, instead of “superintelligence” we can refer to “super-competence” or “superhuman capabilities”. 

Image source: TED talks

When we talk about general AI systems posing catastrophic risks, the concern is about losing control of highly capable AI systems. Definitions of general AI that are commonly used by people working to address these risks are about general capabilities of the AI systems: 

  • PASTA definition: “AI systems that can essentially automate all of the human activities needed to speed up scientific and technological advancement”. 
  • Legg-Hutter definition: “An agent’s ability to achieve goals in a wide range of environments”.

We expect that AI systems that satisfy these definitions would have general capabilities including long-term planning, modeling the world, scientific research, manipulation, deception, etc. While these capabilities can be attained separately, we expect that their development is correlated, e.g. all of them likely increase with scale. 

There are various issues with the word “intelligence” that make it less suitable than “capabilities” for discussing risks from general AI systems:

  • Anthropomorphism: people often specifically associate “intelligence” with being human, being conscious, being alive, or having human-like emotions (none of which are relevant to or a prerequisite for risks posed by general AI systems). 
  • Associations with harmful beliefs and ideologies.
  • Moving goalposts: impressive achievements in AI are often dismissed as not indicating “true intelligence” or “real understanding” (e.g. see the “stochastic parrots” argument). Catastrophic risk concerns are based on what the AI system can do, not whether it has “real understanding” of language or the world.
  • Stronger associations with less risky capabilities: people are more likely to associate “intelligence” with being really good at math than being really good at politics, while the latter may be more representative of capabilities that make general AI systems pose a risk (e.g. manipulation and deception capabilities that could enable the system to overpower humans).
  • High level of abstraction: “intelligence” can take on the quality of a mythical ideal that can’t be met by an actual AI system, while “competence” is more conducive to being specific about the capability level in question.

It’s worth noting that I am not suggesting to always avoid the term “intelligence” when discussing advanced AI systems. Those who are trying to build advanced AI systems often want to capture different aspects of intelligence or endow the system with real understanding of the world, and it’s useful to investigate and discuss to what extent an AI system has (or could have) these properties. I am specifically advocating to avoid the term “intelligence” when discussing catastrophic risks, because AI systems can pose these risks without possessing real understanding or some particular aspects of intelligence. 

The basic argument for catastrophic risk from general AI has two parts: 1) the world is on track to develop generally capable AI systems in the next few decades, and 2) generally capable AI systems are likely to outcompete or overpower humans. Both of these arguments are easier to discuss and operationalize by referring to capabilities rather than intelligence: 

  • For #1, we can see a trend of increasingly general capabilities, e.g. from GPT-2 to GPT-4. Scaling laws for model performance as compute, data and model size increase suggest that this trend is likely to continue. Whether this trend reflects an increase in “intelligence” is an interesting question to investigate, but in the context of discussing risks, it can be a distraction from considering the implications of rapidly increasing capabilities of foundation models.
  • For #2, we can expect that more generally capable entities are likely to dominate over less generally capable ones. There are various historical examples of this, e.g. humans causing other species to go extinct. While there are various ways in which other animals may be more “intelligent” than humans, the deciding factor was that humans had more general capabilities like language and developing technology, which allowed them to control and shape the environment. The best threat models for catastrophic AI risk focus on how the general capabilities of advanced AI systems could allow them to overpower humans. 

As the capabilities of AI systems continue to advance, it’s important to be able to clearly consider their implications and possible risks. “Intelligence” is an ambiguous term with unhelpful connotations that often seems to derail these discussions. Next time you find yourself in a conversation about risks from general AI where people are talking past each other, consider replacing the word “intelligent” with “capable” – in my experience, this can make the discussion more clear, specific and productive.

(Thanks to Janos Kramar for helpful feedback on this post.)

Near-term motivation for AI alignment

AI alignment work is usually considered “longtermist”, which is about preserving humanity’s long-term potential. This was the primary motivation for this work when the alignment field got started around 20 years ago, and general AI seemed far away or impossible to most people in AI. However, given the current rate of progress towards advanced AI capabilities, there is an increasingly relevant near-term motivation to think about alignment, even if you mostly or only care about people alive today. This is most of my personal motivation for working on alignment.

I would not be surprised if general AI is reached in the next few decades, similarly to the latest AI expert survey‘s median of 2059 for human-level AI (as estimated by authors at top ML conferences) and the Metaculus median of 2039. The Precipice gives a 10% probability of human extinction this century due to AI, i.e. within the lifetime of children alive today (and I would expect most of this probability to be concentrated in the next few decades, i.e. within our lifetimes). I used to refer to AI alignment work as “long-term AI safety” but this term seems misleading now, since alignment would be more accurately described as “medium-term safety”. 

While AI alignment has historically been associated with longtermism, there is a downside of referring to longtermist arguments for alignment concerns. Sometimes people seem to conclude that they don’t need to worry about alignment if they don’t care much about the long-term future. For example, one commonly cited argument for trying to reduce existential risk from AI is that “even if it’s unlikely and far away, it’s so important that we should worry about it anyway”. People understandably interpret this as Pascal’s mugging and bounce off. This kind of argument for alignment concerns is not very relevant these days, because existential risk from AI is not that unlikely (10% this century is actually a lot, and may be a conservative estimate) and general AI is not that far away (an average of 36 years in the AI expert survey). 

Similarly, when considering specific paths to catastrophic risk from AI, a typical longtermist scenario involves an advanced AI system inventing molecular nanotechnology, which understandably sounds implausible to most people. I think a more likely path to catastrophic risk would involve general AI precipitating other catastrophic risks like pandemics (e.g. by doing biotechnology research) or taking over the global economy. If you’d like to learn about the most pertinent arguments for alignment concerns and plausible paths for AI to gain an advantage over humanity, check out Holden Karnofsky’s Most Important Century blog post series. 

In terms of my own motivation, honestly I don’t care that much about whether humanity gets to colonize the stars, reducing astronomical waste, or large numbers of future people existing. These outcomes would be very cool but optional in my view. Of course I would like humanity to have a good long-term future, but I mostly care about people alive today. My main motivation for working on alignment is that I would like my loved ones and everyone else on the planet to have a future. 

Sometimes people worry about a tradeoff between alignment concerns and other aspects of AI safety, such as ethics and fairness, but I still think this tradeoff is pretty weak. There are also many common interests between alignment and ethics that would be great for these communities to coordinate on. This includes developing industry-wide safety standards and AI governance mechanisms, setting up model evaluations for safety, and slow and cautious deployment of advanced AI systems. Ultimately all these safety problems need to be solved to ensure that general AI systems have a positive impact on the world. I think the distribution of effort between AI capabilities and safety will need to shift more towards safety as more advanced AI systems are developed. 

In conclusion, you don’t have to be a longtermist to care about AI alignment. I think the possible impacts on people alive today are significant enough to think about this problem, and the next decade is going to be a critical time for steering advanced AI technology towards safety. If you’d like to contribute to alignment research, here is a list of research agendas in this space and a good course to get up to speed on the fundamentals of AI alignment (more resources here).

2022-23 New Year review

This is an annual post reviewing the last year and setting goals for next year. Overall, this was a reasonably good year with some challenges (the invasion of Ukraine and being sick a lot). Some highlights in this review are improving digital habits, reviewing sleep data from the Oura ring since 2019 and calibration of predictions since 2014, an updated set of Lights habits, the unreasonable effectiveness of nasal spray against colds, and of course baby pictures.

  1. 2022 review
    1. Life updates
    2. Work
    3. Health
    4. Parenting
    5. Effectiveness
    6. Travel
    7. Fun stuff
  2. 2022 prediction outcomes
  3. 2023 goals and predictions

2022 review

Life updates

I am very grateful that my immediate family is in the West, and my relatives both in Ukraine and Russia managed to stay safe and avoid being drawn into the war on either side. In retrospect, it was probably good that my dad died in late 2021 and not a few months later when Kyiv was under attack, so we didn’t have to figure out how to get a bedridden cancer patient out of a war zone. It was quite surreal that the city that I had visited just a few months back was now under fire, and the people I had met there were now in danger. The whole thing was pretty disorienting and made it hard to focus on work for a while. I eventually mostly stopped checking the news and got back to normal life with some background guilt about not keeping up with what’s going on in the homeland.

Work

My work focused on threat models and inner alignment this year:

Continue reading

Refining the Sharp Left Turn threat model

(Coauthored with others on the alignment team and cross-posted from the alignment forum: part 1, part 2)

A sharp left turn (SLT) is a possible rapid increase in AI system capabilities (such as planning and world modeling) that could result in alignment methods no longer working. This post aims to make the sharp left turn scenario more concrete. We will discuss our understanding of the claims made in this threat model, propose some mechanisms for how a sharp left turn could occur, how alignment techniques could manage a sharp left turn or fail to do so.

Image credit: Adobe

Claims of the threat model

What are the main claims of the “sharp left turn” threat model?

Claim 1. Capabilities will generalize far (i.e., to many domains)

There is an AI system that:

  • Performs well: it can accomplish impressive feats, or achieve high scores on valuable metrics.
  • Generalizes, i.e., performs well in new domains, which were not optimized for during training, with no domain-specific tuning.

Generalization is a key component of this threat model because we’re not going to directly train an AI system for the task of disempowering humanity, so for the system to be good at this task, the capabilities it develops during training need to be more broadly applicable. 

Some optional sub-claims can be made that increase the risk level of the threat model:

Claim 1a [Optional]: Capabilities (in different “domains”) will all generalize at the same time

Claim 1b [Optional]: Capabilities will generalize far in a discrete phase transition (rather than continuously) 

Claim 2. Alignment techniques that worked previously will fail during this transition

  • Qualitatively different alignment techniques are needed. The ways the techniques work apply to earlier versions of the AI technology, but not to the new version because the new version gets its capability through something new, or jumps to a qualitatively higher capability level (even if through “scaling” the same mechanisms).

Claim 3: Humans can’t intervene to prevent or align this transition 

  • Path 1: humans don’t notice because it’s too fast (or they aren’t paying attention)
  • Path 2: humans notice but are unable to make alignment progress in time
  • Some combination of these paths, as long as the end result is insufficiently correct alignment
Continue reading

Paradigms of AI alignment: components and enablers

(This post is based on an overview talk I gave at UCL EA and Oxford AI society (recording here). Cross-posted to the Alignment Forum. Thanks to Janos Kramar for detailed feedback on this post and to Rohin Shah for feedback on the talk.)

This is my high-level view of the AI alignment research landscape and the ingredients needed for aligning advanced AI. I would divide alignment research into work on alignment components, focusing on different elements of an aligned system, and alignment enablers, which are research directions that make it easier to get the alignment components right.

You can read in more detail about work going on in these areas in my list of AI safety resources.

Continue reading

2021-22 New Year review

This was a rough year that sometimes felt like a trial by fire – sick relatives, caring for a baby, and the pandemic making these things more difficult to deal with. My father was diagnosed with cancer and passed away later in the year, and my sister had a sudden serious health issue but is thankfully recovering. One theme for the year was that work is a break from parenting, parenting is a break from work, and both of those things are a break from loved ones being unwell.

I found it hard to cope with all the uncertainty and stress, and this was probably my worst year in terms of mental health. There were some bright spots as well – watching my son learn many new skills, and lots of time with family and in nature. Overall, I look forward to a better year ahead purely based on regression to the mean. 

  1. 2021 review
    1. Life updates
    2. Work
    3. Effectiveness
    4. Health 
    5. Travel 
    6. Fun stuff
  2. 2021 prediction outcomes
  3. 2022 goals and predictions

2021 review

Life updates

My father, Anatolij Krakovny, was diagnosed with late-stage lung cancer in January with a grim prognosis of a few months to a year of life. This came out of nowhere because he’s always been healthy and didn’t have any obvious risk factors. We researched alternative treatments to the standard chemotherapy and arranged additional tests for him but didn’t find anything promising. 

We went to Ukraine to visit him in February and he was happy to meet his grandson. We were worried about the covid risks of traveling with little Daniel but concluded that they were low enough, and thankfully we were allowed to leave the UK though international travel was not generally permitted. 

My dad seemed to have a remission in the summer, and we considered visiting him in June, but he told us not to come because of the covid situation in Ukraine. Unfortunately we listened to him and didn’t go (this would have been a good opportunity to spend time with him while he was still doing well).

We spent most of the summer in Canada, with grandparents taking care of Daniel. This was a relaxing time with family and nature, until my sister had a sudden life-threatening health problem and was in and out of hospital with a lot of uncertainty around recovery. This also came out of the blue with no obvious risk factors present. She is feeling better now and doctors expect a full recovery, which we are very grateful for.

In November, my dad had a sudden relapse, and we went to Ukraine again. Once there we realized that the public health system wasn’t taking good care of him (they were mostly swamped with covid) and we had to find a private hospital to take him in. He was already in pretty bad shape and died two weeks later, but I’m glad we managed to see him and help him in some way.

Continue reading

Reflections on the first year of parenting

The first year after having a baby went by really fast – happy birthday Daniel! This post is a reflection on our experience and what we learned in the first year.

Grandparents. We were very fortunate to get a lot of help from Daniel’s grandparents. My mom stayed with us when he was 1 week – 3 months old, and Janos’s dad was around when he was 4-6 months old (they made it to the UK from Canada despite the pandemic). We also spent the summer in Canada with the grandparents taking care of the baby while we worked remotely.

We learned a lot about baby care from them, including nursery rhymes in our respective languages and a cool trick for dealing with the baby spitting up on himself without changing his outfit (you can put a dry cloth under the wet part of the outfit). I think our first year as parents would have been much harder without them.

Continue reading