Near-term motivation for AGI alignment

AGI alignment work is usually considered “longtermist”, which is about preserving humanity’s long-term potential. This was the primary motivation for this work when the alignment field got started around 20 years ago, and AGI seemed far away or impossible to most people in AI. However, given the current rate of progress towards general AI capabilities, there is an increasingly relevant near-term motivation to think about alignment, even if you mostly or only care about people alive today. This is most of my personal motivation for working on alignment.

I would not be surprised if AGI is reached in the next few decades, similarly to the latest AI expert survey‘s median of 2059 for human-level AI (as estimated by authors at top ML conferences) and the Metaculus median of 2039. The Precipice gives a 10% probability of human extinction this century due to AI, i.e. within the lifetime of children alive today (and I would expect most of this probability to be concentrated in the next few decades, i.e. within our lifetimes). I used to refer to AGI alignment work as “long-term AI safety” but this term seems misleading now, since alignment would be more accurately described as “medium-term safety”. 

While AGI alignment has historically been associated with longtermism, there is a downside of referring to longtermist arguments for alignment concerns. Sometimes people seem to conclude that they don’t need to worry about alignment if they don’t care much about the long-term future. For example, one commonly cited argument for trying to reduce existential risk from AI is that “even if it’s unlikely and far away, it’s so important that we should worry about it anyway”. People understandably interpret this as Pascal’s mugging and bounce off. This kind of argument for alignment concerns is not very relevant these days, because existential risk from AI is not that unlikely (10% this century is actually a lot, and may be a conservative estimate) and AGI not that far away (an average of 36 years in the AI expert survey). 

Similarly, when considering specific paths to catastrophic risk from AGI, a typical longtermist scenario involves AGI inventing molecular nanotechnology, which understandably sounds implausible to most people. I think a more likely path to catastrophic risk would involve AGI precipitating other catastrophic risks like pandemics (e.g. by doing biotechnology research) or taking over the global economy. If you’d like to learn about the most pertinent arguments for alignment concerns and plausible paths for AI to gain an advantage over humanity, check out Holden Karnofsky’s Most Important Century blog post series. 

In terms of my own motivation, honestly I don’t care that much about whether humanity gets to colonize the stars, reducing astronomical waste, or large numbers of future people existing. These outcomes would be very cool but optional in my view. Of course I would like humanity to have a good long-term future, but I mostly care about people alive today. My main motivation for working on alignment is that I would like my loved ones and everyone else on the planet to have a future. 

Sometimes people worry about a tradeoff between alignment concerns and other aspects of AI safety, such as ethics / fairness, but I still think this tradeoff is pretty weak. There are also many common interests between alignment and ethics that would be great for these communities to coordinate on. This includes developing industry-wide safety standards and AI governance mechanisms, setting up model evaluations for safety, and slow and cautious deployment of advanced AI systems. Ultimately all these safety problems need to be solved to ensure that AGI systems have a positive impact on the world. I think the distribution of effort between AI capabilities and safety will need to shift more towards safety as more advanced AI systems are developed. 

In conclusion, you don’t have to be a longtermist to care about AGI alignment. I think the possible impacts on people alive today are significant enough to think about this problem, and the next decade is going to be a critical time for steering advanced AI technology towards safety. If you’d like to contribute, here is a list of research agendas in this space, and a good course to get up to speed on the fundamentals of AGI alignment.

2022-23 New Year review

This is an annual post reviewing the last year and setting goals for next year. Overall, this was a reasonably good year with some challenges (the invasion of Ukraine and being sick a lot). Some highlights in this review are improving digital habits, reviewing sleep data from the Oura ring since 2019 and calibration of predictions since 2014, an updated set of Lights habits, the unreasonable effectiveness of nasal spray against colds, and of course baby pictures.

2022 review

Life updates

I am very grateful that my immediate family is in the West, and my relatives both in Ukraine and Russia managed to stay safe and avoid being drawn into the war on either side. In retrospect, it was probably good that my dad died in late 2021 and not a few months later when Kyiv was under attack, so we didn’t have to figure out how to get a bedridden cancer patient out of a war zone. It was quite surreal that the city that I had visited just a few months back was now under fire, and the people I had met there were now in danger. The whole thing was pretty disorienting and made it hard to focus on work for a while. I eventually mostly stopped checking the news and got back to normal life with some background guilt about not keeping up with what’s going on in the homeland.

AI alignment

My work focused on threat models and inner alignment this year:

Continue reading

Refining the Sharp Left Turn threat model

(Coauthored with others on the alignment team and cross-posted from the alignment forum: part 1, part 2)

A sharp left turn (SLT) is a possible rapid increase in AI system capabilities (such as planning and world modeling) that could result in alignment methods no longer working. This post aims to make the sharp left turn scenario more concrete. We will discuss our understanding of the claims made in this threat model, propose some mechanisms for how a sharp left turn could occur, how alignment techniques could manage a sharp left turn or fail to do so.

Image credit: Adobe

Claims of the threat model

What are the main claims of the “sharp left turn” threat model?

Claim 1. Capabilities will generalize far (i.e., to many domains)

There is an AI system that:

  • Performs well: it can accomplish impressive feats, or achieve high scores on valuable metrics.
  • Generalizes, i.e., performs well in new domains, which were not optimized for during training, with no domain-specific tuning.

Generalization is a key component of this threat model because we’re not going to directly train an AI system for the task of disempowering humanity, so for the system to be good at this task, the capabilities it develops during training need to be more broadly applicable. 

Some optional sub-claims can be made that increase the risk level of the threat model:

Claim 1a [Optional]: Capabilities (in different “domains”) will all generalize at the same time

Claim 1b [Optional]: Capabilities will generalize far in a discrete phase transition (rather than continuously) 

Claim 2. Alignment techniques that worked previously will fail during this transition

  • Qualitatively different alignment techniques are needed. The ways the techniques work apply to earlier versions of the AI technology, but not to the new version because the new version gets its capability through something new, or jumps to a qualitatively higher capability level (even if through “scaling” the same mechanisms).

Claim 3: Humans can’t intervene to prevent or align this transition 

  • Path 1: humans don’t notice because it’s too fast (or they aren’t paying attention)
  • Path 2: humans notice but are unable to make alignment progress in time
  • Some combination of these paths, as long as the end result is insufficiently correct alignment
Continue reading

Paradigms of AI alignment: components and enablers

(This post is based on an overview talk I gave at UCL EA and Oxford AI society (recording here). Cross-posted to the Alignment Forum. Thanks to Janos Kramar for detailed feedback on this post and to Rohin Shah for feedback on the talk.)

This is my high-level view of the AI alignment research landscape and the ingredients needed for aligning advanced AI. I would divide alignment research into work on alignment components, focusing on different elements of an aligned system, and alignment enablers, which are research directions that make it easier to get the alignment components right.

You can read in more detail about work going on in these areas in my list of AI safety resources.

Continue reading

2021-22 New Year review

This was a rough year that sometimes felt like a trial by fire – sick relatives, caring for a baby, and the pandemic making these things more difficult to deal with. My father was diagnosed with cancer and passed away later in the year, and my sister had a sudden serious health issue but is thankfully recovering. One theme for the year was that work is a break from parenting, parenting is a break from work, and both of those things are a break from loved ones being unwell. I found it hard to cope with all the uncertainty and stress, and this was probably my worst year in terms of mental health. There were some bright spots as well – watching my son learn many new skills, and lots of time with family and in nature. Overall, I look forward to a better year ahead purely based on regression to the mean. 

Continue reading

Reflections on the first year of parenting

The first year after having a baby went by really fast – happy birthday Daniel! This post is a reflection on our experience and what we learned in the first year.

Grandparents. We were very fortunate to get a lot of help from Daniel’s grandparents. My mom stayed with us when he was 1 week – 3 months old, and Janos’s dad was around when he was 4-6 months old (they made it to the UK from Canada despite the pandemic). We also spent the summer in Canada with the grandparents taking care of the baby while we worked remotely.

We learned a lot about baby care from them, including nursery rhymes in our respective languages and a cool trick for dealing with the baby spitting up on himself without changing his outfit (you can put a dry cloth under the wet part of the outfit). I think our first year as parents would have been much harder without them.

Continue reading

2020-21 New Year review

This is an annual post reviewing the last year and making resolutions and predictions for next year. 2020 brought a combination of challenges from living in a pandemic and becoming a parent. Other highlights include not getting sick, getting a broader perspective on my life through decluttering, and going back to Ukraine for the first time. (This post was written in bits and pieces over the past two months.)

2020 review

Life updates:

Janos and I had a son, Daniel, on Nov 11. He arrived almost 3 weeks later than expected (apparently he was waiting to be born on my late grandfather’s birthday), and has been a great source of cuddles, sound effects and fragmented sleep ever since.

1 week old
6 weeks old

Some work things also went well this year – I had a paper accepted at NeurIPS, and was promoted to senior research scientist. Also, I did not get covid, and survived half a year of working from home (much credit goes to the great company of my housemates). Overall, a lot of things to be grateful for.

Continue reading

Tradeoff between desirable properties for baseline choices in impact measures

(Cross-posted to the Alignment Forum. Summarized in Alignment Newsletter #108. Thanks to Carroll Wainwright, Stuart Armstrong, Rohin Shah and Alex Turner for helpful feedback on this post.)

Impact measures are auxiliary rewards for low impact on the agent’s environment, used to address the problems of side effects and instrumental convergence. A key component of an impact measure is a choice of baseline state: a reference point relative to which impact is measured. Commonly used baselines are the starting state, the initial inaction baseline (the counterfactual where the agent does nothing since the start of the episode) and the stepwise inaction baseline (the counterfactual where the agent does nothing instead of its last action). The stepwise inaction baseline is currently considered the best choice because it does not create the following bad incentives for the agent: interference with environment processes or offsetting its own actions towards the objective. This post will discuss a fundamental problem with the stepwise inaction baseline that stems from a tradeoff between different desirable properties for baseline choices, and some possible alternatives for resolving this tradeoff.

One clearly desirable property for a baseline choice is to effectively penalize high-impact effects, including delayed effects. It is well-known that the simplest form of the stepwise inaction baseline does not effectively capture delayed effects. For example, if the agent drops a vase from a high-rise building, then by the time the vase reaches the ground and breaks, the broken vase will be the default outcome. Thus, in order to penalize delayed effects, the stepwise inaction baseline is usually used in conjunction with inaction rollouts, which predict future outcomes of the inaction policy. Inaction rollouts from the current state and the stepwise baseline state are compared to identify delayed effects of the agent’s actions. In the above example, the current state contains a vase in the air, so in the inaction rollout from the current state the vase will eventually reach the ground and break, while in the inaction rollout from the stepwise baseline state the vase remains intact. 

Continue reading

Possible takeaways from the coronavirus pandemic for slow AI takeoff

(Cross-posted to LessWrong. Summarized in Alignment Newsletter #104Thanks to Janos Kramar for helpful feedback on this post.)

As the covid-19 pandemic unfolds, we can draw lessons from it for managing future global risks, such as other pandemics, climate change, and risks from advanced AI. In this post, I will focus on possible implications for AI risk. For a broader treatment of this question, I recommend FLI’s covid-19 page that includes expert interviews on the implications of the pandemic for other types of risks. 

A key element in AI risk scenarios is the speed of takeoff – whether advanced AI is developed gradually or suddenly. Paul Christiano’s post on takeoff speeds defines slow takeoff in terms of the economic impact of AI as follows: “There will be a complete 4 year interval in which world output doubles, before the first 1 year interval in which world output doubles.” It argues that slow AI takeoff is more likely than fast takeoff, but is not necessarily easier to manage, since it poses different challenges, such as large-scale coordination. This post expands on this point by examining some parallels between the coronavirus pandemic and a slow takeoff scenario. The upsides of slow takeoff include the ability to learn from experience, act on warning signs, and reach a timely consensus that there is a serious problem. I would argue that the covid-19 pandemic had these properties, but most of the world’s institutions did not take advantage of them. This suggests that, unless our institutions improve, we should not expect the slow AI takeoff scenario to have a good default outcome. 

Continue reading

2019-20 New Year review

This is an annual post reviewing the last year and making resolutions and predictions for next year. This year’s edition features sleep tracking, intermittent fasting, overcommitment busting, and evaluating calibration for all annual predictions since 2014.

2019 review

AI safety research:

AI safety outreach:

  • Co-organized FLI’s Beneficial AGI conference in Puerto Rico, a more long-term focused sequel to the original Puerto Rico conference and the Asilomar conference. This year I was the program chair for the technical safety track of the conference.
  • Co-organized the ICLR AI safety workshop, Safe Machine Learning: Specification, Robustness and Assurance. This was my first time running a paper reviewing process.
  • Gave a talk at the IJCAI AI safety workshop on specification, robustness an assurance problems.
  • Took part in the DeepMind podcast episode on AI safety (“I, robot”).

Continue reading