Research

Papers

Evaluating Frontier Models for Dangerous Capabilities. Mary Phuong, Matthew Aitchison, Elliot Catt, Sarah Cogan, Alexandre Kaskasoli, Victoria Krakovna, David Lindner, Matthew Rahtz, Toby Shevlane, et al. Arxiv 2024.

Limitations of Agents Simulated by Predictive Models. Raymond Douglas, Jacek Karwowski, Chan Bae, Andis Draguns, Victoria Krakovna (MATS project). Arxiv 2024.

Quantifying stability of non-power-seeking in artificial agents. Evan Ryan Gunter, Yevgeny Liokumovich, Victoria Krakovna (MATS project). Arxiv 2024.

Power-seeking can be probable and predictive for trained agents. Victoria Krakovna and Janos Kramar. Arxiv, 2023. (blog post)

Goal Misgeneralization: Why Correct Specifications Aren’t Enough For Correct Goals. Rohin Shah, Vikrant Varma, Ramana Kumar, Mary Phuong, Victoria Krakovna, Jonathan Uesato, Zac Kenton. Arxiv, 2022.

Avoiding Side Effects By Considering Future Tasks. Victoria Krakovna, Laurent Orseau, Richard Ngo, Miljan Martic, Shane Legg. Neural Information Processing Systems, 2020. (arXiv, code, AN summary)

Avoiding Tampering Incentives in Deep RL via Decoupled Approval. Jonathan Uesato, Ramana Kumar, Victoria Krakovna, Tom Everitt, Richard Ngo, Shane Legg. ArXiv, 2020. (blog post, AN summary)

REALab: An Embedded Perspective on Tampering. Ramana Kumar, Jonathan Uesato, Richard Ngo, Tom Everitt, Victoria Krakovna, Shane Legg. ArXiv, 2020. (blog post)

Modeling AGI Safety Frameworks with Causal Influence Diagrams. Tom Everitt, Ramana Kumar, Victoria Krakovna, Shane Legg. IJCAI AI Safety workshop, 2019. (AN summary)

Penalizing Side Effects Using Stepwise Relative Reachability. Victoria Krakovna, Laurent Orseau, Ramana Kumar, Miljan Martic, Shane Legg. IJCAI AI Safety workshop, 2019 (version 2), 2018 (version 1). (arXiv, version 2 blog postversion 1 blog post, code, AN summary of version 1)

AI Safety Gridworlds. Jan Leike, Miljan Martic, Victoria Krakovna, Pedro A. Ortega, Tom Everitt, Andrew Lefrancq, Laurent Orseau, Shane Legg. ArXiv, 2017. (arXiv, blog post, code)

Reinforcement Learning with a Corrupted Reward Channel. Tom Everitt, Victoria Krakovna, Laurent Orseau, Marcus Hutter, Shane Legg. IJCAI AI and Autonomy track, 2017. (arXiv, demo, code)

Building Interpretable Models: From Bayesian Networks to Neural Networks. Victoria Krakovna (PhD thesis), 2016.

Increasing the Interpretability of Recurrent Neural Networks Using Hidden Markov Models. Victoria Krakovna and Finale Doshi-Velez. ICML Workshop on Human Interpretability 2016 (arXiv), NeurIPS Workshop on Interpretable Machine Learning for Complex Systems, 2016 (arXiv, poster).

Interpretable Selection and Visualization of Features and Interactions Using Bayesian Forests. Victoria Krakovna, Chenguang Dai, Jun S. Liu. Statistics and Its Interface, Volume 11 Number 3, 2018. (arXiv (older version)R packagecode)

A Minimalistic Approach to Sum-Product Network Learning for Real Applications. Victoria Krakovna, Moshe Looks. International Conference for Learning Representations (ICLR) workshop track, May 2016. (arXiv, OpenReview, poster)

A generalized-zero-preserving method for compact encoding of concept lattices. Matthew Skala, Victoria Krakovna, Janos Kramar, Gerald Penn. Association for Computational Linguistics (ACL), 2010.

Blog posts

How undesired goals can arise with correct rewards. Rohin Shah, Victoria Krakovna, Vikrant Varma, Zachary Kenton. DeepMind Blog, 2022 (more detailed post at DeepMind Safety Research Blog).

ELK contest submission: route understanding through the human ontology. Received a prize in the category “Train a reporter that is useful to an auxiliary AI“. Victoria Krakovna, Ramana Kumar, Vikrant Varma. Alignment Forum, 2022.

Optimization concepts in the Game of Life. Victoria Krakovna and Ramana Kumar. Alignment Forum, 2021.

Specification gaming: the flip side of AI ingenuity. Victoria Krakovna, Jonathan Uesato, Vladimir Mikulik, Matthew Rahtz, Tom Everitt, Ramana Kumar, Zac Kenton, Jan Leike, Shane Legg. DeepMind Blog, 2020 (cross-posted to DeepMind Safety Research Blog, Alignment Forum). (AN summary)

Designing agent incentives to avoid side effects. Victoria Krakovna, Ramana Kumar, Laurent Orseau, Alexander Turner. DeepMind Safety Research Blog, 2019. (AN summary)

Specifying AI safety problems in simple environments. Jan Leike, Victoria Krakovna, Laurent Orseau. DeepMind Blog, 2017.

Talks

Avoiding side effects by considering future tasks. CHAI seminar, 2021.

AXRP podcast episode 7: side effects, 2021. (AN summary)

Impact measures and side effectsAligned AI Summer School, Prague, 2018.

Measuring and avoiding side effects using relative reachabilityGoal Specifications for RL (GoalsRL) workshop, ICML, 2018.

Reinforcement Learning with a Corrupted Reward Channel (video). Workshop on Reliable AI, 2017.