Impact measures are auxiliary rewards for low impact on the agent’s environment, used to address the problems of side effects and instrumental convergence. A key component of an impact measure is a choice of baseline state: a reference point relative to which impact is measured. Commonly used baselines are the starting state, the initial inaction baseline (the counterfactual where the agent does nothing since the start of the episode) and the stepwise inaction baseline (the counterfactual where the agent does nothing instead of its last action). The stepwise inaction baseline is currently considered the best choice because it does not create the following bad incentives for the agent: interference with environment processes or offsetting its own actions towards the objective. This post will discuss a fundamental problem with the stepwise inaction baseline that stems from a tradeoff between different desirable properties for baseline choices, and some possible alternatives for resolving this tradeoff.
One clearly desirable property for a baseline choice is to effectively penalize high-impact effects, including delayed effects. It is well-known that the simplest form of the stepwise inaction baseline does not effectively capture delayed effects. For example, if the agent drops a vase from a high-rise building, then by the time the vase reaches the ground and breaks, the broken vase will be the default outcome. Thus, in order to penalize delayed effects, the stepwise inaction baseline is usually used in conjunction with inaction rollouts, which predict future outcomes of the inaction policy. Inaction rollouts from the current state and the stepwise baseline state are compared to identify delayed effects of the agent’s actions. In the above example, the current state contains a vase in the air, so in the inaction rollout from the current state the vase will eventually reach the ground and break, while in the inaction rollout from the stepwise baseline state the vase remains intact.Continue reading