Alignment likely generalizes further than capabilities.

Recently, I was reading this paper which demonstrates how to do online RLHF for alignment of LLMs and a sentence stuck out to me:

We conjecture that this is because the reward model (discriminator) usually generalizes better than the policy (generator)

This is an offhand remark but it strikes at the core of the LessWrong doom story, as propounded here for which the idea of ‘capabilities generalizing further than alignment’ is central. This is a key point because it is necessary for the self-improvmeent of AI systems, even if initially ‘aligned’ using current methods, to spell doom because even if they start out aligned, as they rapidly improve in capabilities, their existing alignment fails to keep up and thus they begin optimizing for some imperfect proxy of their aligned objective which rapidly degenerates into some terrible outcome (by our initial value function).

This sounds plausible but I realized I had never really thought in depth. After some consideration provoked by this paper, I realized that this proposition seems likely to be wrong to me, and in fact that ‘alignment’ or more generally reward modelling or ability to judge outcomes is likely actually easy and hence generalizes further and more robustly than ‘capabilities’. I have a couple of lines of thinking for this:

1.) Empirically, reward models are often significantly smaller and easier to train than core model capabilities. I.e. in RLHF and general hybrid RL setups, the reward model is often a relatively simple small MLP or even linear layer stuck atop the general ‘capabilities’ model which actually does the policy optimization. In general, it seems that simply having the reward be some simple function of the internal latent states works well. Reward models being effective as a small ‘tacked-on’ component and never being the focus of serious training effort gives some evidence towards the fact that reward modelling is ‘easy’ compared to policy optimization and training. It is also the case that, empirically, language models seem to learn fairly robust understandings of human values and can judge situations as ethical vs unethical in a way that is not obviously worse than human judgements. This is expected since human judgements of ethicality and their values are contained in the datasets they learn to approximate.

2.) Reward modelling is much simpler with respect to uncertainty, at least if you want to be conservative. If you are uncertain about the reward of something, you can just assume it will be bad and generally you will do fine. This reward conservatism is often not optimal for agents who have to navigate an explore/exploit tradeoff but seems very sensible for alignment of an AGI where we really do not want to ‘explore’ too far in value space. Uncertainty for ‘capabilities’ is significantly more problematic since you have to be able to explore and guard against uncertainty in precisely the right way to actually optimize a stochastic world towards a specific desired point.

3.) There are general theoretical complexity priors to believe that judging is easier than generating. There are many theoretical results of the form that it is significantly asymptotically easier to e.g. verify a proof than generate a new one. This seems to be a fundamental feature of our reality, and this to some extent maps to the distinction between alignment and capabilities. Just intuitively, it also seems true. It is relatively easy to understand if a hypothetical situation would be good or not. It is much much harder to actually find a path to materialize that situation in the real world.

4.) We see a similar situation with humans. Almost all human problems are caused by a.) not knowing what you want and b.) being unable to actually optimize the world towards that state. Very few problems are caused by incorrectly judging or misgeneralizing bad situations as good and vice-versa. For the AI, we aim to solve part a.) as a general part of outer alignment and b.) is the general problem of capabilities. It is much much much easier for people to judge and critique outcomes than actually materialize them in practice, as evidenced by the very large amount of people who do the former compared to the latter.

5.) Similarly, understanding of values and ability to assess situations for value arises much earlier and robustly in human development than ability to actually steer outcomes. Young children are very good at knowing what they want and when things don’t go how they want, even new situations for them, and are significantly worse at actually being able to bring about their desires in the world.

In general, it makes sense that, in some sense, specifying our values and a model to judge latent states is simpler than the ability to optimize the world. Values are relatively computationally simple and are learnt as part of a general unsupervised world model where there is ample data to learn them from (humans love to discuss values!). Values thus fall out mostly’for free’ from general unsupervised learning. As evidenced by the general struggles of AI agents, ability to actually optimize coherently in complex stochastic ‘real-world’ environments over long time horizons is fundamentally more difficult than simply building a detailed linguistic understanding of the world.

While there are certainly many potential issues to achieve successful alignment and ultimately a safe singularity, the idea that our AI systems will be unable to understand our values as they grow in capabilities seems to be increasingly unlikely to me. This is not to say that outer alignment is solved. There is still immense issues in figuring out ‘what’ values to instill into the AI – whether this should be some preference aggregation over many humans or some set of ‘universal values’ somebody came up with or whatever. Additionally, altough getting a model-based agent to plan for some specific value/utility function is trivial, the exact mechanisms by which we truly instill values into a hybrid model-based and model-free agent (primarily the model-free part) is still not something I completely understand given my lack of full understanding of how the human motivational and reward system works. Also, even if we should expect reward models to generalize as well or better as ‘capabilities’ does not mean it lacks adversarial modes which could be found if e.g. the policy is set up to directly optimize against the reward model without any kind of regularization and there is a significant representational power gap between the models (which is typically the case in current systems). The problems of alignment seem to focus more around making sure we understand how to build sensible homeostatic value systems into powerful agents to ensure sensible optimization, and of course the discussion about what should be aligned, and then attempting to create an equilibrium of aligned AIs that are robust to occasional misaligned AI’s being created either by mistake, hostile actors, or value drift.