Maintaining Alignment during RSI as a Feedback Control Problem

Recent advances have begun to move AI beyond pretrained amortized models and supervised learning. We are now moving into the realm of online reinforcement learning and hence the creation of hybrid direct and amortized optimizing agents. While we generally have found that purely amortized pretrained models are an easy case for alignment, and have developed at least moderately robust alignment techniques for them, this change in paradigm brings new possible dangers. Looking even further ahead, as we move towards agents that are capable of continual online learning and ultimately recursive self improvement (RSI), the potential for misalignment or destabilization of previously aligned agents grows and it is very likely we will need new and improved techniques to reliably and robustly control and align such minds.

In this post, I want to present a high level argument that the move to continual learning agents and ultimately RSI requires us to shift our thinking about alignment techniques from the static frame of e.g. alignment via some fixed RLHF approach to a dynamic frame of feedback control. Namely, alignment methods to ensure stability during online learning or RSI will require constant dynamic and adaptive adjustments rather than simply an extremely good static alignment initialization (although a good initialization will of course be very helpful). Additionally, the existing field of control theory handles exactly these kinds of problems and has constructed a large set of theoretical tools around the design and verification of controllers that I believe likely have important insights for alignment.

Moreover, I think that either a lack of consideration of feedback control for alignment, or an implicit assumption that it is impossible during continual learning or RSI has led to some blindspots and potentially unjustified pessimism in alignment theory. These include the very strong focus in classic alignment theory about the importance of pre-loading the AGI with exactly the right values before takeoff and then the resultant seemingly intractable problems involved in first loading the initial values with such precision that it will hit the tiny aligned ‘value target’ after RSI as well as concerns about value stability during RSI. These problems are certainly valid and appear (and are) intractable without a notion of feedback control.

To make this point more obvious, let us consider an analogy to a world which is trying to figure out spaceflight and space navigation without having a notion of feedback control. Specifically, in this world NASA, SpaceX etc are building very large rockets and their plan to get them to their destination – e.g. Mars, is to simply launch them from Earth in exactly the right direction so that their natural trajectory of flight once the initial rocket booster has finished will lead the rocket to its destination ¹.

While the initial physics calculations looked good and the equations of motion could be directly solved to result in theoretical final angles of launch, in practice, however, many problems cropped up immediately. For instance, during the rocket launch itself, there was considerably turbulence caused by the rapid transit of the rocket through the atmosphere, and moreover there were small but important fluctuations in the thrust coming from the booster rocket. The effect of these disturbances was to shift the final angle of the rocket upon leaving the atmosphere slightly away from its theoretical value which, given the distance then to be traversed between Earth and Mars, led to a catastrophic divergences between projected and actual trajectories of millions of kilometers at the destination.

Based on these results and theoretical considerations of chaos theory and the unknown turbulence of the upper atmosphere as well as possible stray magnetic fields in space itself, physicists could argue in this world that this trajectory alignment problem was unsolvable, that computing the exact angle needed to launch the rocket at to hit Mars was impossible. Given such compounding uncertainties in the trajectory, there was no way you could possibly leave Earth and hit Mars, except by accident, and given that the volume of Mars is minuscule compared to the volume of not-Mars, that any such attempt would mean almost certain death for the astronauts aboard. Some scientists went even further and argued that this fundamental difficulty made targeted spaceflight impossible at all. They presented attempted proofs using to show any uncertainty in calculation or in initial conditions in the upper atmosphere must exponentially propagate through all dynamics calculations and so any attempt to hit any destination at all impossible, meaning that humanity can never leave its home planet and must eventually die when the sun burns out.

Of course, in reality, spaceflight is totally possible and the ability to navigate a spacecraft to a specific destinations is relatively trivial compared to the other myriad problems involved in space missions. But how do we do this? It is not by having the most exactlingly precise calculations of rocket launch angle. NASA and SpaceX do not agonize over billionths of a degree when launching their rockets. What do we do differently? We use feedback control to steer our rockets. Specifically, we equip the rockets with sensors and actuators that enable them to sense deviations from their ideal course and correct them in an online and adaptable fashion, crucially without requiring advance knowledge of the particular deviations that will occur.

Although it sounds absurd, it’s important to realize that the logic above is not actually wrong. Attempting to hit Mars by launching a rocket at juuuust the right angle is indeed a fundamentally doomed endeavour for exactly these reasons. The key is that in this hypothetical world we are missing the very important concept of feedback control. To make spaceflight actually work, we don’t just shoot up a rocket and then let it go on its way, we actively provide continual feedback through thrusters which we use intelligently and adaptively to the situation to control its course and correct any deviations that may occur due to variable or unknown conditions, modelling errors, hardware issues, etc. By using feedback control, we quite easily and straightforwardly beat the curse of chaos theory where fluctuations grow exponentially in uncontrolled systems by deliberately designing systems to dampen and ultimately eliminate such fluctuations. This is in fact how almost any temporally extended goal driven process actually works if it succeeds. You do not drive to work by calculating a set of precise muscle movements (turns, accelerates, brakes etc) ahead of time and the precise timings to do each, then sit in your car, close your eyes, and just execute those movements. If you do you will almost certainly crash very quickly. Instead, you open your eyes, look at your surroundings, and can make minor adjustments to your course to overcome the many miscellaneous challenges that occur every single drive and which are not at all predictable a-priori. Open-loop planning is generally impossible over many steps because of inherent uncertainties and stochasticities in the world and exponential compounding of errors. All realistic policies are closed-loop and adaptive.

While this rocket launch example is obviously silly, it feels like a lot of the alignment theory around RSI has takes a similar approach. We implicitly assume that feedback control during RSI is impossible or do not even consider it at all. Because of this it becomes overridingly important to get a complete value specification perfectly encoded ‘at launch time’ and then hope that this specification remains perfectly intact during RSI so as to produce an aligned superintelligence at the end. However, because of the relatively unknown process of RSI causing all kinds of possible value fluctuations, as well as the tiny ending ‘value target’ of an aligned AI vs the extremely large volume of misaligned AIs, then alignment appears intractably difficult and this approach appears almost certainly doomed. And it is doomed! Just as doomed as trying to hit Mars by launching a rocket at just the right angle. But this is irrelevant because such problems might become very tractable indeed by using feedback control.

Feedback control has a number of extremely nice properties.

1.) Massively reduces sensitivity to initial conditions. Uncontrolled systems (especially complex nonlinear ones) are almost always extremely sensitive to initial conditions. This is the insight of chaos theory. Controlled systems, however, are generally pretty to extremely robust. SpaceX do not really care which angle they launch the rocket at – almost always you can correct for any errors at the begining. Of course some errors are truly egregious (such as launching your rocket pointing straight down at the ground) and cause disaster before the control system can correct. This is obviously useful from within an alignment context where maybe it is possible to have some sensible initial value set, but not a perfect one at the start of RSI, and then apply feedback to keep the value set within some stable and aligned region. I think it is fairly clear that there exist decently well-aligned models today (such as Claude) which perhaps start out within this initial set.

2.) Robustness to modelling error. Control methods do not require super detailed and correct models of the phenomenon to be controlled. Of course in theory, having a perfectly accurate dynamics model is best and gives you the optimal controls. In practice, however, relatively simple approximations such as PID control or linearizing the dynamics work extremely well in many cases. Such approximations are especially good in the case where you are starting out near to the optimum since then all fluctuations can be linearized, meaning they can essentially be modelled as within a convex basin and can be optimized using standard LQR methods. This is obviously a very useful property when we have relatively little idea what the actual ‘dynamics’ of RSI are going to look like in practice. What we learn from control theory is that largely a full understanding of the dynamics do not matter as long as we can keep control near a fixed point where we can make decent linear approximations. The beauty of control theory is that it acts in a self-fulfilling loop. Control keeps the system stable and a stable system has sane, linearizable dynamics which can be successfully controlled against perturbation. If we start in a sensible place then, almost by induction, we can remain in a sensible place.

3.) Controllers are often much simpler than the systems they control. Extremely simple controllers such as PID controllers can often successfully control complex nonlinear dynamical processes which would require significantly more computing power to simulate than just running the PID controller. See here for a fun example of a simple PID controller controlling a nonlinear cartpole problem. Another clear example is that extremely simple controllers in thermostats are able to successfully control highly nonlinear temperature dissipation dynamics in all sorts of settings using the simplest feedback loops imaginable. More speculatively, we also see this to some extent in biology and neuroscience where the behaviour of complex animals (including us!) can be controlled pretty succesfully (although not entirely fool-proofly) by extremely simple feedback controllers such as those involved in hunger and thirst responses. This has obvious relevance in the case of AI alignment where we will almost certainly be designing relatively ‘dumb’ systems (which may be SOTA AI systems of the near future) to try to control superintelligent ones. Our control theory experience suggests that this is far from impossible in practice.

Thinking about the RSI problem from a control theory perspective brings up a number of fairly obvious yet unanswered questions. Namely:

1.) What is our observation space? What are our sensors in terms of alignment? Obviously feedback control is much easier when we can see what we are doing. Any feedback control method during RSI needs a regularly and rapidly run set of evaluations to detect any misalignment within the model. Ideally these tests can cover most dimensions of misalignment which can occur and can be run quickly and reliably. Unlike the physical world, Observability is theoretically almost total in the AGI case, since humanity is likely to control the entire physical and computational substrate of the AGI, as well as have complete access to its weights and activations as well as the code that is being run. In practice, full observability will likely require significant advances in interpretability.

2.) What is our actuation space? How can we adjust the model to move it back towards an aligned state? Neural networks actually offer a wealth of possible ‘actuators’ such as directly doing backprop on the model, activation steering vectors, model editing and concept removal, various interpretability-based interventions etc. The control theory concepts of observability and controllability are useful if fairly straightforward here. To fully control the system we need to ensure that our actuators span the possible orthogonal dimensions of variation.

3.) The importance of defining a target and deviations from the target. The cornerstone of all control theory is the idea of having a set-point and designing a controller to reduce the deviation between the state and the set-point. We need to be able to mathematically define and computationally implement these concepts in the alignment case. We also require at least a crude understanding of the dynamics and how our actuators affect the system state – i.e. even PID control makes the assumption that the relationship between the actuator and the state is positive and monotonic – i.e. if you have a dial labelled ‘go left’ and you increase it, all things equal you should go left, and the more you turn up the dial the more you should go left. The actual specifics of the relationship beyond this can be arbitrary and nonlinear, but if you turn the ‘go left’ dial and you go right then you are obviously doomed from the beginning. We need actuators that are at least this reliable, but this is very far from a full understanding of the relevant dynamics.

4.) The strongly negative effects of delayed responses and mitigations for them. A key concept in control theory is that of delay. There is always a delay between a perturbation existing, it being detected by the sensors, and then the controller outputting a response to the delay. Long delays between system and controller can make a controllable system uncontrollable. Control theory also has a toolbox of techniques for handling delay since in real physical systems there is always some amount of delay and often a considerable amount. One very basic approach, which is used in PID control is to use derivative control – that is, by changing the control signal based on the derivative of the fluctuations, we can effectively smooth out shocks temporally and ‘anticipate’/mitigate the effects of delay somewhat. The integral term in PID tends to control systematic modelling error which would otherwise lead to persistent deviations from the setpoint. Unlike the physical world which acts on its own timescale regardless of your controller, it is actually theoretically much easier to remove delays from an AGI RSI process by simply pausing the AGI for as long as we need after each round of RSI. If the AGI is even ‘mostly aligned’ it should be ‘want’ to be paused or at least not act strongly adversarially to such a process. Even if the AGI cannot be paused for whatever reason, if RSI occurs over human-scale timescales such as weeks to months, which seems likely if it involves extensive finetuning or retraining of models then it should be straightforward for humans to be involved directly in the control loop as well as sophisticated evaluations to be run at every ‘tick’ of RSI.

In general, it seems clear to me that if we are going to succeed at aligning a model through an an RSI process then it will have to be via some feedback control approach. Zero-shotting it with a super precise initial condition is almost certainly doomed given the enormous uncertanties at play in the dynamics of RSI. Here I agree with the classic AI alignment arguments on this point. At the same time, directly stopping or pausing AI progress is also looking very unlikely or impossible at present given the enormous investments into and high rate of diffusion of AI technologies at present. While it is still unclear if rapid self-contained RSI is possible similar to that originally envisioned, or whether takeoff will be slower and widely distributed, it is still very important to plan for and try to solve both cases.

If we do have a single model/AI system undergoing RSI, the feedback control approach broadly suggests the following directions, which we should study and investigate.

1.) Start out with a well aligned and corrigible model. The goal will be to maintain these properties during the RSI process and hopefully the controller will be helped directly, or at least not adversarially optimized against by the AI as long as it stays reasonably well aligned. A well-aligned AI wants to maintain its alignment.

2.) Ensure we have sufficient observability into the system (the AGI). This should include a variety of different ‘sensors’ such as interpretability methods, evaluations, red-teaming, behavioural simulation tests, etc. The goal is cover as many independent dimensions of possible value perturbation as possible, allow us to quantitatively measure degree of misalignment and divergence from the ‘aligned region’. Ideally these tests would also be quick and efficient to run compared to the timescale of the RSI dynamics itself. If RSI involves techniques that are standard today such as finetuning or training new models then this is likely but if RSI corresponds to much more rapid algorithmic changes then we would need to either slow down the RSI process itself or speed up the latency of the sensors.

3.) Ensure we have reliable and decently well understood ‘actuators’ which cover the majority of dimensions of fluctuation. We need methods that given a misaligned model can with decent reliability guide it back to being aligned. This can be as simple as existing finetuning/RLHF/RL methods or could be more exotic. Similarly, we need to ensure that these methods can operate on a timescale equal to or faster than the RSI process itself. In the ideal case we can iteratively pause the RSI process at each step, check for misalignment, and then apply mitigations regularly.

4.) Try to gain an empirical and theoretical grasp on at least the rough contours of the dynamics and value/alignment perturbations likely to occur during RSI.

All of these approaches should ideally be tested on either less powerful systems first as well as in toy models of the RSI process. In general, I feel the alignment community would be well served by thinking about methods to directly try to monitor, measure, and control an AIs value system and degree of alignment both at present and also in an iterative fashion during an RSI process. I also feel that control theory is an understudied area with potentially many interesting insights which could be of significant value to alignment theorists and practitioners and is worthy of further study. At the very least I reccomend this lecture series as an engaging and accessible overview of the basics of the field.

I have played enough KSP to know that this is not how you should actually send a rocket to Mars, but bear with me for now. ↩