Current decision theory and almost all AI alignment work assumes that we will build AGIs with some fixed utility function that it will optimize forever. This naturally runs the risk of extreme goodhearting, where if we do not get exactly the ‘correct’ utility function, then the slight differences between our utility function and that of the AGI will be magnified under sufficient optimization pressure and potentially ‘come apart’ leading to bad outcomes.
It is interesting at this point to note that this model of an optimizing increasing a fixed utility function is highly alien to humans and our experience as agentic optimizers. Almost all of our goals are not fixed but can change rapidly both as we learn new information, and as a result of our physiological state 1. Indeed, I would argue that most of the relative ‘safety’ of humans comes from this property. A human who was 100% fanatical about some strange function X would be, and has proven to be historically, quite dangerous.
Ignoring the more complex case of us updating our beliefs and utility functions given new experiences, even the simplest human drives do not follow the fixed utility function pattern. Instead, our most powerful drives such as for food, water, sex, shelter etc, are strongly homeostatic. If we are hungry, we want to eat food and will optimize towards this objective. However, our desire for food is not fixed, and usually, the more food we eat the less we want more food. If you have eaten enough you will typically become full and getting even more food will become aversive and negatively rewarded. Almost all standard human drives follow this negative feedback loop homeostatic pattern. What this pattern does is keep humans from optimizing their drives too heavily and maintains them within some healthy distribution. For instance, if evolution had simply hardcoded humans with a drive to always eat, then the moment anybody had a surplus of food they would just keep eating until eventually they bloated themselves too much and died. Similar situations exist for almost all other basic human drives such as drinking and sex where optimizing them ‘too well’ eventually results in negative physiological consequences.
This seems super obvious. Evolving agents that just stuffed themselves until they died would be really stupid. However, this is what current RL agents, and our hypothesized AGIs would do given a fixed reward function that assigned positive reward to eating food. If we look at this from an alignment perspective, then evolution has already had to solve a miniature version of the problem, where naively applying optimization pressure on a fixed reward function which is often good (eating food) diverges from the proxy (physiological health) when it goes too far off distribution (large food surplus), and results in a bad outcome (potential death) 2. To solve this, evolution has created a mechanism of creating a dynamic reward function wrapped in a negative feedback loop, such that success at optimizing the reward function leads to the downweighting of future rewards of that type and eventually results in negative rewards if the optimizer has pushed the reward function sufficiently off distribution. To put it another way, homeostatic drives essentially provide a standard way to implement satisficing behaviour into an optimizer with a very elegant negative feedback loop design.
This kind of dynamic reward with negative feedback loop is obviously super useful to control homeostatic systems which just require satisficing within some desired range, but could also potentially be generalized as a widely applicable method to ensure that an AGI optimizing some reward function nevertheless remains on distribution, by simply wrapping the accomplishment of the ‘base reward function’ with a similar negative feedback loop such that the better it does, the less utility or reward it gets from continuing to optimize it. If the AGI then has multiple competing goals, then eventually the marginal costs of pursuing that reward function will diminish relative to some other goal, and even with large amounts of optimizatinon pressure the system will not optimize one goal vastly off distribution. This essenitally seems to be a general feature of how human goals, even beyond basic physiological needs, are setup, where we have a logarithmic hedonic treadmill where the more we accomplish of something, the less valuable it becomes to us. Of course, this approach requires a bunch of hyperparameter tuning to make sure that the thresholds and slope of the treadmill do actually constrain the AGIs behaviour in the correct way in practice and that the penalties kick in before major off-distribution behaviour occurs.
To implement this in practice, an additional question is how to actually build agents that optimize for dynamically changing reward functions? Almost all current RL methods and decision theory assume a fixed reward function 3. For model-based planners this is trivial since you can just swap out the reward function dynamically. For agents with amortized components such as learnt policies or value functions this is more difficult. RL approaches such as successor representations and reward bases can allow for somewhat flexible value function amortization. For amortized policies, less theoretical work has been done that I am aware of although there is a relevant strand of work in meta-learning which tries to learn policies which can be flexibly conditioned on a number of different ‘tasks’ – i.e. reward functions. For alignment, it is also necessary to figure out correct and robust ways to build and tune the necessary feedback loops into the reward function, which must include some method of estimating whether the state has gone too far ‘off distribution’.
-
It is of course possible to rewrite homeostatic rewards, which depend on some kind of internal state or previous reward history, by ‘decompressing’ the reward function into a static one which depends not just on the state but on the internal variable itself \(r(x,u)\) where \(u\) is the internal state instead of \(r(x)\). The cross-product reward of state times internal state would then be a fixed function to optimize. Of course, the internal state is typically some function of previous states \(u = u(x_{t-1:0})\) and hence what this amounts to is having a reward function over all possible trajectories of states. The issue with this, of course, in practice is its computational intractability compared to just handling changing reward functions and designing flexible policies and value functions that can account for this. In general, however, the safety features do not come from the ‘static’ vs ‘dynamic’ reward functions but rather the negative feedback loop implied by the homeostatic mechanism. The more a reward is achieved the less rewarding it is, and hence other goals are prioritized instead. This is true even in the fully static optimization over histories. ↩
-
It is also easy to think of examples where evolution hasn’t completely solved this. In the case of the physiological need for food, the obvious counterexample would be that evolution isn’t super great at this given current obesity rates, which are likely somewhat due to standard imperfections in the mechanism given effectively unlimited food surplus and also that food-types have gone off-distribution with much more calorie-rich food being widely available in the past and also such food tasting much better which effectively shifts up the reward function and makes evolution’s hyperparameters for the negative feedback loop not as well fit. On the other hand, evolution has generally done extremely well with a bunch of physiological drives so that we don’t even notice them. For instance, there is no such thing as ‘thirst obseity’ where people will just drink so much water that they physiologically harm themselves. ↩
-
And they do this for good theoretical reasons. Homeostatic preferences are temporally inconsistent and thus allow agents to be dutch-booked across time. At a very high level, this is basically what a lot of trading strategies do where you buy things off of people when they don’t think they are valuable, and then sell the things back to them once they have become more valuable again. Being temporally dutch-booked isn’t that bad however because of the intrinsic time delays which limits how rapidly the agent can go round the loop and lose money. ↩