Minor terminological nitpick.
[Read More]
Against ubiquitous alignment taxes
It is often argued that any alignment technique that works primarily by constraining the capabilities of an AI system to be within some bounds cannot work because it imposes too high an ‘alignment tax’ on the ML system. The argument is that people will either refuse to apply any method...
[Read More]
Fingerprinting LLMs with their unconditioned distribution
When playing around with the OpenAI playground models, I noticed something very interesting occurs if we study the unconditioned distribution of the models. LLMs are generative models that try to learn the full joint distribution of tokens across text data on their internet and are trained with an autoregressive objective...
[Read More]
Validator models. A simple approach to detecting and counteracting goodhearting
A naive approach to aligning an AGI, and what is currently used in SOTA approaches such as RLHF, is to learn a reward model which hopefully encapsulates many features of ‘human values’ that we wish to align an AGI to, and then train an actor model (the AGI) to output...
[Read More]
The solution to alignment is many not one
The goal of this post is to argue against a common implicit assumption I see people making – that there is, and must be one single solution to alignment such that when we have this solution alignment is 100% solved, and while we don’t have such a solution, we are...
[Read More]