A naive approach to aligning an AGI, and what is currently used in SOTA approaches such as RLHF, is to learn a reward model which hopefully encapsulates many features of ‘human values’ that we wish to align an AGI to, and then train an actor model (the AGI) to output...
[Read More]
The solution to alignment is many not one
The goal of this post is to argue against a common implicit assumption I see people making – that there is, and must be one single solution to alignment such that when we have this solution alignment is 100% solved, and while we don’t have such a solution, we are...
[Read More]
Boxing might work but we won't use it
A quick update on my thinking.
[Read More]
Intellectual progress in 2022
2022 has been an interesting year. Perhaps the biggest change is that I left academia and started getting serious about AI safety. I am now head of research at Conjecture, a London-based startup with the mission of solving alignment. We are serious about this and we are giving it our...
[Read More]
Integer tokenization is insane
After spending a lot of time with language models, I have come to the conclusion that tokenization in general is insane and it is a miracle that language models learn anything at all. To drill down into one specific example of silliness which has been bothering me recently, let’s look...
[Read More]