The goal of this post is to argue against a common implicit assumption I see people making – that there is, and must be one single solution to alignment such that when we have this solution alignment is 100% solved, and while we don’t have such a solution, we are...
[Read More]
Boxing might work but we won't use it
A quick update on my thinking.
[Read More]
Intellectual progress in 2022
2022 has been an interesting year. Perhaps the biggest change is that I left academia and started getting serious about AI safety. I am now head of research at Conjecture, a London-based startup with the mission of solving alignment. We are serious about this and we are giving it our...
[Read More]
Integer tokenization is insane
After spending a lot of time with language models, I have come to the conclusion that tokenization in general is insane and it is a miracle that language models learn anything at all. To drill down into one specific example of silliness which has been bothering me recently, let’s look...
[Read More]
Gradient Hacking is extremely difficult.
Epistemic Status: Originally started out as a comment on this post but expanded enough to become its own post. My view has been formed by spending a reasonable amount of time trying and failing to construct toy gradient hackers by hand, but this could just reflect me being insufficiently creative...
[Read More]