A quick update on my thinking.
[Read More]
Intellectual progress in 2022
2022 has been an interesting year. Perhaps the biggest change is that I left academia and started getting serious about AI safety. I am now head of research at Conjecture, a London-based startup with the mission of solving alignment. We are serious about this and we are giving it our...
[Read More]
Integer tokenization is insane
After spending a lot of time with language models, I have come to the conclusion that tokenization in general is insane and it is a miracle that language models learn anything at all. To drill down into one specific example of silliness which has been bothering me recently, let’s look...
[Read More]
Gradient Hacking is extremely difficult.
Epistemic Status: Originally started out as a comment on this post but expanded enough to become its own post. My view has been formed by spending a reasonable amount of time trying and failing to construct toy gradient hackers by hand, but this could just reflect me being insufficiently creative...
[Read More]
Creating worlds where iterative alignment succeeds
A major theorized difficulty of the alignment problem is its zero-shot nature. The idea is that any AGI system we build will rapidly be able to outcompete its creators (us) in accumulating power, and hence if it is not aligned right from the beginning then we won’t be able to...
[Read More]