There are many facets to the alignment problem but one is as a computer security problem. We want to design a secure system to test our AGIs in to ensure they are aligned, which they cannot ‘break out of’. Having such a secure AGI box is necessary to have any...
[Read More]
Probabilities multiply in our favour for AGI containment
This is a short post for a short point. One thing I just realized, which should have been obvious, is that for prosaic AGI containment mechanisms like various boxing variants,simulation, airgapping, adding regularizers like low impact, automatic interpretability checking for safe vs unsafe thoughts, constraining the training data, automatic booby-traps...
[Read More]
Alignment needs empirical evidence
There has recently been a lot of discussion on Lesswrong about whether alignment is a uniquely hard problem because of the intrinsic lack of empirical evidence. Once we have an AGI, it seems unlikely we could safely experiment on it for a long time (potentially decades) until we crack alignment....
[Read More]
Empathy as a natural consequence of learnt reward models
Empathy, the ability to feel another’s pain or to ‘put yourself in their shoes’ is often considered to be a fundamental human cognitive ability, and one that undergirds our social abilities and moral intuitions. As so much of human’s success at becoming dominant as a species comes down to our...
[Read More]
The ultimate limits to alignment determine the shape of the long term future
The alignment problem is not new. We have been grappling with the fundamental core of alignment – making an agent optimize for the beliefs and values of another – for the entirety of human history. Any time anybody tries to get multiple people to work together in a coherent way...
[Read More]