A quick update on my thinking.
For a while I have been more bullish than the LW norm on techniques like boxing to contain an AGI with the goal of preventing rapid capabilities gains from RSI as well as being able to iteratively test various alignment procedures in simulation, or at least with some degree of powerful monitoring and oversight. I feel that such techniques would be able to give us a pretty high probability of survival given the fact that many redundancies and independent lines of safety can be built in, and that the information assymmetry of the AGI vs us is huge and even superintelligence does not grant omniscience. This means that it is extremely hard for it to anticipate and perfectly defeat all of our countermeasures to escape, and the less it knows about our defences, the better.
I no longer feel that techniques like this are particularly feasible in practice. Not due to any technical arguments but just because I do not believe that people will actually use boxing well, or even attempt to use it. One would hope that as a sensible species, if we develop extremely powerful AGI technologies with capabilities we are nowhere close to understanding, we will not rush to immediately deploy them with direct access to the internet, and ideally without a training corpus full of ML papers and alignment posts. However, clearly that ship has sailed. OpenAI/Microsoft have already deployed what appears to be GPT4 with direct access to the internet into Bing. Libraries like Langchain make deploying internet access and tool-use in LLMs easy and accessible to anybody with an open-source language model, which is everyone who can afford a few GPUs. Race dynamics are extremely well established and the open-source commnunity currently lags SOTA by a year or two at most. Clearly, by the time we build AGIs the default will be immediate deployment to the internet at large, probably even during training. Most likely we will allow the AGI to make arbitrary requests during training. Ostensibly to be able to gather information and clarify confusions during training. In practice, this will just give it a massive attack channel to use for whatever purpose. Fun times lie ahead.