Last year, I wrote a quick post investigating the ‘unconditioned’ distribution of LLMs in the OpenAI API, where the ‘unconditioned distribution’ is simply the distribution of LLM outputs following the empty string – or beginning of sequence token. My intuition here was that this gives some idea of what the...
[Read More]
Capital Ownership Will Not Prevent Human Disempowerment
When discussing the future of AI, I semi-often hear an argument along the lines that in a slow takeoff world, despite AIs automating increasingly more of the economy, humanity will remain in the driving seat because of its ownership of capital. This world posits one where humanity effectively becomes a...
[Read More]
Integer tokenization is now much less insane
Just over a year ago, I wrote about how integer tokenization using the GPT2 and GPT3 tokenizer was insane. This was because that it failed to create a coherent number representation in token space since large numbers of integers were assigned a single unique token, and even multi-token integers were...
[Read More]
Alignment In The Age Of Synthetic Data
Synthetic data is a new frontier in AI training. Phi3 and Llama3 and other recent models have demonstrated the ability of large amounts of synthetic, well-tailored data to significantly improve performance of small models to bring them closer to the frontier by implicitly cheaply distilling from larger, more powerful, models....
[Read More]
Does scaffolding help humans?
Epistemic status: Shower thoughts
[Read More]