The Unconditioned Distribution of Current Open LLMs

Last year, I wrote a quick post investigating the ‘unconditioned’ distribution of LLMs in the OpenAI API, where the ‘unconditioned distribution’ is simply the distribution of LLM outputs following the empty string – or beginning of sequence token. My intuition here was that this gives some idea of what the... [Read More]

Capital Ownership Will Not Prevent Human Disempowerment

When discussing the future of AI, I semi-often hear an argument along the lines that in a slow takeoff world, despite AIs automating increasingly more of the economy, humanity will remain in the driving seat because of its ownership of capital. This world posits one where humanity effectively becomes a... [Read More]

Integer tokenization is now much less insane

Just over a year ago, I wrote about how integer tokenization using the GPT3 and GPT3.5 tokenizer was insane, in that it failed to create a coherent number representation in token space, such that large numbers of integers were assigned a single unique token, and even multi-token integers were not... [Read More]

Alignment In The Age Of Synthetic Data

Synthetic data is a new frontier in AI training. Phi3 and Llama3 and other recent models have demonstrated the ability of large amounts of synthetic, well-tailored data to significantly improve performance of small models to bring them closer to the frontier by effectively cheaply distilling from larger, more powerful, models.... [Read More]