Just over a year ago, I wrote about how integer tokenization using the GPT2 and GPT3 tokenizer was insane. This was because that it failed to create a coherent number representation in token space since large numbers of integers were assigned a single unique token, and even multi-token integers were not split in a consistent way such that, for instance, the number: 2249 is tokenized as ‘2’, and ‘249’ (1-3). The number 2250 is tokenized as ‘22’ and ‘50’ (2-2) and the number ‘2251’ is tokenized as ‘225’ and ‘1’ (3-1). This method of representing numbers would make it very hard for models to learn arithmetic since instead of using consistent decimal algorithms like computers and humans do, the model has to learn special-cased algorithms in an inconsistent token-space including memorizing the outcome of all calculations using unique memorized tokens.
When this post was written, pretty much the only available models were GPT2 and GPT3. Now, just over a year later, there has been an absolute explosion of new models with new tokenizers, so I thought to go back and look at what has changed. Largely, it seems that the problem has been solved. New tokenizers such as Mistral, Llama, Gemma, and GPT4 have a consistent integer-token relationship. This change has probably helped significantly improve the mathematical skills of models beyond other improvements in scale, architecture, data etc and means that using newer tokenizers likely provides significant benefits in mathematical abilities compared to old ones such as GPT-neox.
Having experimented with the tokenizers of the main open models (and GPT4), the current situation is that there are essentially two strategies now being employed. The first, used by Llama, Mistral, Gemma, Deepseek, and Yi, is to do the obvious thing and match the tokenization to our decimal system. That is, each integer receives it’s own unique tokens and multiple integers result in multiple tokens. I.e. the number 2164 = [2,1,6,4]. This lets models perform arithmetic by learning the decimal algorithms that we understand well. Interestingly, the GPT4, and Llama3 take a different approach. They instead chunk the numbers 0->999 as unique tokens and then split longer integers as sub-chunks of these tokens always to the left. I.e. 2164 -> [216,4], 21645 -> [216,45], and 216456 -> [216,456]. It is unclear why they have chosen this strategy since, while it is consistent, it seems much harder for the model to learn, as in effect, it has to learn a three-digit decimal representation with 1000 primitives instead of 10. Nevertheless, it is likely that the model can learn such algorithms, especially larger models with the requisite capacity for this level of memorization. Due to operating on chunks of 3 decimal numbers at a time rather than individual numbers, this representation may improve model capabilities for handling arithmetic operations with many difits if the model’s capabilities are limited more by serial depth of the algorithm than memorization capabilities. I.e, to do addition on a 9 digit number requires nine (or ten with carry) sequential steps with a decimal representation but only 3-4 with this tri-decimal representation. Given that transformers excel at memorization but often struggle with simulating deep serial algorithms, this trade-off may be a good one to make.