Epistemic status: Pretty uncertain, this is a model I have been using to think about neural networks for a while, which does have some support, but is not completely rigorous.

I hear a lot of people talk about scaling laws as if they are a property of specific models or architectures. In this post, I want to push back on this and present my perspective which is that scaling laws are primarily a property of the data (with a sufficiently universal architecture).

Although the title of this post is perhaps too poetic, the core point is very simple:

The scaling laws (i.e. power-law scalings of loss with compute etc) are primarily a property of our datasets rather than that of our models.

Certainly, this is a strong statement that needs some nuancing. Firstly, although I believe the general shape of the scaling laws is primarily determined by the dataset, the specific scaling coefficients are a function of both the dataset and the model type. That is why, for instance, we can observe ’superior’ scaling laws for specific architectures (such as MoEs and hybrid SSMs) over base transformers. On the other hand, worse models such as LSTMs and straight MLPs have poorer scaling and if they are unable to be sufficiently expressive enough to actually capture core properties of the datasets they may actually end up worse than power-law. We also observe this with datasets, for instance some datasets may be generically harder to predict than others and thus lead to higher loss for a fixed model. For instance, code often has a significantly more regular structure than natural language and hence has less inherent entropy, leading to a lower loss for datasets with high proportions of code, all things equal.

My argument is that once we obtain sufficiently expressive (universal?) architectures, they will all show a power-law like scaling on large scale naturalistic datasets of text and images and videos etc that we train networks on today 1.

This is because the fundamental structure of these datasets is almost fractal-like with almost infinite levels of detail to model where the majority of the variance can be explained by a few key causal factors but then as we asymptote towards greater and greater levels of explanation, there is increasingly large amounts of detail to be modelled. Mathematically, what this ends up looking like is the spectrum of the covariance matrix of the dataset having a power-law decay. This can be observed clearly if you take subsets of realistic datasets and actually compute the spectrum of their empirical covariance. It is also easy to synthetically create datasets that have other covariance structures to obtain very different ‘scaling laws’. For instance, on random gaussian data models will converge as the square root which is implied by the central limit theorem.

What neural networks are then doing is progressively learning the eigenvalues and eigenvectors of the covariance matrix of the dataset as they are trained with the speed of learning occurring in proportion to the eigenvalue of the features to be learnt. This means that much larger eigenvalued features, which explain a much larger proportion of dataset variance are learnt first. It can be mathematically shown that linear neural networks learn datasets in just this manner. My hypothesis is that standard nonlinear neural networks broadly learn in an almost identical way — i.e. that they progressively learn the spectrum of the distribution in a way that is limited by the dataset size and their fundamental capacity limits.

This view gives us a neat way to explain precisely what the scaling laws are and why they have the form that they do. Let’s imagine for the moment that there is the ‘one true dataset’ of all natural language that has a specific power-law spectrum of its covariance matrix. Let’s say that our existing datasets are all a subset of this true dataset.

Let’s take a model with a given number of parameters. During training the network learns all the eigenvectors of the dataset in parallel with a speed proportional to the eigenvalue, as in a linear network. The speed of learning is proportional to the eigenvalue but also depends monotonically on the size of the network. Namely, that for larger networks all eigenvectors of the dataset are learned faster. For a given model and dataset size, then there are only a certain amount of eigenvectors that can be learned in the given ‘time available’ which corresponds to the total amount of tokens that are used for training. As we increase the model size, we can learn more eigenvectors and hence progress further down the power-law slope of the dataset covariance spectrum, leading to a power law relationship between loss and model size.

Conversely, if we train for longer (i.e. have a larger dataset since we are always assuming 1 epoch only) then we can learn more eigenvectors at a fixed parameter size, also leading to a power-law relationship between loss and dataset size.

We assume here that the eigenvalue of each feature of the covariance matrix is proportional to the loss drop that learning this feature gives to the model. This makes sense since the largest eigenvalue features are the most important and hence successfully learning them leads to significantly greater loss drops. That the scaling laws have this power-law shape, by this view, also implies that the gain in terms of number of additional features learnt by scaling the model size is linear — i.e. a 10x bigger model can learn 10x more underlying features, it is just that these features contribute progressively less to the loss since the more important features were learnt first.

Perhaps a more intuitive way to think about this is in terms of ‘zooming in’ to a fractal. As we progress down the power-law slope we zoom into increasingly small and detailed features. We can think of a neural network as similar to a telescope observing a region of space. Having a larger model is like having a larger telescope, it allows us to ‘zoom in’ to resolve finer detail. Similarly, training for longer is similar to having a longer ‘exposure time’ which also lets us resolve finer details for a fixed lens size. Due to the natural structure of the dataset, these linear effects get transmuted to power-law decreases in the loss. Assuming that the dataset is full-rank, the limit of the spectrum is the number of data-points and hence the power-law ‘scaling law’ should theoretically continue for as long as the neural network remains underparametrized.

Of course, the scaling laws themselves simply relate the training loss to the network and dataset scale. The relationship of the loss and performance on down-stream tasks is not necessarily guaranteed to be linear or powerlaw and indeed we see that downstream performance remains somewhat hard to predict with increasing scale, although it is mostly monotonic — i.e. decreasing loss generally helps downstream performance but the amount is hard to predict. Indeed the recent performance gains of the last two years have primarily been due to improved datasets – i.e. matching much more closely the training datasets and the desired behaviours of the model – rather than pure pretraining scaling on arbitrary web text.

There also remains, of course, the question of why the covariance structure of large natural datasets should be power-law at all. This is indeed a very general and interesting question since power-laws are observed in a large number of seemingly disparate large-scale phenomena. There is likely a deep explanation for this which I am very interested in, but I don’t have much to say yet here. There has certainly been interesting work in deriving general maximum entropy conditions for power-laws to arise as well as work untangling the relationship of power-law dynamics with chaos theory and general supersymmetric stochastic dynamics.

  1. On a fixed dataset, the power-law will of course end at the point of overparametrization once the model has memorised all the structure in the dataset. If we imagine generating an infinite dataset from the same generative process, in theory the power-law should last forever.