One of the things that make people nervous, awestruck, or both about the development and release of recent AI models is the prospect of them developing “emergent abilities”.
The terminology here can be complicated. Different people mean different things by “emergent abilities”. Here in the context of large language models (LLMs), we’re talking about the sense that newer and larger models suddenly and surprisingly developed “overnight” an ability to do something that previous versions could not, even though the new model wasn’t trained on that task. In other words, they develop abilities that you did not ask them to and could not have predicted that they would by extrapolating from what previous LLMs could do.
For instance it’s been said that at a certain size LLMs suddenly learned how to do arithmetic even without having been trained to do so. Wei et al’s paper “Emergent Abilities of Large Language Models” talks about the performance of these models in this sort of way. Here they report on tests of whether the models can do arithmetic operations such as adding and subtracting 3 digit numbers or multiplying 2 digit numbers:
GPT-3 and LaMDA have close-to-zero performance for several orders of magnitude of training compute before performance jumps to sharply above random at 2 · 1022 training FLOPs (13B parameters) for GPT-3, and 1023 training FLOPs (68B parameters) for LaMDA. Similar emergent behavior also occurs at around the same model scale for other tasks…
A paper from Bubeck et al. goes further::
We demonstrate that, beyond its mastery of language, GPT-4 can solve novel and difficult tasks that span mathematics, coding, vision, medicine, law, psychology and more, without needing any special prompting. Moreover, in all of these tasks, GPT-4’s performance is strikingly close to human-level performance, and often vastly surpasses prior models such as ChatGPT. Given the breadth and depth of GPT-4’s capabilities, we believe that it could reasonably be viewed as an early (yet still incomplete) version of an artificial general intelligence (AGI) system.
This worries certain people because – to be extra-dramatic about it – if these abilities are truly unpredictable then what if the next version of GPT suddenly develops the ability to do something that in the end just happens to destroy humanity? We’d never even see it coming.
Schaeffer et al are much more skeptical of these claims in their paper “Are Emergent Abilities of Large Language Models a Mirage?“.
They consider that a lot of what is reported as wholly surprising emergent abilities are really down to how the researchers chose to measure the emerging ability; particularly the use of measures that are categorical rather than linear. It’s researcher degrees of freedom in action.
From Wikipedia:
Researcher degrees of freedom is a concept referring to the inherent flexibility involved in the process of designing and conducting a scientific experiment, and in analyzing its results. The term reflects the fact that researchers can choose between multiple ways of collecting and analyzing data, and these decisions can be made either arbitrarily or because they, unlike other possible choices, produce a positive and statistically significant result.
If the idea of binary or categorical measures leading to false emergence conclusions isn’t intuitive at this point then we can consider a simpler, non-AI, example. Let’s say we’re aliens looking at mere humans and we’re curious to answer the question “Are humans good at running?”.
We then have to pick exactly what we mean by “good at running”. We could choose to measure it in terms of a continuous measure like “what’s the fastest time they take take to cover the distance of a mile?”. Or we could choose a categorical definition such as “Someone can run if they can cover a mile in 4 minutes or less”.
In the first case, if we borrow Wikipedia’s chart showing the fastest recorded miles ran by men over time, we can see that, since records began in the year 1850, we’ve made slow and steady improvements over time in a smoothish fashion. Each new record was no doubt justifiably celebrated. But the new records set in, let’s say, the 1950s aren’t really “surprising” given the trend that had been occurring beforehand.

But if we used the second definition where whether someone can run well means “can they cover a mile in 4 minutes?” then we’d see that until Roger Bannister came in at 3 minutes 59 seconds in 1954 we’d be claiming that humans can’t really run. It’s 0s all the way.
But then something magical and inexplicable must have happened in 1954 – perhaps the release of Elvis‘ first single? The publication of Tolkien’s Fellowship of the Ring? – because now humans can run! The ability to run well has emerged “unpredictably”.
As it happens, we’d also be saying that no women can run, as the current world record for women running a mile is around 4 minutes 12 seconds, which speaks to a whole different set of problems in addition to that of arbitrary thresholds.
My example is obviously a bit silly, but Schaeffer et al suggest that that is exactly the kind of thing is happening in the “AI has emergent properties” research. The emergent abilities documented are the result of a certain decisions around measurement being made over others.
As they say:
…one can choose a metric which leads to the inference of an emergent ability or another metric which does not.
And the former is what is oftentimes being done.
…emergent abilities appear due the researcher’s choice of metric rather than due to fundamental changes in model behavior with scale. Specifically, nonlinear or discontinuous metrics produce apparent emergent abilities, whereas linear or continuous metrics produce smooth continuous, predictable changes in model performance.
They report that most of the emergent abilities on a standard set of tasks are measured in terms of the choice the system makes either out of a set of multiple choice options, or via text response which is compared to the text representing the true answer. In both cases the system is either scored as right or wrong. There’s no representation of “almost got the answer correct” vs “had absolutely no clue as to the answer” applied.
The researchers explain this both theoretically and also with real world examples. Perhaps the most striking one deals with GPT-3’s apparently emerging-out-of-nowhere ability to do math.
The paper from Schaeffer et al looks at two tasks of this nature: multiply two 2-digit numbers together, and add two 4-digit numbers together. Smaller GPT-3 systems couldn’t do this, which whilst seems a little amusing at first glance is entirely fair because they were never explicitly programmed to do math or trained in the theory of math. Humans also can’t do these tasks until they are trained to do so.
But whilst GPT-3 was never explicitly trained to complete these tasks, as the models got bigger and bigger they suddenly started to be able to provide the correct answer to those sorts of questions anyway. Here’s a couple of charts from the paper’s figure 3 that show that. Check out the darker lines which are the higher digit length numbers The left one is performance on the multiplication task, the right one performance on the addition task. The x axis is a measure of model size, and y is the commonly used measure of accuracy – is the answer (exactly) correct or wrong?

At first the model essentially can’t ever answer these questions correctly. But just before it hits a size of 10^10 parameters it seems like suddenly it develops the ability to answer a class of questions it was never explicitly trained on.
But the emergent nature of this is a byproduct of measurement choices. The researchers repeat the exercise except this time measuring the output in terms of a continuous measure, token edit distance. Here’s the same charts but with token edit distance as the y axis.

Instead of inability changing unpredictably to ability at some model size, we see a relative smooth increase in performance as model size goes up right from the start. There’s no midpoint of model size where we’d be massively shocked at its score on this task.
In their words:
…the source of emergent abilities is the researcher’s choice of metric, not changes in the model family’s outputs.
Although rightly later caveated that just because they don’t think any of the emergent properties demonstrated so far are in fact likely emergent, this doesn’t prove that it’s actually impossible for a large language model to display emergent abilities. There’s no theory that says it’s impossible; but rather alternative explanations of what has been observed so far that suggest they haven’t yet.
It should also be said that this type of problematic conclusions being derived on the basis of inappropriate dichotomisation is neither rare or restricted the the field of artificial intelligence. It’s something of a cardinal sin of analysis that one can see all over the place.
Statistical methods involving measuring or grouping in terms discrete categories can certainly make for easier-to-do and easier-to-interpret analysis. But they also generally make for worse, less reliable, analysis. At best you’re throwing away information that you should ideally leverage – it is interesting and useful to know that GPT-3 gets closer to the appropriate answer to a math question as it increases in size even if it hasn’t gotten it quite right yet. At worst you produce very misleading results, deliberately or otherwise.
In “The cost of dichotomising continuous variables“, Altman and Royston write about this phenomena occurring in the world of medical research. The context that the grouping or dichotomisation is happening in is important:
In clinical practice it is helpful to label individuals as having or not having an attribute, such as being “hypertensive” or “obese” or having “high cholesterol,” depending on the value of a continuous variable.
Categorisation of continuous variables is also common in clinical research, but here such simplicity is gained at some cost. Though grouping may help data presentation, notably in tables, categorisation is unnecessary for statistical analysis and it has some serious drawbacks.
These problems including reducing your statistical power, increasing the risk of a false positive, underestimating the variation in whatever you’re studying between groups and missing any relationship between your groups that isn’t linear in nature alongside certain types of confounding.
If for some reason you have no option but to split your continuous variable up, or no access to the actual variable itself, then there are ways to do it that are better than other ways. As Altman and Royston write, a key issue is where you make the split.
Commonly people split groups into two: high and low. “High” is taken to mean the 50% of sample with the highest outcome of whatever’s being measured, and “low” is the 50% with the lowest measure of the same thing. But this is almost always a bad idea for several reason.
For one, it means the split will be in a different place for every sample in every study so the results can’t be properly compared or combined.
And there’s often no special reason to imagine the median is a structurally significant point for whatever mechanism it is that you are seeking to measure.
There is, however, no good reason in general to suppose that there is an underlying dichotomy, and if one exists there is no reason why it should be at the median
Sometimes there are specific recognised and accepted cutpoints people use, an example in the Altman paper being Body Mass Index. The WHO defines “overweight” as being a BMI of at least 25 kg/m2 (although it must be said that BMI as a measure and that threshold in particular is very much open to debate these days, depending on what you are trying use it for!).
In general you should provide an explicit justification for the threshold you chose. Sometimes it can’t be helped and sometimes, rarely, perhaps there is some benefit in cutting up your sample. But you should justify why you did what you did, why the trade-off was worth it.
This is of course a good rule in general as well as a good rule for this specific type of analytical decision. There could be – and maybe are – a whole raft of papers akin to the famous call to “Justify your alpha” that deal with the various types of researcher degrees of freedom out there.
In general using more categories is better than using less categories – really this often means you’re in effect constructing an ordinal variable – even if it makes analysis more complex.
Finally, what you must never ever do, unless you are an actual supervillain, is to conduct the same analysis repeatedly, shifting the point at which you cut your continuous variable into categories each time until you get a result that you particularly like.