Comment Re:Better Results (Score 2) 128
I know this is an alien concept to most people here, but it would be nice if people would actually, you know, read the papers first? I know nobody does this, but, could people at least try?
First off, this isn't peer reviewed. So it's not "actual, careful research", it's "not yet analyzed to determine whether it's decent research".
Secondly, despite what they call it, they're not dealing with LLMs at all. They're dealing with Transformers, but in no way does it have anything to do with "language", unless you think language is repeated mathematical transforms on random letters.
It also has nothing to do with "large". Their model that most of the paper is based on is minuscule, with 4 layers, 32 hidden dimensions, and 4 attention heads. A typical large frontier LLM has maybe 128 layers, >10k hidden dimensions, and upwards of 100 or so attention heads.
So right off the bat, this has nothing to do with "large language models". It is a test on a toy version of the underlying tech.
Let us continue: "During the inference time, we set the temperature to 1e-5." This is a bizarrely low temperature for a LLM. Might as well set it to zero. I wonder if they have a justification for this? I don't see it in the paper. Temperatures this low tend to show no creativity and get stuck in loops, at least with "normal" LLMs.
They train it with 456976 samples, which is.... not a lot. Memorization is learned quickly in LLMs, while generalization is learned very slowly (see e.g. papers on "grokking").
Now here's what they're actually doing. They have two types of symbol transformations: rotation (for example, ROT("APPLE", 1) = "BQQMF") and cyclic shifts (for example, CYC("APPLE", 1) = "EAPPL".
For the in-domain tests, they'll say train on ROT, and test with ROT. It scores 100% on these. It scores near-zero on the others:
Composition (CMP): They train on a mix of two-step tasks: ROT followed by ROT; ROT followed by CYC; and CYC followed by ROT. They then test with CYC followed by CYC. They believe that the model should have figured out what CYC is doing on its own and be able to apply CYC twice on its own.
Partial Out-of-Distribution (POOD): They train on simply ROT followed by ROT. They then task it to perform ROT followed by CYC. To repeat: it was never traiend to do CYC.
Out-of-Distribution (OOD): They train simply on ROT followed by ROT They then task it to do CYC followed by CYC. Once again, it was never trained to do CYC.
The latter two seem like grossly unfair tests. Basically, they want this tiny toy model with a "brain" smaller than a dust mite's to zero-shot an example it's had no training on just by seeing one example in its prompt. That's just not going to happen, and it's stupid to think it's going to happen.
Re, their CMP example: the easiest way for the (minuscule) model to learn it isn't to try to deduce what ROT and CYC mean individually; it's to learn what ROT-ROT does, what ROT-CYC does, and what CYC-ROT does. It doesn't have the "brainpower", not was it trained to, to "mull over" these problems (nor does it have any preexisting knowledge about what a "rotation" or a "cycle" is); it's just learning: problem 1 takes 2 parameters and I need to do an an offset based on the sum of these two parameters. Problem 2... etc.
The paper draws way too strong of conclusions from its premise. They do zero attempt to actually insert any probes in their model to see what their model is actually doing (ala Anthropic). And it's a Karen's Rule violation (making strong assertions about model performance vs. humans without actually running any human controls).
The ability to zero-shot is not some innate behavior; it is a learned behavior. Actual LLMs can readily zero-shot these problems. And by contrast, a human baby who has never been exposed to concepts like cyclic or rotational transformation of symbols could not. One of the classic hard problems is how to communicate with an alien intellect - if we got a message from aliens, how could we understand it? If we wanted to send one to them, how could we get them to understand it? Zero-shotting communicative intent requires a common frame of reference to build off of.