There are several drafts in my WordPress backend I eventually didn’t bother finishing, partly because I was busy with other things, partly because I feel I’ve wasted enough time and energy writing about AI/LLM/GPT baseline topics here and on my essay on Medium already. But also because I was getting tired of the usual suspects—from antidemocracy-bankrolling techno-authoritarian brunchlord billionaires to hype-cycle-chasing journos who gullibly savor every serial fabulist’s sidefart like a transubstantiated host to singularity acolytes on Xittler who have less understanding of the underlying technologies and attached academic subjects than my late pseudosasa japonica yet corkscrew themselves into grandiose future assertions while flying high from snorting up innuendos of imminent ASI like so many lines of coke.
Thus, I focused on more practical things instead. To start with, I dug into the topic of copyright, with which I’m not yet done. I’m a copyright minimalist and zealous supporter of Creative Commons, Open Science, Library Genesis, Internet Archive, and so on, but you don’t have to be a copyright maximalist to find the wholesale looting of our collective art, science, and culture by that new strain of plantation owners bred from capitalism’s worst actors truly revolting. However, the arguments against the latter have to be very precise against the background of the former, and I’m still at it.
The other thing I did was focusing on aspects—technological, philosophical, social, ethical—that are interesting and a bit more rewarding to think about in the context of large language models. For this roundup, I picked three aspects from the technical side.
The Scaling Process
The first aspect is the scaling process, as observed so far. Making large language models larger not only leads to quantitative improvement but also to qualitative improvement in terms of new abilities like arithmetic or summarization. These abilities are often interpreted as evidence for “emergence,” a difficult (and loaded) term with different definitions in different fields. This recommended article from AssemblyAI from about a year ago is a good introduction, and it also explains why we cannot simply slap the label “emergence” on this scaling process and call it a day. Well, you can define emergent behavior simply as “an ability that is not present in smaller models but present in larger models,” but I believe that’s not only profoundly misguided but actually a cheap parlor trick. While we can’t predict which abilities will pop up at which scale, there’s scant reason to believe that any of them did, do, or will indicate an actual “regime change” that emergence in this definitional context requires, i.e., a fundamental change of the rules that govern a system. But that no “emergence magic” is involved, by all sane accounts, doesn’t make the scaling process any less intriguing.
Interpolation and Extrapolation
About a year ago, there was this fantastic article “ChatGPT Is a Blurry JPEG of the Web” by Ted Chiang in The New Yorker. The article makes several important points, but it might be a mistake to think that large language models merely “interpolate” their training data, comparable to how decompression algorithms interpolate pixels to recreate images or videos from lossy compression. Data points in large language models, particularly words, are represented as vectors in a high-dimensional vector space, which are then processed and manipulated in the network’s modules and submodules. And here’s the gist: for very high-dimensional vector spaces like large language models, there’s evidence that interpolation and extrapolation become equivalent, so that data processing in large language models is much more complex and much more interesting than mere interpolation.
Fractal Boundaries
Finally, a recent paper by former Google Brain and DeepMind researcher Sohl-Dickstein investigates the similarities between fractal generation and neural network training. There’s evidence that the boundary between hyperparameters that lead to successful or unsuccessful training (neural networks) behave like the points from function iteration (fractal generation) that define the boundary where these iterations converge (remain bounded) or diverge (go to infinity). While the generated properties from low-dimensional functions in fractal generation and the complex functions that act in a high-dimensional space in neural network–training differ, these similarities nevertheless might explain the chaotic behavior of hyperparameters in what Sohl-Dickstein calls “meta-loss landscapes” as the space that meta-learning algorithms try to optimize. In a nutshell: meta-loss is at its minimum at the “fractal edge,” the boundary between convergence and divergence, which is exactly the region where balancing is most difficult. Lots of stuff, all very preliminary, but highly captivating.