Zones of LLM predictability

As you may know, large language models (LLMs) are smack dab in the middle of my tangle of interests presently, so you can bet I spend a lot of time talking with my friends and colleagues about them. One lens that seems to have resulted in fruitful conversations is the one related to predictability of output.

In this lens, we look at the LLM’s output as something that we can predict based on the input – and the reaction we might have on the outcomes. If we imagine a spectrum where the results are entirely unpredictable at one extreme, and can be predicted with utter certainty at the other – then we have a space to play in.

For a simple example, let’s suppose we’re asking two different LLMs to complete the sentence “roses are red, violets are …”. If one LLM just returns a bunch of random characters, while the other consistently and persistently says “blue”, we kind of know where we’d place these models on the spectrum. The random character one goes closer to an unpredictable extreme and the insistent blue one goes closer to the perfectly predictable end.

For ease of navigating our newly created space, let’s break it down into four zones: chaotic, weird, prosaic, and mechanistic. 

🌫️ Chaotic

In the chaotic zone dwell the LLMs that basically produce white noise. They aren’t really models, but random character sequence generators.  By the way, I asked Midjourney illustrate white noise, and it gave me this visage:

(It’s beautiful, Midge, but not what I asked for)

This zone is only here to bookend the very extreme of the spectrum. Suffice to say that we humans tend to only use white noise as means to an end, mostly judging it as useless on its own.

🐲 Weird

The adjacent zone is where the model outputs something that is weird and bizarre, yet strangely recognizable and sometimes even almost right. Remember the whole “hands” thing in the early generative imagery journey? That’s what I am talking about.

(“A normal human hand with five fingers” – whoopsie!)

This zone is where LLMs are at their creative best. Sure, they can’t count fingers, and yes, some – many! – outcomes are creepy and disturbing, but they also produce predictions that are just outside of the norms, while still retaining some traits that keep them outside of the chaotic zone. And that stirs creativity and inspiration in those who observe these outcomes. This is the zone where a model is more of a muse – odd and mysterious, and not very serious. Yet, when paired with a creative mind of a human, it can help produce astounding things.

📈 Prosaic

The prosaic zone is where an LLM produces mostly the results we expect. It might add a bit of flourish in bursts of creativity and insert an occasional (very safe) dad joke, but for the most part, that’s the zone that I also sometimes call the “LLM application zone”. If you ever spend time getting your retrieval-augmented generation to give accurate responses, or only return code results that can actually run – you’ve lived in this zone.

(“a happy software engineer working, stock photo” – oh yes, please! More cliche!)

My own explorations are mostly in this zone. The asymptotes I outlined earlier this year are still in place, and holding. If anything, time has shown that these asymptotes are firmer than I initially expected.

⚙️ Mechanistic

Another bookend of the spectrum is the mechanistic zone. At this point, LLM output is so constrained and deterministic that we become uncertain if using an LLM is even necessary: we might be better off just writing “old school” software that does the job.

The mechanistic zone is roughly the failure case for the current “AI” excitement. Should the next AI winter come, we’ll likely see most of the use cases shift toward this zone: the LLM either constrained, significantly scaled down in size, or entirely ripped out, replaced with code.

💬 A conversation guide

Now that we have the zones marked in the space, we can have conversations about them. Here are some interesting starter questions that generated insights for me and my colleagues:

  • How wide (or narrow) is each zone? For example, I know a few skeptics that don’t even believe that the Prosaic zone exists. For them, its width is zero.
  • How much value will be generated in each band? For instance, the Prosaic zone  is where most of the current attention seems to be. Questions like “Can we make LLMs be useful at an industrial scale? How much value can LLMs produce?” seem to be on everyone’s mind.
  • How will the value generated look for each band? What type of value comes out of the Weird zone? What about the Prosaic zone?
  • What kind of advancements – technological or societal – would it take to change the proportions of the zones?

For more adventurous travelers, here are more questions that push the boundaries of the lens:

  • What does “predictable”even  mean? If I know English, but don’t have the cultural background to recognize the “Roses are Red” ditty, I might find the “blue” perplexing as a completion. Violets are kind of purplish, actually.
  • What do judgments about predictability of the LLM output tell us about the observer? What can we tell about their expectations, their sense of self, and how they relate to an LLM?
  • What is it that LLMs capture that makes their output predictable? What’s the nature of that information and what might we discern about it?

As you can tell, I am pretty intrigued by the new questions that large language models surface to us. If you’re interested in this subject as well, I hope this lens will be useful to you.

Leave a Reply