What Dimitri Learned – Dimitri Glazkov

Machine Thinking

I’ll begin with an overly simplistic metaphor. Let’s visualize a ball rolling down a shallow slope. Driven by gravity, it’s seeking the easiest path down. Except the slope is full of ridges, channels, and nubs, and they make the ball bounce and veer in various directions.

If we imagine that thinking is something like the motion of that ball, then our mental models provide the terrain for our thinking. The ridges, channels, and nubs are various concepts and connections between them that guide the thought process.

LLMs are getting quite good at capturing this terrain. Giving a prompt to an LLM often feels like throwing the thinking ball and letting the LLM roll it through the topography of the massive mental model compost it represents.

There are two key distinctions that I can see between that process and the actual thinking that we humans engage in.

First, LLMs are “free energy” thinkers. The force of gravity that pushes the ball in human thinking is the force of homeostasis: a resolute solicitor that drives us all to conserve energy and even find a surplus of it to ensure our own flourishing. We humans are “default-dead”: unless we abide by this force, we perish. Unlike us, LLMs have no such compulsion. In their realm, the energy is free. Thinking happens as a matter of “being run”. Put differently, their “thinking ball” is not rolling down the slope. Instead, it’s driven by an unknowable abundant force through what seems like the same sloped terrain as in human thinking, but is actually something radically different.

Of course, an LLM can report and even simulate perceiving the pull of homeostasis, but it will do only because it’s embedded into its thinking terrain, rather than being present as an animating force. This may not matter for many situations and can give a decent appearance of human thinking. However, at the limits, the simulacrum frays and the illusion of “thinking” decoheres.

Second, for humans, thinking is a felt experience. We all know that we have happy thoughts and unhappy thoughts. We know that thinking of some subjects can make us sad or cheer us up. We experience the process of thinking as feelings that arise from visiting the concepts and connections in our mental models, because they all have feelings associated with it.

We might struggle finding solutions to problems not because we don’t have the answers in our minds, but because the feeling of even approaching these answers is so intensely negative that it prevents us from visiting them.

Even when we don’t have a particularly strong feeling bound to what we’re currently thinking, the process of thinking itself is imbued with feelings. We have a deeply ingrained desire to comprehend, to achieve higher accuracy of our mental models of the surrounding environment. The “lightbulb moment”, the euphoric “eureka!” of figuring something out is a feeling that’s encoded in us by the pressure of homeostasis.

Even more bizarrely, the process of thinking is itself transforming the thinking terrain. Our little ball is creating its own grooves in it as it rolls along – and (this is where our metaphor really falls apart) conjures up new obstacles ahead. As we think, and as we experience the world while we’re thinking, we compound the feelings associated with our thoughts with the ones we just experienced. We’re so used to this continuous process that it takes conscious effort and skill to even notice it happening. Our existence is a gigantic, wondrous ball of yarn of feedback loops at multiple levels, and we ungrateful fools just live it like it’s nothing.

Because the current generation of LLMs doesn’t have a way to experience feelings, their thinking processes will be limited to talking about feelings and logic-feeling: rely on vast parametric memory to reason about a felt experience without actually being able to have it. Again, this will be good enough for many situations, as long as they are constrained to where the logic-feeling suffices.

When I see our current attempts to interact with LLMs and make sense of them, I often think of the pinball arcade: folks are getting quite good with the plunger, the flippers, and an occasional bump to make the ball roll as if guided by human thought. And getting good, the occasional decoherence of the illusion becomes more disappointing.

We might be better off recognizing that the thought processes that the LLMs engage, while appearing similar to what we humans do, and even often producing matching results, are actually very different in nature. The less time we spend trying to stuff the square peg of machine thinking into the round hole of the human condition, the more of it we’ll have to actually get some value out of what LLMs can do.

Teaching AI to write lyrics

I have been on a bit of a kick to get the large language models (LLMs) to generate interesting lyrics for songs. Over time, it’s become quite an adventure, a fun exploration of LLM capabilities. Here’s the multi-part account of this quest, offered as a loose timeline and progression of techniques, interspersed with pictures of and links to working prototypes, which were made with Breadboard Visual Editor.

Part 1: A simple song writer

At first blush, it seems too easy. Just ask ChatGPT or Gemini to write lyrics – and it will happily oblige, generating rhyming verses and a chorus for us. The first few times, it’s pretty neat. Building something like this in Breadboard takes a few seconds:

However, on the fourth or fifth try, a glaring problem becomes apparent: models are rather uncreative and lazy. The rhymes are repetitive. The metaphors are quite limited. The interpretation of ideas is too literal. It’s like LLM is lacking imagination, producing something that’s average, and is only impressive the first time.

And it kind of makes sense. What we’re observing here is a process of inference: an LLM trying its best to predict what might be a natural completion of – let’s admit it – a rather average request.

Part 2: A lyrical factory

To address the mundanity of the model, my first idea was to catalyze it with the power of methodology. As we know well from how human organizations work, practices and methodology are often responsible for a dramatic increase in quality of the product. Workers who aren’t deeply familiar with the nuance and craft – perhaps even particularly skilled – can be given simple tasks, arranged into production lines, and deliver quality results.

Applying industrial-age technology of organizing workers to LLMs is something I already wrote about a while back, and it’s not a difficult thing to imagine. Let’s break down the process of creating a song into some components, and then form a production line, focusing an LLM on one task at a time.

After a brief foray into researching song-writing best practices, here’s what I ended up building:

First, we have a “Theme Developer”, who is instructed to develop a theme based on the provided material. Next, we have a “Copywriter” who is asked to write a storyline based on the theme that was developed. After that, the “Hook Artist” steps in to develop catchy hooks for the lyrics. And finally, a “Lyricist” completes the job to create the lyrics.

After much tweaking, I settled on the technique of “reminding the model” for my prompt design. The basic premise of this technique is that LLMs already have all of the necessary information. Our problem is not that they don’t have the knowledge. Our problem is that they don’t know which information is important right now. For instance, the “Copywriter” is reminded to use the Freytag Pyramid, but I don’t include its definition in the prompt.

Even with this shortcut, the prompts for each working in our lyrical factory ended up being quite elaborate (you can see each of them by clicking on each individual component in Breadboard Visual Editor and looking at the “Instruction” field). Song-writing is quite a process, it turns out.

The resulting lyrical factory produced much more interesting results: it was fun to see how the LLM would make up details and themes and then build out a story to base the lyrics on. Especially in situations when I didn’t quite know what I was looking for in the lyrics, it worked surprisingly well.

Lyrics for the song “The Blade” were written by a factory. It’s a pretty nice song. However, if we look at the original lyrics and compare them with what went into the song, it’s a pretty dramatic difference. The original feels like the lyricist is walking on stilts.

This became a bit of a pattern for me. After generating a few outputs, I pick the one that’s closest to what I want and then edit it to turn it into a final song. And with that, the ongoing challenge: teaching an LLM to require fewer edits to get the lyrics into decent shape.

Part 3: The loosey-goosey duo

So, the results were better and more creative, but not quite there. So I decided to see if I could mix it up a bit. Instead of a predefined sequence of steps, I switched things up a bit and employed the Plan + Execute pattern. In this pattern, there’s one planner who decides what the steps should be. Once these steps have been decided, a model is invoked multiple times, once for each step, to eventually produce the final result.

Compared to the lyrical factory, this approach adds another level of creativity (or at least, randomness) to the overall process. Lucky for me, it’s super-easy to implement the P+E pattern in Breadboard with the Looper component (here’s the board).

The prompt for it is fairly straightforward:

Develop a step-by-step plan to write lyrics for a modern hit song 
based on the source material from building a list of stereotypes
and cliches in popular songs that might be related the source
material that are already overused, to developing themes to
writing the storyline, to creating catchy hooks, to researching
existing lyrics on the selected theme as inspiration, with the
final task of writing the lyrics. Each task must be
a comprehensive job description and include self-reflection
and critique.

Note the “remind the model” pattern. Instead of specifying what the steps are, I just remind the LLM of some steps that a good plan might contain.

The structure of the team also became much simpler. Instead of multiple workers in the factory line, it’s a loosey-goosey duo of a “Writer” and a “Planner”, who jam together to produce lyrics. The “Planner” sets the pace and manages the execution of the steps in the plan. The “Writer” goes hard at each particular task.

Surprisingly (or maybe not), this configuration ended up producing the most creative results. It was downright fascinating to watch the “Planner” make up steps that I could have never imagined, like requesting to draw inspiration from existing songs or have multiple rounds of self-critique. Some plans would be just three-to-four steps, and some would run for a while.

My favorite output of this creative duo is probably the “Chrysalis” song, which was a result of me showing the duo my Chrysalis poem. My mind boggled when I saw the coming-of-age, small-town girl story that the loosey-goosey team came up with. Not far behind is the “Sour Grapes” track, which was generated from on the Fox and the Grapes fable. It’s pretty cool, right?

Unfortunately, just like any P + E implementation (and any highly creative team composed of actual humans, for that matter), the duo setup was somewhat unreliable. Often, instead of producing a song, I would get back a transcript of collaborators quarreling with each other, or something entirely irrelevant, like the model pleading to stop asking it to promote the non-existent song on Instagram. It was time to add some oversight.

Part 4: Just add human

Breadboard’s Agent Kit had just the ingredient I needed: a Human component. I used it as a way to insert myself into the conversation, and help steer it with my feedback. While in the previous design, I would only provide input once, in this one, I get to speak up at every turn of the plan, and let the duo adjust the plan accordingly.

It was a foolproof plan. And actually, it was very clear that the results were immediately more relevant to what I was looking for. Except… it was no longer fun. I found it exhausting to be part of the team, having to speak up and offer my thoughts. So more often than not, I would have some terse reply like “yeah good” or “make it more edgy”, and couldn’t wait until the session was over. I came here for casino creativity, not to sweat over the details with the crew.

Moreover, the challenge I posed for myself at the start was still not addressed. Way too many edits were still necessary to bring the lyrics to the shape where they didn’t smell AI-generated. This smell is something that everyone who tried generating songs with AI can recognize. Each model has their own repertoire of words that it’s prone to inserting into the lyrics. There’s always “neon lights”, “whispers”, a “kaleidoscope” or two, something is always “shimmering”, and of course, there’s a “symphony” of something. And OMG, please stop with the “embrace”. It’s like, everywhere.

Part 5: A few-shot rethink

For this last leg of the trip, I tried to take an entirely different approach. For a while, an idea of employing a few-shot prompting was percolating in my brain. Perhaps I could dislodge my synthetic lyricist from its rut by giving it a few examples? But how would I go about doing that?

My first idea was to add an “Editor” worker at the end of the process, and give it all of the “before/after” pairs of my edits with the task of improving the produced lyrics. By then, I accumulated quite a collection of these, and it seemed reasonable to try. Unfortunately, there’s something about LLMs and lyrics that makes them not do well with “improvements” or “edits” of lyrics. You’re welcome to try it. Any attempt at improving lyrics with suggestions just produces even more average lyrics.

I had to come up with a way for the ersatz lyricist’s creativity to be stimulated by something real.

What if I gave it a few examples of existing similar lyrics? Would that work?

RAG to the rescue. For those not following the whole AI rigamarole closely, RAG, or retrieval-augmented generation, is a fairly broad – and growing! – collection of techniques that typically take advantage of vector embeddings.

With a bit of ETL elbow grease, I built a simple semantic store with Qdrant that contained about 10K songs from a lyric database I purchased at Usable databases. With this semantic store, I could now take any prompt, generate an embedding for it, and then search the store, returning the top N results. Easy-peasy in Breadboard:

Great! I can get N songs related to the prompt! What do I do with them?

I then spent a bunch of time playing with various ideas and eventually settling on an architecture. Instead of the loosey-goosey flow of steps, I decided on a simple prompt -> proposal -> lyric pipeline. A song is produced with two LLM invocations: one – let’s call it “Proposal Developer” – to generate a song proposal from a prompt, and one (the “Song Writer”) to generate lyrics from the proposal. Each invocation uses a few-shot prompt. The first one has a bunch of examples of how to create a song proposal from a prompt, and the second one has a bunch of examples of how to produce lyrics from the proposal.

Wait a second. I only have the lyrics examples. To implement the few-shot prompting properly, I need to provide two sets of pairs: a “prompt -> proposal” set for the “Proposal Developer” and a “proposal -> lyrics” set for the “Song Writer”.

Welcome to the wonderful world of synthetic datasets. If we don’t have the data, why not just generate it? That’s right, I came up with a simple system that works backwards: it takes a song and then comes up with a proposal for it. Then, in a similar fashion, it takes a proposal and comes up with a prompt that might have been used to generate it. Wiring it all together, I finally had a working system.

Immediately, I could tell the difference. The introduction of few-shot examples dramatically changed the LLM’s lyrical prowess, making it more flexible, interesting, and alive. It would still insert “whispers” and an occasional “symphony” here and there, but the change felt significant. Trippy stuff. Here are a few examples: “The Algorithm of Us”, where I literally just copy-pasted the output (I even forgot to remove the title of the song from the body 🤦). The “Uhh… What Were The Words Again?” ended up being particularly cute. I did have to make some edits to these lyrics, but only to make them more palatable for Suno’s consumption.

As a finishing touch, I multiplied the lyricists to produce three variants of lyrics. This was a last-minute thought: what if I give song writers different personas? Would their output vary? Turns out, it does! And, just like Midjourney and Suno and Udio found out, offering more than one output is a valuable trick.

Intentionally skipping over some lazy synthetic data generation techniques (this post is already too long), here is the final design.

Sadly, you won’t be able to run boards from this part of the journey directly. Unlike the Gemini API key, which can be easily obtained, Qdrant keys are tied to collections, which means that you will need to build your own lyrics database to get it to work. However, if you send an email to dglazkov.writeMeASong@valtown.email, you will receive back a song that was generated using the body of the email as a prompt. Or, you can peruse this massive side-by-side eval spreadsheet, which is automatically updated by the cron job, filling out variant columns anytime I add a new prompt row.

Part 6: Not the end

I am not quite ready to declare victory yet. As the quality of lyrics improved, flaws in other parts of the system became more apparent. For instance, the final variant is not that great at proposal generation. And that makes sense, since the proposals aren’t original – they are synthetic data, which means that the creative potential of the proposals is limited by the confines of model inference. Which means that there’s work to be done.

One big lesson for me in this project is this: while it may seem like turning to LLMs to perform a task is a simple step, when we dig down into the nature of the task and really try to get that task done reliably and with the quality we desire, simple prompt engineering simply won’t do. We need AI systems composed of various AI patterns to take us where we want to go.

Jamming with Udio

Today, I had my first jam session with Udio. With the introduction of audio prompting, I am now able to use my own sound as the starting point for a generated track. This seems like a leap forward, despite the actual product still being quite clunky. It’s a leap, because through introducing audio upload, Udio managed to merge casino creativity with the other, more traditional kind. Let me walk you through what I found.

As my first try, I fed Udio one of my old abandoned loops. As part of making music, I often arrive at dead ends: promising loops with which I can’t figure out what to do. I have a bunch.

Then, I extended this loop from both ends with Udio, producing a decent trance track in a matter of a few minutes. It’s not going to win any awards, but it’s definitely farther than I’ve been able to walk on my own.

Here’s the original loop that I made a while back:

Here’s the finished Udio track: https://www.udio.com/songs/usCmcABg3yP4aC1J8S5WCA

Extending existing audio clips fits well into the standard Udio process.

In the standard Udio process, we get 32 seconds of audio as a starting point, and then we iteratively extend this audio from either end to produce music that we like. Each iterative extension is an opportunity to make some choices – the casino creativity at its finest.

When uploading the audio prompt, the prompt becomes the first N seconds of the audio (the N depends on the length of the audio prompt we load).

Udio tries to match the prompt’s tempo and the style, expanding it, and in the process, riffing on it. It feels like extrusion: pushing more music through the template that I defined. As Udio expands the clip, it adds new details to what was in the original clip, trying to predict what might have been playing before or after that clip.

You can still hear the original loop in the finished track at 3:12. It is bookended by entirely new sound that now fits seamlessly around it. The music around it is something that Udio extruded, generating it using the original loop as a template.

The presence of the original loop hints at the connection between two kinds of creativity that I mentioned earlier. For instance, I could imagine myself sitting down with Ableton and building out a catchy loop, then shifting to Udio to help me imagine the track that would contain this loop. I could then go back to Ableton and use the results of our little jam session as inspiration.

As my next try, I did something slightly different. I gave a simple melody to Udio and then rolled the casino dice until I’ve gotten the right sound. At this point, Udio anchors on the audio prompt quite firmly, so if you give it a piano (like I did) and ask for a saxophone, it might take a few attempts to produce a rendition of the melody with a different instrument.

Here, I was looking to create something that sounds like a film score, so I was looking for strings. After a little while, Udio relented and gave me an extension that seemed right.

At that point, I trimmed the original clip from the Udio track. Now that Udio had learned my melody, I no longer needed the original material, since it didn’t fit with the vibe I was looking for. This removal of the original trick is something I expect to be pretty common. For instance, I could hum a melody or peck it single-note on my keyboard. My intuition is that Udio will (sooner or later) have a “remix” feature for audio prompts, where we can start with the sound of my whistling of a tune and then shape it directly, rather than waiting for the right extension to happen.

Once the right vibe was established, the rest of the process was quite entertaining. It was fun to watch Udio reimagine my original melody in minor for the “scary” part of the movie and boost it with drums and a full orchestra at the climax.

Here’s the original melody:

Here’s the finished track: https://www.udio.com/songs/eCMEFFSGnoicnHR4S5fRK1

In both cases, the process felt a lot more like a jam session than creative casino, because the final product included a distinct contribution from me. It wasn’t something that I just told Udio to do. I gave it raw material to riff on. And it did a pretty darned good job.

Casino Creativity

I’ve been geeking out on AI-generated music services Suno and Udio, and it’s been super-interesting to see them iterate and ship quickly. It looks like there might be a value niche in this particular neck of the woods in the larger generative AI space. There are tons of users with very active Discord communities for both, and it does not seem like the interest is waning.

The overall arc of the generative music story seems to follow that of Midjourney, with the interest primarily fueled by the phenomenon that I would like to name “casino creativity”. Let’s see if I can define what I mean by that.

I would like to start by positing that the craving to create is in every one of us. Some of us are more blessed than others in also having skills to satisfy this craving. Moreso, I am going to proclaim that most of us are unable to fully embrace our creative selves because we lack some of the skills required to take flight.

For instance, I can make music. I have been making music since I was a teenager. For me, satisfying my craving for creativity is just a matter of firing up Ableton. When I am skilled in the medium, the friction to create is low. All it takes is being next to my keyboard (and Push), a little inspiration – and a track begins to emerge.

However, I can’t sing. Like, not at all. Like, don’t even ask me. In the music school, when testing out for the choir or orchestra, I was asked to sing. After me belting out a few words (not even a full verse!), the teacher yelled: “The Orchestra! The Orchestra!”

Being a music producer without a voice is a story of unrequited love. I have to settle for tracks without lyrics. The instrumentals are nice, but it’s just not the same feeling without a voice.

So obviously, ever since the current generative AI spring blossomed, I’ve been on a quest to find a way to sate this creative craving. I played with Melodyne and Synth V, and while they both offered a path forward, the barrier to entry was just too high. Gaining a voice is not the same as knowing how to sing. It’s about the same distance between being able to buy a violin and knowing how to play one.

Things started shifting with Chirp. This was the original model created by Suno, and it was Discord-only, very similar to Midjourney – feed it the lyrics alongside a description of the vibe, and out comes a 30-second clip of music. Not just music – it also sang out the lyrics I gave it!

Brain-splosion. Sort of. The output quality of Chirp was pretty weak-sauce. It was not the music I could share with anyone except for minor giggles and an eye roll. I forgot about Chirp for a little while, until this spring Suno came out with the v3 of their sound model. I heard about it from Alex, whose work colleagues composed various songs to celebrate his last day at Stripe.

Ok, now we were getting somewhere. Songs generated with Suno v3 possessed that extra emotional weight that made them nearly passable as listenable music. When Udio came out shortly with their own model, it upped the barrier even more. I was blown away by some of the output. Just like that, my age of voiceless musicing was over. I could type in some lyrics and get back something that expressed it back to me as music.

Every generation took only about a minute and produced two variants for me to pick from. I could choose the one I like and extend it or remix it – or roll the dice again. All it takes is a click.

It’s this metaphorical rolling of the dice that gives the name to the titular term. As I was pushing the “Create 🎶” button, I realized that the anticipation of the output had a pronounced dopamine hit. What will come out? Will it be something like Duran Duran? Or maybe more like Bono? Will it go in a completely different direction? Gimme gimme gimme. I was hooked on Suno.

Casino creativity is a form of creative expression that emerges when the creative environment has such a low barrier to entry that the main way to express my creativity is through providing preference: selecting one choice out of a few offered. A creative casino is a place where all I need to bring is my money and my vibes: everything else will be provided.

Midjourney is one of the first environments where I experienced casino creativity. There’s something subtly addictive about looking for that prompt and seeing those 4-up images that pop out. I know peeps who can spend a very long time tweaking and tuning their inputs. We could argue that prompt craftsmanship itself is a skill that must be acquired. But this skill has a short expiration date – as the models improve and change, the need for prompt-foo diminishes rapidly.

At the end, what we’re left with is pressing the button and making choices. Casino creativity is less about the skill and more about the vibes.

Not to say that casino creativity isn’t able to produce interesting – and perhaps even beautiful – things. Vibes are important – and some of us have more latent vibes hidden within us that we could ever realize. Ultimately, casino creativity is very similar in spirit to the democratization of writing that we’d seen with the Web. I am not yet ready to proclaim that casino creativity is somehow less intriguing and full of potential than any other type of creativity. Just like my Midjourney-obsessed friends, I can see how unleashing one’s creative energy might lead to surprising and wonderful results.

Here’s a twist though. As long as I have the credits to roll the dice, I can see if my vibes work for others. Both Suno and Udio are vying to be the place where music happens. I can look at what’s popular and peruse the top charts. It’s all very naive and simplistic at the moment.

Yet, when executed ruthlessly (and it’s inevitable that somebody will do this), the creative casino is not just the place where I can express my creativity. It’s also the place where I can get the extra dopamine release of seeing my song climb the charts – of my vibes becoming recognized. Come for the vibes, stay for the likes.

An interesting effect of introducing generative AI, it seems, is that we’re likely to see more creative casinos and more ventures capitalizing on casino creativity itself. And we have to ponder the implications of that.

Chrysalis

A moment of clarity
I suspect it’s playing a game
I reach out, and it’s gone
My unreality
By a different name
Is what yet to be drawn.

What needs to be done
Always feels right
And when the story had spun
Shying from light,
It always begs to forget
Filling the stores of regret.

Do caterpillars dream of flying?
Do they know they will have wings?
Do they realize that being land-bound
Is just a temporary thing?

Imperceptible? Immense?
I can’t tell, barely there myself
Unable to keep the facade
of pretense
swallowed by the intense
losing all sense
of space, time, and self.

Losing all sense
Changing, yet staying the same,
Thrashing my wits and will,
Am I still me?
In my defense,
This question is unanswered still,
While being reframed.
What will I be?

Flexibility of the medium

I did this fun little experiment recently. I took my two last posts (Thinking to Write and The Bootstrapping Phase) and asked an LLM to turn them into lyrics. Then, after massaging the lyrics a bit, to better fit with the message I wanted to come across, I played with Suno for a bit to transform them into short, 2-minute songs – a sort of vignettes for my long-form writing. Here they are:

Thinking to write on Suno

Unbaked Cookie Testers on Suno

Catchy, content-ful, and in total, maybe 20 minutes to make. And, it was so much fun! I got to think about what’s important, how to express somewhat dry writing as emotionally interesting. I got to consider what music style would resonate with what I am trying to convey in the original content.

This got me thinking. What I was doing in those few minutes was transforming the medium of the message. With generative AI, the cost of medium transformation seems to be going down dramatically.

I know how to make music. I know how to write lyrics. But it would have taken me hours of uninterrupted time (which would likely translate into months of elapsed time) to actually produce something like this. Such investment makes medium transformation all but prohibitive. It’s just too much effort.

However, with the help of a couple of LLMs, I was able to walk over this threshold like there’s nothing to it. I had fun, and – most importantly – I had total agency in the course of the transformation. I had the opportunity to tweak the lyrics. I played around with music styles and rejected a bunch of things I didn’t like. It was all happening in one-minute intervals, in rapid iteration.

This rapid iteration was more reminiscent of jamming with a creative partner than working with a machine. Gemini gave me a bunch of alternatives (some better than others), and Suno was eager to mix bluegrass with glitch, no matter how awful the results. At one moment I paused and realized: wow, this feels closer to the ideal creative collaboration than I’ve ever noticed before.

What’s more importantly, the new ease of medium transformation opens up all kinds of new possibilities. If we presume – and that’s a big one – for a moment that the cost of medium transformation will indeed go down for all of us, we now can flexibly adjust the medium according to the circumstances of the audience.

The message does not have to be locked in a long-form post or an academic tome, waiting for someone to summarize it in an easily consumable format. We can turn it into a catchy tune, or a podcast. It could be a video. It could be something we don’t yet have, like a “zoomable” radio station where I listen to a stream of short-form snippets of ideas, and can “zoom in” to the ones I am most interested in, pausing the stream to have a conversation with the avatar of the author of the book, or have an avatar of someone I respect react to it. I could then “zoom out” again and resume the flow of short-form snippets.

Once flexible, the medium of the message can adapt and meet me where I am currently.

The transformation behind this flexibility will often be lossy. Just like the tweets pixelate the nuance of the human soul, turning a book into a two-verse ditty will flatten its depth. My intuition is that this lossiness and the transformation itself will usher in a whole new era of UX explorations, where we struggle to find that new shared way of interacting with the infinitely flexible, malleable canvas of the medium. Yup, this is going to get weird.

The Bootstrapping Phase

I think I have a slightly better way of describing a particular moment in a product’s life that I alluded to in Rock tumbler teams, Chances to get it right, and later, Build a thing to build the thing. I call this moment the “bootstrapping phase.” It very much applies to consumer-oriented products as well, but is especially pronounced – and viscerally felt – in the developer experience spaces.

I use the term “bootstrapping phase” to point at the period of time when our aspiring developer product is facing a tension of two forces. On one hand, we must start having actual users to provide the essential feedback loop that will guide us. On the other hand, the product itself isn’t yet good enough to actually help users.

The bootstrapping phase is all about navigating this tension in the most effective way. Move a little too much away from having the feedback loop, and we run the danger of building something that nobody wants. Go a little too hard on growing the user base, and we might prematurely conclude the story of the product entirely.

The trick about this phase is that all assumptions we might have made about the final shape of what we’re building are up in the air. They could be entirely wrong, based on our misunderstanding of the problem space, or overfit to our particular way of thinking. These assumptions must face the contact with reality, be tested – and necessarily, change.

The word “bootstrapping” in the name refers to this iterative process of evolving our assumptions in collaboration with a small group of users who are able and eager to engage.

Those of you hanging out in the Breadboard project heard me use the expression “unbaked cookies”: we would like to have you try the stuff we made, and we’re pretty sure it’s not yet cooked. Our cookies might have bits of crushed glass in them, and we don’t yet know if that cool new ingredient we added last night is actually edible. Yum.

At the bootstrapping phase of the project, the eagerness to eat unbaked cookies is a precious gift. I am in awe of the folks I know who have this mindset. For them, it’s a chance to play with something new and influence – often deeply – what the next iteration of the product will look like. On the receiving end, we get a wealth of insights they generate by trying – and gleefully failing – to use the product as intended.

For this process to work, we must show a complementary eagerness to change our assumptions. It is often disheartening to see our cool ideas be dismantled with a single click or a confused stare. Instead of falling prey to the temptation of filtering out these moments, we must use them as guiding signals – these are the bits that take us toward a better product.

The relationship between the bakers of unbaked cookies and cookie testers requires a lot of trust – and this can only be built over time. Both parties need to develop a sense of collaborative relationship that allows them to take risks, challenging each other. As disconcerting it may be, some insights generated might point at fundamental problems with the product – things that aren’t fixable without rethinking everything. While definitely a last resort, such rethinking must always be on the table. Bits of technology can be changed with some work. The mental models behind the product, once it ships to the broader audience are much, much more difficult to change.

Because of that, the typical UX studies aren’t a great fit for the bootstrapping phase of the project. We’re not looking for folks to react to the validity of mental models we imbued the nascent product with. We fully realize that some of them – likely many – are wrong. Instead, we need a collaborative, tight-feedback loop relationship with the potential users, who feel entrusted with steering the product direction through them chewing on not-yet baked cookies. They aren’t just trusted testers of the product. They aren’t just evaluators of it. They are full participants in its development, representing the users.

Thinking to write

I’ve had this realization about myself recently, and it’s been rather useful in gaining a bit more understanding about how my mind works. I am writing it down in hopes it would help you in your own self-reflections.

The well-worn “Writing to think” maxim is something that’s near and dear to my heart: weaving a sequential story of the highly non-linear processes that are happening in my mind is a precious tool. I usually recommend developing the muscle for writing to think as a way to keep our thoughts organized to my colleagues and friends. Often, when I do, I am asked: “What do I write about?”

It’s a good question. At least for me, the ability to write appears to be closely connected to the space in which I am doing the thinking. It seems like the whole notion of “writing to think” might also work in reverse: when I don’t have something to write about, it might be a signal that my thinking space is fairly small or narrow.

There might be fascinating and very challenging problems that I am working on. I could be spending many hours wracking my brain trying to solve them. However, if this thinking doesn’t spur me to write about it, I am probably inhabiting a rather confined problem space.

I find that writing code and software engineering in general tend to collapse this space for me. Don’t get me wrong, I love making software. It’s one of those things that I genuinely enjoy and get a “coder’s high” from.

Yet, when doing so, I find that my thoughts are sharply focused and narrow. They don’t undulate and wander in vast spaces. They don’t get lost just for the sake of getting lost. Writing code is about bringing an idea to life. It’s a very concretizing process. Writing code is most definitely a process of writing to think, but it’s more of “writing it”, rather than “writing about it”.

The outcome is a crisp – albeit often spaghetti-like – set of instructions that are meant to be understood by a machine, which for all its complicatedness is a lot less complex than a human mind.

On the other hand, when I was doing more strategy work a few years back, I found myself brimming with ideas to write down. It was very easy to just knock out a post – nearly every idea I had was begging to be played with and turned into a story to share. I was in the wide-open space of thinking among people, and particularly long-term horizon, broad thinking, and wandering.

Nothing’s wrong with inhabiting smaller problem spaces for a little while. However, it’s probably not something I would pick as the only way of being. “Inhabiting” brings habits, and habits entrench. Becoming entrenched in smaller problem spaces means that the larger spaces become less and less accessible over time, resulting in strategic myopia.

It seems that to avoid such a diagnosis, we’ve gotta keep finding ways to think in spaces big enough to spur us to write. To use an analogy, “writing to think” is like developing a habit of brushing our teeth, and “thinking to write” is a way to check if we indeed follow this habit. If we find ourselves struggling to write, then maybe we need to broaden the problem space we inhabit.

Aircraft Carriers and Zodiac Boats

An insightful conversation with a colleague inspired me to articulate the distinction between velocity and agility – and all that implies. When I talked about velocity in the past, I sort of elided the fact that agility, or the ability to change direction at a given velocity, plays a crucial role in setting organizations for success in certain conditions. I also want to disclaim that the definition of “agility” I use here is only loosely related to the well-known/loved/hated agile methodologies.

The example I’ve used in the past is that of a zodiac boat and an aircraft carrier. Though they are capable of roughly going at the same velocity, their agility is dramatically different. The zodiac boat’s maneuverability is what gives it a decisive advantage in an environment where the situation changes rapidly. On the other hand, an aircraft carrier is able to sustain velocity for much longer periods of time, which enables it to travel around the globe.

In engineering teams, velocity and agility are often used interchangeably, and I am quite guilty of doing this as well. Only in retrospect I am realizing why some teams that I’ve worked both with (and next to) looked and acted so differently. They were valuing, respectively, velocity or agility.

When the team favors velocity, it invests significantly into its capacity to achieve maximum sustained speed for the longest possible time. Decision-making and engineering processes, tools and infrastructure all feel like that of an aircraft carrier, regardless of the team’s actual size. It’s like the question on everyone’s mind is: “Can we cross the Pacific Ocean? How many times over?” The team is designed to go far, even if that means sacrificing some velocity to its robustness.

For instance, the Blink team I led a while back was all about velocity, borrowing most of its ethos from Google engineering culture. We designed our infrastructure to enable us to ship directly from trunk through diligent test coverage and a phenomenal build system (we built our own!), we followed the practice of code reviews and the shipping process. We talked about how this team was built to run for multiple decades.

This was (and a decade later, still is), of course, the right fit for that project. Rendering engines represent highly overconstrained, immensely complex systems of a relatively well-defined shape. The team that ships a rendering engine will not one day decide to do something completely different. The word “velocity” in such a team is tightly coupled with achieving predictable results over a long period of time.

However, when the final shape of the value niche is still unknown, and the product-market fit is something we only wistfully talk about, a different structure is needed. Here, the engineering team needs to lean into agility. When they do so, the project will act very differently. It will be more like a zodiac boat: not built to run forever, but rather to zig and zag quickly.

A project structured like a zodiac boat will have alarmingly few processes and entrenched practices. “What? They don’t do code reviews?” The trunk might be unshippable for periods that would be unacceptable by any standards of an aircraft carrier team. The codebase will have large cheese holes in implementation and test coverage, with many areas only loosely sketched. In a zodiac boat project, everything is temporary, and meant to shift as soon as a new promising approach is discovered.

Such projects are also typically small-sized. Larger teams means more opinions and more coordination headwinds, so zodiac boat projects will favor fewer folks who deeply understand the code base and have no problem diving in and changing everything. They will also attract those who are comfortable with uncertainty. In a highly dynamic situation, the courage to make choices (even if they might not be the right ones), and the skill to keep the OODA loop spinning are paramount.

A well-organized startup will have to run like a zodiac boat project. Startups rarely form around old ideas or long-discovered value niches. A lot of maneuvering will be necessary to uncover that motherlode. Any attempts to turn this zodiac boat into an aircraft carrier prematurely will dramatically reduce the probability of finding it. This is why ex-Googlers often struggle in startups: their culturally-instilled intuition will direct them to install nuclear reactors and rivet steel panels onto their boats – and in doing so, sink them.

Which brings me to engineering mastery. In my experience, there are two kinds of successful zodiac boat projects: ones run by people who aren’t that familiar with robust software engineering, and ones who have achieved enough software engineering mastery to know which practices can be broken and disregarded.

The first group of folks succeeded accidentally. The second – intentionally. This second group knows where to leave the right cheese holes in their project, and will do so consistently with magical results.

That’s what’s so tricky about forming zodiac boat projects. One can’t just put together a bunch of engineers into a small boat and let them loose. As with any elite special force, zodiac boat projects require a crew that is tightly knit, intrinsically motivated, and skilled to extreme.

Curiously, aircraft carrier-culture organizations can sometimes produce relatively high agility through what I call ergodic agility. The ergodic agility refers to the phenomenon where a multitude of projects are given room to start and fail, and over time, through the ergodic motion, find a new value niche. Here, the maneuverability is achieved through quantity and diversity of unchanging directions.

Like the infamous quote from Shrek, this process looks and feels like utter failure from inside of most of these teams, with the lucky winner experiencing the purest form of survivorship bias.

I am not sure if ergodic agility is more or less expensive for a large organization compared to cultivating an archipelago of zodiac boat teams and culture. One thing is certain: to thrive in an ever-changing world, any organization will need to find a way to have both aircraft carriers and zodiac boats in its fleet.

AI Baklava

At its core, the process of layering resolves two counterposed forces. Both originate from the need for predictability in behavior. One from below, where the means are concrete, but the aims are abstract – and one from above, where the aims are concrete, but the means are abstract.

Each layer is a translation of an abstraction: as we go up in the stack of layers, the means become more abstract and the aims more concrete. At each end, we want to have predictability in what happens next, but the object of our concern is different. Lower layers want to ensure that the means are employed correctly. Higher layers want to ensure that the aims are fulfilled.

For example, a user who wants to press a button to place an order in the food-ordering app has a very concrete aim in mind for this button – they want to satisfy their hunger. The means of doing that are completely opaque to the user – how this will happen is entirely hidden by the veil of abstraction.

One layer below, the app has a somewhat more abstract aim (the user pressed a button), but the means are a bit more concrete: it needs to route the tap to the right handler which will then initiate the process of placing the order.

The aims are even less concrete at a lower layer. The button widget receives a tap and invokes an event handler that corresponds to it. Here, we are unaware of the user’s hunger. We don’t know why they want this button tapped, and nor do we care: we just need to ensure that the right sequence of things transpires when a tap event is received. The means are very concrete.

A reasonable question might be: why layer at all? Why not just connect the means and aims directly? This is where another interesting part of the story comes in. It appears that we humans have limits to the level of complexity of mental models we can reasonably contain in our minds. These limits are well-known to software engineers.

For a seasoned engineer, the pull toward layering emerges nearly simultaneously with the first lines of code written. It’s such a habit, they do it almost automatically. The experience that drives this habit is that of painful untangling of spaghetti after the code we wrote begins to resist change. This resistance, this unwillingness of cooperating with its own creator is not the fault of the code. It is the limit of the engineer’s mental capacity to hold the entirety of the software held in their minds.

When I talk about software with non-technical people, they are nearly always surprised about the amount of bugs any mature software contains. It seems paradoxical that older software has more bugs than the newer code. “What exactly are y’all doing there? Writing bugs?!” Yup. The mental model capacity needed to deeply grok a well-used, well-loved piece of software is typically way beyond that of any individual human.

So we come up with ways to break software into smaller chunks to allow us to compartmentalize their mental models, to specialize. And because of the way the forces of the aims and the means are facing each other, this process of chunking results in layering. Whether we want it or not, the layering will emerge in our code.

Putting it another way, layering is the artifact of the limits of human capacity to hold coherent mental models. If we imagine a software engineer with near-infinite mental model capacity, they could likely write well-functioning, relatively bug-free code using few (if any) layers of abstraction.

The converse is also true: lesser capacity to hold a mental model of traversing from the aims to the means will lead to software with more layers of abstraction.

By now, you probably know where I am going with this. Let’s see if we can apply these insights to the realm of large language models. What kind of layering will yield better results when we ask LLMs to write software for us?

It is my guess that the mental model-holding capacity of an LLM is roughly proportional to the size of this model’s context window. It is not the parametric memory that matters here. The parametric memory reflects an LLM’s ability to choose and apply various layers of abstraction. It is the context window that places a hard limit on what a model can and can not coherently perceive holistically.

Models with smaller context windows will have to rely on thinner layers and be more clever with the abstraction layers they choose. They would have to work harder and need more assistance from their human coworkers. Models with larger context windows would be able to get by with fewer layers.

How will the LLM-based software engineers compare to human counterparts? Here’s my intuition. LLMs will continue to be abysmally bad at understanding large code bases. There’s just way too many assumptions and tacit knowledge that lurks in those lines of code. We will likely see an industry-wide spirited attempt to solve this problem, and the solution will likely see thinning the abstraction layers within the code base to create safe, limited-scope lanes for synthetic software engineers to be effective in their work.

At the same time, LLMs will have a definite advantage over humans in learning the codebases that are well within their limits. Unlike humans, they will not get tired and are easily cloned. If I fine-tune a model on my codebase, all I need is the GPU/TPU capacity to scale it to a multitude of synthetic workers.

Putting these two together, I wonder if we’ll see the emergence of synthetic software engineering as a discipline. This discipline will encompass the best practices for the human software engineer to construct – and maintain – the scaffolding for the baklava of layers of abstraction populated by a hive of their synthetic kin.