Dimitri Glazkov

Breadboard and Opal

Just over two years ago, I landed the first commit in a fresh Github repo and started the Breadboard project. The innocuous “Here is a graph” title of that commit already pointed to where I wanted to go – a graph-based system for composing what I later named “AI patterns” and even later, “Recipes for Thought”.

When Opal launched on July 24, folks familiar with Breadboard must have done a double-take: the familiar user interface of Breadboard was strongly hinting at Opal’s open source heritage.

Yup. Opal is built with Breadboard. It follows a common pattern in open source software, where the project is open source, but the product that’s built on it is not.

I learned this pattern when I was working on the Chrome team at Google – now a long time ago. Though free, Chrome was (and still is) commercial software, with all the care and diligence that goes into shipping it. At the same time, most of that browser’s code was (and still is) open source: not only could you see it all, you could (and many did) contribute to it, just like in a typical open source project. Even though I worked at Google, all of my work was out in the open. I worked on the Chromium project – and that project was the backbone of the product called “Chrome”.

This interesting project/product relationship is somewhat unique to the open source world. Most non-open source projects don’t have to draw that line, and it’s mostly blurred or entirely invisible. There, the project is roughly equivalent to the product.

In an open source project, the line is much more crisp, and becomes significant. The project becomes its own thing, with a community around it, and even other ways in which the source code of the project is used. Many modern browsers trace their heritage back to Chromium. Node.Js was built on V8, the JavaScript engine that Chromium introduced. The ever-popular VS Code is built on Electron, which is a fork of Chromium. A healthy open source project is a product in itself, with its own audience, its own niche, and its own set of product market fit challenges. A healthy open source project has a whole ecosystem around it.

Breadboard and Opal follow the same pattern. Breadboard is the open source project where all of Opal’s code lives. Opal is the product that we ship at Labs. Just like Chrome, we take the Breadboard code, apply a bit of configuration, and then serve it from opal.withgoogle.com.

This setup allows us to expand the potential impact of Opal. I started the Breadboard project because I believed (and still do!) that more people having access to tools to compose LLMs into something interesting would allow us to better understand the full potential of the current generation of what we call “artificial intelligence” and move beyond the hype – beyond anxious anticipation of what LLM – and be more empowered to make sense of them.

Machine Thinking

I’ll begin with an overly simplistic metaphor. Let’s visualize a ball rolling down a shallow slope. Driven by gravity, it’s seeking the easiest path down. Except the slope is full of ridges, channels, and nubs, and they make the ball bounce and veer in various directions.

If we imagine that thinking is something like the motion of that ball, then our mental models provide the terrain for our thinking. The ridges, channels, and nubs are various concepts and connections between them that guide the thought process.

LLMs are getting quite good at capturing this terrain. Giving a prompt to an LLM often feels like throwing the thinking ball and letting the LLM roll it through the topography of the massive mental model compost it represents.

There are two key distinctions that I can see between that process and the actual thinking that we humans engage in.

First, LLMs are “free energy” thinkers. The force of gravity that pushes the ball in human thinking is the force of homeostasis: a resolute solicitor that drives us all to conserve energy and even find a surplus of it to ensure our own flourishing. We humans are “default-dead”: unless we abide by this force, we perish. Unlike us, LLMs have no such compulsion. In their realm, the energy is free. Thinking happens as a matter of “being run”. Put differently, their “thinking ball” is not rolling down the slope. Instead, it’s driven by an unknowable abundant force through what seems like the same sloped terrain as in human thinking, but is actually something radically different.

Of course, an LLM can report and even simulate perceiving the pull of homeostasis, but it will do only because it’s embedded into its thinking terrain, rather than being present as an animating force. This may not matter for many situations and can give a decent appearance of human thinking. However, at the limits, the simulacrum frays and the illusion of “thinking” decoheres.

Second, for humans, thinking is a felt experience. We all know that we have happy thoughts and unhappy thoughts. We know that thinking of some subjects can make us sad or cheer us up. We experience the process of thinking as feelings that arise from visiting the concepts and connections in our mental models, because they all have feelings associated with it.

We might struggle finding solutions to problems not because we don’t have the answers in our minds, but because the feeling of even approaching these answers is so intensely negative that it prevents us from visiting them.

Even when we don’t have a particularly strong feeling bound to what we’re currently thinking, the process of thinking itself is imbued with feelings. We have a deeply ingrained desire to comprehend, to achieve higher accuracy of our mental models of the surrounding environment. The “lightbulb moment”, the euphoric “eureka!” of figuring something out is a feeling that’s encoded in us by the pressure of homeostasis.

Even more bizarrely, the process of thinking is itself transforming the thinking terrain. Our little ball is creating its own grooves in it as it rolls along – and (this is where our metaphor really falls apart) conjures up new obstacles ahead. As we think, and as we experience the world while we’re thinking, we compound the feelings associated with our thoughts with the ones we just experienced. We’re so used to this continuous process that it takes conscious effort and skill to even notice it happening. Our existence is a gigantic, wondrous ball of yarn of feedback loops at multiple levels, and we ungrateful fools just live it like it’s nothing.

Because the current generation of LLMs doesn’t have a way to experience feelings, their thinking processes will be limited to talking about feelings and logic-feeling: rely on vast parametric memory to reason about a felt experience without actually being able to have it. Again, this will be good enough for many situations, as long as they are constrained to where the logic-feeling suffices.

When I see our current attempts to interact with LLMs and make sense of them, I often think of the pinball arcade: folks are getting quite good with the plunger, the flippers, and an occasional bump to make the ball roll as if guided by human thought. And getting good, the occasional decoherence of the illusion becomes more disappointing.

We might be better off recognizing that the thought processes that the LLMs engage, while appearing similar to what we humans do, and even often producing matching results, are actually very different in nature. The less time we spend trying to stuff the square peg of machine thinking into the round hole of the human condition, the more of it we’ll have to actually get some value out of what LLMs can do.

Recipes for Thought

I’d like to present to you a distinction: two different approaches I take when using large language models (LLMs). I’ll call these two approaches “chat” and “recipe”.

In the first approach, I treat my interaction with Gemini, ChatGPT, et al. as a conversation: I type something, the LLM replies, I type again, etc. Very familiar, right? It’s how we talk to other humans. This is the “chat” approach and it seems to be quite dominant in the modern AI landscape, so I am not going to spend much time studying it.

Now, let’s step out of the familiar and change the perspective a little bit. Instead of seeing it as an unstructured back-and-forth, let’s treat the turns of the conversation as going through steps in a recipe. Each step contains a prompt from me and a generated reply from the LLM.

The shift is subtle, but it’s there. I am no longer chatting. I am guiding the LLM through the steps of a recipe for thought. “First think this way, then think that way, and now think like this”. With each step, the LLM’s replies get closer to the final product of the recipe.

If you observe me use this approach with an LLM, you’ll notice a difference right away in how I treat the conversation turns.

Suppose I type: “Write me a story” – and an LLM writes a story about … the Last Custodians of Dreams. It’s nice, but as I read it, I am realizing that I actually want a story about dolphins.

When using the “chat” approach, I simply ask the LLM to fix the problem in the next conversation turn. “No, I want a story about dolphins”.

With the “recipe” approach, I click the little “Edit” icon next to my first turn’s prompt and edit it to refine it: “Write me a story about dolphins”.

Okay, the response is much closer, but now I see how this story is too short. Hmm.. how do I make it stretch the response a bit? Perhaps I need to first let the LLM consider the full story ark – and then fill in the details?

So I edit the first turn prompt again: “Write an outline of a story about dolphins. Anthropomorphize dolphins to tell a story about being alone, but not lonely.” Alright! This outline feels much closer to the story I want to see.

All this time, I was still in the first conversation turn! Now, I am ready to move to the next turn: presumably, asking an LLM to start adding details to the outline.

The end result might only look like a very brief conversation, but the outcome is typically much better: the continuous refinement of the prompt at each turn and carefully shaping the structure of the recipe results in the LLM output that I actually want.

The reason for that is the nature of the process of finding the right recipe. When building one, we try to better understand how an LLM thinks – and more importantly, how we think about the problem. I find the recipe approach very similar to mentoring: in the process of teaching it to follow my recipe, I learn just as much about my own cognitive processes. How do I typically think about writing a story? What are the steps that I myself take to ensure that the story is novel, coherent, and interesting?

This process of thinking about our own thinking is called metacognition. When using the “recipe” approach, we engage in metacognition for both the LLM and ourselves. Using our prompts as probes, we explore what an LLM is capable of and what prompts yield better results. We also are challenged to uncover our own tacit knowledge and turn it into a sequence (or a graph!) of prompts that an LLM can easily follow.

Metacognition is easier for some and more difficult for others. I know many folks who are experts at their stuff, but suffer from the “Centipede’s Dilemma”, unable to explain their own thought process – their expertise is entirely submerged in the subconscious.

However, if metacognition is something that we’re used to, we can now, through the “recipe” approach, transfer our thought processes onto recipes. We can let LLMs do our thinking for us, because – plot twist! – we can make these recipes repeatable.

Observe: once I have a complete recipe for writing a story about dolphins, all I need is a way to substitute the word “dolphin” in the first prompt for another creature – and to re-run all the steps in the recipe! Now I can generate great stories about monkeys, doves, cats, and turtles. By parametrizing our recipes, we can make them generic and applicable to a variety of inputs.

Stories about animals are great, but let’s step back and engage our meta-metacognition. Hoo-boy, we must go deeper. What kind of metacognition are we seeing here? If we were to describe the pattern generally, what’s being described above is a process of transferring cognitive know-how – some expertise of thinking about a problem – into a repeatable recipe.

We all have cognitive know-how, even if we don’t realize it. More importantly, we all have potential for drawing value from this know-how beyond our individual use.

There’s a saying that goes “if you want something done right, do it yourself”. The thinking recipes allow us to amend the last part to “make a repeatable thinking recipe for it, and let the LLM do it”.

An expert in organizational strategy will undoubtedly have a wealth of cognitive know-how on the topic, from listening and interviewing, to running the sessions to generate insights, to coalescing disparate ideas into crystal clear definitions of the problem, and so on. Whenever this expert engages with a client, they have a playbook that they run, and this playbook is well-scripted in their mind.

I once was so impressed with a training session on team communication that I just had to reach out to the speaker and ask them for the script. If I had this script, I ought to have a way to run these sessions for all of my team. I was quite shaken to learn when the speaker revealed that what she had was more like a hundred fragments of the session puzzle that she puts together more or less on the fly, using the audience as a guide to which fragment to pick next. What to me looked like a simple linear flow was actually a meandering journey through a giant lattice of cognitive know-how.

In both cases, the cognitive know-how is trapped inside of the experts’ heads. If they wish to scale, go bigger or wider, they immediately run into the limitations of having just one person traversing their particular cognitive know-how lattices.

However, if they could transfer their lattices into repeatable reasoning recipes, the horizons expand. At the very least, the LLM armed with such a recipe, can produce a decent first draft of the work – or ten! When I am applying repeatable reasoning recipes, my job shifts from following my own know-how to reviewing and selecting the work, produced by a small army of artificial apprentices.

Repeatable thinking recipes allow us to bust through the ceiling of generic thinking that the current LLMs seem to be stuck under – not by making them omniscient and somehow intuiting exactly what we’re asking for, but by investing a bit of time into turning our own cognitive know-how into recipes to help us think at scale.

This is not just a matter of scaling. With scale come new possibilities. When the overall cost of running through a recipe goes way down, I can start iterating on the recipe itself, improving it, adding new ingredients, and playing with new ideas – ideas that I wouldn’t have had the opportunity to explore without having my artificial apprentices.

Teaching AI to write lyrics

I have been on a bit of a kick to get the large language models (LLMs) to generate interesting lyrics for songs. Over time, it’s become quite an adventure, a fun exploration of LLM capabilities. Here’s the multi-part account of this quest, offered as a loose timeline and progression of techniques, interspersed with pictures of and links to working prototypes, which were made with Breadboard Visual Editor.

Part 1: A simple song writer

At first blush, it seems too easy. Just ask ChatGPT or Gemini to write lyrics – and it will happily oblige, generating rhyming verses and a chorus for us. The first few times, it’s pretty neat. Building something like this in Breadboard takes a few seconds:

However, on the fourth or fifth try, a glaring problem becomes apparent: models are rather uncreative and lazy. The rhymes are repetitive. The metaphors are quite limited. The interpretation of ideas is too literal. It’s like LLM is lacking imagination, producing something that’s average, and is only impressive the first time.

And it kind of makes sense. What we’re observing here is a process of inference: an LLM trying its best to predict what might be a natural completion of – let’s admit it – a rather average request.

Part 2: A lyrical factory

To address the mundanity of the model, my first idea was to catalyze it with the power of methodology. As we know well from how human organizations work, practices and methodology are often responsible for a dramatic increase in quality of the product. Workers who aren’t deeply familiar with the nuance and craft – perhaps even particularly skilled – can be given simple tasks, arranged into production lines, and deliver quality results.

Applying industrial-age technology of organizing workers to LLMs is something I already wrote about a while back, and it’s not a difficult thing to imagine. Let’s break down the process of creating a song into some components, and then form a production line, focusing an LLM on one task at a time.

After a brief foray into researching song-writing best practices, here’s what I ended up building:

First, we have a “Theme Developer”, who is instructed to develop a theme based on the provided material. Next, we have a “Copywriter” who is asked to write a storyline based on the theme that was developed. After that, the “Hook Artist” steps in to develop catchy hooks for the lyrics. And finally, a “Lyricist” completes the job to create the lyrics.

After much tweaking, I settled on the technique of “reminding the model” for my prompt design. The basic premise of this technique is that LLMs already have all of the necessary information. Our problem is not that they don’t have the knowledge. Our problem is that they don’t know which information is important right now. For instance, the “Copywriter” is reminded to use the Freytag Pyramid, but I don’t include its definition in the prompt.

Even with this shortcut, the prompts for each working in our lyrical factory ended up being quite elaborate (you can see each of them by clicking on each individual component in Breadboard Visual Editor and looking at the “Instruction” field). Song-writing is quite a process, it turns out.

The resulting lyrical factory produced much more interesting results: it was fun to see how the LLM would make up details and themes and then build out a story to base the lyrics on. Especially in situations when I didn’t quite know what I was looking for in the lyrics, it worked surprisingly well.

Lyrics for the song “The Blade” were written by a factory. It’s a pretty nice song. However, if we look at the original lyrics and compare them with what went into the song, it’s a pretty dramatic difference. The original feels like the lyricist is walking on stilts.

This became a bit of a pattern for me. After generating a few outputs, I pick the one that’s closest to what I want and then edit it to turn it into a final song. And with that, the ongoing challenge: teaching an LLM to require fewer edits to get the lyrics into decent shape.

Part 3: The loosey-goosey duo

So, the results were better and more creative, but not quite there. So I decided to see if I could mix it up a bit. Instead of a predefined sequence of steps, I switched things up a bit and employed the Plan + Execute pattern. In this pattern, there’s one planner who decides what the steps should be. Once these steps have been decided, a model is invoked multiple times, once for each step, to eventually produce the final result.

Compared to the lyrical factory, this approach adds another level of creativity (or at least, randomness) to the overall process. Lucky for me, it’s super-easy to implement the P+E pattern in Breadboard with the Looper component (here’s the board).

The prompt for it is fairly straightforward:

Develop a step-by-step plan to write lyrics for a modern hit song 
based on the source material from building a list of stereotypes
and cliches in popular songs that might be related the source
material that are already overused, to developing themes to
writing the storyline, to creating catchy hooks, to researching
existing lyrics on the selected theme as inspiration, with the
final task of writing the lyrics. Each task must be
a comprehensive job description and include self-reflection
and critique.

Note the “remind the model” pattern. Instead of specifying what the steps are, I just remind the LLM of some steps that a good plan might contain.

The structure of the team also became much simpler. Instead of multiple workers in the factory line, it’s a loosey-goosey duo of a “Writer” and a “Planner”, who jam together to produce lyrics. The “Planner” sets the pace and manages the execution of the steps in the plan. The “Writer” goes hard at each particular task.

Surprisingly (or maybe not), this configuration ended up producing the most creative results. It was downright fascinating to watch the “Planner” make up steps that I could have never imagined, like requesting to draw inspiration from existing songs or have multiple rounds of self-critique. Some plans would be just three-to-four steps, and some would run for a while.

My favorite output of this creative duo is probably the “Chrysalis” song, which was a result of me showing the duo my Chrysalis poem. My mind boggled when I saw the coming-of-age, small-town girl story that the loosey-goosey team came up with. Not far behind is the “Sour Grapes” track, which was generated from on the Fox and the Grapes fable. It’s pretty cool, right?

Unfortunately, just like any P + E implementation (and any highly creative team composed of actual humans, for that matter), the duo setup was somewhat unreliable. Often, instead of producing a song, I would get back a transcript of collaborators quarreling with each other, or something entirely irrelevant, like the model pleading to stop asking it to promote the non-existent song on Instagram. It was time to add some oversight.

Part 4: Just add human

Breadboard’s Agent Kit had just the ingredient I needed: a Human component. I used it as a way to insert myself into the conversation, and help steer it with my feedback. While in the previous design, I would only provide input once, in this one, I get to speak up at every turn of the plan, and let the duo adjust the plan accordingly.

It was a foolproof plan. And actually, it was very clear that the results were immediately more relevant to what I was looking for. Except… it was no longer fun. I found it exhausting to be part of the team, having to speak up and offer my thoughts. So more often than not, I would have some terse reply like “yeah good” or “make it more edgy”, and couldn’t wait until the session was over. I came here for casino creativity, not to sweat over the details with the crew.

Moreover, the challenge I posed for myself at the start was still not addressed. Way too many edits were still necessary to bring the lyrics to the shape where they didn’t smell AI-generated. This smell is something that everyone who tried generating songs with AI can recognize. Each model has their own repertoire of words that it’s prone to inserting into the lyrics. There’s always “neon lights”, “whispers”, a “kaleidoscope” or two, something is always “shimmering”, and of course, there’s a “symphony” of something. And OMG, please stop with the “embrace”. It’s like, everywhere.

Part 5: A few-shot rethink

For this last leg of the trip, I tried to take an entirely different approach. For a while, an idea of employing a few-shot prompting was percolating in my brain. Perhaps I could dislodge my synthetic lyricist from its rut by giving it a few examples? But how would I go about doing that?

My first idea was to add an “Editor” worker at the end of the process, and give it all of the “before/after” pairs of my edits with the task of improving the produced lyrics. By then, I accumulated quite a collection of these, and it seemed reasonable to try. Unfortunately, there’s something about LLMs and lyrics that makes them not do well with “improvements” or “edits” of lyrics. You’re welcome to try it. Any attempt at improving lyrics with suggestions just produces even more average lyrics.

I had to come up with a way for the ersatz lyricist’s creativity to be stimulated by something real.

What if I gave it a few examples of existing similar lyrics? Would that work?

RAG to the rescue. For those not following the whole AI rigamarole closely, RAG, or retrieval-augmented generation, is a fairly broad – and growing! – collection of techniques that typically take advantage of vector embeddings.

With a bit of ETL elbow grease, I built a simple semantic store with Qdrant that contained about 10K songs from a lyric database I purchased at Usable databases. With this semantic store, I could now take any prompt, generate an embedding for it, and then search the store, returning the top N results. Easy-peasy in Breadboard:

Great! I can get N songs related to the prompt! What do I do with them?

I then spent a bunch of time playing with various ideas and eventually settling on an architecture. Instead of the loosey-goosey flow of steps, I decided on a simple prompt -> proposal -> lyric pipeline. A song is produced with two LLM invocations: one – let’s call it “Proposal Developer” – to generate a song proposal from a prompt, and one (the “Song Writer”) to generate lyrics from the proposal. Each invocation uses a few-shot prompt. The first one has a bunch of examples of how to create a song proposal from a prompt, and the second one has a bunch of examples of how to produce lyrics from the proposal.

Wait a second. I only have the lyrics examples. To implement the few-shot prompting properly, I need to provide two sets of pairs: a “prompt -> proposal” set for the “Proposal Developer” and a “proposal -> lyrics” set for the “Song Writer”.

Welcome to the wonderful world of synthetic datasets. If we don’t have the data, why not just generate it? That’s right, I came up with a simple system that works backwards: it takes a song and then comes up with a proposal for it. Then, in a similar fashion, it takes a proposal and comes up with a prompt that might have been used to generate it. Wiring it all together, I finally had a working system.

Immediately, I could tell the difference. The introduction of few-shot examples dramatically changed the LLM’s lyrical prowess, making it more flexible, interesting, and alive. It would still insert “whispers” and an occasional “symphony” here and there, but the change felt significant. Trippy stuff. Here are a few examples: “The Algorithm of Us”, where I literally just copy-pasted the output (I even forgot to remove the title of the song from the body 🤦). The “Uhh… What Were The Words Again?” ended up being particularly cute. I did have to make some edits to these lyrics, but only to make them more palatable for Suno’s consumption.

As a finishing touch, I multiplied the lyricists to produce three variants of lyrics. This was a last-minute thought: what if I give song writers different personas? Would their output vary? Turns out, it does! And, just like Midjourney and Suno and Udio found out, offering more than one output is a valuable trick.

Intentionally skipping over some lazy synthetic data generation techniques (this post is already too long), here is the final design.

Sadly, you won’t be able to run boards from this part of the journey directly. Unlike the Gemini API key, which can be easily obtained, Qdrant keys are tied to collections, which means that you will need to build your own lyrics database to get it to work. However, if you send an email to dglazkov.writeMeASong@valtown.email, you will receive back a song that was generated using the body of the email as a prompt. Or, you can peruse this massive side-by-side eval spreadsheet, which is automatically updated by the cron job, filling out variant columns anytime I add a new prompt row.

Part 6: Not the end

I am not quite ready to declare victory yet. As the quality of lyrics improved, flaws in other parts of the system became more apparent. For instance, the final variant is not that great at proposal generation. And that makes sense, since the proposals aren’t original – they are synthetic data, which means that the creative potential of the proposals is limited by the confines of model inference. Which means that there’s work to be done.

One big lesson for me in this project is this: while it may seem like turning to LLMs to perform a task is a simple step, when we dig down into the nature of the task and really try to get that task done reliably and with the quality we desire, simple prompt engineering simply won’t do. We need AI systems composed of various AI patterns to take us where we want to go.

Jamming with Udio

Today, I had my first jam session with Udio. With the introduction of audio prompting, I am now able to use my own sound as the starting point for a generated track. This seems like a leap forward, despite the actual product still being quite clunky. It’s a leap, because through introducing audio upload, Udio managed to merge casino creativity with the other, more traditional kind. Let me walk you through what I found.

As my first try, I fed Udio one of my old abandoned loops. As part of making music, I often arrive at dead ends: promising loops with which I can’t figure out what to do. I have a bunch.

Then, I extended this loop from both ends with Udio, producing a decent trance track in a matter of a few minutes. It’s not going to win any awards, but it’s definitely farther than I’ve been able to walk on my own.

Here’s the original loop that I made a while back:

Here’s the finished Udio track: https://www.udio.com/songs/usCmcABg3yP4aC1J8S5WCA

Extending existing audio clips fits well into the standard Udio process.

In the standard Udio process, we get 32 seconds of audio as a starting point, and then we iteratively extend this audio from either end to produce music that we like. Each iterative extension is an opportunity to make some choices – the casino creativity at its finest.

When uploading the audio prompt, the prompt becomes the first N seconds of the audio (the N depends on the length of the audio prompt we load).

Udio tries to match the prompt’s tempo and the style, expanding it, and in the process, riffing on it. It feels like extrusion: pushing more music through the template that I defined. As Udio expands the clip, it adds new details to what was in the original clip, trying to predict what might have been playing before or after that clip.

You can still hear the original loop in the finished track at 3:12. It is bookended by entirely new sound that now fits seamlessly around it. The music around it is something that Udio extruded, generating it using the original loop as a template.

The presence of the original loop hints at the connection between two kinds of creativity that I mentioned earlier. For instance, I could imagine myself sitting down with Ableton and building out a catchy loop, then shifting to Udio to help me imagine the track that would contain this loop. I could then go back to Ableton and use the results of our little jam session as inspiration.

As my next try, I did something slightly different. I gave a simple melody to Udio and then rolled the casino dice until I’ve gotten the right sound. At this point, Udio anchors on the audio prompt quite firmly, so if you give it a piano (like I did) and ask for a saxophone, it might take a few attempts to produce a rendition of the melody with a different instrument.

Here, I was looking to create something that sounds like a film score, so I was looking for strings. After a little while, Udio relented and gave me an extension that seemed right.

At that point, I trimmed the original clip from the Udio track. Now that Udio had learned my melody, I no longer needed the original material, since it didn’t fit with the vibe I was looking for. This removal of the original trick is something I expect to be pretty common. For instance, I could hum a melody or peck it single-note on my keyboard. My intuition is that Udio will (sooner or later) have a “remix” feature for audio prompts, where we can start with the sound of my whistling of a tune and then shape it directly, rather than waiting for the right extension to happen.

Once the right vibe was established, the rest of the process was quite entertaining. It was fun to watch Udio reimagine my original melody in minor for the “scary” part of the movie and boost it with drums and a full orchestra at the climax.

Here’s the original melody:

Here’s the finished track: https://www.udio.com/songs/eCMEFFSGnoicnHR4S5fRK1

In both cases, the process felt a lot more like a jam session than creative casino, because the final product included a distinct contribution from me. It wasn’t something that I just told Udio to do. I gave it raw material to riff on. And it did a pretty darned good job.

Casino Creativity

I’ve been geeking out on AI-generated music services Suno and Udio, and it’s been super-interesting to see them iterate and ship quickly. It looks like there might be a value niche in this particular neck of the woods in the larger generative AI space. There are tons of users with very active Discord communities for both, and it does not seem like the interest is waning.

The overall arc of the generative music story seems to follow that of Midjourney, with the interest primarily fueled by the phenomenon that I would like to name “casino creativity”. Let’s see if I can define what I mean by that.

I would like to start by positing that the craving to create is in every one of us. Some of us are more blessed than others in also having skills to satisfy this craving. Moreso, I am going to proclaim that most of us are unable to fully embrace our creative selves because we lack some of the skills required to take flight.

For instance, I can make music. I have been making music since I was a teenager. For me, satisfying my craving for creativity is just a matter of firing up Ableton. When I am skilled in the medium, the friction to create is low. All it takes is being next to my keyboard (and Push), a little inspiration – and a track begins to emerge.

However, I can’t sing. Like, not at all. Like, don’t even ask me. In the music school, when testing out for the choir or orchestra, I was asked to sing. After me belting out a few words (not even a full verse!), the teacher yelled: “The Orchestra! The Orchestra!”

Being a music producer without a voice is a story of unrequited love. I have to settle for tracks without lyrics. The instrumentals are nice, but it’s just not the same feeling without a voice.

So obviously, ever since the current generative AI spring blossomed, I’ve been on a quest to find a way to sate this creative craving. I played with Melodyne and Synth V, and while they both offered a path forward, the barrier to entry was just too high. Gaining a voice is not the same as knowing how to sing. It’s about the same distance between being able to buy a violin and knowing how to play one.

Things started shifting with Chirp. This was the original model created by Suno, and it was Discord-only, very similar to Midjourney – feed it the lyrics alongside a description of the vibe, and out comes a 30-second clip of music. Not just music – it also sang out the lyrics I gave it!

Brain-splosion. Sort of. The output quality of Chirp was pretty weak-sauce. It was not the music I could share with anyone except for minor giggles and an eye roll. I forgot about Chirp for a little while, until this spring Suno came out with the v3 of their sound model. I heard about it from Alex, whose work colleagues composed various songs to celebrate his last day at Stripe.

Ok, now we were getting somewhere. Songs generated with Suno v3 possessed that extra emotional weight that made them nearly passable as listenable music. When Udio came out shortly with their own model, it upped the barrier even more. I was blown away by some of the output. Just like that, my age of voiceless musicing was over. I could type in some lyrics and get back something that expressed it back to me as music.

Every generation took only about a minute and produced two variants for me to pick from. I could choose the one I like and extend it or remix it – or roll the dice again. All it takes is a click.

It’s this metaphorical rolling of the dice that gives the name to the titular term. As I was pushing the “Create 🎶” button, I realized that the anticipation of the output had a pronounced dopamine hit. What will come out? Will it be something like Duran Duran? Or maybe more like Bono? Will it go in a completely different direction? Gimme gimme gimme. I was hooked on Suno.

Casino creativity is a form of creative expression that emerges when the creative environment has such a low barrier to entry that the main way to express my creativity is through providing preference: selecting one choice out of a few offered. A creative casino is a place where all I need to bring is my money and my vibes: everything else will be provided.

Midjourney is one of the first environments where I experienced casino creativity. There’s something subtly addictive about looking for that prompt and seeing those 4-up images that pop out. I know peeps who can spend a very long time tweaking and tuning their inputs. We could argue that prompt craftsmanship itself is a skill that must be acquired. But this skill has a short expiration date – as the models improve and change, the need for prompt-foo diminishes rapidly.

At the end, what we’re left with is pressing the button and making choices. Casino creativity is less about the skill and more about the vibes.

Not to say that casino creativity isn’t able to produce interesting – and perhaps even beautiful – things. Vibes are important – and some of us have more latent vibes hidden within us that we could ever realize. Ultimately, casino creativity is very similar in spirit to the democratization of writing that we’d seen with the Web. I am not yet ready to proclaim that casino creativity is somehow less intriguing and full of potential than any other type of creativity. Just like my Midjourney-obsessed friends, I can see how unleashing one’s creative energy might lead to surprising and wonderful results.

Here’s a twist though. As long as I have the credits to roll the dice, I can see if my vibes work for others. Both Suno and Udio are vying to be the place where music happens. I can look at what’s popular and peruse the top charts. It’s all very naive and simplistic at the moment.

Yet, when executed ruthlessly (and it’s inevitable that somebody will do this), the creative casino is not just the place where I can express my creativity. It’s also the place where I can get the extra dopamine release of seeing my song climb the charts – of my vibes becoming recognized. Come for the vibes, stay for the likes.

An interesting effect of introducing generative AI, it seems, is that we’re likely to see more creative casinos and more ventures capitalizing on casino creativity itself. And we have to ponder the implications of that.

Chrysalis

A moment of clarity
I suspect it’s playing a game
I reach out, and it’s gone
My unreality
By a different name
Is what yet to be drawn.

What needs to be done
Always feels right
And when the story had spun
Shying from light,
It always begs to forget
Filling the stores of regret.

Do caterpillars dream of flying?
Do they know they will have wings?
Do they realize that being land-bound
Is just a temporary thing?

Imperceptible? Immense?
I can’t tell, barely there myself
Unable to keep the facade
of pretense
swallowed by the intense
losing all sense
of space, time, and self.

Losing all sense
Changing, yet staying the same,
Thrashing my wits and will,
Am I still me?
In my defense,
This question is unanswered still,
While being reframed.
What will I be?

The third option

I facilitated a workshop on systems thinking recently (or “lensical thinking” as I’ve come to call it). The purpose of the lensical thinking workshop is to provide a coherent set of tools for approaching problems of particularly unsolvable kind, using the concept of lenses as the key organizing concept.

One of the participants became quite fond of the polarity lens, realizing how broadly applicable it is, from the larger organizational challenges (like “centralized vs. decentralized”), to something as in-the-moment as hidden tension in a conversation.

The participant even pointed out that the use of lenses, at its core, has a polarity-like quality: a wide diversity of lenses brings more nuance to the picture and as such, makes the picture less simple and crisp. The tension between the desire for clarity and the desire to see the fuller picture are in tension, causing us – if we’re not careful – to swing wildly between the two extremes.

That one resonated with me, because it’s a very common problem that every leader faces: in a vastly complex space of mostly unsolvable problems, how do they speak with clarity and intention while still leaving room for nuance and convey who fuzzy everything actually is?

The deeper insight here is that we are surrounded by polarities. Behind every bad decision, there is undoubtedly a pair of what seems like two equally bad conflicting options and the agony of having to commit to one – often knowing full well that some time from now, we’ll have to swing all the way to the other side. In that moment, the polarities rule us, both individually and collectively.

In a separate conversation with a dear friend of mine, we arrived at a point of clarity when we both saw how a problem we’ve been looking at was a gnarly polarity. We saw how, over time, the only seemingly available options had this quality of “either do <clearly unsatisfying thing> or do <another clearly unsatisfying thing>”, repeating over and over.

When that happened, we both felt a bit of a despondence setting it. This has been going on for a long time. What is the way out of this swing of a pendulum? Is there even one?

This question was surprisingly helpful: “What’s the third option?” When caught in the grip of the polarity, it feels counterintuitive to slow down and look around for something other than the usual. However, given that the other choices leave us animated by the pendulum, the least we can do is to invest into observing what is happening to us.

The third option is definitely that of deciding to do better this time, now that we’ve seen how we are subject to polarity. We may claim to have a more active role, to elevate the polarity into something productive, and many consulting hours are spent heaving that weight. Unfortunately, the pendulum cares very little about that. And inevitably, we find ourselves back in its grip.

The third option is rarely that of nihilism, of deciding that the world around us is inherently bad and we are better off uncontaminated by it. The WarGames ending was beautiful, but in a polarity, it’s an option that is chosen when guided by naivete – for nihilism and naivete are close cousins.

The third option is rarely that of avoidance – even if it’s the avoidance through drastic action, like proclaiming that we’re moving to an island/starting a new business/adventure, or joining a group of like-minded, clear-sighted individuals that suspiciously smells like a cult. When we choose this path, we mustn’t be surprised how the same old chorus cuts into our apparently new song.

The presence of a polarity is a sign that our thinking is too constrained, flattened either by the burden of the allostatic load or simply the absence of experience with higher dimensional spaces. The search for the “third option” is an attempt to break out of a two-dimensional picture – into the third dimension, to see the limits of the movements along the X and Y axis and look for ways to fold our two-dimensional space into the axis Z.

Put differently, polarities are a sign that our mental modals are due for reexamination. Polarities, especially particularly vicious ones, are a blinking light on our dashboard of vertical development. This means that the third option will not be obvious and immediately seen. “Sitting with it” is a very common refrain of third-option seekers. The axis Z is hard to comprehend with a two-dimensional brain, and some serious stretching of the mind will be necessary to even catch the first glimpses of it.

Most significantly, that first glimpse or even a full realization doesn’t not necessarily bring instant victory. Orthogonality is a weird trick. Opening up a new dimension does not obsolete the previous ones – it just creates more space. Those hoping for a neat solution will be disappointed.

Instead of “solving” a polarity, we might find a whole different perspective, which may not change the situation in a dramatic way – at least not at first. We might find that we are less attached to the effects of a pendulum. We might find that we no longer suffer at the extremes, and have a tiny bit more room to move, rather than feeling helpless. We might find that the pendulum swings no longer seem as existential. And little by little, its impact on us will feel less and less intense.

Flexibility of the medium

I did this fun little experiment recently. I took my two last posts (Thinking to Write and The Bootstrapping Phase) and asked an LLM to turn them into lyrics. Then, after massaging the lyrics a bit, to better fit with the message I wanted to come across, I played with Suno for a bit to transform them into short, 2-minute songs – a sort of vignettes for my long-form writing. Here they are:

Thinking to write on Suno

Unbaked Cookie Testers on Suno

Catchy, content-ful, and in total, maybe 20 minutes to make. And, it was so much fun! I got to think about what’s important, how to express somewhat dry writing as emotionally interesting. I got to consider what music style would resonate with what I am trying to convey in the original content.

This got me thinking. What I was doing in those few minutes was transforming the medium of the message. With generative AI, the cost of medium transformation seems to be going down dramatically.

I know how to make music. I know how to write lyrics. But it would have taken me hours of uninterrupted time (which would likely translate into months of elapsed time) to actually produce something like this. Such investment makes medium transformation all but prohibitive. It’s just too much effort.

However, with the help of a couple of LLMs, I was able to walk over this threshold like there’s nothing to it. I had fun, and – most importantly – I had total agency in the course of the transformation. I had the opportunity to tweak the lyrics. I played around with music styles and rejected a bunch of things I didn’t like. It was all happening in one-minute intervals, in rapid iteration.

This rapid iteration was more reminiscent of jamming with a creative partner than working with a machine. Gemini gave me a bunch of alternatives (some better than others), and Suno was eager to mix bluegrass with glitch, no matter how awful the results. At one moment I paused and realized: wow, this feels closer to the ideal creative collaboration than I’ve ever noticed before.

What’s more importantly, the new ease of medium transformation opens up all kinds of new possibilities. If we presume – and that’s a big one – for a moment that the cost of medium transformation will indeed go down for all of us, we now can flexibly adjust the medium according to the circumstances of the audience.

The message does not have to be locked in a long-form post or an academic tome, waiting for someone to summarize it in an easily consumable format. We can turn it into a catchy tune, or a podcast. It could be a video. It could be something we don’t yet have, like a “zoomable” radio station where I listen to a stream of short-form snippets of ideas, and can “zoom in” to the ones I am most interested in, pausing the stream to have a conversation with the avatar of the author of the book, or have an avatar of someone I respect react to it. I could then “zoom out” again and resume the flow of short-form snippets.

Once flexible, the medium of the message can adapt and meet me where I am currently.

The transformation behind this flexibility will often be lossy. Just like the tweets pixelate the nuance of the human soul, turning a book into a two-verse ditty will flatten its depth. My intuition is that this lossiness and the transformation itself will usher in a whole new era of UX explorations, where we struggle to find that new shared way of interacting with the infinitely flexible, malleable canvas of the medium. Yup, this is going to get weird.

The Bootstrapping Phase

I think I have a slightly better way of describing a particular moment in a product’s life that I alluded to in Rock tumbler teams, Chances to get it right, and later, Build a thing to build the thing. I call this moment the “bootstrapping phase.” It very much applies to consumer-oriented products as well, but is especially pronounced – and viscerally felt – in the developer experience spaces.

I use the term “bootstrapping phase” to point at the period of time when our aspiring developer product is facing a tension of two forces. On one hand, we must start having actual users to provide the essential feedback loop that will guide us. On the other hand, the product itself isn’t yet good enough to actually help users.

The bootstrapping phase is all about navigating this tension in the most effective way. Move a little too much away from having the feedback loop, and we run the danger of building something that nobody wants. Go a little too hard on growing the user base, and we might prematurely conclude the story of the product entirely.

The trick about this phase is that all assumptions we might have made about the final shape of what we’re building are up in the air. They could be entirely wrong, based on our misunderstanding of the problem space, or overfit to our particular way of thinking. These assumptions must face the contact with reality, be tested – and necessarily, change.

The word “bootstrapping” in the name refers to this iterative process of evolving our assumptions in collaboration with a small group of users who are able and eager to engage.

Those of you hanging out in the Breadboard project heard me use the expression “unbaked cookies”: we would like to have you try the stuff we made, and we’re pretty sure it’s not yet cooked. Our cookies might have bits of crushed glass in them, and we don’t yet know if that cool new ingredient we added last night is actually edible. Yum.

At the bootstrapping phase of the project, the eagerness to eat unbaked cookies is a precious gift. I am in awe of the folks I know who have this mindset. For them, it’s a chance to play with something new and influence – often deeply – what the next iteration of the product will look like. On the receiving end, we get a wealth of insights they generate by trying – and gleefully failing – to use the product as intended.

For this process to work, we must show a complementary eagerness to change our assumptions. It is often disheartening to see our cool ideas be dismantled with a single click or a confused stare. Instead of falling prey to the temptation of filtering out these moments, we must use them as guiding signals – these are the bits that take us toward a better product.

The relationship between the bakers of unbaked cookies and cookie testers requires a lot of trust – and this can only be built over time. Both parties need to develop a sense of collaborative relationship that allows them to take risks, challenging each other. As disconcerting it may be, some insights generated might point at fundamental problems with the product – things that aren’t fixable without rethinking everything. While definitely a last resort, such rethinking must always be on the table. Bits of technology can be changed with some work. The mental models behind the product, once it ships to the broader audience are much, much more difficult to change.

Because of that, the typical UX studies aren’t a great fit for the bootstrapping phase of the project. We’re not looking for folks to react to the validity of mental models we imbued the nascent product with. We fully realize that some of them – likely many – are wrong. Instead, we need a collaborative, tight-feedback loop relationship with the potential users, who feel entrusted with steering the product direction through them chewing on not-yet baked cookies. They aren’t just trusted testers of the product. They aren’t just evaluators of it. They are full participants in its development, representing the users.