June 2023 – Dimitri Glazkov

The gifts of failure

I may have written about this before. It’s a trick that I’ve learned intuitively over the years, and I’ve found it tremendously helpful. It tends to work best in situations where the outcome is uncertain.

When adventuring forth into the unknown, we usually focus on the thing that we want to achieve. Even if the chances of it happening are slim, we set our eyes on the prize and commit to persevere through whatever challenges are thrown at us.

It is just before this point that I usually pause and spend a bit of time contemplating the gifts of failure. The question I ask myself is simple: “What good can come out of failing in this venture?” There are usually plenty of platitudes about learning from failure and all that wonderful stuff that is talked about incessantly in books about innovation. This is not what I am looking for.

When I ask that question, I am looking for tangible benefits that might arise as a result of the failure. Once we’ve reached the threshold where we definitely know we’ve been defeated – by circumstances, a worthy foe, or our own limitations – what are we left with?

Put differently, when all we have is to pick up the broken pieces of our failed enterprise, how quickly can we repurpose these pieces to start anew?

If we don’t contemplate this question earnestly, the answer is usually “not much”. In such cases, the collapse is dramatic and total. Very little of the effort put into the project is recoverable. If anything good comes of it, it’s sheer luck.

The stepping stones tactic is one way through which we can maximize this luck. In Greatness Cannot Be Planned, the authors argued that small increments just outside of the adjacent possible – the stepping stones! – are what ultimately feeds innovation and big discoveries.

By reflecting on the gifts of failure, we can sketch our path as a series of stepping stones that we create for ourselves. What are the stepping stones will we need to produce anyway as we walk toward our vision? Which ones can take on the life of their own, and serve us even after our initial project has failed? Through this lens, the product of our endeavor will no longer look like a monolithic thing. Instead, it will be a modular construct that can flexibly rearrange itself into something completely different, becoming highly resilient.

As another tactic more useful for smaller projects, I sometimes maintain a small portfolio of secondary benefits that could be promoted as primary. These benefits must be somewhat independent of each other, so that if the initial attempts prove fruitless, there’s another path forward that opens up by refocusing on another potentiality.

Both of these tactics arise from reflecting on the original question. When we accept that in highly uncertain, volatile environments, any exploration is much more likely to fail than to succeed, we stop seeing the target of our adventure as the mythical “end all be all”, and focus on accumulating opportunities and potentialities along the way.

Gravity and water

I have been lucky enough to be part of the process of shaping several organizations, yet only uncovered this nugget of insight very recently. Very briefly, it helps to discern the forces of the broader environment that influence us without our awareness.

It is very rare that the leaders of a newly formed organization want it to behave exactly the same as the surrounding environment. If that is the case, we have a distinction without a difference, a label rather than an organization. Teams and organizations are created to do something (even if slightly) different from the rest of the environment. At the same time, they are still in that environment and are influenced by the forces of this environment.

If I want to create a team that innovates daringly while situated within an organization that is designed for keeping an important chunk of infrastructure running. I can speak all I want about inspiring creativity and fearless questing for ideas to my team – but if I don’t account for the fact that most processes around me are designed and incentives are structured for a “keeping” mission, none of that may matter. Implacably, my attempts to encourage innovation will come out shaped like infrastructure improvement projects. It will feel like an invisible force being exerted on the newly born team, veering it off our intended course.

More often than not – especially in larger organizations – such forces are invisible until we start deviating from their direction. These forces are meant to organize, to bring various people’s intentions into alignment in service of the mission. Only when we try to do something different do we start experiencing friction.

Two questions might help you start making these more visible.

The first question is “what is our gravity?” This will take a little bit of self-reflection. We rarely ask ourselves why our feet stick magically to the ground. Similarly, organizations have gravity-like forces that everyone is used to. They tend to hide in structural and cultural bits of the org. If you’d like, use my fourplex of constraints to investigate. Which ones of these do we need to counteract when deviating from the default organization configuration? OKRs are probably my favorite example here. No matter how much we try to overcome it, the language of OKRs inexorably drives us to short-term thinking. If we try to build an organization that thinks in a longer timeframe than the rest of its OKR-driven environment, we will have a difficult time placing ourselves within it. “Wait, these people don’t have OKRs? Do they even do anything?”

The second question is “what is our water?”, echoing the parable presented by David Foster Wallace. What are the things that we take for granted to such a degree that we don’t even notice them anymore? How do these things influence how we think and act? Where are they out of alignment with our intentions? If our team has a strong engineering culture, we will naturally have biases that will filter out less “engineer-ey” ideas – and that might be alright, unless these ideas are essential for making progress toward our intended mission.

The answers to these two questions are nearly always tentative and wrong, at least initially. It is important to keep examining them as the time goes on.

A while back, one of my colleagues was reflecting on their first year on a team at which we both worked at the time: “At first, I thought the hardest part will be to get creativity and inspiration going. Surely, the problem was that we weren’t thinking cleverly enough. Oh, boy was I wrong. It’s exactly the opposite! Everyone here is incredibly smart and is bursting with creative energy. Now I believe that where we struggle the most is in applying all that energy in a way that leads to shared learning that persists.” My colleague’s guess was even more refined and nuanced the next year.

Understanding what is stopping us from where we want to go is key to getting there. If we don’t stop to examine the forces that hold us in place, the best we can hope for is an external shock that jolts us out of our trap.

Models and structured reasoning

One of my colleagues quipped recently that instruction tuning might not be about teaching large language models (LLMs) what to do, but rather about reminding LLMs of the wealth of wisdom they already have at their foundation. Here’s a rather unkempt vignette on what wisdom I would like an LLM to recall.

I would like to help a model remember how to solve problems that are a bit more complex, and require some rigor in reasoning. We are well underway in our first generation of LLM reasoning explorations, and I am looking a bit behind the current horizon.

Here is my hypothesis: an LLM that is capable of reasoning about a broad variety of problems would be biased to structure its reasoning in a certain way. That is, when we ask it to “think step by step”, it will rely on some predefined structure of these steps while performing a text completion.

If this hypothesis is true, reasoning structures will be at the core of dramatically improving the reasoning ability of the models. Put differently, I anticipate that the next cycle of model evolution will be about exploring the space of reasoning structures and finding more and more capable local maximas of such structures.

The future of instruction tuning might not be about the quantity or even quality of the data, but rather about the rigor of the overarching reasoning structure in the data sets.

To make my “reasoning structure” concept a bit more clear, I tried to conjure one up relying on the problem understanding framework that I’ve created a while back. This is just one rough sketch, but it should give you some idea of what the heck I am talking about.

🎱 Guessing the problem class

To quickly recap, I posit in my framework that there are three classes of problems: solved, solvable, and unsolvable.

LLMs are already able to address solved problems with moderate success. The implicit code execution shows a straightforward way to approach them: once the problem is deemed as something that was already solved before, we ask an LLM to write the minimal bit of code to produce the solution. Crunching numbers, reversing words, making charts, and various similar kinds of problems fall in this class.

LLMs are making decent progress on tackling solvable problems. Solvable problems have no well-established solution, but we know what the solution looks like – also known as “the ground truth”. The presence of this ground truth is what enables reasoning boxes, and their iterative approach to finding the solution.

The yet untapped potential of LLMs is in making sense of unsolvable problems. Unsolvable problems are highly adaptive, and often appear to be solving us rather than the other way around. Things like relationships, organizations, politics, and ecosystems all fall into this class of problems – and usually, we don’t end up solving them. Instead, we navigate through them as best we can. A good way to picture an unsolvable problem is thinking of an open ocean. We can’t solve the high seas, but we can navigate them.

The key to navigating an unsolvable problem is in investing a bit of time into sensing the various forces inside of the problem that make it impossible to solve once and for all. I call this process “framing the problem”. A successful framing produces one or more solvable problems that when solved, help us navigate the unsolvable problem.

Let’s use this simple taxonomy as the foundation of a reasoning structure.

The structure will likely begin with making a guess at the class of the problem: are we looking at a solved, solvable, or unsolvable problem? I experimented a bit with asking LLMs about Cynefin, and was pleasantly surprised by the quality of these guesses. Perhaps this is the foundational wisdom of the model that my colleague was talking about?

Based on the outcome of the initial guess, the structure will guide the reasoning to branch and engage with the problem differently based on its class.

🔨 Handling solved problems

For solved problems, an LLM would attempt to just answer it, or write a bit of code as I mentioned above – just as most question-answering LLM-based applications do today. Like so:

Question: What is the square root of 38495?
Response: The problem of determining a square root of a number is a solved problem. 

Here is a program that returns a square root of 38495:
```javascript
console.log(Math.sqrt(38495));
```
 The answer is the output of the program.

For this problem class, this is not as much as solving the problem as retrieving the solution from the depth of the model’s wisdom.

🧩 Actually solving problems

If the problem is solvable, the reasoning structure will assume an iterative stance with two interlinked lines of reasoning.

The first line of reasoning tries to answer the question of “Has the problem been solved?”. This is the actual process of solving the problem.

The second line of reasoning tries to answer the question “Do I know everything I need to know to verify that the problem has been solved?”. This is the process of establishing the problem context, which determines the constraints that could be used to verify that the solution is correct. This context serves as the ground truth for the problem.

Establishing problem context is a problem in itself, and the reasoning structure will branch further into various techniques that could be effective. For example, the LLM might decide to write a suite of tests that, when run, will prove or disprove that the answer is correct. It could also opt to ask the user to provide feedback on the answer and learn from that feedback.

It might even choose an “inner monologue” technique, where the ground truth is present only in a socratic sense, with the LLM constantly questioning its own assumptions and making progress by spotting inconsistencies in them. As long as the assumptions are visible, there is an opportunity to question and correct them.

Whatever the choice, the reasoning structure will need to presume that both solving and context-finding processes are iterative. The LLM will no longer attempt to solve the problem in one completion, but rather focus on making progress toward the solution in multiple iterations and prefer frequent contact with the ground truth.

This stance is different from the typical “question-answer” one of the solved problem class. I’ve been able to get the current generation of models to do so with mixed results, and it feels like an important capability to develop, perhaps through instruction tuning.

The model will also need to be capable of keeping track of where in the reasoning structure it is. Given that solving a problem might take multiple completions, there needs to be something that helps the model orient in relation to the overall progress of solving the problem.

Question: Write a to-do list app
Response: Writing a to-do list app is a solvable problem, because it is very clear when the app is built and whether it allows viewing the to-do list, as well as adding and removing items from the list. 

What is the information am I missing to solve this problem? 

I need to ask the following questions: 
- Other than showing a to-do list, adding, and removing items from the to-do list, what other features are important to you? 
- Should this app support multiple lists or is just one list sufficient?
- On what device will you use this app?

Then, a few interactions later:

Question: Write a to-do list app
Context: This is a solvable problem, and we are currently in the process of determining requirements. Previous interactions established the following requirements: 
- app will support only one to-do list
- app will run on an iPhone
- app will allow viewing one to-do list
- it should be possible to add and remove items on the list
- it should be possible to sort items on the list. 
Response: <completion continues here>

Today, this tracking is primarily done by recording the history of the problem-solving interactions, like a chat transcript. Most chat models will accept the history of the previous messages and are tuned to treat it as guidance for the text completion. I can’t wait to see the new, more effective methods of retaining problem context be discovered.

For instance, a more sophisticated reasoning structure might guide a model to detect that the problem could be split into multiple sub-problems that could be solved more easily, in parallel or in sequence. In such cases, the problem context would map into multiple sub-problems and eventually reduce back to one, once the sub-problems are solved. For example:

Question: Write an app that allows me to create a to-do list
Response: Writing a to-do list app is a solvable problem, because it can be tested if the app:
- successfully starts
- allows viewing the to-do list
- supports adding and removing items from the list.

To simplify the problem, I will break it up into two sub-problems:
- the backend that provides the to-do list API for storing and retrieving the list
- the frontend that consumes this API.

The context for the backend sub-problem will be:
- write a backend for a to-do list app
<detailed backend spec goes here>
- write an API for the backend that supports three methods: retrieve the list, add a new item, and remove an existing item.
<detailed spec of the API goes here>

The context for the frontend sub-problem will be:
- write a Web app frontend for a to-do list app
- the fronted will support viewing to-do list on mobile devices, and specifically on Safari for iPhone 
<detailed frontend spec goes here>
- the frontend will integrate with the following API
<detailed spec of the API goes here>

🌀 Framing unsolvable problems

When the LLM guesses the problem as unsolvable, the reasoning structure guides the process toward discerning the tangle of forces that creates unsolvability. This part of the process is less about finding solutions and more about building a reasonable dynamic model of the problem.

I am not yet sure how such a model could be represented efficiently as text, but it will likely start with a list of statements that form into a causal graph. For example, here’s how a model may respond to a complex question:

Question: Should I stay in college or start my own AI company?
Response: making a decision on whether to stay in college or start an AI company appears to be an unsolvable problem. There does not seem to be one right answer and it is hard to check the correctness of the answer.

To better understand the forces at play,  I will ask the following clarifying questions:
- What is difficult about making this decision?
- Why does it have to be an either-or decision?
- What holds you back from starting a company?
- What is so attractive about starting a company?
- Why is leaving college necessary?
- Why is staying in college important for you?

After a few rounds of questions and answers, the graph will begin to take shape, the LLM might identify a few leverage points: solvable problems within the larger unsolvable problem.

Question: Should I stay in college or start my own AI company?
Context: This is a complex problem and we are currently in the process of building a model of the problem. Here are the edges of the model graph:
- AI is very exciting, and there are a lot of cool opportunities 
- If I wait, I will miss these opportunities
- To graduate successfully, I need to wait
- To build a company, I need a lot of time
- To build successful foundation for my life, I need to excel at school
- To excel at school, I need a lot of time
- Time is an extremely limited resource
<graph edges continue>

Response: Based on the graph I have so far, I believe that time might be a leverage point in this problem. I will recommend framing the original problem as that of optimizing for the most effective use of time and creating space for both college and company in the schedule.

Suggested problem statement: Optimize my schedule to enable both staying in college and starting my own AI company.

Once the framing emerges, the reasoning structure will follow the recipe for the solvable problem that popped out. While the original problem remains unsolved, there are now concrete steps that the user can take to navigate it.

The dynamic model that is created by the process of navigating an unsolvable problem is a valuable artifact. In fact, it probably needs to be retained. Depending on the success and failure of the framing that was produced, it will need to be updated and refined.

🩹 Detecting problem classification mistakes

Because the first step of the reasoning structure is just a guess, the structure will need to provide guidance to detect the mistake and shift the problem into a different class.

Some of these mistakes aren’t as bad as the others. For example, mistaking a solved problem as solvable or a solvable problem as unsolvable will both result in inefficient use of resources, but produce a satisfying answer.

There is another category of mistakes where the solution is produced, but isn’t valid, like when a solvable problem was diagnosed as solved, and the attempt to solve the problem results in the wrong answer. I wonder if some sort of “inner critic” reasoning might be necessary even for solved problems to mitigate that.

The mistake that is a bit more troublesome is when the unsolvable problem was guessed to be solvable. It will look like inability to pin down constraints, creating a sense of divergence, rather than convergence toward the objective. Detecting this might be challenging, because the reasoning will continue finding new solvable sub-problems, and appear to make the infinite-regress sort of progress. Here, the reasoning structure will need some sort of a trip wire: a way to ask the model to lift its pencil and say “buddy, you’re spiraling out of control, this might be an unsolvable problem”.

😎 Mission impossible

Part of this essay was an exercise of trying to write a sketch of a reasoning structure in its entirety. Now, you and I could look at it and wonder whether or not something like a reasoning structure might even be feasible to imbue into a model.

It definitely feels a bit daunting. However, if the concept seems intriguing to you, please give it a try. Maybe hack together something very simple to start small, and expand from there. Who knows, maybe through this inquiry, the full depth of wisdom of a large language model will be revealed to us.

Development environment for rapid experimentation

I’ve been playing with various software development configurations that might enable rapid experimentation, and landed on this particular one. I am pretty sure there are even more effective ways, and I can’t wait to learn from you about them. This is what I have so far.

A quick disclaimer: this is not yet another “how to set up a repo” tutorial. It’s mostly a capture of my learnings. I will refer to a couple of such tutorials, though.

To set things up. I was looking for a way to enable a small-ish team to enable development of rapid prototypes. That is, write something, see if it does what we need, test the waters, learn like crazy, and break no sweat if it doesn’t.

🧫 Ecosystem

The first question on my mind was that of the developer ecosystem. To unlock fertile learning through testing the waters, prototypes need to ship. They do not have to ship as polished products with firm SLAs, but they do need to reach some users who would be willing to mess with the prototypes, react to them, and provide feedback. To maximize the chances of serendipitous feedback, we must play in the most populous ecosystems of folks who like to mess with unpolished stuff. When choosing a place to learn, pick the rowdiest bazaar.

This requirement narrowed down the possible environments quite a bit. Looking at Stack Overflow survey results, the two ecosystems stood out as by far the most legible for the title: Javascript developers and Python developers. They form the fat head of the developer environment power curve. These are the places to play.

I then spent a bunch of time messing with both environments, and ended up deciding on the Javascript ecosystem. There were several reasons for that, not all of them entirely objective. Roughly, it all came down to two factors:

Javascript runs both in the browser and on the server, and the surprising amount of code and infrastructure that can be shared between the two allows for fewer jumping through hoops to make things go;
The overall state of the scaffolding and tooling in the Javascript ecosystem seems to be a touch less messy than that of Python, with Python still overcoming some of the legacy warts around package publishing, environment isolation, transition to python3, and addition of types. At least for me, I found that I end up fighting Python more often than fighting Javascript.

🧰 Toolchain

After picking the environment, I wasted a bunch of time resisting TypeScript. As a Javascript old-timer and a known build step grump, I really didn’t want to like it. But after getting over my hang ups, I must admit: TypeScript is basically the best thing that could ever happen to unlock rapid prototyping. As long as I know where the layer gaps are (hint: the missing runtime type support), it’s basically the perfect tool for the job. Especially with the way it is integrated into VSCode, TypeScript hovers at just the right altitude to help me write the code quickly and have high confidence in this code working on the first run.

Which brings me to the next increment in my journey. If we choose TypeScript, we must go with VSCode as the development surface. I am sure there are other cool editors out there (I hear you, vim/emacs fans!), but if we’re looking for something that fits TypeScript like a glove, there is simply no substitute. Combined with eslint and prettier, the VSCode support for TypeScript makes development an enjoyable experience.

So… Node, Web, TypeScript, VSCode. These are the choices that came out of my exploration. I briefly played with the various Node package managers, and concluded that npm is likely the thing to stick with. I love the idea behind pnpm and yarn is super-fun, but at least for me, I decided to go with what comes in the box with Node. Deno is cool, too – but as a newcomer, it simply doesn’t meet the “rowdiest bazaar” bar.

The choices made so far define the basic shape of the prototypes we will develop and the sketch of the development flow. The prototypes will be either shipped as Web apps or libraries/tools as npm packages. Every prototype will start as an npm package. It might have server-only code, client-only code, or a mix of both. Prototypes that look like tools and libraries will be published on npm.

#️⃣ Runtime versions and settings

I invested a bit of time deciding on versions and settings of TypeScript and Node. One key guiding principle I chose was “as close to the metal as possible”. TypeScript compiler is quite versatile and it can output to a variety of targets to satisfy the needs of even the most bizarre deployments. Given that we’re prototyping and writing new code, we don’t need to concern ourselves with the full breadth of deployment possibilities – and we certainly can be choosy about the version of the browser we expect to present our experiments.

With this leeway and the recognition that TypeScript is mostly an implementation of ECMAScript (the standard behind Javascript) plus type annotations, we can configure the TypeScript compiler to mostly remove type annotations.

For Node, I chose to go with v18.16, primarily because this is the version that introduced the real fetch implementation, which matches what modern Web browsers ship.

So, if we have Node 18 and the config of the TypeScript below, we should minimize the amount of new code introduced by the TypeScript compiler and maximize the client/server code compatibility.

{
    "lib": ["ES2022", "DOM"],
    "module": "NodeNext",
    "target": "ES2022"
}

As an aside, there was a fun rabbit hole of a layer gap into which I fell while exploring this space. Turns out, Node TypeScript type annotations don’t have the declarations for the fetch implementation. So I ended up doing this funky thing with adding the “DOM” library to the TypeScript config. This worked better than I expected. As long as we remember that a) TypeScript types are not seen by the actual Javascript runtime and b) most of the actual DOM objects aren’t available in Node, one can get away with a lot of fun hacks. For example, we can run unit tests for client-side code on the server!

🏠 Repository configuration and layout

With versions and runtime configs squared away, I proceeded to fiddle with configuring the repository itself. I first started with the “let a thousand tiny repos bloom” idea, but then quickly shifted toward the Node monorepo. This choice might seem weird given the whole rapid prototyping emphasis. The big realization for me was that we want to encourage our prototypes to mingle: we want them to easily reuse each other’s bits. It is out of those dependencies that interesting insights emerge. We might spot a library or a tool in a chunk of code that every other prototype seems to rely on. We might recognize patterns that change how we think about the boundaries around prototypes and would need space to reshape them. With all prototypes being individual packages, the friction of dependency tracking will simply prevent that.

There are multitudes of ways in which one could bring up a TypeScript monorepo. I really liked this guide, or this setup that relies exclusively on the TypeScript compiler to track dependencies. Ultimately, I realized that I prefer to use separate build tools that track the dependency build graph, and invoke the compiler to do their bidding. This is the setup that Vercel’s Turborepo folks advocate, and this is the one I ended up choosing.

Any Node monorepo will loosely have this format: there will be a bunch of config files and other goop in the root of the repository, and then there will be a directory or two (usually called “packages” or “apps”) that contains directories for the individual packages.

My intuition is that to facilitate rapid prototyping, we need a convention that reflects the state of any package in the monorepo. For example, we could have two package-holding directories, one for “seeds” and one for “core”. In the “seeds” directory, we place packages that are early prototypes that we’re just playing around with. Once a package acquires dependencies and becomes useful for other prototypes, we graduate to the “core” directory.

Another useful convention when working with Node monorepos is that the npm package names are all scoped under the same npm organization and the name of that organization matches the name of the repo.

So for example, if our monorepo is named “awesome-crew-prototypes”, all packages are published under the “@awesome-crew-prototypes” npm organization. For example, a prototype for a library that does URL parsing will be published as “@awesome-crew-prototypes/url-parser”. This way, the fact that the “url-parser” is part of the “awesome-crew-prototypes” monorepo is reflected in its name.

🚀 Team practices

As the final challenge, I worked out the best practices for the team that might be working in this repository. This section is the least well-formed, since typically, the practices emerge organically from collaborating together and depend quite a bit on the mix of the people on the team.

Having said that, the following rules of thumb felt right right as the foundation for the practices:

Have a fix-forward mindset – everyone pitches in to keep things running.
Mingle – seek to reuse other packages that we build, but don’t panic if that doesn’t work out. Think of reuse as very, very early indicators of a package usefulness.
Keep the rewrite count high – don’t sweat facing the possibility of rewriting the code we’re writing multiple times.
Duct tape and popsicle sticks – since we’re likely going to rewrite it, what lands does not need to be perfect or even all that great, as long as it gets the job done.
Ship many small things – rather than aiming for a definite product with a “wow” release moment, look to ship tiny tools and libraries that are actually helpful.

Armed with all of these, a team that is eager to experiment should be able to run forward quickly and explore the problem space that they’ve chosen for themselves, and have fun along the way. Who knows, maybe I’ll actually set up one of these myself one day. And if I do, I’ll let you know how it goes.

I also quickly put together a template for the environment that I described in this post. It probably has bugs, but should give you a more concrete idea of the actual setup.