Four layers

There seems to be some layering rhythm to how software capabilities are harnessed to become applications. Every new technology tends to grow these four layers: Capabilities, Interfaces, Frameworks, and Applications.

There does not seem to be a way of skipping or short-cutting around this process. The four layers grow with or without us. We either develop these layers ourselves or they appear without our help. Understanding this rhythm and the cadence of layer emergence could be the difference between flopping around in bewilderment and growing a durable successful business. Here are the beats.

⚡ Capabilities

Every new technological capability usually spends a bit of time in a purgatory of sorts, waiting for its power to become accessible. It needs to traverse the crevasse of understanding: move from being grokable by only a handful of those who built it to some larger audience. Many technologies dwell in this space for a while, trapped in the minds of inventors or in the hallways of laboratories. I might be stating the obvious here: the process of inventing something is not enough for it to be adopted.

I will use large language models as the first example in this story, but if you look closely, most technological advances follow this same rhythm. The transformer paper and the general capability for building models has been around for a while, but until last year, it was mostly contained to the few folks who needed to understand the subject matter deeply.

🔌 Interfaces

The breakthrough usually comes in the form of a new layer that emerges on top of the capability: the Interfaces layer. This is typically what we see as the beginning of the technology adoption growth curve. The Interfaces layer can be literally the API for the technology or any other form of simplifying contract that enables more people to start using the technology.

The Interfaces layer serves as the democratizer of the Capabilities layer: what was previously only accessible to the select few – be that due to the complex nature of the technology, capital investment costs, or some other barrier – is now accessible to a much larger group. This new audience is likely still fractionally small compared to all the world’s population, but it must be numerous enough for the tinkering dynamic to emerge.

This tinkering dynamic is key to the success of the technology. Tinkerers aren’t super-familiar with how the technology works. They don’t have any of the deep knowledge or awareness of its limits. This gives them a tremendous advantage over the inventors of the technology – they aren’t trapped by the preconceived notions of what this technology is about. Tinkers tinker. Operating at the Interfaces layer, they just try to apply the tech in this way and that and see what happens.

Many research and development organizations make a crucial mistake by presuming that tinkering is something that a small group of experts can do. This usually backfires, because for this phase of the process to play out successfully, we need two ingredients: 1) folks who have their own ideas about what they might do with the capabilities and 2) a large enough quantity of these folks to actually start introducing surprising new possibilities.

Because of this element of surprise, tinkering is a fundamentally unpredictable activity. This is why R&D teams tend to not engage in it. Especially in cases when the societal impact of technology is unclear, there could be a lot of downside hidden in this step.

In the case of large language models, OpenAI and StabilityAI were the ones who decided that this risk was worth it. By providing a simple API to its models, OpenAI significantly lowered the barrier to accessing the LLM capabilities. Similarly, by making their Stable Diffusion model easily accessible, StabilityAI ushered a new era of tinkering with multimodal models. They were the first to offer the large language models Interfaces layer.

Because it’s born to facilitate the tinkering dynamic, the Interfaces layer tends to be opinionated in a very specific way. It is concerned with reducing the burden of barriers to entry. Just like any layer, it does so by eliding details: to simplify, some knobs and sliders become no longer accessible to the consumer of the interface.

If the usage of the Interfaces layer starts to grow, this indicates that the underlying Capabilities layer appears to have some inherent value, and there is a desire to capture as much of this value as possible.

🔋Frameworks

This is the point at which a new layer begins to show up. This third layer, the Frameworks, focuses on utility. This layer asks: how might we utilize underlying the Interfaces layer in more effective ways, and make it even more accessible to an even broader audience?

Utility might mean different things in different situations: in some, the value of rapidly producing something that works is the most important thing. In others, it is the overall performance or reliability that matters most. Most often, it’s some mix of both and other factors.

Whatever it is, the search for maximizing utility results in development of frameworks, libraries, tools, and services that consume the Interfaces layer. Because there are many definitions of utility and many possible ways to achieve it, the Frameworks layer tends to be the most opinionated of the stack.

In my experience, the diversity of opinion introduced in the Frameworks layer depends on two factors: the inherent value of the capability and the own opinion of the Interfaces layer. 

The first factor is fairly straightforward: the more valuable the capability, the more likely there will be a wealth of opinions that will grow in the Framework layer.

The second factor is more nuanced. When the Interfaces layer is introduced, its authors build it by applying their own mental model of how the capability will be used via their interface. Since there aren’t actual users of the layer yet, it is at best a bunch of guesses. Then, the process of tinkering puts these guesses to the test. Surprising new uses are discovered, and the broadly adopted mental models of the consumers of the interface usually differ from the original guesses.

This difference becomes the opinion of the Interfaces layer. The larger this difference, the more effort the Frameworks layer will have to put into compensating for this difference – and thus, create more opportunities for even more diversity of opinion.

An illustration for how this plays out is the abundance of the Web frameworks. Since the Web browser started out as the document-viewing application, it still has all of those early guesses firmly entrenched. Indeed, the main API for web development is called the Document Object Model. We all have moved on from this notion, asking our browsers to help us conduct business, entertain us, and work with us in many more ways than the original designers of this API envisioned. Hence, the never-ending stream of new Web frameworks, each trying yet another way to close this mental model gap.

It is also important to call out a paradox that develops as a result of the interaction between the Frameworks and the Interfaces layer. The Frameworks layer appears to simultaneously apply two conflicting pressures to the Interfaces layer below: to change and to stay the same.

On one hand, it is very tempting for the Interfaces layer maintainers to change it, now that their initial guesses were tested. And indeed, when talking to the Frameworks layers developers, the Interfaces layer maintainers will often hear requests for change.

At the same time, changing means breaking the existing contract, which creates all kinds of trouble for the Frameworks layer – these kinds of changes are usually a lot of work (see my deprecation two-step write-up from a while back) and take a long time.

The typical state for a viable Interfaces layer is that it is mired in a slow slog of constant change whose pace feels glacial from the outside. Once the Frameworks layer emerges, the API layer becomes increasingly more challenging to evolve. For those of you currently in this slog, wear it as a badge of honor: it is a strong signal that you’ve made something incredibly successful.

The Frameworks layer becomes the de-facto place where the best practices and patterns of applying the capability are developed and stored. This is why it is only after a decent Frameworks layer appears do we start seeing robust Applications actually utilizing the capability at the bottom of the stack.

📱Applications

The Applications layer tops our four-stack of layers. This layer is where the technology finally faces its users – the consumers of the technological capability. These consumers might be the end users who aren’t at all technology-savvy, or they could be just another group of developers who are relieved to not have to think about how our particular bit of technology works on the inside.

The pressure toward maximizing utility develops at this layer. Consumer-grade software is serious business, and it often takes all available capacity to just stay in the game. While introducing new capabilities could be an appealing method to expand this business, at this layer, we seek the most efficient way possible to do so. The whole reason the Frameworks layer exists is to unlock this efficiency – and to further scale the availability of the technology.

This highlights another common pitfall of a research organization is to try to ram a brand new capability right into an application, without thinking about the Interfaces and Frameworks layers between them. This usually looks like a collaboration between the team that builds at the Capabilities layer and the team that builds at the Application layer. It is usually a sordid mess. Even if the collaboration nominally succeeds, neither participant is happy in the end. The Capability layer folks feel like they’ve got the most narrow and unimaginative implementation of their big idea. The Application folks are upset because now they have a weird one-off turd in their codebase.

👏 All together now

Getting technology adopted requires cultivating all four layers. To connect Capabilities to Applications, we first need the Interfaces layer that undergoes a significant amount of tinkering, with a non-trivial amount of use case exploration that helps map out the potential space of solutions that the new technology can actually solve. Then, we need the Frameworks layer to capture and embody the practices and patterns that trace the shortest paths across the explored space.

This is exactly what is playing out with the large language models. While ChatGPT is getting all the attention, the actual interesting work is happening at the Frameworks layer that sits on top of the large language model Interfaces layer: the OpenAI, Anthropic, and PaLM APIs.

The all-too-common trough of disillusionment that many technological innovations encounter can be described as the period of time between the Capability layer becoming available and the Interfaces and Frameworks layers filling in to support the Applications layer. 

For instance, if you want to make better guesses about the future of the most recent AI spring, pay attention to what happens with projects like LangChain, AutoGPT, and other tinkering adventures – they are the ones accumulating the recipes and practices that will form the foundation of the Frameworks layer. They will be the ones defining the shape of the Applications layer.

Here’s the advice I would give to any team developing a nascent technology:

  • Once the Capabilities layer exists, immediately focus on developing the Interfaces layer. For example, if you have a cool new way to connect devices wirelessly, offer an API for it.
  • Do make sure that your Interfaces layer encourages tinkering. Make the API as simple as possible, but still powerful enough to be interesting. Invest into capping the downside (misuse, abuse, etc.). For example, start with an invitation-only or rate-limited API.
  • Avoid the comforting idea that just playing with the Interfaces layer within your team or organization constitutes tinkering. Seek out a diverse group of tinkerers. Example: opt for a public preview program rather than an internal-only hackathon.
  • Prepare for the long slog of evolving the Interfaces layer. Consider maintaining the Interfaces layer as a permanent investment. Grow expertise on how to maintain the layer effectively.
  • Once the Interfaces layer usage starts growing, watch for the emergence of the Frameworks layer. Seed it with your own patterns and frameworks, but expect them not to take root. There will be other great tool or library ideas that you didn’t come up with. Give them all love.
  • Do invest in growing a healthy Frameworks layer. If possible, assume the role of the Frameworks layer facilitator and patron. Garden this layer and support those who are doing truly interesting things. Weed out grift and adversarial players. At the very least, be very familiar with the Frameworks landscape. As I mentioned before, this layer defines the shape of Applications to come.
  • Do build applications that utilize the technology, but only to learn more about the Frameworks layer. Use these insights to guide changes in Interfaces and Capabilities layer.
  • Be patient. The key to finding valuable opportunities is in being present when these opportunities come up – and being the most prepared to pursue these opportunities.

If you orient your work around these four layers, you might find that the rhythm of the pattern begins to work for you, rather than against you.

Fix-forward and rollback commit stances

Now that I play a bit more with open source projects, more observations crystallize into framings, and more previous experiences start making sense. I guess that’s the benefit of having done this technology thing for a long time – I get to compost all of my learnings and share the yummy framings that grow on top of them.

One such framing is the distinction between two stances that projects have in regard to bad code committed into the repository: the fix-forward stance and the rollback stance.

“Bad code” in this scenario is usually the code that breaks something. It could be that the software we’re writing becomes non-functional. It could be as subtle as a single unit test begins to fail. Of course, we try to ensure that there are strong measures to prevent bad code from ever sneaking into the repository. However, no matter how much continuous integration infrastructure we surround ourselves with, bad code still occasionally makes it through.

When the project has a fix-forward stance, when the bad code is found, we keep moving forward, fixing the code with further commits.

In the rollback stance, we identify and immediately revert the offending commit, removing the breakage.

⏩ The fix-forward stance

The fix-forward stance tends to work well in smaller projects, where there is a high degree of trust and collaboration between the members of the project. The breakage is treated as a “fire”, and everyone just piles on to try and repair the code base.

One way to think of the fix-forward stance is that it places the responsibility of fixing the bad code on the collective shoulders of the project members.

One of my favorite memories from working on the WebKit project were the “hyatt landed” moments, when one of the founding members of the project would land a massive chunk of code that introduces a cool new feature or capability. This chunk usually broke a bunch of things, and members of the project would jump on putting out the fires, letting the new code finish cooking in the repository.

The obvious drawback of the fix-forward stance is that it can be rather randomizing. Fixing bad code and firefighting can be exhilarating for a few times, but grows increasingly frustrating, especially as the project grows in size and membership.

Another drawback of fixing forward is that it’s very common for more bad code to be introduced while fighting the fire, resulting in a katamari ball of bugs and a prolonged process of deconstructing this ball and weeding out all the bugs.

🔁 The rollback stance

This is where the rollback stance becomes more appealing. In this stance, the onus of responsibility for the breakage is on the individual contributor. If my code is deemed to be the culprit, it is simply ejected from the repository, and it’s on me to figure out the source of the brokenness.

In projects with the rollback stance, there is often a regular duty of “sheriffing” where an engineer or two are deputized to keep an eye on the build to spot effects of bad commits, hunt them down, and roll them back. The sheriff usually has the power to “close the tree”, where no new code is allowed to land until the problematic commit was reverted.

It is not fun to get a ping from a sheriff, letting me know that my patch was found to be the latest suspect in the crimes against the repository. There’s usually a brief investigation, with the author pleading for innocence, and a quick action of removing the commit from the tree.

The key advantage of the rollback stance is that it’s trustless in its nature, and so it scales rather well to large teams with diverse degrees of engagement. It doesn’t matter if I am a veteran who wrote most of the code in the project or someone who is making their first commit in hobby time – everyone is treated in the same way.

However, there are also several drawbacks. First, it could take a while for complicated changes to land. I’ve seen my colleagues orchestrate intricate multi-part maneuvers to ensure that all dependencies are properly adjusted and do not introduce breakages in the process. 

There is also a somewhat unfortunate downside of the trustless environment: because it is on me to figure out the problem, it can be rather isolating. What could have been a brief firefighting swarm in a fix-forward project can turn into a long lonely slog of puzzling over the elusive bug. This tends to particularly affect less experienced and introverted engineers, who may spend weeks or even months trying to land a single patch, becoming more and more dejected with each rollback.

Similarly, it takes nuance and personal awareness to be an effective sheriff. A sheriff must constantly balance between quick action and proper diagnosis. Quite often, actually innocent code gets rolled out, while the problematic bits remain — or the sheriff loses large chunks of time while trying to diagnose the problem too deeply, and thus holding up the entire project. While working on Chromium, I’ve seen folks who are genuinely good at this job – and folks who I would rather not be sheriffing at all.

Because it is trustless, a rollback stance can easily lead to confrontational and zero-sum dynamics. Be very careful here and cultivate the spirit of collaboration and sense of community, lest we end up with a project where everyone is out for themselves.

📚 Lessons learned

Which stance should you pick for your next project? I would say it really depends on the culture you’d like to hold up as the ideal for the project.

If this is a small tight-knit group of folks who already work together well, the fix-forward stance is pretty effective. Think startups, skunkworks, or prototyping shops that want to stay small and nimble.

If you’d like your project to grow and accept many contributors, a rollback stance is likely the right candidate – as long as it is combined with strong community-building effort.

What about mixing the two? My intuition is that a combination of both stances can work within the same project. For example, some of the more stable, broader bits of the project could adopt the rollback stance, and the more experimental parts could be in a fix-forward stance. As long as there is a clean dependency separation between them, this setup might work.

One thing to avoid is inconsistent application of the stance. For example, if for our project, we decide that seasoned contributors could be allowed to use the fix-forward stance, and the newcomers would be treated with rollbacks, we’ll have a full mess on our hands. Be consistent and be clear about the stance of your project – and stick with it.

Schemish

I’ve been spending most of my hobby time playing with reasoning boxes and it’s been crazy fun. I need to write properly about my explorations later, but so far, I’ve convinced an LLM to brainstorm ideas and then critique itself in a diverge-converge exercise, evaluate Cynefin space of a situation, and even get some Humean reasoning going – albeit with mixed results on the last one.

One pattern that I’ve leaned on consistently is asking an LLM to produce output in JSON. There are several advantages to that. First, I get the output that I can then process outside of an LLM. Second, JSON surprisingly acts as a useful set of reasoning rails for LLMs itself. It kind of makes sense – since LLMs are predictive devices, laying out a structure of prediction helps organize and guide the prediction itself. In other words, JSON is not just useful for processing the output. It also helps with framing the process of text completion.

Of course, now that we’re asking an LLM to return JSON, the question arises: how do we inform it of the structure in which this JSON needs to be?

The first obvious answer is that we already have a way to describe the structure of JSON output. It’s called JSON Schema and it’s in broad use across the industry. Turns out, modern LLMs will happily consume JSON Schema. All you have to do is, somewhere in your prompt, tell them something along the lines of (you can see an example here):

Respond in valid JSON that matches the following JSON schema:
<insert your JSON schema here>

The output will (well, with 95% probability, as it usually goes with LLMs) dutifully follow the specified schema.

Unfortunately for us, JSON Schema is a bit chunky. Designed for precision and processing, it’s verbose and can quickly blow through our token budget if we’re not careful. It is also not exactly a human-readable format, making prompt hacking a bit of a chore.

On the other hand, describing JSON in English is also a bit too weird: “the output must be valid JSON containing an object with foo and bar as strings, blah blah blah” is something that LLMs will tolerate, but who has the time for typing it all in prose?

So I ended up using this weird pidgin that I named Schemish. It’s not really a JSON Schema, but not plain English either. Instead, it’s a mock of JSON output, with hints for the output provided as values. Kind of like this:

Respond in valid JSON of the following structure:
{
  "context": "A few sentences describing the context in which the question was asked",
  "subjects": [
    {
      "name": "Name of an subject mentioned or implied in the question",
      "assumption": "An assumption that you made when identifying this subject",
      "critique": "How might this assumption be wrong?",
      "question": "A question that could be asked to test the assumption",
    }
  ],
  "objects": [
    {
      "name": "Name of an object mentioned or implied in the question",
      "assumption": "An assumption that you made when discerning this object",
      "critique": "How might this assumption be wrong?",
      "question": "A question that could be asked to test the assumption",
    }
  ],
}

As you can see, I am specifying the structure of the output, but instead of relying on JSON schema syntax, I am simply mocking it out – just as I would describe JSON output in a sketch for another person.

If we want to go even more compact, we could go for YAML:

Respond in valid YAML of the following structure:
---
context: "A few sentences describing the context in which the question was asked",
  subjects:
    - name: "Name of an subject mentioned or implied in the question"
      assumption: "An assumption that you made when identifying this subject"
      critique: "How might this assumption be wrong?"
      question": "A question that could be asked to test the assumption"
  objects: 
    - name: "Name of an object mentioned or implied in the question"
      assumption: "An assumption that you made when discerning this object"
      critique: "How might this assumption be wrong?"
      question": "A question that could be asked to test the assumption"

Now, for an even cooler trick. You might have noticed that the structure has somewhat repeating elements in it. JSON Schema has references and other tools to deal with redundant declarations. Turns out, Schemish can do the same thing. Just use what you would usually do when sketching out a structure for yourself or your colleagues – leave a hand-wavy comment:

Respond in valid YAML of the following structure:
---
context: "A few sentences describing the context in which the question was asked",
  subjects:
    - name: "Name of an subject mentioned or implied in the question"
      assumption: "An assumption that you made when identifying this subject"
      critique: "How might this assumption be wrong?"
      question": "A question that could be asked to test the assumption"
  objects: 
    # same structure as `subjects`

What’s neat about Schemish is that it’s clearly derivable from a JSON Schema. Though I haven’t actually written the code, it’s fairly obvious that JSON Schema description fields as well as type and optionality can be expressed as English (or some shorthand of it) in the value of the Schemish structure.

This means that I could potentially connect a piece of machinery that speaks JSON Schema to an LLM via Schemish, with Schemish translator shepherding the schema from precise, mechanical code-land to the squishy, funky LLM-land – and then picking up the output, produced by the LLM to usher it back to the code-land by validating it with JSON Schema.

This Schemish idea seems fairly useful and I’ve seen similar techniques pop up in the AI tooling sphere. If you like it, please do something with it. I’ll probably write a simple wrapper for Schemish, too.

Stages of project forking

I’ve been thinking recently about how to design an open source project, and realized that there’s a really neat framing hiding in my memory. So I dug it out. If you are trying to make sense of the concepts of forking source code, see if this framing works for you.

It is fairly common that we have a chunk of someone else’s open source code that we would like to use. Or maybe we are trying to best prepare for someone else to use our open source code. In either case, we typically want to understand how this might happen.

The framing that I have here is a three-stage progression. It really ought to have a catchy name. The three stages are: dependency, soft fork, and hard fork. In my experience, a lot of open source code tends to go through this progression, sometimes more than once.

Depending on a particular situation, we might not be starting at the beginning of the sequence. As I will illustrate later, a project might not even move through this sequence linearly. This framing is an oversimplification for the sake of clarity. I hope you get the general gist, and then will be able to apply it flexibly.

🔗 Dependency stage

When we see a useful bit of source code, we start at the “dependency” stage. We want to consume this code, so we include it into our build process or import it directly into our project as-is. Using someone else’s code as a dependency has benefits and drawbacks. 

The benefits of dependencies are that we don’t have to write or maintain this code. It serves as a layer of abstraction on top of which we build our thing, no longer needing to worry about the details hidden in this layer of abstraction.

The drawbacks come out of the failure of the assumptions made in the previous paragraph. Depending on the layer gaps this dependency contains or the sort of opinion it imposes, we might find that the code doesn’t quite do what we want.

At this point, we have two choices. The first choice is to continue treating this code as a dependency and try to fill in the gaps or shift the opinion by contributing back to the project from which this code originates. However, at this point, we are now participating in two projects: ours and the dependency’s. Depending on how different the projects’ organization and cultures are, we may start incurring an extra cost that we didn’t expect before: the cost of navigating across project boundaries. 

If these costs start presenting an obstacle for us, the second choice starts looking mighty appealing. This second choice moves us to the next stage in our progression: the soft fork.

🍝 Soft fork stage

When soft-forking, we create a fork of the open source code, but commit to regularly pulling commits from the original. This is very common in situations where we ourselves do not have enough resources (expertise, bandwidth, etc.) and the original code is being actively improved. Why not get the best of both worlds? Get the latest improvements from the original while making our own improvements in our copy.

In practice, we end up underestimating the costs of maintaining the fork. Especially when the original project is moving quickly, the divergence of thinking across the two pools of people who are working on same-but-different-but-same code starts to rapidly unravel our plans of the soft fork harmony. We end up caught in the trap of having to accommodate both the thinking of our team and the team that’s working on the original code – and that is frankly the recipe for madness. Maintenance costs start growing non-linearly, when our assumptions that it will “just be a simple merge” begin exploding, the timebombs that they are.

Because of that, the “soft fork” stage is rarely a stable resting place in our progression. To abate the growing discontent, we are once again faced with two choices: go back to the “dependency” stage or proceed forward to the next stage in our little progression. Both are expensive, which makes the soft fork a nasty kind of trap. 

Going back to the “dependency” stage means investing extra energy into upstreaming all of the accumulated code and insights. Many of them will be incompatible with what the original code maintainers think and like. Prepare for grueling debates and frustrations. Bring lots of patience and funding.

🔱 Hard fork stage

Moving forward to the “hard fork” stage means going our own way – and losing the benefit of the expertise and investment that made the soft fork so appealing in the first place. If we are a lean team that thought it would do a cool trick of drafting behind a larger team with our soft fork, this would be an impossible proposition.

Hard-forking is rarely beneficial in the short term. For a little while, we will be just learning how to run this thing by ourselves, and it will definitely feel like a dent in velocity. However, if we are persistent, a hard-forked project eventually regains its groove. Skill gaps are filled, duality of opinions is reduced, and the people around the project form unique culture and identity.

The key challenge of hard forks is that of utility. In the end, if the hard fork is not that different from the original, a natural question emerges: why do we need both? Depending on the kind of environment these projects inhabit, this question could be easily answered — or not.

📖 Story time

To give you an example of this progression, here’s an abbreviated (and likely embellished by yours truly) story of the relationship between Chromium and WebKit projects.

The Chromium project worked in secret for about two years, quickly going from WebKit being a dependency to a soft fork, with a semi-regular merge process. The shift from dependency to soft fork was pretty much necessary, given that the Chromium folks wanted to embed a different JavaScript engine than WebKit. This engine will end up being named “V8”.

 In the last year or so prior to release, the team decided to temporarily shift to a hard fork stage. When I joined the team one month before public release, returning back to the soft fork stage was my first big project. Since I was thrilled to work on a browser project, I remember reporting happily that I was down to just 400 linker errors. When my colleagues, wide-eyed, turned to stare at me, I would add that last week it was over 3000.

Once the first merge was successful, my colleague Ojan strongly advocated for a daily merge. I couldn’t understand why this was so important back then, but this particular learning opportunity presented itself nearly immediately. There was a strongly super-linear relationship between the difficulty of the merge and the number of commits in it. If the merge contained just a handful of commits, it wasn’t that big of a deal. However, if the number exceeded a few dozen, we would be in deep trouble – making sense of the changes and how they intersected with the changes we’ve made spiraled out of control.

Simultaneously, we committed to “unforking” – that is, to moving all the way back to the “dependency” stage, where Chromium consumed WebKit as a pure dependency. This was a wild time. We were doing three things at once: performing continuous merge with the tip-of-tree WebKit, shuttling our Chromium diffs over to WebKit, and building a browser. I still think of those times fondly. It was such a fun puzzle.

Over a year later – and that tells you about the sheer amount of work that was necessary to make all this happen – we were unforked. We moved all the way back to the first stage of the progression. At that point, the WebKit project’s code was just one of the dependencies of Chromium. We focused on making more and more contributions upstream, and the team that was working on WebKit directly grew.

Ultimately, as you may know, we forked again, creating Blink. This particular move was a hard fork, skipping a stage in the sequence. This wasn’t an easy decision, but having explored and understood the soft fork, we knew that it wasn’t what we’re looking for.

✨ What I learned

With this framing, I accumulated a few insights. Here they are. Your mileage may vary.

When consuming open source projects:

  • Be aware that soft forks are always a lot more expensive than they look.
  • There will be many different ideas that make soft forks look appealing, including neat techniques of carrying patches and clever tooling to seamlessly apply them. These don’t reduce the costs. They just hide them.
  • When stuck with soft-forking, put all your energy into reducing the delay between merges. The beer game dynamic will show up otherwise.

In our own open source project:

  • Work hard to reduce the need to be soft-forked. Invest extra time to make configuration more flexible. Accommodate diverse needs. We are better off having these folks work in the main tree rather than wasting energy on a soft work.
  • Culture of inviting contribution and respect of insights from others is paramount: when signing up to run an open source project, we accept the future where the project will morph and change from our original intention. Lean into that, rather than trying to prevent it.

The waterline

This one is a very short framing, but hopefully, it’s still useful. Whenever we operate in environments that look like pace layers (and when don’t they?), there emerges a question of relative velocity. How fast am I moving relative to the pace layer where I want to play?

If the environment that surrounds us is moving slower than our team’s velocity, we will feel overly constrained and frustrated. Doing anything appears to be laden with unnecessary friction. On the other hand, if the environment is moving faster than us, we’ll feel like a tractor on an autobahn: everyone is zooming past us. Whenever something like this happens, this means that our relative velocity doesn’t match the velocity of the pace layer in which we’ve chosen to play.

This mismatch will feel uncomfortable, and there is a force hiding behind this discomfort. This force will constantly nudge us toward the appropriate layer.

If, after working in a startup where writing hacky code and getting it out to customers quickly was existential, I join a well-structured team with robust engineering practices and reliability guarantees, I will feel the force that constantly pushes me out of this team – the “serious business” processes will aggravate the crap out of me.

Conversely, if I grew up in a large company where landing one CL a week was somewhat of an achievement, joining a small team of rapid prototypers may leave me in bewilderment: how dare these people throw code at the repo without any code reviews?! And why does it feel like everyone is zooming past me?

Similarly, every technology organization has a waterline. Everything above this waterline will move at a faster pace than the organization. Everything below will be slower.

If this organization decides to enter the game at the faster-pace layer, they will see themselves being constantly lapped: new, better products will emerge faster than the team can keep up with. We would be still in the planning stages of a new feature while another player at this layer already shipped several iterations of it.

Similarly, the organization will struggle applying itself effectively at the lower pace layers. We would ship a bunch of cool things, and see almost no uptick or change in adoption or usage patterns. Worse yet, our customers will complain of churn and instability, choosing more stable and consistent peers over us.

No matter how much we will want to play above or below our waterline, we’ll keep battling the force that nudges us toward us. We might speak passionately about needing to move faster and innovate, and be obsessed with speed as a concept – but if our team is designed to move at a deliberate pace of a tractor, all it will be is just talk.

When an organization is aware of its waterline, it can be very effective. Choosing to play only at the pace layer where the difference in relative velocity is minimal can be a source of a significant advantage in itself: the organization can apply itself fully to solving the problem at hand, rather than battling the invisible force that nudges it out of the pace layer it doesn’t belong to.

Limits of applying AI

I’ve been thinking about the boundaries of what’s possible when applying large language models (LLMs) to various domains. Here’s a framing that might not be fully cooked, but probably worth sharing. Help me make it better.

To set things up, I will use the limits lens from the problem understanding framework. Roughly, I identify three key limits that bound what’s possible for anyone who is put in front of a problem: capacity, or the actual ability to find effective solutions (one or more) to the problem; time, the velocity at which effective solutions are discovered; and finally, attachment, or the resistance to incorporate interesting information into their understanding of the problem to find solutions.

These three limits govern what we humans can do. Let’s see how these limits could be applied to LLMs.

Let me start with capacity. The fact that this limit applies to LLMs is fairly self-evident, and can be illustrated easily with the progress we have made in the last year. What was impossible just 18 months ago is now a widely accepted capability. The progression to larger models, larger context windows feels like a constant drumbeat – this is the capability limit being pushed further and further back. When will we run into the wall of the invisible asymptote? When will we be able to confidently say: “Aha! Here’s proof that LLMs can’t actually do that”? It’s not easy to tell. However, just like with all things of this world, this limit is still present with LLMs. We just haven’t quite found it yet.

Another interesting trick that LLMs have compared to humans is the ability to clone. A bot or an agent or a network of reasoning are easily reproducible. One does not need to spend a lifetime raising an LLM from infant to adulthood. Once a pattern is established, clones of this pattern are easy to produce. This is a significant source of capacity. Being more numerous is easy, which means that brute force can compensate for smarts in a pinch.

The final component of the limit of capacity is the computing power. The amount of energy it takes to make the LLM produce a potential answer to a given problem seems like an important factor. Again, we seem to be marching along the Moore’s Law curve here, and I expect each new breakthrough in atoms and bits to significantly push the capacity limit out.

Speaking of the future, let’s talk about the limit of time. I sprinkled the references to time into my examination of the limit of capacity. They seem closely interrelated. LLMs becoming more capable and efficient means that they will also be able to solve problems more effectively. At this moment of time, I am contemplating networks of reasoning boxes, which implies the era when invoking LLMs will take single milliseconds. This seems to fall out naturally of the limit of capacity advancing. There will likely be an asymptote we’ll hit with how fast an LLM can go, but we’re definitely not there yet.

The limit of attachment is the one that’s been most curious for me. As far as I can tell, LLMs don’t seem to have it at all. While people could get bored or tired, or have anxiety about doing things one way or another, express strong convictions and make rash decisions… LLMs don’t seem to have any of that. LLMs don’t have the “ooh, I wonder what <character> will do next on <TV show> episode“ mindworm. They don’t need to rush home to put their kid to bed. They have no desire to spend the day just hanging out with friends. LLMs are unattached. There is no inner shame that they have to protect at all cost. There are no values that they passionately embrace. The limit of attachment for LLMs appears to be a bottomless abyss – or a blue sky, if you’re into more positive metaphors.

Will the limit of attachment manifest itself for LLMs? Probably. There’s something very significant about this lack of the limit that indicates the vastness of the space we’re in.

So, what does this mean in practice? 

As I was saying to my colleagues, we’re at the 8-track tape stage of the whole LLM story. What we think is cool and amazing now will be viewed as silly and as quaint as the 8-track tapes just a few months in the future. Prepare for more tectonic shifts as the previously understood lines of the capacity limit are redrawn. Avoid early firm bets on the final shape of things in this space.

Also, we’re likely very close to the moment when we will have a decent representation of an “information worker”, probably as some sort of reasoning network. This worker will never get tired, will never complain about the problem being too pedestrian for their talents, will never slack off or quiet-quit. This worker will continuously improve and get better at the tasks that they are given, performing them more efficiently each time.

Looking at the LLM bounding box – or more accurately, the lack thereof – my intuition is that we’re going to see terraforming of entire industries, especially where information plumbing is a significant cost of doing business. I have no idea how they will change and what they will look like after the dust settles, but it’s very likely that this change will happen.

More importantly, as this change grinds into our established understanding of how things are, we are likely to grapple with it as a society. It is very possible that it is our own limits of attachment that we will try to impose on the LLMs, no matter how fruitless this effort will end up being in the long term.

Concept miner

Here’s a concrete example of a reasoning box that I’ve been talking about last week. It’s not super-flashy or cool – certainly not one of those viral demos, but it’s a useful example of recursively ground-truthing a reasoning box onto itself. The source code is here if you want to run it yourself.

The initial idea I wanted to play with was concept extraction: taking a few text passages and then turning them into a graph of interesting concepts that are present in the text. The vertices of the graph would be the concepts and the edges will represent the logical connections between them.

I started with a fairly simple prompt:

Analyze the text and identify all core concepts defined within the text and connections between them.

Represent the mental concept and connections between them as JSON in the following format:

{
"concept name": {
"definition": "brief definition of the concept",
"connections": [ a list of of concept names that connect with this concept ]
}
}

TEXT TO ANALYZE:
${input}

This is mostly a typical reasoning box – take in some framing, context, and a problem statement (“identify all core concepts”), and produce a structured output that reflects the reasoning. In this particular case, I am not asking for the chain of reasoning, but rather for a network of reasoning.

The initial output was nice, but clearly incomplete. So I thought – hey, what if I feed the output back into the LLM, but with a different prompt. In this prompt, I would ask it to refine the list of concepts:

Analyze the text and identify all core concepts defined within the text and connections between them.

Represent the mental concept and connections between them as JSON in the following format:

{
"concept name": {
"definition": "brief definition of the concept",
"connections": [ a list of of concept names that connect with this concept ]
}
}

TEXT TO ANALYZE:
${input}

RESPONSE:
${concepts}

Identify all additional concepts from the provided text that are not yet in the JSON response and incorporate them into the JSON response. Add only concepts that are directly mentioned in the text. Remove concepts that were not mentioned in the text.

Reply with the updated JSON response.

RESPONSE:

Notice what is happening here. I am not only asking the reasoning box to identify the concepts. I am also providing the outcome of its previous reasoning and asking to assess the quality of this reasoning.

Turns out, this is enough to spur a ground truth response in the reasoning box: when I run it recursively, the list of concepts grows and concept definitions get refined, while connections shift around to better represent the graph. I might start with five or six concepts in the first run, and then expand into a dozen or more. Each successive run improves the state of the concept graph.

This is somewhat different from the common agent pattern in reasoning boxes, where the outcomes of agent’s actions serve as the ground truth. Instead, the ground truth is static – it’s the original text passages, and it is the previous response that is reasoned about.  Think of it as the reasoning box making guesses against some ground truth that needs to be puzzled out and then repeatedly evaluating these guesses. Each new guess is based on the previous guess – it’s a path-dependent reasoning.

Eventually, when the graph settles down and no more changes are introduced, the reasoning box produces a fairly well-reasoned representation of a text passage as a graph of concepts.

We could maybe incorporate it into our writing process, and use it to see if the concepts in our writing connect in the way we desire, and if the definitions of these concepts are discernible. Because the reasoning box has no additional context, what we’ll see in the concept graph can serve as a good way to gauge if our writing will make sense to others.

We could maybe create multiple graphs for multiple passages and see if similar concepts emerge – and how they might connect. Or maybe use it to spot text written in a way that is not coherent and the concepts are either not well-formed or too thinly connected.

Or we could just marvel at the fact that just a few lines of code and two prompts give us something that was largely inaccessible just a year ago. Dandelions FTW.

Separating cargo from the train

I’ve been puzzling over a problem that many engineering teams face and came up with this metaphor. It’s situated in the general space of attachment and could probably apply to things other than engineering teams. 

Here’s the setup. Imagine that we’re leading a team whose objective is to rapidly explore some newly opened space. Everyone gets their little area of the space, and armed with enthusiasm and skill, the teams venture off into the unknown. A few months later, a weird thing happens: now we have N teams that build technology or products in basically the same space where they started.

Instead of exploring, the team just settled into the first interesting thing they found. Exploration collapsed into optimizing for the newly found value niche.

This might not necessarily be a bad thing. If the new space is ripe with opportunities or the team is incredibly lucky, they might have struck gold on the first try.  Except my experience tells me that most of the time, the full value of the niche is grossly overestimated, and the teams end up organizing themselves into settlers of a tiny “meh” value space.

The events that follow are fairly predictable. There is a struggle between us and the individual team leads to “align”, where the word “align” really stands for “what the heck are y’all doing?! we were supposed to be exploring!!” from us and “stop distracting us with your silly ideas! we have customers to serve and things to ship!” from the sub-teams. The team becomes stuck.

I have seen various ways in which the resolution plays out. There’s one with the uneasy compromise, where the “exploration team” kayfabe is played out at the higher levels (mostly in slide decks), and the sub-teams are just left to do their thing. There’s one with the leader making a “quake”: a swift reorg that leaves the sub-team leads without a path forward. There’s one where a new stealth sub-team is started to actually explore (you can guess what happens next).

The lens that really helps here is “something will get optimized”. When we have engineers, we have people whose literal job description includes organizing code into something that lasts. Like a car with unbalanced wheels, by default, engineers will veer toward elephant-land. Given no other optimization criteria, what will get optimized is the quality of the code base and the robustness of the technical solution that it offers.

The problem is, when exploring, we don’t need any of that. We need messy, crappy code that somewhat works to get a good sense of whether there’s a there there. And then we need to throw that code out or leave it as-is and move on to the next part of the space.

This is not at all natural and intuitive for engineering teams. There are no tests! Not even a code review process! This dependency was made by a random person in Nebraska! Madness!

By the way, the opposite of this phenomenon is also true. If our engineering team does not have this tendency toward building code that lasts, we probably don’t have an engineering team. We might have some coders or programmers, but no engineering.

To shift an engineering team to be more amenable to exploration, we need to shift the target of the optimization.

That’s where the cargo-and-train metaphor comes in. Let’s pretend that an engineering team is the train that delivers a cargo: a thing that it makes. The thing about cargo is that once it is delivered, it leaves the train and the train gets new cargo. Train is permanent. Cargo is transient.

To make our train go efficiently, we optimize for moving cargo as quickly as possible, and we optimize for keeping the train in its best condition. Figuring out which part of our work we optimize to keep and which one we optimize to move is what it’s all about.

If we follow this metaphor, there are two questions that an engineering team needs to ask itself: “What is our cargo?” and “What is our train?” We need to consciously separate our cargo from our train.

Which part of our business do we optimize to let go of as efficiently as possible, and which part of it do we keep and grow?

For a typical engineering team, the cargo is the software release and the code base is the train.

Each release is the snapshot of the code base at a certain state. Once that release cut is made, we mentally let go of it and start on the next release. Releasing well means being able to make a release cut like a Swiss train: always on time, with no hiccups.

The codebase is the train, since this is where releases come from. Codebase is the place where the product grows and matures. Our codebase is what we keep and improve and strive to make better with time. Terms like technical debt we engineers invent reflect our anxiety about succeeding at this process.

When the engineering team is asked to explore a new space, the answers to the two questions are like different.

It might very well be that the code we write is cargo. It’s just something we do as a byproduct of our exploration. We write a ton of prototypes, throw them in the wild, and see which ones stick.

What is the train then? My intuition is that it’s knowledge. After all, the whole point of exploration is mapping the unknown. If our continuous delivery of cargo – writing of prototypes – doesn’t light up more and more territory, we’re doing something wrong.

So when an engineering team is asked to explore a new space, we need to contemplate the cargo-and-train questions carefully and decide on our answers to them.

Then, we need to invest into making sure that everyone on the team optimizes for the right thing: the thing that we want to be our cargo is optimized to be delivered and let go of quickly, and the thing we want to be our train is carefully and lovingly grown and enriched with each delivery.

This includes everything from mission and vision where the cargo-and-train questions are clearly answered, but also into culture, incentives, and structure of the team. Remember – most of the default engineering practices and processes were designed for default engineering teams. Which means that if we’re setting out to explore, they will be working against us.

Reasoning boxes

This story begins with the introduction of metacognition to large language models (LLMs). In the LLM days of yore (like a few months ago), we just saw them as things we could ask questions and get answers back. It was exciting. People wrote think pieces about the future of AI and all that jazz.

But then a few extra-curious folks (this is the paper that opened my eyes) realized that you could do something slightly different: instead of asking for an answer, we could ask for the reasoning that might lead to the answer.

Instead of “where do I buy comfortable shoes my size?”, we could inquire: “hey, I am going to give you a question, but don’t answer it. Instead, tell me how you would reason about arriving at the answer. Oh, and give me the list of steps that would lead to finding the answer. Here’s the question: where do I buy comfortable shoes my size?

Do you sense the shift? It’s like an instant leveling up, the reshaping of the landscape. Instead of remaining hidden in the nethers of the model, the reasoning about the question is now out in the open. We can look at this reasoning and do what we would do with any reasoning that’s legible to us: examine it for inconsistencies and decide for ourselves if this reasoning and the steps supplied will indeed lead us toward the answer. Such legibility of reasoning is a powerful thing.

With reasoning becoming observable, we iterate to constrain and shape it. We could tell the LLM to only use specific actions of our choice as steps in the reasoning. We could also specify particular means of reasoning to use, like taking multiple perspectives or providing a collection of lenses to rely on.

To kick it up another notch, we could ask an LLM to reason about its own reasoning. We could ask it “Alright, you came up with these steps to answer this question. What do you think? Will these work? What’s missing?” As long as we request to provide the reasoning back, we are still in the metacognitive territory.

We could also give it the outcomes of some of the actions it suggested as part of the original reasoning and ask it to reason about these outcomes. We could specify that we tried one of the steps and it didn’t work. Or maybe that it worked, but made it impossible for us to go to the next step – and ask it to reason about that.

From the question-answering box, we’ve upleveled to the reasoning box.

All reasoning boxes I’ve noticed appear to have this common structure. A reasoning box has three inputs: context, problem, and framing. The output is the actual reasoning. 

The context is the important information that we believe the box needs to have to reason. It could be the list of the tools we would like it to use for reasoning, the log of prior attempts at reasoning (aka memory), information produced by these previous attempts at reasoning, or any other significant stuff that helps the reasoning process.

The problem is the actual question or statement that we would like our box to reason about. It could be something like the shoe-shopper question above, or anything else we would want to reason about, from code to philosophical dilemmas.

The final input is the framing. The reasoning box needs rails on which to reason, and the framing provides these rails. This is currently the domain of prompt engineering, where we discern resonant cues in the massive epistemological tangle that is LLM that give to the reasoning box the perspective we’re looking for. It usually goes like “You are a friendly bot that …” or “Your task is to…”. Framing is sort of like a mind-seed for the reasoning box, defining the kind of reasoning output it will provide.

Given that most of the time we would want to examine the reasoning in some organized way, the framing usually also constrains the output to be easily parsed, be it a simple list, CSV, or JSON.

A reasoning box is certainly a neat device. But by itself, it’s just a fun little project. What makes reasoning boxes useful is connecting them to ground truth. Once we connect a reasoning box to a ground truth, we get the real sparkles. Ground truth gives us a way to build a feedback loop.

What is this ground truth? Well, it’s anything that can inform the reasoning box about the outcomes of its reasoning. For example, in our shoe example, a ground truth could be us informing the box of the successes or failures of actions the reasoning box supplied as part of its reasoning.

If we look at it as a device, a ground truth takes one input and produces one output. The input is the reasoning and the output is the outcomes of applying this reasoning. I am very careful not to call ground truth “the ground truth”, because what truths are significant may vary depending on the kinds of reasoning we seek.

For example, and as I implied earlier, a reasoning box itself is a perfectly acceptable ground truthing device. In other words, we could connect two reasoning boxes together, feeding one’s output into another’s context – and see what happens. That’s the basics of the structure behind AutoGPT.

Connecting a reasoning box to a real-life ground truth is what most AI Agents are. They are reasoning boxes whose reasoning is used by a ground truthing device to take actions, like searching the web or querying data sources – and then feeding the outcomes of these actions back into the reasoning boxes. The ground truth connection is what gives reasoning boxes agency.

And I wonder if there’s more to this story?

My intuition is that that the reasoning box and a ground truthing device are the two kinds of blocks we need to build what I call “socratic machines”: networks of reasoning boxes and ground truthing devices that are capable of independently producing self-consistent reasoning. That is, we can now build machines that can observe things around them, hypothesize, and despite all of the hallucinations that they may occasionally incur, arrive at well-reasoned conclusions about them.

The quality of these conclusions will depend very much on the type of ground truthing these machines have and the kind of framing they are equipped with. My guess is that socratic machines might even be able to detect ground truthing inconsistencies by reasoning about them, kind of like how our own minds are able to create the illusion of clear vision despite only receiving a bunch of semi-random blobs that our visual organs supply. And similarly, they might be able to discern, repair and enrich insufficient framings, similar to how our minds undergo vertical development.

This all sounds outlandish even to me, and I can already spot some asymptotes that this whole mess may bump into. However, it is already pretty clear that we are moving past the age of chatbots and into the age of reasoning boxes. Who knows, maybe the age of socratic machines is next to come? 

Innovation frontier

So we decided to innovate. Great! Where do we begin? How do we structure our innovation portfolio? There are so many possibilities! AI is definitely hot right now. But so are advances in green technology – maybe that’s our ticket? I heard there’s stuff happening with biotech, too. And I bet there are some face-melting breakthroughs in metallurgy…

With so much happening everywhere all at once, it could be challenging to orient ourselves and innovate intentionally – or at least with enough intention to convince ourselves that we’re not placing random bets. A better question: what spaces do we not invest into when innovating?

Here’s a super-simple framing that I’d found useful in choosing the space to innovate. It looks like a three-step process.

First, we need to know what our embodied strategy is. We need to understand what our capabilities are and where they will be taking us by default.

This is important, because some innovation may just happen as a result of us letting our embodied strategy play out. If we are an organization whose embodied strategy is strongly oriented toward writing efficient C++ code, then we are very likely to keep seeing amazing bits of innovation pop out in that particular space. We will likely lead some neat C++ standards initiatives and invent new cool ways to squeeze a few more drops of performance out of the code we write.

As I mentioned before, embodied strategy is usually not the same as stated strategies. I know very few teams who are brutally honest with themselves about what they are about. There’s usually plenty of daylight between what the organization states about where they’re going and where they are actually going. The challenge of step 1 is to pierce the veil of the stated strategy.

As you may remember from my previous essays, this understanding will also include knowing our strategy aperture. How broad is our organization’s cone of embodied strategy?

At the end of the first step, we already have some insight on the question above. Spaces well outside of our cone of embodied strategy are not reachable for us. They are the first to put into the discards pile. If we are an organization whose strengths are firmly in software engineering, attempting to innovate in hardware is mostly like throwing money away – unless of course we first grow our hardware engineering competency.

The second step is to understand our innovation frontier. The innovation frontier is a thin layer around our cone of embodied strategy. Innovation ideas at the outer edge of this frontier are the ones we’ve just discarded as unreachable. Ideas at the inner edge of the frontier are obviously going to happen anyway: they are part of the team’s embodied strategy.

It is the ideas within this frontier that are worth paying closer attention to. They are the “likely-to-miss” opportunities. Because they are still on the fringe of the embodied strategy, the organization is capable of realizing them, but is unlikely to do so – they are on the fringe, after all.

It is these opportunities that are likely going to sting a lot for a team when missed. They are the ones that were clearly within reach, but were ignored because of the pressing fires and general everyday minutiae of running core business. They are the ones that will disrupt the business as usual, because – when they are big enough – they will definitely reshape the future opportunities for the organization.

The innovation frontier is likely razor-thin for well-optimized and specialized organizations. The more narrow our strategy aperture, the less likely we will be to shift a bit to explore curious objects just outside of our main field of view.

In such cases, the best thing the leader of such an organization can do is to invest seriously into expanding their innovation frontier. Intentionally create spaces where thinking can happen at a slower pace, where wilder ideas can be prototyped and shared in a more dandelion environment. Be intentional about keeping the scope roughly within the innovation frontier, but add some fuzziness and slack to where these boundaries are.

The third step is to rearrange the old 70/20/10 formula and balance our innovation portfolio according to what we’ve learned above:

  • Put 70% into the ideas within the innovation frontier and the efforts to expand our innovation frontier.
  • Put 20% into the ideas that are within the strategy aperture.
  • Just in case we’re wrong about our understanding of our embodied strategy, put 10% into the ideas that are at the outer edge of the innovation frontier.

And who knows, my law of tightening strategy aperture could be proven wrong? Perhaps if an organization is intentional enough about expanding its innovation frontier, it could regain its ability to see and realize the opportunities that would have been previously unattainable?

Wait, did we forgo the whole notion of timelines in our innovation portfolio calculations? It’s still there, since the cone of embodied strategy does extend in time. It’s just not as significant as it was in the old formula. Why? That’s a whole different story and luckily, my friend Alex wrote this story down just a few days ago.