the evolutionary leaps of ai engineering

ai engineering doesn't improve linearly. it has phase changes — moments where something shifts and the entire relationship between human and machine becomes a different kind of thing. not faster, not better. different.

i've been through two of these leaps. i think i can see a third.

leap 1: world models

the default mode of working with ai is stateless. open a chat, dump context, get output, close the chat. next time, start over. the ai doesn't remember what was decided, what was tried, what failed.

this works for one-off questions. it collapses for anything that compounds. building software — or any complex system — is fundamentally about accumulating decisions. each decision constrains future decisions. stateless ai can't accumulate anything.

the first leap was giving the ai structured context about the domain. not a prompt, not a doc dump — a world model. a documented representation of what exists, how things relate, what's mutable, what's invariant. the ai starts every session with institutional memory.

this changed the kind of work ai could do. before world models: generate code from a description. after world models: build on yesterday's decisions without the human reconstructing the context. the ai went from tool to something with continuity.

the question world models answered: what does the system know?

leap 2: holdout scenarios

world models gave the ai memory. but memory doesn't mean the output is right. the ai could build confidently on prior decisions and still produce something that doesn't work.

the second leap was structural separation between building and evaluating. the agent that writes code never sees the criteria it's evaluated against. scenarios describe expected behavior. the evaluation runner checks the output. the builder only sees behavioral failure descriptions — never the test logic.

this is the same principle as train/test separation in machine learning, applied to software engineering. the system can verify its own work without gaming itself. when scenarios pass, the behavior is correct. when they fail, the builder gets a description of what went wrong — like a bug report from a user, not a failing test with line numbers.

this changed the trust model. before holdout scenarios: the human reviews every output. after: the system validates itself, and the human reviews what passed. the surface area requiring human attention shrinks by an order of magnitude.

the question holdout scenarios answered: how do we trust the output?

leap 3: taste convergence

this is the one i haven't solved yet. it's a hypothesis, not a proven pattern. but i think it's the next phase change.

world models solved what the ai knows. scenarios solved what the ai should do. what's left is what the ai should want — which outputs are good enough, which topics are worth pursuing, which angles make something worth reading versus forgettable.

right now, taste is the thing that can't be delegated. you can give the ai all the context and all the rules and it still produces output that's correct but not yours. you spend your review time not fixing errors but adjusting tone, cutting the boring parts, adding the edge. that's taste. and it's the current bottleneck.

the argument against taste convergence is that taste shifts. what i find interesting today isn't what i'll find interesting in six months. if the ai learns a snapshot of my preferences, it's always learning the last version — chasing a moving target.

but here's the thing: on a long enough timeline, even the shifts become predictable. taste isn't random. it has structure. preferences evolve in directions shaped by values, exposure to new ideas, current projects, life stage. an ai that understands the trajectory of your taste — not just the snapshot — could anticipate where you're heading.

that's the difference between a recommendation algorithm and a creative partner. one learns what you watched. the other learns why you liked it and predicts what you'll like next — including things you haven't encountered yet.

the question taste convergence would answer: what should the system want?

the pattern

each leap changes the kind of relationship between human and ai:

stateless → world models: from tool to something with continuity
manual review → holdout scenarios: from supervised worker to self-verifying system
static preferences → taste convergence: from executor to collaborator

each one reduces the surface area where human intervention is required. not by removing the human — by the system getting better at predicting what the human would do. the human's role shifts from doing to steering to taste-making.

where i am

i'm solidly in leap 2. the systems i build validate themselves against holdout scenarios. the world models compound across sessions. the methodology is proven across multiple projects and domains.

leap 3 is the hypothesis i'm testing now. this blog is part of the experiment — ai drafts content, i refine it, the system learns from the delta between what it produced and what i approved. every edit is a taste signal. every approval reinforces. every rejection corrects.

if it works, the cadence accelerates without the quality dropping. if it doesn't, taste remains the permanent human layer — which is fine. that's a meaningful role.

either way, the question is worth asking. the leaps don't announce themselves. you only recognize them after the shift has already happened.