From Process Certainty to Result Certainty: A Different Kind of Security in the AI Era

💡 Sharing this article by Yage (grapeot). He was among the first to truly master AI tools, and one of the earliest people I started following—about two years ago now. He consistently produces high-quality blog posts, and every single one is worth reading.

Even in 2026, turning AI from a demo into a product isn’t easy. Take Chinese-to-English translation—everyone thinks LLMs solved this long ago, just call an API, right? But when we recently tried to add auto-translation sync to the Superlinear Academy community, we discovered the development experience was terrible.

The core issue is that AI output has a lot of uncertainty. A post might be too long and the AI gets lazy—translating properly at first, then summarizing toward the end. Or it short-circuits, starting in English but randomly inserting Chinese characters in the middle. Or it makes small format changes, like dropping bold text. Or it might timeout, outputting half the result before hanging.

To overcome these uncertainties, we had to add lots of detailed handling in our code. For long posts, we’d split them into segments, call the API separately for each, then stitch them back together (Wide Research). But this creates another problem—terminology might not be consistent between segments. So we had to design additional workflows to ensure the same Chinese term doesn’t get translated into two different English words between the first and second segments. We also added checks—if the output still contains Chinese characters, translate again. To handle timeouts while avoiding redundant translation, we implemented checkpoint recovery, only translating the failed portion and inserting it back.

This approach did significantly improve success rates, ensuring even long community posts could be translated properly. But the whole thing felt exhausting. We spent 90% of our time not on making translations better, but on workflow and orchestration to clean up after the AI. And because unexpected issues kept popping up, we stopped fixing problems that only appeared once or twice—it felt endless. No productivity gains at all. Might as well use old machine translation APIs.

Then we tried a completely different approach, and the problem was solved. But before explaining what we did, let me share our deeper thinking about the root cause.

The Four-Layer Structure of Agent Calls

As mentioned, calling an AI API isn’t just fire-and-forget. It requires many supporting pieces. From an integration perspective, these can be divided into four layers:

Model Layer: Are we using Claude or GPT? Opus or Haiku? What Reasoning Effort?
Protocol Layer: Chat Completion API or Response API? MCP or RESTful API? How to handle Rate Limits? Enable JSON Mode? When we say “calling an API,” we usually mean the protocol layer.
Runtime Layer: How to manage state? How to invoke tools? How to give file contents to AI? How to control permissions? How much concurrency? This isn’t traditional API development, but it’s unavoidable if you want AI stable in production.
Contract Layer: What standard defines success? What checks after getting AI results? How to set Guardrails? When to introduce human intervention? How to ensure compliance? This layer determines whether we can trust AI output and actually use it in production.

When discussing AI product development, most talk about the protocol layer. But in actual development, the runtime layer takes the most time. Unlike traditional APIs, LLMs introduce too much uncertainty that the runtime layer must absorb and handle. The problem is, the runtime layer has nothing to do with business logic. Whether doing translation, code generation, or customer service bots, we all handle laziness, stitching, context management, and concurrency control. Every team reinvents the wheel. So a natural thought: can we outsource the runtime layer?

It’s not that simple. Different models have different failure patterns. Some follow instructions well but get lazy on long texts; others are creative but terrible at format control. We clean up differently for different models, especially for long-tail failure patterns. So the runtime layer is often highly model-customized, hard to reuse, let alone outsource.

But recently something changed this situation: Claude Code itself isn’t open source, but more model providers are actively supporting compatibility. Kimi, DeepSeek, GLM all provide official interfaces—just change a few environment variables and Claude Code can use these models in the backend. This is interesting. It means Claude Code has transcended being just a tool and become something reusable.

More importantly, when model providers claim Claude Code compatibility, what they’re actually doing is: adapting their model’s failure patterns to Claude Code’s expected behavior. In other words, the cleanup doesn’t disappear, but who does it changes—from us developers to model providers. To enter this ecosystem, they must ensure their models perform stably in Claude Code’s runtime. (This discussion applies to similar tools like Codex/Cursor Agent too, since their command-line interfaces are very similar and easy to adapt.)

In other words, Claude Code/Codex/Cursor Agent is becoming a reusable Agentic Runtime.

This solves the long-tail problem mentioned earlier. Those scattered edge cases that one team can’t fix, the whole ecosystem can. Every model provider wanting Claude Code compatibility, including Anthropic itself, is filling holes for us. So a new approach: for translation, we can completely shift from “call API then clean up ourselves” to “just hand it to Claude Code.” By freeloading off the entire ecosystem’s adaptation work, we’re effectively reusing its runtime layer—moving from reinventing wheels to standing on a converging standard.

In Practice: Handing Translation to Claude Code

This is why we decided to take a different path: instead of continuing to handle uncertainty in the runtime layer ourselves, stand directly on this converging standard. So we tried giving the community translation task to Claude Code. The most immediate feeling: most problems we’d spent lots of time on just disappeared.

First, the laziness problem. Before with API calls, we had to do segmentation, stitching, validation ourselves. But Claude Code works differently—its basic unit of operation is files. Files are stateful, existing on disk, serializable, persistent. So we can have Claude translate chapter by chapter, writing back to the file after each chapter, without needing an orchestration layer outside to track progress.

Same for checkpoint recovery. Before when API calls timed out, we had to record checkpoints, translate only failed parts, stitch back. Now we don’t. If translation dies halfway, the file is there, already-translated parts won’t be lost. Restart and tell Claude Code to continue—it reads the file, sees where translation stopped, continues from there.

Terminology consistency used to require specially designed workflows, using master-detail or progressive forms to pass first-segment terminology to the second. Now Claude Code reads the whole file before each modification—it naturally sees prior context. So terminology consistency is solved with a simple prompt: first read the whole file, see what terminology was used before, translate lines XX to XX.

Chinese characters in output—before we needed detection, judgment, retry. Now just tell Claude in the prompt: after translation, check from start to end, ensure no residual Chinese characters. Going further, since Claude Code can call Python, we can have it write a simple script to verify the final file format meets requirements. It writes check logic, runs it, fixes issues itself.

The common thread: problems that used to require workflow-level solutions can now be stated clearly in natural language prompts, letting the agent handle them reliably. We can finally focus on making translations better, not preventing the system from doing something stupid.

Agentic Loop and Evaluation-First Mindset

These changes let us finally focus on translation quality itself. But after finishing I got curious: why can Claude Code do this? Does a different way of calling APIs really make such a big difference?

We said earlier an important reason is reusing the ecosystem’s adaptation work. But that’s just a higher-level, surface reason. From the four-layer structure perspective, the direct reason Claude Code works is: it lets AI observe the results of its own actions.

This sounds obvious, but it’s the essential difference between agentic AI and traditional API calls. With APIs, AI only sees the prompt fed to it, spits out a result, then it’s over. If the result has problems—JSON format wrong, missing fields, second half lazy—AI doesn’t know. You discover the problem, you decide whether to retry, you write the fix logic. This is why we think AI is dumb and need to clean up after it.

But Claude Code is different. After modifying a file, it can call Python to run a JSON parser, see an error saying line 9527 has a syntax error. This error feeds back to it, so it knows what to fix. Fix it, run again, passes, continue. This execute → observe → correct loop is the agentic loop.

This is also why the file format is so important. Files are state carriers; visible state makes the closed loop possible. We changed translation from calling an API once for results to having an agent operate files in a working directory—this effectively gives AI a pair of eyes. It can see what it did last step, see validation script output, decide next steps based on this information. This is the capability the runtime layer brings.

But an agentic loop running doesn’t mean it runs correctly. Observing results is one thing; knowing what result counts as “correct” is another. That’s what the contract layer answers. Back to translation. Even with Claude Code, it’s not like we switched tools and things magically worked.

If you just say “translate this file to English,” Claude translates, but results still have segments with Chinese characters. Same problem as API calls before, just much easier to fix now: add to the prompt—after translation, run a Python script to check for residual Chinese characters, fix if found. Claude Code reliably writes a simple regex check, runs it, finds problems and goes back to fix, runs again until passing.

But this reveals a more important issue: previous errors weren’t because Claude was dumb, but because it didn’t know what “done translating” meant. To it, ensuring each chapter got one Chinese-to-English operation meant task complete. But to us, “done translating” also includes correct format, no residual Chinese, consistent terminology—implicit expectations. These expectations were in our heads, invisible to Claude. Once we explicitly write out these expectations and tell it how to verify, it can judge whether it’s done.

I like this analogy: imagine giving a task to an intern with amnesia. This intern has no context, doesn’t know what you discussed before, doesn’t know your implicit expectations, can only see this one instruction. You need to write acceptance criteria detailed enough that: based only on this information, they can judge if they’re done. If they think they’re not done, they know what’s missing. My experience: write to this detail level, and you can basically expect Claude Code/Codex to reliably complete tasks. If it can’t, don’t rush to blame AI—first check if the standard wasn’t clear.

So now we can clarify the relationship between these two layers. The runtime layer gives the agent observation ability—it can see what it did, what the result was. The contract layer tells it what counts as success—it can judge if it’s done. Both are essential: observation without standards means the agent spins aimlessly, giving beautiful results that may not meet requirements; standards without observation means the agent stops after one try, success pure luck. Agentic loop plus evaluation-first forms a complete closed loop.

From Process Certainty to Result Certainty

Once this closed loop is established, it subtly changes where our trust in AI comes from. Behind it are actually two different kinds of certainty.

Traditional programmers’ security comes from process certainty. Every line of code I write is under my control, every branch, every boundary condition I’ve considered. The program’s behavior is what I designed—as long as it follows this logic, it will definitely produce conforming results. This certainty is tangible, and this ability to translate results into program behavior is a fundamental skill we’ve trained for years.

But the agentic loop and evaluation-first mindset we just saw is another kind of certainty. We don’t specify every step, but specify what the destination looks like and how to verify arrival. The process is uncertain—Claude might translate then check, or check while translating, might use regex or other methods—but the result is certain: as long as acceptance criteria are right, the final product is right. This is result certainty.

Behind these two certainties are two different cost structures. Process certainty economics: code execution costs almost nothing, but writing code costs expensive labor. So we carefully design logic, pursue reuse, avoid repetition, spreading labor costs across every execution. Result certainty economics is reversed: intelligence is getting cheaper, costs of having AI repeatedly try, check, correct are rapidly falling. We can lavish tokens for certainty—not by writing more defensive code, but letting AI use its reasoning ability to counter uncertainty.

This is the same logic I discussed in Disposable Software and Compressed Reality. That article said when writing code costs approach zero, disposable software becomes optimal strategy. The change here is broader: not just code, but all reasoning and intelligence is getting cheaper. Translation isn’t writing code, but it’s still produced by burning tokens. When this cost is low enough, we can have AI do checks on the spot, write validation scripts on the spot, loop repeatedly until correct—without pre-writing all possible situations as rules in code.

Beyond cost structure changes, there’s also a ceiling difference. Process certainty’s ceiling is our imagination and energy—situations we can think of, logic we can write, that’s the system’s boundary. Result certainty’s ceiling is higher: we don’t need to enumerate all possible paths, just clearly define what’s correct, and the agent will find its way to that state.

But we’re not used to result certainty and often feel uneasy. Because a core skill we pride ourselves on in our careers is exactly translating results into process: boss says they want a system handling 100k concurrent connections, we design an architecture to guarantee that result; PM says user uploads can’t exceed 10MB, we write validation logic to block oversized requests. So when we start using AI, this habit naturally continues—we instinctively want rules to govern AI behavior: output must be JSON format, every field must exist, handle this situation this way, that situation that way.

But this path has a ceiling. AI isn’t a deterministic system; constraining it with process means you’re using lots of rules to control its uncertainty. More rules, more holes to patch, eventually spending more effort on defense than on solving problems. This is exactly the predicament we faced doing translation with APIs at first.

But what if we accept some compromise? If we’re willing to accept process uncertainty and instead constrain AI behavior by specifying results, things change. We no longer say “you must use this method to handle this situation,” but “the final product must satisfy these conditions, figure out how yourself.” This way, AI’s flexibility isn’t a risk we need to control, but a resource for completing tasks.

Of course, this path wasn’t easy to walk before. If you want AI to observe results itself, judge right/wrong itself, decide next steps itself, you had to build an agentic loop yourself. And agent wrapping is harder than it looks: you handle tool call formats, parse AI output, manage context windows, adapt to different model characteristics. After all this, you find yourself doing another form of trading process for certainty. (And introducing Agentic frameworks often brings larger technical debt.)

But now you don’t have to. Claude Code, Codex, Cursor Agent—these tools have done the dirty work of the runtime layer. Agentic loop is ready-made, file system is ready-made, tool call encapsulation is ready-made. What you need to do is think clearly about what result you want, how to verify it, then tell it in natural language.

So I have a suggestion: try embracing process uncertainty. Don’t reflexively specify every AI behavior step, but directly describe your expectations for final results, codify them as verifiable standards. Let Claude Code-type tools handle runtime layer matters; you focus on the contract layer: define what’s correct, define how to verify.

This is a different way of working, and a different source of security.

Conclusion

Of course, this way of working has boundaries.

First is the nature of tasks themselves. Result certainty works on the premise that you can clearly define what “correct” means. Translation suits this because acceptance criteria can be formalized: correct format, no residual Chinese, consistent terminology—these can all be written as scripts for the agent to run. But some tasks’ “correct” is hard to define, or the defined standard itself is ambiguous—though then again, using rules to constrain process would only be harder. At least evaluation-first gives a clear failure signal.

Second is safety. With APIs, AI has no control over your system. It can only receive prompts, return text, that’s it. But Claude Code-type tools are different. They can read/write files, execute Python, run bash commands. This is why they’re powerful, and also a danger. This problem deserves serious attention. Our approach is tightening permissions at configuration: use –allowedTools to limit callable tools, narrow executable scope to specific scripts. Going further, combine with popular lightweight sandbox solutions, so even if the agent messes up it only disrupts files inside the sandbox, not the host system.

There are indeed many pitfalls here. How to design permission models, how to configure sandboxes, how to rollback problems—these are open questions without standard answers. But I’m optimistic about this direction. Security issues are engineering issues, engineering issues are solvable. These risks won’t make this path unworkable.

Back to the opening question: do we make AI part of the system, calling it with programs, making a translation product with AI features? Or make AI the system’s core, having it call programs, making an AI Agent that completes translation tasks?

We tried both paths and found the latter has surprisingly much higher success rate and stability. This might be because the latter lets us reuse the ecosystem’s adaptation work, because agentic loop lets AI self-correct, because evaluation-first lets us constrain AI behavior with results rather than process. These factors stack together to form a different way of working.

It requires us to give up some things: the sense of control over process, certainty of every behavioral step, and the instinct we trained for years to translate results into flows. But it also gives us things: higher ceiling, less grunt work, and a new, result-based security.

How far can this pattern extend? I’m not sure, but at least for translation, it completely changed our development experience. We’ve organized this practice into a how-to guide—you can send it to your AI and try it right now.

Original Author: Yage (grapeot)
Original Link: https://yage.ai/result-certainty.html

☕ If you found this helpful, consider buying me a coffee to support more content like this.

Buy me a coffee