AI Native Dev: Drew Knox Talk Transcript — Context as Code

Speaker: Drew Knox — Head of Product & Design at Tessl (former Research Scientist leading language modeling teams at Grammarly) Event: AI Native Dev Duration: 29:36 Source: YouTube Analysis: Deep Analysis & Editorial Commentary

Introduction: Drew Knox’s Background

Time: 00:00:01 - 00:01:33

We have effectively had general-purpose agentic development machines for 50 years. We just called them software engineers. You never just wrote somebody a Slack message and expected them to go build an entire system and just make all the best decisions, right?

Well, nice to meet everybody. My name is Drew Knox. I’m the head of product and design here at Tessl. I’m going to talk today about using skills in a more professional, rigorous software engineering mindset.

So before I get into this, why should you trust me? Well, one — maybe don’t. Maybe be skeptical. But in my past life, before leading product and design here at Tessl, I was a research scientist leading the language modeling teams at Grammarly and at a startup that sadly has not found success yet called Cantina — it’s a social network, AI-first. I’ve done a lot of work on developer tools and I do a lot of moonlighting actually writing code, probably none of it as good as the actual people on Tessl’s teams. So I’ve thought a lot about this, I’ve done a lot of work on this. I’d like to share some insights, would love questions, would love to hear your experience and what’s worked for you. I’ll try to save lots of time for questions at the end. But without further ado — you want to work on skills, and maybe more broadly, you want to work on context for your agents.

The Era of Context Engineering

Time: 00:01:33 - 00:04:17

I’m sure folks have heard about context engineering. It feels like every year we’re told that this is the year of something. I’ve heard people say that this is the year of context engineering. Maybe it is, maybe it isn’t.

As you start to work on this, you’ll probably go through the same stages of denial, acceptance, et cetera — from “this is amazing, I’m getting good results” to suddenly “God, how does any of this work? Is any of this impactful? I thought I was an engineer and now I feel like I’m an artist or a librarian. How do I turn this thing — agents, context engineering — back into the kind of reliable, predictable engineering that I know and love?”

So how do we go about doing that? I think the first thing to realize is that the reason we’re all doing context engineering now is because we’ve all effectively become tech leads instead of ICs. The job in some sense is no longer writing good code. It’s ensuring that good code can be written — which is things that tech leads know and love, or hate, already: maintaining good standards, making good decisions, documenting it, providing the context to the rest of your team, setting a good quality bar for other engineers to contribute. We’re doing that. We’re just doing that for agents now.

And so what that means is that context is in some sense our new code.

Some people might hate that. Please take it with a grain of salt — it’s a metaphor. If context is our new code, though, there’s things that we expect of our code. We want a way to know: Are my programs correct? Are they performant? How do I reuse programs? How do I automate repetitive tasks that are annoying?

We’ve come to expect a lot of answers here for actual code — unit tests, integration tests, analytics and observability — all these things that give us really good insight into how our programs function. And the core thing that I want to argue today is that all of these have an analog in the world of context engineering. And if you are diligent about finding a toolset that does this, you can reclaim a lot of that predictability, a lot of that rigor that you’ve come to expect with code. I’m going to show you Tessl just to illustrate the concepts, but you don’t have to use Tessl to do any of this. These are general concepts and patterns. So how can we take all these concepts and apply them to context? That’s the TLDR.

Three Challenges and the SDLC Analogy

Time: 00:04:18 - 00:08:47

Before we get started, there are three challenges that make it not a direct comparison.

First, LLMs are non-deterministic. You can’t just run them once and say “oh, it worked” or “oh, it didn’t work, so I now know my context is good.” If you tell an agent to do a thing, sometimes it will, sometimes it won’t. I’m sure you’ve felt this pain many times.

Second, a lot of times when you create context, there’s not one right or wrong answer. If you write a style guide or documentation for a library, how do you determine that an agent’s solution did it correctly? You can’t just write a unit test and say “ah, we’re done, it worked.” So grading output can be a little challenging.

And finally, there is this new problem that your programs are now actually things that describe other things. So you have things to keep in sync. You might update your API and need to update documentation to match it, or change a company flow in one place and make sure it gets distributed throughout your organization.

So this is a quick overview. I’m going to dive into each of these. I actually do think there is a direct analogy for all of the tools that you’ve come to expect in the software development life cycle. I’ll quickly run through them:

Static analysis is going to look like LLM-as-judge — the same idea of a fixed set of best practices, rules, validation, compiling, that you should be able to run against your context. To give an example, we recently saw a customer using Tessl who had added an @ sign into one of their files and didn’t realize that was suddenly triggering the import mechanisms for most agents’ MD files, breaking a whole host of their context without even realizing. Seems silly, but static validation is still important.
Unit tests are going to look probably the most different. Instead of defining a unit test that runs, you’re going to want to think through scenarios that stress-test the agent, run them many times in parallel, and take statistical averages. You want to see: when I add context, does it actually improve the average performance?
Integration tests — same thing, but testing lots of context at once, designing scenarios that map to using different kinds of context together.
Analytics — how can you start actually measuring agent sessions in the wild to see what’s happening? Do we have missing context? Are things being used correctly?
Automation and build scripts — how do you make it so that your context is not this static thing that grows out of date and dies, but as you update things you’re getting follow-up PRs that auto-update your context?
Package manager reuse — this has in the last two or three weeks sort of blown up everywhere. Things like Skills.sh, Tessl’s context registry. The idea of reusable units of context has come onto the scene.

Static Analysis: Format Validation and Best Practices

Time: 00:08:48 - 00:10:55

OK, review formatting and best practices. I’m going to use Tessl as an example here, but I’ll try to explain all of this in a way where you could build it all yourself if you wanted. There’s other tools that do a lot of this — not as well as Tessl though, obviously.

If you look at the Skills standard, first of all there’s a bunch of static formatting you can do. They have a reference CLI implementation that will verify your skill compiles. I think everybody who’s writing skills should have that in CI/CD, checking that all of your skills are kept up to date. Anytime a skill file changes, you should be checking validation. You would be stunned how many people — none of their context is loading and they don’t even realize it. That’s a big one.

But also, if you look at Anthropic, they have a best practices guide — basically a list of rules. Tessl will tell you if your things compile. We also take Anthropic’s best practices and run that through LLM-as-judge. There’s a bit more you can do to tune the prompt for better results, but honestly just putting a prompt with Anthropic’s best practices in it is a great starting point. You get information on how specific your context is, whether it has a good concrete case for when it should be used. I’m sure folks have heard about skills and how they don’t activate very often — there are concrete things you can do without even running the skill to know how likely it is to trigger.

These things are cheap, they’re quick, you can put them in CI/CD, and it’s a surprisingly large lift to actually making your context useful. I recommend this as just table stakes. Everybody should have this, just like everybody should have a formatter and a linter. Bonus points: you can feed the output of this back into an agent and ask it to fix it. Pretty nice quick loop.

Evals: Is My Context Actually Helping?

Time: 00:10:57 - 00:17:38

OK, now slightly more complicated. A slightly more net-new concept. How do you write evals for your context?

Depending on whether you’re coming from more of a software background or more of an ML/deep learning background, this might either be obvious or not so obvious. The thing you’re trying to answer is: Is my context actually helping? And how well is the agent doing at the task that I’m trying to achieve?

If I use this as an example — we have some library that we want the agent to use, and we can see how it performs without any context. It’s not good at using the list function; maybe it implements it itself or uses a different library. It’s also bad at async handling, but it’s pretty good at correct stream combination and at doing zip files.

You want to understand this so that you can then understand where you need to apply context to fix the problem. There’s a couple things you might get from a view like this:

You might have written a bunch of context only to realize the agent did fine without it — why are you wasting tokens on it?
You might actually write something and realize it made performance worse because something’s gone out of date or it’s just added tokens for no reason.
In an ideal world, you see: “Ah, it works better with it and I’ve only applied tokens where it matters.”

All you have to do to get this set up is write some prompts — realistic tasks that you want the agent to do that require usage of the context you’ve created — and then write a scoring rubric for what a good solution to that problem looks like.

The reason I say write a scoring rubric and not “write a bunch of unit tests” is twofold. First, unit tests are really obnoxious to write and they take a long time, and you will quickly find that you just don’t do it if you have to create example projects and test suites for every single piece of context. More importantly, agents do unspeakable things to get unit tests to pass. Functional correctness is not the only thing that you’re measuring, especially for context. A lot of times you want to know: Was idiomatic code written? Did it use the library I actually wanted it to use instead of implementing its own solution? There’s really no way to measure this with unit tests. It’s much better to do more agentic review or LLM-as-judge.

What you want to do is define — we put them in markdown files. You want to have a prompt that runs through “build this thing, here are the requirements.” It should require using the context, or at least should require doing what the context says, because you actually want to measure it with and without context to see if the agent is just smart enough to do it on its own. Then importantly, you want to define some kind of grading rubric. You want to be pretty specific so that you get reliable results from an LLM — things like “the solution should use this exact API call somewhere in the method” or “it should initialize this before it initializes that.” Very granular things that can be checked at the end.

An important thing to note is that once you have these in place — this can take a bit of upfront work, it’s like the new source files that you have to care about as an agentic developer — but say you get about five of these per piece of context, that’s what we’ve found is a pretty reliable measure. Once you have some of these, then you can reap the benefits forever. Just like unit tests — every time you make a change, you rerun these, you see if it helped or hurt.

One thing that’s different is that oftentimes you’ll rerun these without changing the context, because there is something else that’s changing: the agent and the model. What we have found is that oftentimes you can start stripping out your context as agents get better. We had style guides for Python. Claude Opus 4.6 writes pretty damn good Python. It doesn’t need a style guide anymore. Your evals can tell you that and help you delete context that you no longer need. Save money, don’t pay the tokens.

Every once in a while there will be a regression. There was a recent Gemini that was kind of a smartass and thought it didn’t need to use tools and read context. And then we realized, oh, we’ve had a regression — we need to go beef up how much we tell the agent to use the context.

Repo evals — I talked about integration tests. It’s basically the same thing, but you don’t want to just test your context in isolation. You also want to measure realistic scenarios in your full coding environment with all your context installed. I was just watching a talk earlier today that described the “dumb zone” — where you’ve gotten too much context in your context window because of tools, because of context, because of all these things, and the agent is just persistently bad.

So you want to have a few coding scenarios — five for your repo is a fine place to start — that represent an average development task, with a rubric to grade the output. Run it every once in a while. See if your tech debt has gotten to a point where agents don’t understand how to work in your code. Have you installed too much context? Too many tools?

One thing we found that works pretty well is scan your previous commits and turn some of those into tasks. You can even, on a regular cadence, pick five random commits over the last month and refresh your eval suite. For folks in the ML world, you have things like input drift where you want to update your tasks every once in a while. Don’t worry about it if that seems like too much effort — just start with something and you can improve it over time. Same idea: task scenarios, grading rubrics, run them every once in a while, make sure you haven’t degraded things.

Observability: Mining Agent Logs

Time: 00:17:40 - 00:20:36

This one I think is pretty cool, but also kind of scary — you want something like analytics and observability. You’ve written this context, you validated the change before you’ve pushed it out to the repo for everyone. We do that in software, but then we also still have crash logs, we have metrics, we have usability funnels. This actually does exist for agents — just a lot of people aren’t paying attention to it.

All of the agents store all of their chat logs in files in accessible places. You can write your own scripts if you’d like. Tessl has capability to gather these — opt-in, of course, because obviously it’s very sensitive information. You can review those transcripts to see things like: Were tools called? How often was this piece of context used? How often does this pattern actually manifest in the code? How often does it import a library right in the middle of a function?

There’s a lot of rich information here that you could just write a quick script for, ask everyone on your team to run it once, aggregate a bunch of logs, and review common problems that you might want to make new context for. A great one is anytime the agent apologizes — just look for the word “sorry,” look for “you’re absolutely right.” All of these things are good signals. Like, “oh, maybe we should write something to fix that.” There’s a wealth of information and I guarantee you’ve got three or four months of Cursor logs sitting on all your devs’ machines that you could mine for “what should we be doing differently?”

How do you keep your context up to date? You can do something pretty simple here — set up something in your CI/CD. There are all kinds of agentic code review tools, Claude Code, Web. But I think a general thing to set up is: anytime a PR comes up, have something scan that PR and then look and say, “Is there any markdown file here that should be updated?” It’s not that hard. It really works better than you’d think. Because PRs tend to be so focused, agents are pretty good at finding out where they should update. If your PRs are too big — maybe it’s a good sign to make your PRs smaller again.

Tessl can automate a lot of this. “Oh, you added a new case to your logging levels here — update your documentation as well.”

This one is probably the most important because as your context gets out of date, it just destroys agent performance. So if you’re going to write context, you have to have a solution for keeping it up to date. Agents are pretty good at doing this, so you don’t have to do it by hand. Don’t do it by hand, because you won’t do it.

Package Managers: Reusing Context

Time: 00:20:37 - 00:22:36

Last thing: package managers. You need a package manager if you want to reuse context — code review skill, documentation on how to use React, best practices, et cetera. I won’t belabor this point. There’s lots of good options out there. Skills.sh is probably the most popular, though it pains me to say that. Tessl has a package manager as well. It’s not the most popular. I think it’s the best. I won’t pitch you on why it’s the best, but it’s the best.

Two things that are different that you should think about when figuring out how to use context:

First, unlike other package managers, a lot of context that you’re going to install is going to be describing other package managers. I have an example here where I have documentation on a library that’s part of PyPI, and it describes a particular package and a particular version. It’s a weird concept. So you want to think about: what is your strategy for matching? If you have documentation on a library, how do you make sure that as you update your library, you keep documentation keyed to the same version? You don’t want to say “I’m using Context7 on the latest of React” but actually you’re pinned to React 17 for some reason.

Second, think about how you keep your context in sync with dependencies, in sync with tools or APIs that you’re using. Because it’s a new source of drift that you might have to care about.

That’s it. That’s my walkthrough. A lot of this is not necessarily hard to do — it’s just fiddly to keep updated and keep pace with the rate of agent change. Happy to answer questions now or afterwards.

Q&A: The Future of Context

Time: 00:23:11 - 00:24:54

Audience: So what do you see as the end state — in 12 months or even 6 months? Claude 4.6 is really good, Codex 5.3 — and when Codex 6 comes out, Claude 5, Gemini 4… Do we need a lot of the scaffolding or does it go away?

Drew Knox: Fantastic question. First, it’s going to split a lot by whether you’re a greenfield or a brownfield. If you’ve built an app from the ground up for agents, it’s going to be a lot easier than if you’re doing an enterprise Java app.

I think the number of things you need context for will go down. The Python style guide example — all the rage six months ago, nobody needs it now. But describing your custom internal logging solution — you’re always going to have to document that because an agent doesn’t have access to it, it’s not in its training weights. There’s some amount of knowledge that will always need to be told to the agent.

My expectation is that eventually you won’t be proactively jamming almost any context into an agent’s window. You’ll have some kind of signposting, like progressive disclosure — the agent will get to look at it if it deems it necessary, like a normal developer. And then a lot of your usage of context will be applied at review time. You will create a review agent that looks for things like “did it break our style guide? Did it reimplement something?” It’ll be there for control, not to educate the agent up front.

I think evals are going to play a big part in helping you navigate that change — knowing when it’s time to move things out of the context window into a review, or just delete it.

Q&A: Eval Scoring in Practice

Time: 00:24:54 - 00:26:24

Audience: I wanted to ask about evals. You had max score 50 and 30. From my experience, non-binary score doesn’t really work. Could you tell how it works and for what agents does it work?

Drew Knox: I think that’s right. Binary is pretty much the only — we give granularity in Tessl if you want to do more. But if you look at it, agents pretty much always score zero or max score. So I would say no, you could get away with 0 or 1 and it’d be about the same.

Audience: So I’m an AI engineer. I want to build solutions really fast. Would you recommend just using Opus 4.6 to get out an eval set very quickly and then just use that as a baseline — which is not perfect but just have that as a starting point? Or would you recommend doing a full thorough analysis first?

Drew Knox: Personally, I’m busy, I have a lot to do — just start with the best agent. What I’d say more is: once you have some really repetitive tasks, it can be worth it to say “OK, what is the cheapest I can get away with?” A lot of times context will help you use smaller models to do that. But for day to day, your general driver — just always crank it to the max, unless you have some reason you can’t.

Q&A: The Role of the Technical Architect

Time: 00:26:24 - 00:29:22

Audience: My question is more towards non-technical people, or people who are not too technical. When do you think — or what barometer can we use to measure the point in time where we don’t really need to have too much into what the agents do or what the LLMs do? Like, you have spec-driven design, acting as a product manager, writing a PRD but not having too much importance on the technical side. Are we going to be at that soon?

Drew Knox: I’m going to throw out maybe a spicy take, which is: definitely we’re not there, and I don’t think we’ll ever be there.

What I mean by that is — we have effectively had general-purpose agentic development machines for 50 years. We just called them software engineers. And in that case, you never just wrote somebody a Slack message and expected them to go build an entire system completely unsupervised and just make all the best decisions.

Another way of putting this — my wife, who is a very senior staff engineer at Meta, says: “If you cloned me, I would still code review my code. I would never accept anyone to submit things without looking at it.”

So personally, I think there will always be a place for a technical architect, a steward, somebody who’s guiding the quality of the code base. I think what that role is will change over time.

Right now it’s a lot of in-the-weeds, very specific decisions. It’s a lot of reviewing code, mentoring and coaching people up, and you tend to have one PM to five to ten engineers. I imagine we’ll get to a place where you invert that ratio — you have one technical steward whose job is to think about the overall system design, to be constantly reviewing agent code, to be reviewing things that people are building and understanding “oh, this is a consistent failure point; if we abstract this part out, if we build a component that agents can use, they’ll more reliably get better one-shot success.” And then you have five to ten more product, design, product-engineering people who are out exploring the frontier of your product space, with this one technical steward helping them land their code and keep things maintainable and improving over time.

When will we get to that point? That part I’m less certain of. It could be in two weeks, it could be in two years. I think it’s probably in the order of single-digit years. Certainly, I wouldn’t be surprised if a completely AI-native greenfield project starting within the next year could work in that model. But certainly for brownfield, I think it’ll be harder.

☕ If you found this helpful, consider buying me a coffee to support more content like this.

Buy me a coffee