Context Engineering for Accounting Firms | The AI Accountant

Last week I asked AI to find me flights to Salt Lake City. It came back and told me only one airline — WestJet — served the return leg. No other options existed. I knew that was wrong. United flies that route through Denver. American connects through Dallas. Delta goes via Detroit. But until the system got it wrong, I'd never written any of that down as a rule.

That correction — "check United's hub connections before declaring a route unserved" — took ten seconds to say. It contained knowledge I'd built over twenty years of flying. And it was completely invisible to me until the system failed.

I wasn't working on anything related to my CAS practice. I was building a personal flight search skill — describing my Air Canada loyalty, my preference for direct flights, my home airport, my fare class rules. The kind of mundane, repeatable task that nobody thinks of as "AI work." But the process of building that skill taught me something about context engineering that I'd been circling around for months without fully grasping.

The first draft is never the skill

When I sat down to describe my flight preferences, the obvious stuff came quickly. Home airport: YYZ. Preferred airline: Air Canada. Fare class: Standard or Flex, never Basic. Direct flights preferred. Sort by price. That's the top-down context engineering — the deliberate documentation of rules you already know you follow.

The skill worked. Technically. It searched for flights and returned results. But it made decisions I wouldn't have made, missed options I would have found, and presented information in ways that didn't match how I actually think about travel.

The real skill emerged from the corrections. "Return tickets aren't always cheaper than two one-ways within North America." "I won't take red-eye flights on domestic routes." "Don't depart before 9 AM unless there's a hard arrival deadline." "If the search tool only shows one airline on a route, don't trust it — it probably has gaps in its inventory." Each of these was a rule I follow every time I search for flights. I'd never articulated any of them because I'd never needed to. They lived in muscle memory, not in documentation.

That's the encoding gap — the distance between what you know and what your system knows. And it only closes when the system gets it wrong.

The first roadblock isn't the last word

Here's a detail that matters. Claude by itself couldn't search for flights at all. It had no access to flight data. That could have been the end of the experiment. Instead, I found two flight search integrations — MCPs, the connectors that give AI access to external tools and data — and installed both in about thirty seconds each. The first one returned prices but not airline names. Useless for someone who needs to know which flights are Air Canada. The second one returned airline names but missed entire carriers on some routes. Neither tool alone gave complete results. Combined, they covered each other's blind spots.

This is the part most people miss. The first attempt at any AI workflow will hit a wall. The tool can't access the data. The output is incomplete. The format is wrong. That's not failure — that's the starting point. The practitioners who get results are the ones who treat the first roadblock as information, not as a stop sign. What's missing? What else could fill the gap? What happens if I combine two imperfect tools instead of waiting for one perfect one?

In your practice, the same pattern applies. Your AI can't log into Xero and reconcile directly. But it can work with exported data, cross-reference multiple sources, and flag exceptions you'd have missed scanning manually. The capability isn't always built in. Sometimes you have to assemble it — and the assembly takes minutes, not months.

Corrections are knowledge extraction, not error fixing

Here's the part that changed how I think about this. Every correction I made contained a rule. Not a preference — a rule. "United flies SLC to Toronto via Denver" is a routing rule. "$78 more for Star Alliance is worth it; $322 isn't" is a decision threshold. "Price alone isn't enough to break a preference — timing and Aeroplan earning potential matter more" is a priority hierarchy.

These rules lived in my head. The system failing made me say them out loud. Capturing them made them permanent. The next time I search for flights, the skill won't make those mistakes again. It won't declare a route unserved without checking both tools. It won't compare the wrong price points. It won't ignore indirect United connections through Denver.

That's the compounding flywheel. Each correction makes the system permanently better — not just for that search, but for every search that follows. And the corrections are free. They take seconds. The knowledge they contain took years to build.

Now think about your practice. How many times a day does someone on your team correct an AI output without capturing the correction? "That report summary put the revenue discussion under expenses — this client's consulting income always needs its own section because it's seasonal." "That's the wrong contact for this entity — the wife handles the books, not the husband." "We don't bill that client monthly, we bill quarterly on a different cycle." Every correction is a rule. Every uncaptured rule is knowledge that evaporates when the chat window closes.

This works for any repeatable task — not just accounting

I'm making this point with flights deliberately. If context engineering only worked for month-end close or bank reconciliations, it would be a narrow technique worth learning for those specific workflows. But the fact that the same process — describe, test, correct, capture — works for booking flights, for client communications, for advisory prep, for any task with embedded preferences and judgment calls — that's what makes it a general-purpose capability.

The mechanism is always the same. You start by writing down the obvious rules — the ones you can articulate before the system runs. You test it against a real scenario. The system fails in ways that surface deeper knowledge — rules you didn't know you were following. You capture the corrections. The system improves permanently.

The flight skill went through three iterations before the live test. Each iteration caught something the previous one missed. But the live test — the real scenario with actual constraints — exposed failures that no amount of test prompts would have surfaced. The system needed to encounter "arrive by 5pm in Salt Lake City on a Tuesday in May" to reveal that arrival deadlines conflict with departure preferences, that search tools have airline coverage blind spots, and that the $189 United option through Denver is the one to compare — not the $433 one through the same hub on a different flight.

Your client work is the same. The test scenarios are useful. The real engagements are where the encoding gap closes.

The correction that disappears is the most expensive one

After the skill was built, we ran a retrospective — three questions, thirty seconds. What went wrong. What correction did I make. Have I captured it. That reflection surfaced two more rules the live test hadn't made explicit, and both went into the final version. I've written before about why telling AI it's wrong is the most valuable thing you'll do all week. This is the practical mechanism: corrections contain professional knowledge. If you don't capture them, they vanish into a chat log nobody will ever search. That's the real cost of the encoding gap. Not the time spent correcting. The knowledge lost by not capturing.

Start with something mundane

Here's what I'd suggest. Don't start with your most complex client workflow. Start with something mundane. Something you do often enough to notice the patterns but small enough that getting it wrong doesn't matter. Search for flights. Draft weekly status emails. Prepare meeting agendas. Process expense reports.

Build a skill for it. Run it. Correct it. Capture the corrections. Do this three times and you'll understand context engineering better than any article — including this one — can teach you. You'll feel the encoding gap close. You'll notice the moment when a correction forces you to articulate something you've done instinctively for years.

Then take that process to your client work. Because the encoding gap is the same whether you're routing flights through Denver or categorizing transactions through a chart of accounts. The knowledge is in your head. The system needs it in writing. And it only comes out when the system gets it wrong.

The question isn't whether your team has the knowledge. They do. The question is whether you're capturing it — or letting it evaporate, one correction at a time.

This same encoding process is what separates an Excel diagnostic from an interactive scoring tool. And it's what lets you build purpose-built agents that replace your paid software subscriptions. The pattern is the same whether you're routing flights or managing clients.

I built an AI skill for searching flights. It taught me more about context engineering than any client project.

The first draft is never the skill

The first roadblock isn't the last word

Corrections are knowledge extraction, not error fixing

This works for any repeatable task — not just accounting

The correction that disappears is the most expensive one

Start with something mundane

Bonus content for subscribers

The first draft is never the skill

The first roadblock isn't the last word

Corrections are knowledge extraction, not error fixing

This works for any repeatable task — not just accounting

The correction that disappears is the most expensive one

Start with something mundane

Bonus content for subscribers

More on Building AI Workflows

More from Peter McCarroll