AI vendor reliability for accounting firms | The AI Accountant

If your AI vendor's model degraded on Tuesday, would your firm know by Friday? Most CAS practices can't answer that. Most haven't asked.

This week the question stopped being theoretical. Anthropic — the vendor most CAS firms now run on — published a quality post-mortem confirming three separate degradations of Claude Cowork, Claude Code, and the Claude Agent SDK between March 4 and April 16. Six weeks. Three overlapping problems: a default reasoning level dropped without notice, a caching bug that wiped session state every turn, and a system prompt that capped responses mid-sentence. Users noticed and reported. Internal monitoring did not. By the time the vendor confirmed, AMD's AI director had already called the product "dumber, lazier" in public.

In Monday's roundup I framed this as part of a bigger pattern — single-vendor strategy isn't a safe harbor anymore. This piece is the operational answer. Three tests, one afternoon to set up.

If your firm builds anything important on top of that vendor's model — and most CAS firms now do — that's the news that matters. Not the post-mortem itself. The fact that the customers caught it first.

This isn't an Anthropic problem. It's the new operating environment.

Every major AI vendor will have a week like this. Models degrade silently. Plans get rewritten. Features ship the same day quality breaks. The vendor's PR will always tell you everything is fine. Your firm has built workflows, deliverables, and client expectations on a category of tool that changes underneath you, and you have no instrument that would tell you when it does.

You need three.

Your model changed on Tuesday and your firm doesn't know it

Pick one workflow your firm runs repeatedly with AI. Close commentary. Meeting prep. MD&A first draft. Client comms. Once. Ask your AI to write a scoring rubric for what good output looks like — clarity, accuracy of figures, tone match, and completeness. Save the rubric. Save a known-good output the partner has approved.

Then once a week, run the workflow against a stable input and have a different LLM score the new output against the rubric. The cross-LLM step is the trust mechanism. If you score the model with itself and the model has degraded, the scoring will degrade with it. Self-checking is self-defeating.

The point isn't to chase exact scores. Outputs vary run to run, and small variations are normal. The point is to catch a sizable drop. When the score moves from a steady 8 to a steady 6, something changed in the model before the vendor announced it.

Cost: an afternoon to set up. Five minutes a week to run.

The same instrument tests the next vendor

The rubric you built for the drift test does double duty. New models will land monthly now. Sometimes from your own vendor. Sometimes from competitors. When OpenAI shipped GPT-5.5 this week, the firms that already had a rubric ran it through and had a comparative answer in an afternoon. The firms that didn't will wait for marketing claims to settle, ask their network for opinions, and rebuild their evaluation from scratch each time.

That's the difference between vendor optionality as a capability and vendor optionality as a wish.

Most CAS firms today have neither — and the absence isn't a failure of judgment. It's a failure of having the right tool. Build the rubric once and you can evaluate any new model the day it ships. Don't build it and you're locked into whatever vendor you committed to last year, regardless of what the market does next.

The partner signs the file. The model signs nothing.

Pick the most recent client deliverable AI touched. Three questions, in order. Whose name is on the file? Which model produced the AI portion? Was that the same model the partner approved when the workflow was set up?

In most firms today, you can answer the first question and not the other two. That's the gap. The fix isn't elaborate. It's a one-line note in the workpaper or workflow log: AI assistance: Cowork on Opus 4.7, prompt v3, run April 25, 2026. That version trail is the difference between defensible AI use and "we used AI somewhere."

When the model changes, your sign-off doesn't. The accountability stays with the partner. If you can't reconstruct what the model was when the work was done, you've taken on a risk you didn't price into the engagement.

The week-one move

This isn't a six-figure governance program. None of these tests cost more than an afternoon to set up. The drift test runs in five minutes a week, the rubric doubles as your evaluation tool for any vendor that ships next, and the version trail is one line in a workpaper. The firms that build this in April spend the next year refining it. The firms that don't will spend the next year hoping the vendor doesn't have another bad week — and a vendor that just confirmed six weeks of silent degradation isn't the one to extend that hope to.

I've put the prompts I use for all three tests into a free Vendor Test Pack — the rubric-generation prompt, the cross-LLM scoring prompt, the version-trail line template, and a worked example for one CAS workflow. Take what's useful and adapt the rest to your practice. Grab it at theaiaccountant.ai/vendor-test-pack.

This isn't an Anthropic problem. It's the new operating environment.

Your model changed on Tuesday and your firm doesn't know it

The same instrument tests the next vendor

The partner signs the file. The model signs nothing.

The week-one move

More on Building AI Workflows

More from Peter McCarroll