Reading the auto-code rate: what's actually a good number, by client mix

The most-watched number on a typical auto-coding dashboard is the auto-code rate — the percentage of transactions coded without requiring a human touch. Practices celebrate when it ticks past 90%. They worry when it dips below 80. The headline number is satisfying to chase, but it can mislead in both directions if you read it without context.

Here's what the auto-code rate actually means, what the right number is for your particular client mix, and the three sub-metrics that matter more than the headline.

What the number actually measures

A modern auto-coding stack runs several layers in order — typically deterministic rules, a per-client machine-learning model, a large-language-model fallback, and manual review. The auto-code rate counts a transaction as "auto-coded" if one of the automated layers committed a code with confidence above the practice's threshold (commonly around 0.80–0.85). Manual codings — the ones the bookkeeper had to touch — don't count toward the rate.

This means the headline rate measures one thing: how much of the work the platform did. It doesn't measure how well it did the work. A practice with a 92% auto-code rate whose audit findings include 4% miscoded transactions is in worse shape than a practice with a 78% rate and zero miscoded transactions.

What a "good" rate looks like, by client mix

The mix of clients matters more than anything else. Here's the rough shape:

Mature retail / hospitality (high transaction volume, narrow vendor set)

Expected auto-code rate: 88–95%. Most transactions are repeat-vendor purchases that the rule engine handles cleanly. Anything below 88% probably means the rule library isn't being maintained. Anything above 95% might mean the auto-commit threshold is set too aggressively and the platform is committing low-confidence codings that need a review pass.

Mixed practice (some retail, some services, some sole traders)

Expected auto-code rate: 78–88%. The lower end is normal because sole-trader clients have fewer transactions per vendor (the rule engine has less to learn from) and the per-client ML takes longer to settle. Anything below 78% is worth investigating; the most common cause is rule rot — rules that were saved a year ago but no longer match because the vendor renamed.

Heavy services (consulting, design, construction)

Expected auto-code rate: 65–78%. Services businesses have project-based vendors that show up once and disappear, fewer recurring patterns, and more genuine ambiguity per transaction. A 70% rate here is excellent. Chasing the rate higher with aggressive rules usually backfires by miscoding the long tail.

New clients (first 90 days)

Expected auto-code rate: 40–65% in month one, 65–80% by month three. The per-client ML hasn't seen enough data yet and the rule library is still being built. If month-one is below 40%, the practice usually skipped the starter rule-pack import; if month-three is below 65%, the rules being saved aren't sticking (most often because they're too narrow — "AGL ENERGY MARCH PAYMENT" instead of "AGL").

The three sub-metrics that matter more

The auto-code rate is the headline; these are the metrics I actually look at when I'm reviewing a client's coding health.

1. Layer mix

A good dashboard will show you what percentage of auto-codings came from each layer — rules / ML / LLM. The healthiest practices we observe have a mix roughly in the order of two-thirds rules, one-fifth ML, one-eighth LLM, single-digit manual.

If rules are below half, the rule library is under-built — too much work is falling through to the more expensive layers (ML and LLM consume compute and time; rules are essentially free at inference). The fix is to mine the ML and LLM codings for patterns that recur and promote them to rules. Any auto-coding platform worth its salt offers a "rule suggestions" view; if your tool doesn't, you can build the equivalent by reviewing the top recurring non-rule codings each week.

If LLM is above 20%, the rule + ML layers aren't catching enough. This usually means rule rot (saved rules that don't match anymore) or a recent influx of new vendors that haven't had time to train the per-client ML.

2. Manual-touch percentage

This is the inverse of the auto-code rate — the percentage of transactions the human had to handle. The headline rate hides whether the manual touches were necessary (genuine ambiguity that needed human judgement) or avoidable (the model should have caught it).

A healthy practice has 3–8% manual touches. The lower end signals over-aggressive auto-commit; the upper end signals under-built rules. If manual is above 12%, there's almost always a rule library improvement waiting.

3. Per-client confidence distribution

One of the most useful diagnostic charts is a per-client confidence distribution — for each client, what does the spread of auto-commit confidences look like? A healthy distribution is heavily weighted above 0.90 with a long thin tail below.

If a particular client's distribution is flat across 0.70 to 0.95, the per-client ML hasn't learned that client well. Usually because the vendor mix is too varied or the historical coding was inconsistent (different bookkeepers coding the same vendor different ways, which trains the model on noise).

The fix for these clients is human: pick the 10 most common vendors, lock in a clean coding decision for each, and let the ML retrain. The next quarter's distribution sharpens dramatically.

The two failure modes

When the auto-code rate is "too high":

The auto-commit threshold is set too low (below 0.80), and the platform is committing codings that genuinely needed review
The rules library has overly broad pattern matches (e.g. matching on PAY and catching every payment-related descriptor)
Manual touches are too few — the bookkeeper has lost the habit of reviewing the queue and is rubber-stamping

When the rate is "too low":

Rule library hasn't been refreshed in months — vendors have changed names, new vendors have appeared
The auto-commit threshold is set too high (above 0.85) and confident codings are being held in the queue for no reason
Per-client ML hasn't had time to settle (this is normal for first-90-day clients)

What I tell new practice owners

Don't chase the headline number. Set a quarterly habit: pull the layer mix, pull the manual-touch rate, pick the bottom three clients by auto-code rate, spend 30 minutes looking at why they're low. That review takes an hour, fixes the actual problem (usually 3–5 rule additions), and improves the headline rate by 4–8% over the next month without any model changes.

The platform is doing the work either way — the auto-code rate just tells you how visible that work is. The metrics that matter are the ones that tell you whether the work is correct.