AI bookkeeping: how machine learning works in transaction coding

AI bank reconciliation software in Australia is becoming a standard part of practice workflows, but most explanations of how it works stay at the headline level: "the AI learns your coding decisions." That is true, but it skips over the mechanics that actually determine whether the tool will perform well for your practice — and how quickly.

This post is for bookkeepers and CA practice owners who want to understand what is happening under the hood: what the model is trained on, what signals it uses, how confidence scores work, and what a realistic improvement curve looks like over the first 6–12 months of use.

What ML-powered transaction coding actually is

Machine learning transaction coding is a classification task. Given a bank transaction — a description string, an amount, a date — the model predicts which account code and GST treatment a bookkeeper would assign to it.

The model is not rules-based. Rules say: "if the description contains X, code to Y." An ML model says: "based on every coding decision this practice has made for transactions that look like this, the most likely code is Y, and I am 91% confident." That distinction matters. Rules require manual setup and break when a vendor changes its bank description format. An ML model infers patterns and handles variation automatically.

What makes Reconlink's approach different from a generic AI tool is per-client model isolation. Each client in your practice gets its own model, trained only on that client's data. The model for a hospitality client learns from hospitality vendor patterns — Uber Eats supplier invoices, cleaning contractors, POS merchant settlements. The model for a construction client learns from materials suppliers, subcontractor ABN payments, and equipment hire transactions. The two models never share data or parameters. This is not just a privacy design decision — it is why the model gets accurate rather than averaging across irrelevant patterns from unrelated businesses.

What signals the model uses

When a new transaction arrives, the model evaluates several features:

1. Transaction description text The description string is tokenised — broken into component words and substrings. "STRIPE* ACME PTY LTD" becomes tokens including "stripe", "acme", "pty", "ltd". The model has learned, from this client's history, that transactions with "stripe" in the description are typically coded to a particular income or COGS account. Tokenisation means partial matches still register: a new merchant the model has not seen before may share tokens with known vendors and inherit partial signal.

2. Amount — magnitude and structure The dollar amount carries two types of signal. Magnitude matters: a $4.50 transaction at a café is more likely to be a meal allowance than a $4,500 one. Structure also matters: round numbers (exactly $1,000.00, exactly $500.00) are statistically more likely to be transfers, drawings, or scheduled payments rather than purchase invoices. The model learns these patterns from the client's history.

3. Day of week and day of month Payroll runs on the same day each fortnight. Rent debits on the first of the month. Subscription billings cluster on the date the software was first purchased. Temporal features let the model distinguish between a $2,200 transaction that arrived on the 15th (likely payroll) and the same amount arriving on a random Wednesday (more likely a supplier payment).

4. Merchant category code (where available) Bank statements from some institutions include an MCC — a standardised code for the type of merchant. A transaction tagged MCC 5812 (eating places and restaurants) is handled differently from one tagged MCC 7389 (services not elsewhere classified). Not all statements carry MCCs, and their quality varies by institution, so the model uses them where present and falls back on text and amount features where they are absent.

What the model does not use: any data from other clients at your practice, any data from other practices using Reconlink, or any generic training dataset. The model starts from the client's own coding history and improves from there.

How confidence scores work

Every prediction the model makes comes with a probability between 0 and 1. A score of 0.91 means that in the training data, transactions with a similar feature profile were coded to the predicted account 91% of the time. The score is calibrated — it reflects real historical accuracy, not just relative certainty.

This calibration is important. A model that is confident but wrong (high score, incorrect code) is more dangerous than one that is uncertain and routes to human review. Reconlink's model training includes calibration steps to keep scores aligned with observed accuracy.

Practices configure an auto-commit threshold — the score above which the model's suggestion is accepted without bookkeeper review. The typical range is 0.82–0.88. Setting the threshold at 0.85 means transactions the model is 85% or more confident about will be coded automatically; those below that level go to the review queue.

Choosing the right threshold involves a trade-off:

A lower threshold (e.g. 0.80) auto-codes more transactions but accepts more errors.
A higher threshold (e.g. 0.90) sends more to the queue but produces cleaner auto-codes.

Most practices start at 0.85 and adjust after reviewing a month of auto-coded transactions. If the auto-coded batch has a low correction rate, they lower the threshold to push more volume through automatically. If corrections are frequent, they raise it.

Training data requirements and the improvement curve

The model needs historical codings to train on. More history means better initial accuracy. Here is what to expect at different stages:

Month 1–2 (new client, no import) The model has limited data. Auto-code rate will be low — perhaps 30–40% of transactions, mostly the ones covered by rules. The model is learning and every coding decision a bookkeeper makes becomes a training example.

Month 3–4 After the first BAS cycle, the model has seen a representative sample of the client's vendor mix. Auto-code rate typically climbs to 60–75%. Common vendors are now well-represented in the training data.

Month 6–12 With six or more months of history, the model has seen most recurring vendors through multiple occurrences. Auto-code rate in this range typically reaches 80–90% for clients with relatively stable transaction patterns. The model has also learned seasonal patterns — EOFY expenses, Christmas trading, quarterly insurance payments.

Warm-starting with historical data If you are migrating a client from another system, importing 12 months of historical codings accelerates this curve significantly. Instead of starting from scratch, the model begins with a full year of vendor-to-account-code mapping. For practices moving from Xero or MYOB, a historical import means the model behaves more like a 6-month-old model from day one rather than a brand-new one.

The correction loop

When a bookkeeper overrides an ML suggestion, that correction does not disappear. It feeds back into the training data as a labelled example: "for this transaction profile, the correct code is X, not Y."

The model retrains periodically (not in real time, but frequently enough that corrections accumulate). Over multiple retraining cycles, a vendor the model kept getting wrong accumulates incorrect examples. The model responds in two ways: it lowers the confidence score it assigns to similar transactions (routing more to human review) and, as correct examples accumulate, it recalibrates to the right code.

A vendor that was incorrectly coded three times in a row and then correctly coded eight times will eventually stabilise. The confidence score on that vendor's transactions rises, and auto-coding resumes. This self-correcting behaviour is why active use improves accuracy — the model is not static.

The practical implication for practices is that the first two months require more manual review than later months. Treating early corrections as model investment — not as evidence that AI does not work — leads to a better outcome by month six.

Where ML ends and the LLM begins

ML handles the pattern-recognition task well for established vendors and recurring transactions. It does not handle novelty well.

When a transaction arrives from a vendor the model has never seen, with a description that shares few tokens with known vendors, and an amount that does not clearly signal a transaction type — the confidence score drops below threshold. The transaction does not auto-code. Instead, it routes to the LLM fallback layer.

The LLM reads the full transaction description, considers the client's chart of accounts, and reasons about the most likely coding. Unlike the ML model, the LLM can handle genuinely ambiguous transactions — a payment that could plausibly be a subcontractor invoice, an asset purchase, or a prepaid expense, depending on context it can partially infer from the description.

The LLM output is a suggestion, not an auto-commit. These transactions always land in the human review queue with the LLM's reasoning visible. A bookkeeper confirms, overrides, or splits. The confirmed decision then becomes a training example for the ML model — so if the same vendor recurs, the ML model eventually handles it directly without routing to the LLM.

The three-layer system is covered in detail here if you want to understand how rules, ML, and LLM interact at a system level.

What this means for Australian practices

The practical outcome of per-client ML is that the model becomes measurably better over a 6–12 month period. This is not a marketing claim — it is observable in the auto-code rate metric, which Reconlink tracks per client per month.

A well-functioning implementation shows quarter-over-quarter improvement in auto-code rate until it plateaus (usually around 85–92% for typical SME clients). If the rate is flat or declining, it usually indicates one of three things: a change in the client's vendor mix, a threshold set too conservatively, or corrections not being applied consistently.

Tracking auto-code rate monthly gives practices an objective signal about whether the ML is calibrating correctly — and whether the practice is investing in the correction loop that makes the model improve.

For practices concerned about what happens when the model does get something wrong, this post covers the error-handling mechanics and audit trail in detail.

FAQ

Does Reconlink's ML use data from other practices? No. Each practice's clients are modelled in isolation. No coding data from Client A at Practice X is used to train the model for Client B at Practice Y, or for any other practice. The only training data for a client's model is that client's own history.

How long before the ML model is useful on a new client? It varies. If you import 12 months of historical codings during onboarding, the model is useful from the first reconciliation cycle. Starting from scratch, expect meaningful auto-coding after 2–3 months of regular use. The first BAS cycle will always have a lower auto-code rate than subsequent ones.

What happens to a transaction the model has never seen before? If confidence is below threshold, the transaction routes to the human review queue — either flagged for manual coding or sent to the LLM fallback layer for an assisted suggestion. It is never auto-coded below threshold, regardless of what the model thinks.

Can we adjust the auto-commit threshold per client? Yes. Practices set a default threshold that applies across clients, and can override it per client. A client with complex or high-risk coding requirements (trust accounts, complex GST treatment, etc.) can be set to a higher threshold so more transactions land in the review queue.

What GST treatment does the model learn? The model learns GST codes as part of the coding decision — in the same way it learns account codes. A transaction coded to 6-1050 Electricity with 11 (GST on expenses) trains the model to suggest both the account and the GST treatment together. Practices that apply consistent GST coding get consistent ML output; practices with inconsistent historical coding get lower confidence scores until the correct pattern is established.

The bottom line

ML-powered transaction coding in AI bank reconciliation software for Australian practices is not magic — it is a classification model trained on your clients' own data, improving as it accumulates correct examples. The mechanism is learnable, the performance is measurable, and the improvement curve is predictable if you understand what drives it.

If you want to see how this works in a live practice environment, explore Reconlink's features or book a demo to walk through the ML coding layer with your own client data.

AI bookkeeping: how machine learning works in transaction coding

What ML-powered transaction coding actually is

What signals the model uses

How confidence scores work

Training data requirements and the improvement curve

The correction loop

Where ML ends and the LLM begins

What this means for Australian practices

FAQ

The bottom line

How to set up coding rules in Reconlink

Reading the auto-code rate: what's actually a good number, by client mix

What is a coding rule in accounting software? A practical guide for Australian bookkeepers

Run your practice on ReconLink.