Cross-border AML with retrieval-augmented LLMs

Sanctions screening is one of the oldest unsolved problems in financial compliance. Lists are messy, names are translated badly, and the cost of a missed hit is a regulator letter that ruins everyone’s quarter. The cost of a false positive, meanwhile, is an analyst burning their afternoon on a customer who shares two letters with someone on a UN list. Both errors are expensive. Most teams pick a side and live with the consequences.

We think LLMs change the cost curve. Not because the model magically gets it right, but because retrieval-augmented generation lets us turn a fuzzy-string problem into a structured-reasoning problem. Here’s how we’ve been building it.

The pipeline

At the top of the stack we run an embedding-based retriever over the consolidated sanctions corpus (OFAC SDN, UN, EU, FINTRAC). We index not only names but also aliases, transliterations, dates of birth, nationalities, addresses, and known associates. For each new transaction or onboarding event we generate a query embedding and pull the top-K candidates in under 30ms with Vertex AI Matching Engine.

That candidate set is then handed to Gemini Pro for structured reasoning. We don’t ask “is this a hit?”; we ask the model to produce a verdict object with a JSON schema we control: match-strength, contributing fields, contradicting fields, residual risk, and a recommended action with citations. The schema is enforced with function-calling so we never have to parse free-form text.

Why retrieval matters

A bare LLM hallucinates fluently. Give it a name and ask it to reason, and it’ll happily confabulate a biography. Anchoring the reasoning in retrieved candidates—each with a stable URI back to the source list—makes the model’s output checkable. Every line in the verdict can be tied back to a specific list entry, and the audit trail in BigQuery preserves both the retrieval and the reasoning step.

We also keep retrieval intentionally permissive. We’d rather pull 15 weak candidates and let the model triage than rely on a brittle first-pass cutoff. Vertex AI’s Matching Engine is fast enough that the extra recall doesn’t bite us, and the LLM’s contradiction signal turns out to be far more reliable than a hand-tuned threshold.

Evals, not vibes

Every change to a prompt, a retriever weight, or a model version goes through an offline eval before it touches production. We maintain a few thousand human-labelled cases drawn from real-world ambiguity: the same name on different lists, near-misses on date of birth, spelling variants between Cyrillic and Latin scripts. The eval runs on Vertex AI Eval and emits a scorecard with precision, recall, and per-category drift.

The scorecard is the artefact we trust. Anyone on the team can look at a proposed change and see what it costs in false negatives versus what it gains in analyst time. Without that artefact, “the new prompt feels better” quietly becomes a regulatory finding two months later.

The case-builder agent

The reasoning step is where most teams stop. We’ve found that the bigger lift comes from what happens next. When the LLM emits a verdict that warrants human review, we hand the candidate set, verdict, and source documents to a case-builder agent. It assembles a case file: an annotated narrative, a checklist of additional evidence the analyst should pull, and a suggested decision with reasoning.

The analyst opens the file and either signs it off or edits it. Either way the trace is preserved: the original retrieval, the reasoning, the case, the analyst’s edits, and the final disposition. That trace is what regulators ask for, and what makes the system defensible six months later when nobody remembers the call.

What we got wrong

Our first version put the LLM in charge of retrieval too. It would decide, in natural language, which fields to query. The result was hard to debug and impossible to evaluate: small changes in phrasing produced wildly different recall. Splitting retrieval (deterministic embedding search) from reasoning (LLM with structured output) was the single biggest improvement we made.

Second mistake: chasing precision in the LLM step instead of recall in retrieval. Once we accepted that the cheap step is retrieval and the expensive step is human time, we stopped trying to make the LLM never say “maybe”. False-positive triage is what the case agent is for.

What’s next

We’re experimenting with longer-context Gemini for cases that depend on policy text—running the entire AML manual into the prompt and asking the model to reconcile a transaction against specific clauses. Early results are encouraging on policies under 50K tokens; we’ll write up the approach when the eval is solid.