RegTech

Reducing False Positives in AML Screening: How AI and ML Actually Help

Industry research consistently puts the false-positive rate of legacy AML screening systems at over 90%. The cost is enormous — analyst time spent investigating noise is analyst time not spent investigating real financial crime. This guide explains what causes false positives, how modern systems reduce them, and how to deploy AI and machine learning without compromising regulatory defensibility.

Published: May 2026 Category: RegTech Read time: ~9 minutes

Quick Answer

False positives are AML alerts that, on investigation, do not correspond to suspicious activity. They occur in every screening system because the matching algorithms are tuned to over-detect rather than under-detect — missing a genuine match has regulatory consequences that missing a false positive does not. Reducing false positives without sacrificing true-positive detection requires a layered approach: scenario tuning based on the firm's actual data, secondary-identifier matching that uses additional data points beyond name, contextual scoring that incorporates customer risk and history, and supervised machine-learning models trained on the firm's historical disposition data. Properly implemented, these techniques typically cut false-positive volume by 60–85% while maintaining or improving true-positive detection rates — which translates directly into more analyst time spent on genuine investigations.

Every compliance team has felt this. The alert queue grows faster than the analyst headcount. Senior analysts spend their days clearing alerts that any first-look review confirms as false positives. Genuinely suspicious activity gets less attention than it should because the queue keeps moving. Some firms hire offshore alert-clearance teams running into hundreds of analysts to keep the queue manageable. This is not how regulators want AML to work.

The good news is that false positives are now a tractable problem. The technology to reduce them dramatically — without reducing true-positive detection — is mature, deployed in production at major banks, and increasingly the default in modern compliance platforms. The challenge is not whether the techniques work; the challenge is operationalising them in a way that holds up to regulatory scrutiny.

Why False Positives Happen

False positives arise from three structural features of AML screening, each of which operates correctly in isolation but produces noise in combination.

The three structural drivers of false positives:

Fuzzy matching — sanctions and PEP screening must match names despite spelling variations, transliteration differences, and aliases. The algorithm necessarily generates non-exact matches that humans must review. Lower the fuzzy threshold and false positives drop, but true positives are also missed.
Common names — many sanctioned individuals have names that are common in their countries of origin. Hundreds of thousands of customers with the same name as a sanctioned individual will match unless additional data differentiates them.
Conservative scenario design — transaction monitoring scenarios are typically tuned to over-detect because the cost of missing a genuine red flag is regulatory enforcement, while the cost of over-alerting is internal operational pain. The incentives push toward higher false-positive rates.

These drivers are not bugs. A perfectly tuned system with zero false positives would also miss true positives. The objective is not zero false positives but the right balance — and that balance is set by data, not by intuition.

Layer 1: Scenario and Threshold Tuning

The first layer of false-positive reduction is also the most under-used: tuning scenario parameters and matching thresholds against the firm's actual data. Most legacy systems are configured at deployment with vendor-default thresholds and never tuned again. This is supervisorily problematic and operationally wasteful.

Tuning starts with disposition data — the historical record of how alerts were resolved. For each scenario, the firm calculates the precision (proportion of alerts confirmed as true positives), recall (estimated proportion of true positives actually detected), and the threshold sensitivity curve. From this, threshold values can be set to match the firm's risk appetite, supported by data rather than guesswork.

Scenario tuning is now an explicit regulatory expectation in most major jurisdictions. MAS Information Papers, FCA Dear CEO letters, and FinCEN guidance all reference scenario tuning as a normal feature of an effective programme. A programme that has run for years without tuning is presumptively under-supervised.

Layer 2: Secondary-Identifier Matching

Most legacy systems match primarily on name. A customer named Mohamed Ali matches every sanctioned Mohamed Ali on the SDN list — and there are many. But the customer's date of birth, nationality, gender, and address are typically known at onboarding. Using those secondary identifiers to differentiate between identically-named individuals dramatically reduces false positives.

Effective secondary-identifier matching requires: structured data capture at onboarding (so the secondary identifiers are usable), structured data on the sanctions list side (most major lists provide this — OFAC SDN includes DOB, nationality, address where available), and matching logic that knows how to weight secondary identifiers correctly (a DOB match is strongly differentiating; a nationality match alone is less so).

Properly deployed, secondary-identifier filtering eliminates the bulk of trivial false positives — common-name matches against sanctioned individuals with completely different DOB and nationality. Analyst time previously spent on these matches is freed for genuine investigation.

Layer 3: Contextual Risk Scoring

Two alerts are not the same alert. An alert on a low-risk customer with no prior alerts in two years is fundamentally different from an alert on a high-risk customer with three prior cleared alerts in the last quarter. Modern alert workflows route and prioritise alerts based on contextual risk, ensuring high-risk alerts get attention first and low-risk alerts can be cleared faster.

Contextual scoring incorporates the customer's risk rating, the customer's alert history, the alert's position in the customer's transaction pattern, and any known relationships to other alerted customers. The output is not a binary alert/no-alert decision but a graduated priority score that drives workflow.

Contextual scoring does not eliminate alerts — it sequences them. The analyst looks at the highest-risk alerts first, can rapidly dispose of low-risk alerts with documented rationale, and produces a measurable improvement in case-handling efficiency. The audit trail is preserved at every step.

Defensibility Matters

The reason regulators have been historically cautious about machine learning in AML is concern about defensibility — can the firm explain to the regulator why the model classified this alert as low priority? Modern ML approaches address this through model explainability (SHAP values, feature attribution) and through hybrid architectures where ML augments rather than replaces deterministic rules. The deterministic layer keeps every alert visible; the ML layer prioritises and accelerates.

Layer 4: Supervised Machine Learning

The most powerful false-positive reduction technique is supervised machine learning trained on the firm's own historical disposition data. The training signal is straightforward: given the features of a historical alert, did the analyst classify it as a false positive or a true positive?

A well-trained supervised model can classify new alerts with accuracy that materially exceeds rule-based scoring. In production deployments at major banks, ML-based alert prioritisation has reduced manual review effort by 60–85% while maintaining or improving the rate at which genuine SARs are filed. The improvement comes from the model learning patterns in the data that no human-designed rule could practically capture.

Supervised ML in AML works best when three conditions are met. Sufficient training data — at minimum tens of thousands of historical disposition records, ideally hundreds of thousands. Quality disposition data — analyst classifications must be consistent and accurately recorded; sloppy historical disposition labels produce unreliable models. Continuous retraining — typology evolves and customer behaviour evolves, so models must be retrained on a defined cadence (typically quarterly).

Building the Programme to Hold Up

False-positive reduction is operationally valuable only if it holds up to regulatory inspection. A programme that reduces alerts but cannot defend the reduction is worse than the legacy programme it replaced.

The four practices that make an FP-reduction programme defensible:

Document every tuning decision — every threshold change, every model retrain, every scenario adjustment must have a written rationale tied to data.
Run shadow/champion-challenger — new tuning runs in parallel with old before going live, so the actual impact on true-positive detection is measurable.
Independent validation — the model and tuning must be validated by a function independent of the team that built it. Internal model validation or model-risk-management is the standard.
Continuous monitoring — the model's performance is tracked over time so drift is detected and addressed.

Modern compliance platforms support this lifecycle natively — tuning workflows, A/B testing, model performance tracking, and explainability all built into the case-management workflow.

Cut Alert Volume Without Cutting Detection

One Constellation's screening and monitoring platform combines tuned scenarios, secondary-identifier matching, contextual scoring, and explainable supervised ML — letting your team focus on the alerts that actually matter.

Book a Demo Explore Transaction Monitoring

Solutions

Industries

Resources

Company