How AI agents misdiagnose CPA spikes

CPA (cost per acquisition) spikes are the single most-asked-about alert in paid search operations, and the one where most AI agents fail hardest. The failure is not that the model is weak. The failure is that almost every implementation — in marketing SaaS, in one-off ChatGPT prompts, in auto-generated dashboards — treats the problem as “CPA went up, fire an alert.” That heuristic is wrong at exactly the moments a marketer most needs it to be right.

This post walks through why, and how mureo’s anomaly detector is built to refuse false positives instead of emitting them.

The naive approach

Ask any LLM to “watch my Google Ads CPA and alert me if something’s off” and you get some version of:

if today's CPA > target CPA × 1.5:
    alert("CPA spike detected")

It is the simplest thing that could possibly work, and it has been shipping in ad-ops tooling for fifteen years. On a single campaign in steady state, it can even be useful. Applied naively across an account by an AI agent, it becomes noise generator.

Three failure modes that make the naive approach wrong

1. Sample size is almost never enough

A campaign that converts 40 times on Monday and 5 times on Tuesday has not “spiked” — Tuesday is a different sample size. The apparent CPA on a 5-conversion day is so volatile that a single atypical lead inflates it by 30%. On a 2-conversion day, a single outlier can double it.

The naive alert fires. The marketer looks. There is nothing to fix. Five false positives later, the marketer stops reading the alerts, and the system that was supposed to be the safety net is now background noise.

2. Baseline is usually the wrong comparison

“Compare today’s CPA to target CPA” is intuitive but statistically naive. Target CPA is an aspiration, not a baseline. The relevant comparison is:

What is this campaign’s typical CPA, on days similar to today, given what we actually know about it?

Which means you want a median-based baseline over a recent window of same-shape days, not a threshold someone typed into a briefing deck eighteen months ago. Median (not mean) because one bad day should not distort the reference.

3. Severity is not binary

The naive alert either fires or does not. In reality, CPA at 1.4× baseline and CPA at 2.5× baseline require very different human responses. Collapsing them into one boolean wastes the agent’s most useful channel — priority — on a decision the system could have made for the operator.

mureo’s design

The anomaly detector in mureo/analysis/anomaly_detector.py is deliberately small. It does three things, and refuses to do any of the fourth:

A. Median baseline from the action log

Every mureo workflow records a CampaignSnapshot to the append-only action_log. The detector builds a baseline by taking the median CPA (or CTR, or spend) over a configurable window of recent snapshots for the same campaign. Median is chosen over mean because it is robust to the single-day outliers that a marketer should not need to hand-filter.

B. Sample-size gates

Below a statistical threshold, the detector does not alert at all. The numbers come from the mureo-learning skill’s sample-size rules, which codify what is and is not a trustworthy signal:

Metric	Minimum sample per day	Rationale
CPA spike	30 conversions	Below this, a single atypical lead moves CPA too much to call it a “spike”
CTR drop	1000 impressions	Below this, the delivery mix is too noisy to trust the CTR number

These are not arbitrary. They come out of the sample-size rules in the mureo-learning skill, which encodes what signal looks like at each metric — the point below which a “bad day” is day-to-day noise rather than a shift worth acting on. Below the gate, the detector returns nothing. The metric is surfaced for monitoring in the /daily-check report, not for action.

C. Severity tiers tied to effect size

When the gate is cleared, the detector emits one of two tiers:

Tier	CPA condition	CTR condition
HIGH	≥ 1.5× baseline	≤ 0.5× baseline
CRITICAL	≥ 2.0× baseline	≤ 0.3× baseline

Two tiers, not five. Five tiers would imply a precision the detector does not have. The two tiers map to different operator actions:

HIGH — investigate before the next daily check; likely a structural cause (bid change, new competitor, landing page break).
CRITICAL — pause-worthy without explanation; budget is actively burning against something that stopped working.

D. What the detector refuses to do

Three things the naive version ships that mureo’s does not:

It does not alert on zero-conversion days unless the campaign previously had non-zero conversions and spent money today. Zero conversions on a paused campaign is the correct state.
It does not alert on brand campaigns unless the baseline includes brand. Non-brand baseline applied to brand campaign will always look like a “CPA spike.” The detector inherits the brand flag from the campaign snapshot rather than guessing.
It does not infer root cause. The anomaly is a shape, not an explanation. Root cause belongs in the /rescue workflow, which consults the diagnostic knowledge base with the anomaly as input. Mixing detection and explanation is how SaaS dashboards end up telling you CPA rose “due to increased competition” on days when the actual cause was a bidding strategy that flipped into learning mode.

When you should still override the agent

The detector is tuned for the median account. It is wrong — and you should override it — in at least these cases:

Known promotional pulse. If you are running a 48-hour flash sale and CPA doubles on hour 2, that is the promotion working (high CPC auction, high volume), not a spike. Tell mureo with /learn; future runs will factor the pulse in.
Attribution lag. Some ad types — view-through, app-install, offline conversion imports — report conversions 1-7 days late. Same-day CPA will show as “spiked” because the numerator is real but the denominator is partial. The detector does not currently correct for this; a wrapper that suppresses alerts within the lookback window is on the roadmap.
Sample-gate boundary. If you have a CPA metric that genuinely matters at 20 conversions/day (niche B2B, high LTV), the 30 threshold is too loose. Operator override: pass a smaller min_conversions to the tool invocation. The default is the default, not the ceiling.

Bottom line

The job of an anomaly detector on a money-touching account is not to notice that a number went up. It is to emit an alert rarely enough that, when it does fire, it is worth acting on.

mureo’s detector is not clever. It refuses to fire below sample-size gates; it uses a median rather than a mean; it picks two severity tiers instead of five; it lets humans override when local context demands it. Every one of those choices trades “ability to look impressive on a slide” for “being trusted at 3 AM.”

If that trade is wrong for your account, mureo is wrong for your account. If it is right, the code is at mureo/analysis/anomaly_detector.py.

This article is part of the mureo methodology series. The source numbers cited (1.5×/2.0× CPA, 30 conversions, 1000 impressions) are the current defaults in anomaly_detector.py as of mureo 0.9.21; they are versioned with the OSS release and may be retuned as the diagnostic knowledge base grows.