How to measure AI visibility. A real measurement stack, not a vanity score

Most "AI visibility" metrics being sold in 2026 are vanity scores. Single numbers with no clear methodology, no comparison set, no instrumented attribution to revenue. This piece walks through the actual measurement stack we use to prove (or disprove) GEO impact at every stage, with the source-of-truth ordering when sources disagree.

If you're commissioning GEO work and the agency can't describe their measurement at this level of specificity, you're funding hope, not work.

The three levels of proof

GEO measurement has three layers, each with a different lag and a different evidentiary weight. Strong results show movement at all three.

Level 1: Visibility (leading indicator)

What it measures: citation rate across a fixed prompt set.

How:

Lock a prompt set of 30–50 buyer-intent queries in the client's language and market. These don't change during the engagement.
Run each prompt 3× per engine (LLM responses are stochastic. Single runs lie).
Run across all major engines: ChatGPT, Claude, Gemini, Perplexity at minimum.
Capture: was the brand cited? Position in citation list? Which competitors appeared? Which sources did the LLM reference?
Re-run weekly during the sprint, then monthly during retainer.

Why it's essential: this is the only metric that moves first. Levels 2 and 3 lag by 30–90 days. Without Level 1, you can't prove anything inside a 30-day sprint window.

Why it's insufficient alone:visibility != revenue. A skeptical CEO will ask "so what?" You need Levels 2 and 3 to answer.

Level 2: LLM referral traffic (mid-term proof)

What it measures: sessions, signups, and revenue from users who arrived via an LLM.

How:

In GA4, create an audience: session_source matches regex chatgpt|chat\.openai|perplexity|gemini\.google|claude\.ai|copilot\.microsoft|you\.com|phind
Establish a 30-day baseline before sprint starts. Annotate the metrics in your reporting tool.
Track weekly during the sprint.
Compare 30-day pre-sprint baseline vs 30-day post-sprint window.

Important caveats:

GA4 underreports LLM referrals by 30–60%. Reasons: many users copy-paste your URL into a new tab (shows as direct), in-app browsers strip referrers, Perplexity sometimes drops referrer headers.
This means "LLM referral sessions" in GA4 is a floor, not a ceiling. Real traffic is higher than the segment shows.

Level 3: Branded search + direct lift (gold standard)

What it measures: when LLMs recommend you by name, users go look you up. Branded organic search volume and direct traffic must rise.

How:

In Google Search Console: filter Performance report for queries containing your brand. 30-day baseline, then 30-day post-sprint window.
In GA4: track direct traffic (channel = Direct). Same comparison.
Cross-reference with paid brand search (Google Ads brand campaigns). If branded organic rises while paid brand stays flat, you've isolated the GEO effect.

Why this is the gold standard:these signals are nearly impossible to fake. If LLMs are actually recommending you more, branded search has to rise. If it doesn't, the GEO work isn't moving recommendation behavior in any real sense.

The fourth signal: self-reported discovery

Independent of the three measurement layers, add a single signup-form field:

How did you hear about us? [Google] [ChatGPT, Perplexity, or AI assistant] [Social media] [Friend / referral] [Other]

This catches what GA4 misses (the direct-traffic AI referrals) and gives you hard percentage data. Within 60 days of a successful sprint, expect 10–25% of new signups to self-report "AI" as discovery source. That's the strongest single signal that GEO is working.

Source-of-truth ordering when sources disagree

Inevitably the numbers don't match. GA4 says 200 LLM-referral sessions. The brand-search uplift suggests 500+. Self-reported survey says 12% of new signups (which would imply ~800 LLM-influenced signups). Which is right?

Order of precedence we use:

Self-reported attribution. Users who say they heard via AI are the most reliable signal of AI influence. They explicitly remember the touchpoint.
Branded search + direct lift correlation. Movement here is structurally caused by recommendation behavior.
LLM referral session count in GA4. Useful but known to be undercounted.
Visibility rate on prompt set. Leading indicator only. Strongest correlation to lagging metrics but not a substitute for them.

When all four are aligned and moving together, the GEO work is real. When they diverge sharply, investigate before claiming wins.

What you must instrument before the sprint starts

Non-negotiable setup. If you skip any of these, you can't prove anything at Day 30. This is the single biggest mistake new GEO engagements make.

✅ Fixed prompt set documented and frozen
✅ Baseline visibility audit completed and archived (screenshots of cited responses)
✅ GA4 "AI Search Referrals" audience created with all LLM referrers
✅ GA4 baseline metrics snapshot (sessions, signups, revenue from that segment, 30 days)
✅ GSC branded search baseline (30 days rolling, exported as CSV)
✅ Direct traffic baseline (30 days)
✅ Signup form survey field deployed with the "AI assistant" option
✅ Customer-facing teams briefed to ask "what prompted you to look us up?"

What gets sold as measurement and isn't

Push back hard on these in any agency pitch:

"AI Optimization Score". A single proprietary number. No standardized methodology, can't be cross-checked, vendor-locked.
"GPT-4 ranking". LLMs don't have stable rankings.
"Prompt rankings tracked over time" without stochasticity controls. A single run is noise. Need 3+ runs per prompt to draw any conclusion.
"AI Overview impressions" from GSC. This measures Google AI Overviews only, not LLM citations. Mostly noise for GEO purposes.
"Schema markup deployed". That's an activity, not a result. Should produce measurable citation lift.
"Content optimized for AI". Meaningless without prompt-level evidence.
Visibility % without competitor comparison. Useless in isolation. Always show your delta + competitor delta.
Single-engine measurement. "we improved ChatGPT visibility" misses 3 of the 4 major engines. Always report across all four.

The honest measurement narrative for a client report

At Day 30 / Day 60 / Day 90 the report should follow this shape:

Visibility delta. Prompt-level citation rate before/after, per engine, with screenshots.
Sample cited responses. 5–10 verbatim quotes where your brand is named. Visceral proof.
LLM referral traffic. GA4 segment chart.
Branded search + direct lift. GSC + GA4 correlation chart.
Self-reported discovery. Survey response breakdown.
Competitor displacement. List of prompts where you now win that a competitor won at baseline.

Six sections. All aligned. All measurable. All verifiable. That's what makes GEO a defensible service category and not the next round of agency snake oil.

The three levels of proof

Level 1: Visibility (leading indicator)

Level 2: LLM referral traffic (mid-term proof)

Level 3: Branded search + direct lift (gold standard)

The fourth signal: self-reported discovery

Source-of-truth ordering when sources disagree

What you must instrument before the sprint starts

What gets sold as measurement and isn't

The honest measurement narrative for a client report

Get your free AI visibility scorecard.