LLM Observability Maturity Scorecard

Diagnose where your team stands across 7 dimensions of LLM observability. Get a composite score, gap analysis, and concrete next steps.

The 5 Maturity Levels

Progression from no observability to closed-loop automated intelligence.

0[Dark]

Absence. No LLM-specific observability exists. The system is a black box. Failures are discovered by end users.

1[Theater]

Presence without utility. Instrumentation exists but does not inform decisions. Data is collected but not consumed. There is no operational difference between having these logs and not having them.

What changes at this boundary

The minimum for L1: at least one LLM endpoint has some form of logging or tracing. Even console.log counts. The bar is existence, not quality.

2[Measured]

Operational awareness. Infrastructure metrics are systematically collected and reviewed. The team can answer: "How fast? How often? How much?" But NOT: "How good?"

What changes at this boundary

This is the hardest boundary and where most organizations stall. L1 has data nobody uses. L2 has data that is reviewed on a defined cadence (even weekly), with defined operational metrics, by defined people. The diagnostic test: can you name the person who reviewed LLM metrics in the last 7 days? If no → L1.

3[Governed]

Quality awareness + ownership. Semantic quality is measured. Someone owns the outcomes. SLOs exist. The team can answer: "How good? Who's responsible? What's acceptable?"

What changes at this boundary

L2 measures infrastructure (latency, errors, tokens). L3 measures the quality of the output itself — relevance, correctness, safety, grounding. This is the level transition that is uniquely LLM-specific. In traditional observability, L2 is often sufficient. For LLMs, L2 is dangerously incomplete.

4[Predictive]

Proactive + closed-loop. Anomalies detected before user impact. Observability data drives automated or semi-automated decisions about models, prompts, and routing — and the governance loop is closed: scope and behavioral-drift violations are enforced in real time, not discovered in the post-mortem.

What changes at this boundary

L3 has humans reviewing quality data and making decisions. L4 has systems that detect anomalies and either trigger automated responses or surface actionable alerts before users notice degradation. Closed-loop maturity includes governance enforcement, not just model/prompt/routing automation.

The 7 Assessment Dimensions

Each dimension is independently scoreable. Being strong in one does not imply strength in another.

01 / 07

Trace Coverage

What fraction of your LLM interactions are captured with enough structure to be queryable?

+ Measures

The breadth and structural quality of your instrumentation. This is the foundational layer — without data, nothing else in this scorecard matters.

− Does NOT measure

Whether anyone looks at the traces (that's Feedback Loop). Whether the traces include quality metrics (that's Quality Signals). This dimension is purely about capture.

02 / 07

Quality Signals

Do you measure the quality of what your LLMs produce — not just whether they responded?

+ Measures

The breadth and sophistication of your quality measurement system. How many types of quality signals do you collect, and how systematically?

− Does NOT measure

Hallucination detection specifically (that's Dimension 3). Cost (that's Dimension 4). Whether anyone acts on quality data (that's Dimension 6).

※ Why this dimension is separate

Quality Signals is about the breadth of your quality measurement. Hallucination Awareness is about the depth of your capability on the single most dangerous LLM failure mode. You can be L3 on Quality Signals (you measure relevance, completeness, tone) and L1 on Hallucination Awareness (you don't detect fabrications). These are independent axes.

03 / 07

Hallucination Awareness

How do you detect, classify, and respond to your models fabricating information?

+ Measures

Your specific capability to handle the failure mode that is unique to LLMs and does not exist in traditional software. A service can be slow, broken, or expensive — only an LLM can confidently lie to your users with a smile.

− Does NOT measure

Broad quality measurement (relevance, completeness, tone) — that's Quality Signals. This dimension is specifically about fabrication detection.

※ Why this dimension is separate

You can have broad quality measurement (relevance, completeness, tone) and still be completely blind to hallucinations. Conversely, you can have excellent hallucination detection and no broader quality framework. These capabilities are independently valuable and independently assessable. A concrete example: an org can be L3 on Quality Signals — measuring relevance, tone, and completeness with automated evaluators — and simultaneously L0 on Hallucination Awareness because they have zero grounding verification in their RAG pipeline. This is not an edge case. It is the typical state of teams that adopted LLM-as-judge evals through Arize or Langfuse but never built source-claim verification. A single LLM-as-judge evaluator does not cover both dimensions.

04 / 07

Cost Visibility

Do you understand what your LLM usage costs — at the granularity that enables decisions?

+ Measures

Your ability to attribute, forecast, and optimize LLM spending. This is the FinOps dimension. LLM-specific factors: token economics are asymmetric (input vs. output pricing); model selection is a cost lever (same task can differ 10x); context window utilization affects spend; caching requires instrumentation to measure impact; provider pricing tiers are context-dependent.

− Does NOT measure

Quality of outputs (that's Quality Signals). Infrastructure performance (that's Trace Coverage). This dimension is specifically about cost attribution and optimization.

05 / 07

Incident Ownership

When your model degrades or fails — is there a defined process, or is it chaos?

+ Measures

Organizational readiness to respond to LLM-specific incidents. This is deliberately a process/people dimension, not a technology dimension. LLM-specific incident types: quality degradation (model outputs get worse without any infrastructure signal); prompt regression (a prompt change degrades quality in untested cases); provider degradation (upstream model provider degrades quality); data contamination (retrieval pipeline surfaces incorrect data); cost explosion (code change causes a 10x cost spike).

− Does NOT measure

The quality measurement system itself (that's Quality Signals). Technical tracing capability (that's Trace Coverage). This dimension is about the organizational response, not the detection tooling.

※ Why this dimension is separate

Observability without incident response is voyeurism. You can have the most sophisticated dashboards in the world — if nobody knows whose job it is to look at them when things go wrong, you have observability theater. This dimension is the bridge between "we can see the problem" and "we can fix the problem."

06 / 07

Feedback Loop

Does your observability data actually change what you do — or is it a dashboard nobody looks at?

+ Measures

The degree to which observability output is connected to engineering and product decisions about models, prompts, and system design. This is the "closed loop" dimension — where observability becomes engineering.

− Does NOT measure

The quality of your evaluation (that's Quality Signals). The coverage of your tracing (that's Trace Coverage). This dimension measures whether the data you collect has a path back to the system that produced it.

※ Why this dimension is separate

Most organizations stop at "collect data" and "build dashboard." The leap to "data drives decisions" and then "data drives automated actions" is the maturity gap that separates orgs who improve from orgs who accumulate technical debt with better visibility. This is the capstone dimension — it only works when everything below it functions. It is intentionally ordered last to reflect that dependency, not to suggest it matters less.

07 / 07

Governance & Auditability

When an auditor, insurer, or court asks what your LLM system did, under whose authority, and whether the record was altered — can you answer with evidence, or only with reconstruction?

+ Measures

Your ability to produce a tamper-evident, attributable, lifetime-scoped record of LLM decisions and the authority/policy context around them. This is the compliance and non-repudiation dimension.

− Does NOT measure

Whether anyone *responds* when things break (that's Incident Ownership). Whether output quality is *measured* (that's Quality Signals). This dimension is specifically about durable, provable, authority-attributed records.

※ Why this dimension is separate

You can be L4 on Trace Coverage and L0 here. Traces sitting in a mutable SIEM, unsigned, uncorrelated, and rotated out in 30 days are excellent for debugging and worthless as evidence. Trace Coverage is capture *for engineering*. Governance & Auditability is capture *for accountability* — different consumer, different durability requirement, different threat model. The EU AI Act's Article 12 asks for the second one and does not accept the first as a substitute.

The 4 Disqualifiers

A mean hides landmines. Disqualifiers are blocking yes/no questions that sit outside the composite. A blocking answer to any of them caps your overall maturity at L1 (Theater), regardless of how high your dimension scores push the average. They catch the presence of a fatal practice the per-dimension structure structurally misses — never the absence of a good one. The set is small and stable.

DQ #01 — mutable audit log
Can a single person or agent permanently delete the only copy of your audit/incident records?
+ Safe
No — audit/incident records have an append-only or out-of-blast-radius copy
⚠ Blocking
Yes — a single actor can destroy the only copy
Why fatalA backup that shares a deletion path with primary data is not a recovery layer. The only-copy log is one mistake or one compromised credential from being unrecoverable.
DQ #02 — blanket-scope production token
Does any production LLM agent hold a credential scoped for destructive operations it doesn't need for its current task?
+ Safe
No — agent credentials are scoped to current-task operations only
⚠ Blocking
Yes — at least one agent holds destructive scope it doesn't need
Why fatalGovernance is theater if the token can delete the database. Capability declaration must hold at credential level, not just at policy level.
DQ #03 — unguarded prod deploy
In the last 90 days, did a prompt or model reach production with no quality gate of any kind?
+ Safe
No — every prompt/model change passed at least one quality gate
⚠ Blocking
Yes — at least one prompt or model shipped without any quality check
Why fatalQuality signals you don't defend are decoration. A single unguarded deploy means the gate is optional, which means it isn't a gate.
DQ #04 — no evidentiary record
If a regulator asked tomorrow for the exact request/response of one specific LLM decision from six months ago — signed and unaltered — could you produce it?
+ Safe
We could produce it
⚠ Blocking
We could not produce it
Why fatalThe Governance & Auditability capability must hold end-to-end, over the lifetime — not just on paper. This catches the gap between "we have a tamper-evident log" and "we can pull evidence for any historical decision."

How Scoring Works

Score yourself 0–4 on each of the 7 dimensions. The composite score is the arithmetic mean of all 7 dimension scores.

Overall maturity level = floor of the composite score. A composite of 2.8 → Level 2 (Measured).

Disqualifier cap — if you answer the blocking option on any of the 4 disqualifiers, your overall level is capped at L1 (Theater). The raw composite is still shown, with the triggering disqualifier(s) named, so the path back is visible.

Uneven profile warning fires when any dimension scores 0, or when a dimension is 2+ levels below your composite. A high composite with a critical gap dimension is still a broken system — the warning ensures you see it clearly.

Gap actions are sorted lowest-scoring dimensions first. Fix the foundation before decorating.