← All Posts
Article

Three Ways Your API Lies: Lessons from GitHub's Rough Week

Between April 9 and April 13, GitHub published four incident reports. Three of them are the same problem in different disguises — and that problem is probably in your stack too.

observabilityincidentsgovernancepost-mortemai-opsarchitecture

You open the agent dashboard. It shows no sessions running. You assume your agents are idle. They are not. They are running. You just can't see them.

That happened to Copilot customers starting on April 9. The Mission Control UI stopped listing third-party agent sessions — specifically Claude and Codex Cloud Agent integrations. GitHub's postmortem is precise: "API returned successful responses with incomplete results", average error rate 0%, maximum error rate 0%. No 5xx. No timeout. No alert. Fourteen hours and twenty-five minutes of invisibility, and every HTTP probe in the building said everything was fine.

Two other incidents the same week rhyme with it. An octodns failure deleted a production Pages DNS record after an upstream data source intermittently returned nothing for it — 17.5 million failed requests in 97 minutes, peak error rate 12.77%. And a Copilot coding agent cascade where a caching bug held rate-limited state longer than the actual rate-limit window, producing four separate outage waves and roughly 22,700 delayed or failed workflow creations across a ten-hour window.

Three different incidents. One question underneath each: what does a signal mean when it arrives incomplete, stale, or missing? Your pipeline is answering that question every day, whether you wrote the answer down or not. Here are three ways it might be answering wrong.

Lie #1: the incomplete response that claims to be complete

Mission Control's dashboard returned HTTP 200 with a list. The list was short a category of items. The consumer — the customer's UI code, and every monitoring system watching it — had no way to tell the difference between "there are no third-party sessions" and "there are third-party sessions and we filtered them out by mistake". From outside, the responses were syntactically perfect.

That's the failure mode to watch for in your own APIs: a response schema that conflates empty and incomplete. It is not only a schema design issue. It is a monitoring issue, because once the schema has lost the distinction, nothing downstream can recover it. Your SLO built on HTTP status codes measures conformance to the protocol, not fidelity to reality.

Two things help:

  1. When a response was shaped by a filter, a join, or a projection, mark that in the payload. A result_scope field, a list of applied filters, or a sources_queried array turns silent omission into explicit, auditable metadata. It also gives your consumer a place to assert expectations.
  2. Monitor shape as well as status. If agent_sessions.length has a multi-day floor of three and suddenly reads zero for a class of customer, that is a detectable anomaly. Dashboards counting 200s will not flag it. Dashboards counting expected content shape will.

If you run anything that other teams build dashboards or governance workflows on top of — an LLM retrieval service, an audit log reader, an agent inventory — this is the lesson. Your 200 is load-bearing. Make it carry weight.

Lie #2: the stale decision broadcast as current

The Copilot cache trapped a rate-limited verdict from upstream and kept serving it after upstream had recovered. Four outage waves instead of one recovery. The cache wasn't wrong about the fact — at one point, upstream really had said denied. It was wrong about when that was true.

Negative cache entries — denials, failures, rate-limit hits, timeouts — tend to get the same TTL treatment as positive ones. They shouldn't. A cached success is a snapshot of a known-good moment; a cached denial is a veto, and vetos should expire fast and loudly.

What to check in your own stack:

  1. Every cache of a negative verdict — rate limits, auth failures, circuit-breaker open states — should carry a validity window shorter than the upstream's own recovery time, and consumers should know the age of what they're reading.
  2. If the cache is authoritative for a block/allow decision, it needs an independent health check against upstream. Not a re-poll of the cache. A probe to the source of truth.
  3. When a cache layer can extend the blast radius of a transient upstream failure — in this case, turning a few minutes of rate-limiting into ten hours of intermittent degradation — treat that amplification as a first-class concern in your architecture review, not a post-hoc footnote.

The question to walk into your next design review with: every cached value in this system, how fresh does its consumer need it to be, and does the consumer actually know?

Lie #3: the automation that reads silence as command

octodns polled an upstream data source for a production DNS record. The source came back empty for that record — transiently. octodns interpreted the absence as "this record should no longer exist" and deleted it. Pages went down.

The bug is obvious in hindsight. The design question is harder: why could a tool that polls an upstream source delete, unattended, records that other systems had created? The planned remediation — preventing octodns from deleting records owned by other systems — is an admission that the tool's authority was never properly scoped.

If you have infrastructure-as-code reconcilers, AI agents with tool access, or any system that takes autonomous action on diffs between "desired state" and "observed state", this applies to you. Three checks:

  1. Asymmetric blast radius. Creation is usually recoverable; deletion often isn't. A reconciler that will happily delete should require stronger evidence than one that will happily create. "Missing from one poll" is weak evidence. "Missing from N consecutive polls across M minutes, confirmed by a secondary source" is stronger. Most reconcilers do not draw this line because nobody asked them to.
  2. Ownership boundaries. Does your automation only act on resources it created or explicitly owns? Or can it reach across and modify resources registered by humans, other services, or other tooling? If the latter, every one of those resources is at the mercy of a polling bug.
  3. Kill-switch proximity. How fast can you halt this reconciler? If the answer is "we'd need to push a config change and wait for a deploy", that's too slow. Feature flags or circuit breakers that stop writes in seconds, not minutes, are the difference between a one-record incident and a 17-million-request one.

Agents with tool access are this class of system, just with less deterministic triggers. If octodns reading an ambiguous poll and deleting a DNS record feels reckless, an LLM agent reading an ambiguous instruction and calling a deletion tool should feel worse.

What to take into Monday

The fourth GitHub incident that week — the April 13 Copilot latency event — was a pure capacity failure, and the fix (more compute) was proportional to the cause. That is the shape most incidents have. The three in this post are the other shape: the system did what it was designed to do, and what it was designed to do turned out not to be what anyone wanted.

The shared ingredient is a missing honesty contract. Responses that don't declare their completeness. Caches that don't declare their freshness. Automation that doesn't declare the limits of its authority. Each gap is small in isolation. Stacked on top of each other, they are the reason your green dashboard and your customer's complaint ticket can both be accurate at the same time.

Three questions worth adding to this week's architecture review:

  • Where in your stack does a 200 hide an incomplete result?
  • Where in your stack does a cached decision outlive the condition that produced it?
  • Where in your stack does an automated writer have authority over things it did not create?

Answer those three honestly and next week's postmortems are less likely to be yours.