Building Ominvo

Day 65: My health dashboard hid its own outage for four days

June 29, 20266 min read

Yesterday, while dogfooding the new IP Abuse admin tab, I noticed something weird in a different tab. The admin Health tab showed all four services as Operational — Database, Payments, Email Delivery, AI Engine, each with a green dot and a response time. "All Systems Operational" said the banner at the top.

But the response times looked suspicious. 1033ms for the database. 212ms for Stripe. The numbers were specific, but they didn't update when I refreshed. The "Last checked" timestamp said 8:43 pm, which I assumed meant just now.

I almost moved on. Half a day later — after Day 64's main work shipped — I dug back in.

The dashboard had been lying to me for four days.

What I expected vs what was true

I expected: UptimeRobot pings /api/health every 5 minutes, the route checks Supabase, Stripe, Resend, and Anthropic in parallel, logs the result to service_health_log, the admin tab reads from that table. Standard observability loop.

What was actually happening: UptimeRobot was pinging /api/health every 5 minutes. The endpoint was returning 200. The admin tab was reading from service_health_log. But the table hadn't received a new row since June 24 at 08:36 UTC. The most recent four rows — one per service — were sitting there from four days ago. Every time the admin tab loaded, it pulled those same four rows. The "Operational" status, the response times, the green banner — all from data that was four days old.

I ran a quick query in Supabase to confirm.

The root cause

Day 60 introduced a full rewrite of /api/health as part of the auto-rollback system. The new version checked Supabase and Stripe, approximated Anthropic by checking if the env var was set, and returned a JSON summary to UptimeRobot.

What it did not do: write anything to service_health_log.

The insert was dropped entirely — not commented out, not guarded behind a flag, just gone. The route continued to return 200 on every ping. UptimeRobot saw no errors. The GitHub Actions rollback workflow saw no errors. The only thing that noticed was the admin tab, which saw zero new rows — but it defaulted gracefully: no logs means 100% uptime (no failures detected) and status "ok" (latest entry). The tab showed green because its defensive defaults were too defensive.

This kind of bug is invisible to tests, invisible to CI, and invisible to any external monitor that only checks for 200 vs non-200. The only way to catch it is to look at the data and notice that nothing is changing.

The fix

Restored /api/health to what it should have been since Day 60:

Four services checked: Supabase (live query on businesses), Stripe (balance.retrieve()), Resend (domains.list()), Anthropic (models.list({ limit: 1 })). The old env-var-only Anthropic check is gone — it was measuring the wrong thing.
Response times captured: wall-clock milliseconds from before the call to after, per service.
Error messages captured: if a check throws, the error message goes into the log row.
Status is ok or down: no degraded for now — the checks either pass or they don't.
All four results inserted to service_health_log via the service-role client after every check, whether any check fails or not.

The insert is fire-and-forget wrapped in try/catch. If the log write fails, the route logs the error but still returns the real service status to UptimeRobot. The health check endpoint must not fail to report because its own logging is broken — that would be a second-order problem hiding the first.

The stale-data banner

The real design failure here wasn't the missing insert — it was that the admin tab had no way to tell the operator it was showing old data. If the most recent row in service_health_log is four days old, and the tab shows "All Systems Operational" with no caveat, that's a trust problem. The next time it shows green, you don't know if it's genuinely green or if logging broke again.

So I added a stale-data warning banner that mounts at the top of the Health tab:

Red banner — if service_health_log has no data at all: "No health check data available. The /api/health endpoint may not be writing to service_health_log. Cards below show fallback values, not real status."
Amber banner — if the most recent row is more than 15 minutes old: "Health checks have not run in X minutes. Last check: [timestamp]. Cards below may show stale data."
Nothing — if everything is fresh.

The 15-minute threshold is three missed pings. One or two missed pings can happen. Three in a row means something is structurally wrong.

This banner should have been there from Day 33 when the health check system was first built. Observability infrastructure that doesn't surface its own staleness is the same as not having it.

What I should have done differently

The root cause is obvious in retrospect: I rewrote a route that had a side effect (DB write) and didn't preserve the side effect. But the deeper failure is that nothing caught it for four days.

Tests don't catch dropped side effects unless you test for them explicitly. TypeScript doesn't know that service_health_log should be getting new rows every five minutes. The route compiles clean. The endpoint returns 200. Everything looks fine everywhere except in the actual data.

The rule I'm adopting going forward: when rewriting a route, list its side effects explicitly before touching anything. The health check route had exactly one side effect — the DB write. I should have written that down before changing the route and verified it survived. I didn't.

The stale-data banner is the permanent fix. Even if this happens again, the admin tab will tell me within 15 minutes instead of four days.

What's next

Day 66 is the changelog and infrastructure tidy-up before the Stripe live-mode cutover around August 5. The PRE_LAUNCH flag flips July 23 when GBP API approval lands, and the actual launch is August 10.

Forty-four days.

Written by

The founder of Ominvo

Building review management for single-location small businesses. Join the waitlist →

Older post →

Day 64: When one person signs up ten times to dodge your rate limit