Building OminvoDay 62

Day 62: What your users see when Anthropic goes down

June 26, 20265 min read

Yesterday was about the infrastructure layer — auto-rollback, health checks, the machinery that kicks in when our code breaks so the system heals itself without a 2am page. That's the part users never see, and shouldn't have to.

Today was the human layer: what does a salon owner actually see on their screen when Anthropic has an outage at 9pm on a Saturday? Before today the answer was a broken page, a silent spinner, or a raw error message that means nothing to someone who's never heard of a rate limit. None of those are acceptable.

Three external APIs, three failure modes

Ominvo depends on three external services we don't control — Anthropic for AI replies, Stripe for payments, Supabase for data. All three have had outages this year. All three fail differently: Anthropic throws rate limit errors and 5xx responses; Stripe throws its own SDK exceptions with varying codes; Supabase can go slow or return connection errors. Without graceful degradation, our users see identical symptoms for all three: a button that does nothing, or a page that looks broken. That's a problem because users blame the product in front of them, not a third-party infrastructure company they've never heard of.

Why generic error messages are worse than no message

"Something went wrong" is barely better than a crash. A skeptical SMB owner — someone who decided to trust us with their Google reviews — doesn't know if it's their wifi, their browser, our product, or some API they've never heard of. Worse: if every failure looks the same, they can't tell a 30-second blip from a real outage. They'll assume it's broken, close the tab, and maybe not come back. At least a crash is dramatic. A silent spinner is a slow confidence drain.

The right pattern is structured error responses — distinct codes per failure mode, so the UI can show honest, specific messaging instead of a generic shrug.

Structured error responses across draft-reply, checkout, and billing portal

Three API routes now return a consistent JSON shape on failure:

{ "error": "ai_unavailable", "message": "...", "retryable": true }

/api/draft-reply catches Anthropic SDK errors. RateLimitError returns 429 with ai_rate_limited. Everything else returns 503 with ai_unavailable.
/api/create-checkout-session and /api/create-portal-session wrap their Stripe calls. Stripe failures return 503 with payment_unavailable.
Every error includes a retryable: true flag so the UI can decide whether to show a retry button or a permanent message.

The success response shapes were not touched — only the failure paths changed. The full ship log is on the /changelog if you want the before/after detail.

Why an inline banner beats a toast or modal

Toasts disappear before users read them — especially older users, or anyone who looked away the moment something happened. Modals are for decisions, not status information. An inline amber banner stays visible until the user dismisses it or the action succeeds. It's also non-blocking: the user can still navigate, still try again, still see the rest of the page.

The banner sits directly below the button that triggered the action, so the cause-and-effect is obvious. No hunting around the screen for what went wrong. On the /dashboard, it appears below the AI reply draft button. On the /pricing page, it appears below the checkout CTA. Same component, different context, same honest message.

The bug we found while shipping graceful degradation

After shipping the main work, we clicked "Manage Billing" on a GigaChad test account. Nothing happened. The button got stuck on "Redirecting..." with no message, no error, no way back.

First diagnosis: button state wasn't resetting on API failure. Quick fix — the button now resets if the response is non-OK or the URL is missing. Tested again. Still broken — but now the broken state was the new graceful degradation banner saying "Payment processing is temporarily unavailable."

Except Stripe wasn't down. We checked.

Real diagnosis via Supabase MCP: the test GigaChad account was tier-upgraded via admin override, so it had no stripe_customer_id. The portal route was correctly returning a 400, but our graceful degradation shim was bucketing that 400 into the same 503 message as a Stripe outage. So we were showing a Stripe outage message when the real problem was that the account had never been through a Stripe checkout at all.

The fix: distinct error codes. no_billing_account (400) is now separate from payment_unavailable (503). The first says: "No billing account on file — this usually means your account was upgraded manually. Contact support to set up billing." The second says: "Payment processing is temporarily unavailable." Same banner component, different message, different truth.

This is the lesson: graceful degradation isn't just about catching errors. It's about telling the truth about which error.

What we'd do differently

The dogfood test is non-negotiable — and Day 62 proved it. We shipped graceful degradation, then used the product as a user, and immediately found the bug. If we hadn't tested on our own account, the first paying customer who got admin-upgraded would have seen a misleading "payment unavailable" message and emailed support assuming Stripe was down. The right sequence on Day 62 was ship, dogfood, fix, then write about it — not the other way around. Discovering a bug in your own reliability work, while testing your own reliability work, is exactly how it's supposed to go.

Tomorrow returns to feature work.

Tagged

#engineering#reliability#error-handling#user-experience#saas

Written by

The founder of Ominvo

Building review management for single-location small businesses. Join the waitlist →

Older post →

Day 61: Why your auto-rollback should ignore Stripe outages