← All essays
May 15, 20265 min read

The 14-Day Soak: What We Monitored, What We Ignored

A "boring" cutover doesn't finish at cutover — it finishes at day 14. Here are the four alerts that page, the noise we deliberately ignore, and the cyclic patterns we wait for.

The 14-Day Soak: What We Monitored, What We Ignored

The stake: a "boring" cutover doesn't actually finish at cutover. The minute the new stack is serving real traffic, you start a soak window — a deliberate period where the old stack stays alive as a rollback path and you watch for failures that would never have surfaced in a four-minute smoke test.

Two weeks is the number we use. Long enough to catch the monthly billing run, the weekly cleanup job, the once-a-Sunday usage pattern. Short enough that "we're still soaking" doesn't become a permanent state.

This post is what we watch during that window — and just as important, what we don't page on. Half of operational maturity is knowing what signals are real and what signals are noise.

Status: this post was started two days into our current soak (2026-05-15). Final retrospective numbers land when the window closes 2026-05-27.

The four real signals

We have alerts on a lot of things. Only four of them get someone awake at 3am.

1. Sustained 5xx rate above 0.5% for 5 minutes on the API. Cold-start blips push the instantaneous 5xx rate above 0.5% routinely — a single failed request in the first second after a scale-up event is a 100% rate on a one-request denominator. The page condition is the 5-minute moving average. We picked 0.5% because real customer impact starts being visible to users in the second percent; below 0.5% you're chasing tail-of-distribution noise that customers don't notice.

2. Database connection pool saturation above 70% for 5 minutes. Cloud SQL's connection limit on the tier we run is a hard ceiling. Hitting it means new requests start failing with "too many connections," surfaced as 503s. 70% gives roughly two minutes of headroom before the failures actually start. Alert on the pool, not on the failures it produces — the pool metric tells you why the failures are about to happen and gives you time to scale.

3. Twilio webhook 4xx rate above 1% over 15 minutes. Twilio retries 4xx responses on a backoff schedule. A spike that doesn't recover within 15 minutes means signature validation is failing systematically — either the secret rotated and we missed a place to update it, or the URL is mismatched, or the bodies are being mangled (see the Cloudflare-proxy post). 15 minutes absorbs Twilio's normal redelivery jitter; sustained beyond that, it's a real problem.

4. Migration job failures. Every deploy runs a Cloud Run Job for migrations before the new revision takes traffic. A failure means the deploy aborts and traffic stays on the previous revision. The page exists so the on-call knows a deploy didn't go through — load-bearing because nobody else is going to check.

That's the list. Four alerts. Three on the API, one on the deploy pipeline.

The signals we deliberately ignore

These all fire and we route them to a "look at it in business hours" channel, not to anyone's phone.

  • Cold-start p95 latency. Cloud Run's autoscaler spins up new instances under load; first requests take 2–5 seconds. Not actionable — the response to "Cloud Run cold-started a new instance" is "good, that's what it's supposed to do." We watch median and p90 instead and let p95 be noisy.
  • Cloud SQL replica lag. Tens of milliseconds normally, occasional second-or-two spikes during heavy writes. We don't have user-facing read-after-write inconsistency, so it's only relevant if it sustains for minutes.
  • 404s on /wp-admin, /.env, /phpmyadmin. Internet-wide scanners. Cloud Armor ranks them; we look at aggregates weekly.
  • Login attempt rate limits. The per-IP login throttle (30/min) trips every few hours on what looks like an aggressive automated tester or a confused customer. Real credential stuffing would show up as a sustained burst.
  • Inbound mail processor backpressure. Single instance by design. Queue depth is a dashboard metric, not a pager — unless it's been backed up for more than an hour.

What changes between cutover day and day 14

The soak is not a fixed window of constant attention. The intensity tapers.

  • Days 1–2. Primary on-call in front of the dashboard during business hours. Smoke tests on a 15-minute cadence. Every alert is a real-person event. We're looking for cutover-specific failures — missed columns, wrong credentials, mispointed webhooks.
  • Days 3–7. Alert intensity drops. Cutover-specific failures, if any, have surfaced. We're now looking for operational drift — does the application behave the same way on day 5 that it did on day 1? The classic failure is a slow memory leak that takes three days to manifest as elevated p99.
  • Days 8–14. Now we're looking for cyclic patterns. Monthly billing runs in this window. The weekly cleanup job runs twice. A once-a-Sunday usage spike happens twice. Anything that fails tells us we missed something in the migration.
  • Day 14. Soak window closes. If no critical alerts fired and the cyclic jobs all succeeded, we proceed to decommission. Anything ambiguous, we extend by a week.

What we learned this time around

Update pending — soak window closes 2026-05-27. Two days in, the only signals so far are the expected ones: cold-start spikes, scanner traffic, two false-positive WAF blocks I tuned out of Cloud Armor in preview-mode on staging (good — that's why preview-mode exists). No data integrity issues. No signature failures. No 5xx spikes. The boring outcome is the right outcome.

What this costs you if you skip it

You'll cut over, declare victory at the end of the day, and find the broken monthly billing job three weeks later when a customer asks why their invoice is wrong. The soak is the difference between "we shipped" and "we shipped and we know it works."

What I tell teams about their own soak

  1. Pick four alerts that page. Not forty. Every additional pager makes the four real ones quieter. The hard part of alert design is saying no to "what if we also paged on…" — the answer is almost always "don't."
  2. Keep the old stack alive for the entire soak window. Rollback is the cheapest insurance. Decommission is a separate, deliberate step, not a thing that "happens automatically once cutover succeeds."
  3. Plan to update the runbook based on what the soak surfaces. The point of the soak is to learn what your application does in production that it didn't do in staging. Every surprise is a runbook addition.

Run the audit → /audit-checklist — find the failure modes before your soak does.

Next in the series: The 1-Hour Audit: How I Walk a Contractor-Built Stack on Day One — the hour I sell to every founder who calls about an inherited application.

Run the audit on your own stack

A 30-question self-audit. P0/P1/P2 severity. Takes about an hour.

Open the checklist →