Migrating Production Data: A 4-Minute Cutover That Took 14 Days to Plan
The cutover took four minutes. The plan that made four minutes safe took fourteen days. Here is the runbook, the rollback path, and what I would not skip again.
Migrating Production Data: A 4-Minute Cutover That Took 14 Days to Plan
The cutover took under four minutes for the data move and twenty for everything else. The plan took fourteen days.
Two dry-runs against a real production snapshot, four code blockers found and fixed, three architectural decisions reversed, and one moment where I wrote "if the sample-decrypt fails, abort the cutover" into the runbook because we'd discovered a single wrong character in a secret would have shown ciphertext to every user. The point of the fourteen days was to make the four minutes boring.
The setup
The application was a contractor-built CRM running on Cloud Run, fronted by a Hostinger VPS the contractor still owned, backed by a Cloud SQL instance on a public IP. Lead data, recordings, and inbound mail all flowed through the same single-instance Node API. Roughly 150 MiB of database, six Twilio phone numbers, ten thousand leads, twenty users.
The brief: move all of it to in-house infrastructure inside the same GCP project, side-by-side with the contractor's stack. Cut over via blue-green. Maintain user sessions if possible. Don't let the contractor know we're cutting over until after it's done. Two weeks.
Day 1–11: the planning
Most of the work happened before anyone touched a database. The output of the planning phase was:
- A Terraform module set that stood up: a private-IP Cloud SQL instance, a serverless VPC connector, two Cloud Run services with internal-load-balancer-only ingress, a managed Google Load Balancer with a managed cert, Cloud Armor on the LB, and a Secret Manager populated by an out-of-band script.
- A migration script that ran as a Cloud Run Job (not on first request — see post #4 for why that pattern is a liability).
- A cutover runbook with twelve numbered steps, four of which had a hard time budget and a defined abort condition.
- Two dry runs.
The dry runs are the part teams skip. They are also the part that catches everything that would have caused the real cutover to fail.
Dry run #1: four blockers in 4 minutes 56 seconds
We exported the live contractor production database into a GCS bucket using gcloud sql export sql --offload. --offload runs the export from a temporary read replica, which means zero impact on live traffic during the export. The export ran in 2:57. We imported it into a fresh staging Cloud SQL instance in 0:56. Total data move time: 3:53.
Then everything else broke.
Blocker 1: gcloud sql import sql --database=X is ignored when the dump contains USE Optileads. The data always landed in a database named whatever the dump's USE statement said, regardless of the --database flag. Workable for the real cutover, but it changed the import pre-flight: we now had to drop and recreate the target database before each import, not just truncate the tables.
Blocker 2: migration number collision. Our codebase forked from the contractor's at migration 045. The contractor kept shipping past us. Our own work also added migrations numbered 046 and 047. The migration runner saw all of our 046-and-up migrations as "pending" and tried to re-run them. We renumbered to 055 and 056. Full story in post #7.
Blocker 3: sync({alter:false}) is not a no-op. The startup sync tried to shrink a column that the contractor had widened in production, and threw Data too long for column 'referenceId' at row 150. Full write-up in post #4.
Blocker 4: collation mismatch. Source database used utf8mb4_0900_ai_ci. Target's default was utf8mb4_unicode_ci. Fix: drop and recreate the destination database with the correct collation before each import.
We landed four fixes in one commit, deployed them, and ran the dry run again.
Dry run #2: the encryption key gotcha
The second dry run got further. The data imported. The migrations ran. The application started. Then a tester opened the leads page and saw a thousand rows of {"v":1,"iv":"...","ct":"..."} instead of email addresses.
The application encrypts PII at the application layer using a secret stored in Cloud Run env vars on the contractor's stack. The new stack had a placeholder value in Secret Manager — fresh per the original plan to rotate everything at cutover. Without the matching encryption key, every encrypted column rendered as ciphertext, and the dedup migration we'd written was about to silently hash ciphertext blobs.
This is the post #6 story. The TL;DR is that we reversed the decision to rotate the encryption key at cutover. We inherited the contractor's key, accepted the standing exposure for a few weeks, and added a Track-2 work item to re-encrypt every column with a fresh key once the cutover stabilized.
We also added a single line to migration 055 that decrypts the first encrypted row using the strict decrypt and throws a clear, actionable error if the key is wrong. That line is the reason the real cutover did not blow up two minutes after the import.
The real cutover: under 30 minutes, end to end
Step 1 — Pre-flight verification (1 min). Confirmed prod state, captured schema_migrations row count from the source, listed destination service accounts.
Step 2 — Set the encryption key (~5 seconds). A single pipeline read the source's ENCRYPTION_KEY from its Cloud Run env and piped the value directly into gcloud secrets versions add --data-file=- on the destination. Never echoed, never written to a file.
Step 3 — Drop and recreate the destination database with the correct collation (~5 seconds).
Step 4 — Export from live production with --offload (2:57). Zero impact on the contractor's live users. 155 MiB of dump in a private GCS bucket.
Step 5 — Import into the new production database (54 seconds).
Step 6 — Grant the application database user, then run migrations (~3 minutes). Migration 055 backfilled 1020 leads with dedup hashes. Migration 056 added the timezone column. The contractor's 046–054 matched their recorded names and were skipped. The sample-decrypt assertion passed silently.
Step 7 — Force an API revision restart to pick up the new encryption-key version (~30 seconds).
Step 8 — Smoke test (~5 seconds). curl /health returned 200. A lead's email rendered as a real email, not ciphertext.
Step 9 — Seed the admin user (~2 minutes). Our admin user pre-existed in the imported data. Login worked because bcrypt hashes import unchanged.
Step 10 — Populate Twilio, SMTP, IMAP secrets (~30 seconds). A single script read nine credentials from the source env and piped each into its destination Secret Manager secret. No values printed.
Step 11 — Repoint six Twilio numbers via REST API (~5 seconds). Not the Console. A loop over the IncomingPhoneNumbers.json endpoint. Full story in post #9.
Step 12 — Force another API revision restart to pick up the Twilio secrets (~30 seconds).
Total elapsed: under 30 minutes. The data move portion was about four minutes.
What the four minutes hides
The four-minute number is the one that goes on a slide. What it actually represents:
- 14 days of planning,
- two dry runs that surfaced eight blockers,
- four code fixes that landed in advance, not during,
- a runbook with explicit abort conditions on each step,
- a one-line sample-decrypt assertion in a migration that prevented a silent corruption,
- a decision made under pressure during dry run #2 to inherit a credential we'd planned to rotate.
Boring cutovers are expensive to produce. They are also the only kind worth selling to a customer who has been told three times that "we're modernizing the platform."
Next in the series: the encryption-key gotcha — the one that, on a slightly less paranoid team, would have shown a thousand rows of ciphertext to every user the moment we cut over.
Run the same audit on your own stack. Open the 30-question checklist →
Next in the series: The Encryption Key Gotcha That Would Have Shown Ciphertext to Every User →
Run the audit on your own stack
A 30-question self-audit. P0/P1/P2 severity. Takes about an hour.
Open the checklist →