← All essays
May 10, 20265 min read

The Encryption Key Gotcha That Would Have Shown Ciphertext to Every User

A single misconfigured encryption key would have shown ciphertext to every user the moment we cut over. Here is how I caught it, and the test that catches it next time.

The Encryption Key Gotcha That Would Have Shown Ciphertext to Every User

Mid-afternoon, dry run number two. Application booted. Migrations ran. Login screen loaded. I signed in as a test user, clicked through to the leads page, and saw — in every row, in every column that was supposed to hold a name or an email — a wall of base64.

The data wasn't gone. The decryption key was wrong by one character. On the other side of an unaborted cutover, that screen would have been the customer's first impression of "the new platform."

What the application was doing

The contractor's application encrypts PII at the application layer. Email, phone, first name, last name — anything that could be a customer identifier — gets pushed through an AES-GCM encrypt function on write and a matching decrypt on read. The encryption key is a literal string, stored in a Cloud Run environment variable, named ENCRYPTION_KEY.

The key has been the same value for the entire history of the application. Every encrypted row in production was encrypted with that single key.

On the new stack, we'd taken the design decision early on (DD-002 in the architecture log) to rotate every credential at cutover. Fresh JWT secret, fresh database password, fresh webhook keys, fresh everything. Including the encryption key.

The Secret Manager entry on the new stack, optilead-prod-encryption-key, had been populated with a freshly-generated value during the Terraform run. The application would happily start with it. The application would happily encrypt new writes with it. The application would also try to decrypt existing data with it and produce gibberish, because the data was encrypted under a different key.

Without a re-encryption pass over every column first, the freshly-rotated key was guaranteed to corrupt the application's view of every row in the database.

How the corruption presents

The first failure mode is the one I caught: ciphertext rendering in the UI. The decrypt function was wrapped in a safeDecrypt helper that returned the ciphertext untouched if decryption failed. From the application's perspective, every encrypted column "decrypted" successfully — to the JSON blob you saw above. The UI dutifully rendered the blob as a string.

That's the loud failure. There's a quiet one underneath.

The same application also has a deduplication system that hashes the decrypted email address — that is, the string you get after running the decrypt function — to produce an emailHash column that uniqueness constraints can target. We had a migration scheduled to run on cutover that would backfill emailHash for every existing row. That migration would have hashed the ciphertext blob ({"v":1,"iv":"...","ct":"..."}) instead of the underlying email, because safeDecrypt returns the input on failure. Every row would have gotten a non-meaningful hash. The dedup logic would have silently treated every lead as unique, including duplicates.

The loud failure shows up in two seconds. The quiet failure shows up the first time someone tries to import a lead that was already in the database, and discovers that the deduplication that's been running for a year is not, in fact, running.

The decision we made

The original plan was to rotate every credential at cutover. The dry-run findings forced us to revisit that plan for one specific credential: the encryption key.

The relevant trade-off is what does rotating this credential break, versus what does inheriting it keep alive:

  • Rotating breaks the data. A new key plus old ciphertext equals garbage. The fix is a re-encryption pass over every column, which requires either downtime or row-level locking, takes 10–30 minutes for a database with a million encrypted rows, and is operationally non-trivial to do during a cutover window.
  • Inheriting keeps the data working. The new stack uses the contractor's existing key value. Every existing row decrypts. Every new write encrypts with the same key.

The cost of inheriting is that the leaked key — known to anyone who'd ever cloned the contractor's repo — remained the live encryption key on the new stack for as long as it took to do the re-encryption properly. Two to four weeks.

We took the cost. The reasoning, written into the design decision log: until contractor IAM is revoked, rotating our key doesn't reduce attack surface because the same data is plaintext-accessible on old prod.

Which is to say: the same data the leaked key protects on the new stack is sitting unencrypted in the contractor's running application until we decommission it. Rotating the key on the new stack first costs us a cutover failure and reduces the actual risk by approximately zero.

How we set the inherited key without echoing it

The naive approach — read the contractor's env, copy the value, paste it into the destination's Secret Manager — leaks the value into shell history, into your terminal scrollback, into the macOS clipboard history, and onto whatever Slack thread you forwarded it to "for the runbook."

The right approach is to pipe the value via stdin and never let it touch a string variable you can inspect:

gcloud run services describe optileads-api \
  --region=us-central1 --format=json \
| python3 -c "
import json, sys
svc = json.load(sys.stdin)
for c in svc['spec']['template']['spec']['containers']:
    for e in c.get('env', []):
        if e['name'] == 'ENCRYPTION_KEY':
            sys.stdout.write(e['value'])
            break
" \
| gcloud secrets versions add optilead-prod-encryption-key --data-file=-

The value flows through a pipe. It never lands in a variable, never echoes, never ends up in ~/.bash_history.

The single-line check that saved the cutover

The other half of the fix was a defensive assertion in migration 055 — the dedup-backfill migration that would otherwise have silently hashed ciphertext.

const firstRow = await Lead.findOne({ where: { email: { [Op.ne]: null } } });
const decrypted = strictDecrypt(firstRow.email);  // throws on failure

strictDecrypt, unlike safeDecrypt, throws if decryption fails. If the destination's encryption key doesn't match the data, the migration throws within milliseconds and the deploy aborts.

What this turns into, in practice, is a tripwire:

Error: encryption-key mismatch — sample decrypt of lead id=1 failed.
Set optilead-prod-encryption-key to source value before re-running.

Loud, actionable, recoverable.

What I tell teams looking at their own cutover

  1. Find every application-layer secret that the data depends on, before you plan the rotation. Encryption keys, signing keys, anything where rotating the credential invalidates existing artifacts.
  2. Write a tripwire. The single-line strictDecrypt assertion is what would have caught the mistake if I hadn't caught it manually. The tripwire's cost is one line.
  3. Pipe inherited secrets via stdin, not via clipboard. The shell-history leak is the kind of thing your SOC 2 auditor doesn't catch and your incident response team does, six months later.

Next in the series: renumbering migrations so your fork doesn't collide with the contractor's — the quieter version of the same problem.


Run the same audit on your own stack. Open the 30-question checklist →

Next in the series: How to Renumber a Migration So It Doesn't Collide With the Contractor's →

Run the audit on your own stack

A 30-question self-audit. P0/P1/P2 severity. Takes about an hour.

Open the checklist →