Building a CI/CD Safety Net for a Financial Platform

Context

I’m the sole frontend engineer on a Next.js 15 financial platform (BRS Workspace) that handles settlement processing, bonus adjustments, and payment approvals for PM International Korea. Real money flows through this app.

When I inherited the codebase, the build had two scary flags turned on:

// next.config.js — the state of things when I started
ignoreBuildErrors: true,    // TypeScript errors? Ship it anyway.
ignoreDuringBuilds: true,   // ESLint errors? Also ship it.

And CI? All PR checks had continue-on-error: true. The CI/CD pipeline had zero test steps. Code flowed from my machine to production AKS pods with no automated verification.

For a financial app. Yeah.

What I Built

A 4-stage safety net across 6 epics, 31 stories, 21 days:

Pre-commit → PR Checks → CI/CD Pipeline → Post-deploy Health

The numbers:

610 files changed
+39,337 / -19,921 lines
14 PRs merged
313 TypeScript errors fixed
0 production incidents

The Parts I’m Most Proud Of

Fixing 313 Type Errors in One Shot (Epic 5)

This was the mountain I’d been climbing toward since day one. The entire project existed so I could eventually flip ignoreBuildErrors off. But you can’t just remove it — 313 errors scream at you.

I broke it into surgical sub-stories: MUI palette augmentation, Shadcn cleanup, React 19 JSX namespace migration, and then the big one — 313 unknown → proper type fixes across 86 files. Each sub-story was its own PR so reviewers (and future-me) could follow the logic.

The moment npm run build passed clean with both flags removed? Genuinely one of the best feelings I’ve had as a developer.

The E2E Smoke Test Architecture (Epic 3)

Building the Playwright infrastructure from scratch was deeply satisfying. Keycloak admin API for test user tokens, dual-mode global setup (CI vs local), standalone Next.js server in GitHub Actions — every piece had to fit together perfectly.

The menu-route-validation.spec.ts test is my favorite: it intercepts the real backend menu API response and verifies every menuViewPath resolves to an actual frontend route. It catches the exact class of bug where backend and frontend path conventions diverge — which had already caused a production 404 that nobody noticed.

Knowing When NOT to Build (Story 4-2)

I’m proud of the ArgoCD PostSync hook decision. The story asked: “Should we add deployment-time alerting?” I investigated, found zero notification infrastructure in the cluster (no ArgoCD Notifications controller, no Slack channel, no Teams webhook), and closed the story as unnecessary.

It’s easy to build things. It’s harder to say “this adds complexity for zero benefit right now” and close the ticket.

Where I Struggled

Visual Regression Was Painful

Getting Playwright screenshot comparisons working in CI took way more iterations than I expected. The standalone Next.js server needed specific env vars (HOSTNAME, PORT), Keycloak auth vars had to be passed to the build step AND the test step separately, and Linux renders fonts differently than macOS — so baselines had to be generated in CI, not locally.

I ended up writing an auto-update workflow that regenerates baselines when visual tests fail. Not glamorous, but it works.

The Silent Skip That Haunted Me

menu-route-validation.spec.ts was silently skipping in CI for weeks. The test caught errors gracefully (as designed), but “gracefully” meant test.skip() — which shows as a pass in GitHub Actions. The root cause? NEXT_PUBLIC_API_URL_AUTHZ wasn’t in the Playwright runner’s env block.

This one hurt because the test was specifically designed to catch a bug class we’d already been bitten by. It took 3 PRs to fix properly: env var injection, switching from direct fetch() to intercepting the app’s own auth flow, and handling hydration timing.

Lesson learned: a test that silently skips is worse than no test at all.

Being the Only Person

There’s no sugarcoating this. Being the sole engineer on a financial platform means every decision is yours — architecture, implementation, review, deployment. There’s no one to catch your blind spots.

I leaned heavily on AI-assisted development (Claude Code with the BMAD workflow framework) to simulate the multi-person process: separate planning, implementation, and code review phases with fresh context each time. It’s not the same as a real team, but it forced me to articulate decisions instead of just making them.

Key Takeaways

Build the pipeline before you need it. Every hour spent on CI/CD infrastructure saved me from shipping broken builds to production.
Phase your work ruthlessly. I could’ve tried to fix everything at once. Instead, Epic 0 was just measurement. Epic 1 was just making checks blocking. By the time I hit Epic 5, I had the safety net to make big changes confidently.
Tests that skip silently are technical debt. If a test can’t run, it should fail loudly, not pretend everything’s fine.
Solo doesn’t mean sloppy. One person can build production-grade infrastructure if they’re disciplined about process.
Know when to close a ticket. Not every planned feature deserves to exist. Sometimes the right answer is “not now.”

What’s Next

The BRS-TEST-AUTO project is done. All 6 epics complete. The build is clean, CI is enforced, and there’s a real safety net.

The codebase still has 5 tech debt items flagged during code review (race conditions in dialog handlers, z.any() in a Zod schema, duplicated permission paths). Those are for another day.

For now, I’m going to enjoy the clean build. npm run build passes. No flags. No bypasses. Just working code.

Update: I’ve since published a detailed, step-by-step playbook on how I implemented the testing strategy discussed here. Read it here: A Playbook for Retrofitting Test Automation into a Legacy Next.js Codebase.

Update 2: I’ve since optimized the pipeline’s Docker build from 5-8 minutes to under 2 minutes. The root cause was a shared cache tag across environments. Read it here: From 8 Minutes to 90 Seconds.

Built with Next.js 15, React 19, Playwright, GitHub Actions, and an unreasonable amount of YAML.