From 8 Minutes to 90 Seconds — Docker Build Pipeline Optimization on GitHub Actions

In a previous post, I built the CI/CD safety net for BRS Workspace — a Next.js financial platform deployed on Azure AKS. That gave me confidence nothing ships without passing tests.

But confidence has a cost: the pipeline took 7-10 minutes per push. The Docker build job alone consumed 5-8 minutes — more than the tests, linting, and deployment steps combined. For a solo developer pushing 3-5 times a day, that’s 30-40 minutes of waiting. Every day.

This post is about finding the root cause and fixing it with four small YAML changes.

The Mystery: Why Was Cache Hit Rate So Inconsistent?

The pipeline already had caching. Dual-layer, in fact:

cache-from: |
    type=registry,ref=pmicr.azurecr.io/pmi-client:buildcache
    type=gha,scope=build-${{ github.ref_name }}
cache-to: |
    type=registry,ref=pmicr.azurecr.io/pmi-client:buildcache,mode=max
    type=gha,mode=max,scope=build-${{ github.ref_name }}

Two cache backends. mode=max storing all intermediate layers. This should be fast. And sometimes it was — I had one run that completed the build step in 42 seconds. Perfect cache hit. But the next run on the same branch would take 6 minutes. Then 7 minutes. Then 42 seconds again.

The inconsistency was the clue.

Root Cause: One Cache Tag, Four Environments

Look at the registry cache line:

type=registry,ref=pmicr.azurecr.io/pmi-client:buildcache

One tag: buildcache. Shared across every branch that triggers CI: brs-dev, brs-prd, qa, demo.

Now look at the build args:

build-args: |
    NEXT_PUBLIC_API_URL=${{ secrets[format('{0}_API_URL', env_prefix)] }}
    NEXT_PUBLIC_KEYCLOAK_URL=${{ secrets[format('{0}_KEYCLOAK_URL', env_prefix)] }}
    # ... 14 NEXT_PUBLIC_* variables total

Each environment has different secrets. DEV_API_URL points to dev-api.pmi-korea.com, PROD_API_URL points to api.pmi-korea.com. These values are baked into the Docker image at build time because Next.js inlines NEXT_PUBLIC_* variables during compilation.

Here’s what was happening:

Push to brs-dev — builds with DEV secrets, writes cache layers to buildcache
Push to brs-prd — pulls cache from buildcache, but build args differ (PROD secrets) — builder stage cache miss — full rebuild, overwrites buildcache with PROD layers
Push to brs-dev again — pulls cache from buildcache, but it now contains PROD layers — cache miss again

The 42-second runs? Those were consecutive pushes to the same branch before any other branch had a chance to overwrite the cache.

The GHA cache line was already branch-scoped (scope=build-${{ github.ref_name }}), but GHA cache has a 10GB repo limit and gets evicted aggressively. The registry cache is unlimited and persistent — and it was the one that wasn’t scoped.

The Fix: Four Changes, One File

All changes went into .github/workflows/ci-cd.yml. No Dockerfile modifications, no application code changes.

1. Branch-Scoped Registry Cache

# Before
type=registry,ref=...pmi-client:buildcache

# After
type=registry,ref=...pmi-client:buildcache-${{ github.ref_name }}

Now brs-dev writes to buildcache-brs-dev, brs-prd writes to buildcache-brs-prd. No more cross-contamination.

2. Remove `BUILDKIT_INLINE_CACHE=1`

This build arg embeds cache metadata directly into the production image layers. It was the “easy” caching approach before the dual-backend strategy was in place. With explicit type=registry and type=gha backends already configured, it was pure dead weight — adding 5-10% bloat to every layer pushed to ACR and pulled by every pod in the cluster.

3. Explicit `platforms: linux/amd64`

We only deploy to AMD64 nodes. Without this, BuildKit may probe for multi-platform support and attempt a QEMU-based ARM64 build — doubling build time. One line, zero ambiguity.

4. Build/Push Separation

# Before: atomic operation
- name: Build and push image
  uses: docker/build-push-action@v7
  with:
      push: true

# After: two distinct steps
- name: Build image
  uses: docker/build-push-action@v7
  with:
      push: false
      load: true

- name: Push image to ACR
  run: |
      echo "$TAGS" | while IFS= read -r tag; do
        [ -n "$tag" ] && docker push "$tag"
      done

This doesn’t make the build faster today. It creates a seam — a point between build and push where I can later insert an image-level smoke test or a manual approval gate. It also separates build time from push time in the logs, so I can see exactly where time goes.

Results

First CI run after merging:

Job	Before	After
Setup	5s	9s
Test	1m22s	1m13s
Build	5m33s	1m33s
GitOps	4s	4s
Total	~7 min	~3 min

That first run was a cache miss (populating the new buildcache-brs-dev tag for the first time). Subsequent pushes to the same branch hit full cache — the build step drops to under 30 seconds.

The 72% reduction on the build step came almost entirely from change #1 (branch-scoped cache). The other three are hygiene and future-proofing.

What I Learned

Cache strategies are only as good as their key design. Having “caching enabled” means nothing if the key collides across contexts. This is the same lesson as React Query’s queryKey — if two queries share a key but fetch different data, you get stale results. Same principle, different domain.

Redundant optimizations compound into bloat. BUILDKIT_INLINE_CACHE=1 was probably added during initial setup, before the dual-backend cache was configured. Nobody removed it because it wasn’t obviously hurting anything. But “not obviously hurting” and “actively helping” are not the same thing.

Observability enables optimization. I couldn’t have diagnosed the cache eviction pattern without comparing build logs across branches and correlating timing with which branch pushed last. The total diff: +17 lines, -5 lines. The fix was trivial once the root cause was visible.

This was one piece of a broader CI/CD optimization effort that also included workflow restructuring (~120-170 min/month saved), bundle optimization (optimizePackageImports for MUI, explicit sharp dependency), and fixing a signin prerender error that added 30+ seconds of noise to build logs.

Combined, these changes took the pipeline from “sometimes 10 minutes, sometimes 7, unpredictably” to “consistently under 3 minutes.” The build is no longer the bottleneck. I push, I wait, and by the time I’ve read the diff one more time, the deploy is done.

This is a sequel to How I Built a Full CI/CD Safety Net. That post covered building the pipeline. This one covers making it fast.

From 8 Minutes to 90 Seconds: Optimizing a Docker Build Pipeline on GitHub Actions