From 8 Minutes to 90 Seconds: Optimizing a Docker Build Pipeline on GitHub Actions
In a previous post, I built the CI/CD safety net for BRS Workspace — a Next.js financial platform deployed on Azure AKS. That gave me confidence nothing ships without passing tests.
But confidence has a cost: the pipeline took 7-10 minutes per push. The Docker build job alone consumed 5-8 minutes — more than the tests, linting, and deployment steps combined. For a solo developer pushing 3-5 times a day, that’s 30-40 minutes of waiting. Every day.
This post is about finding the root cause and fixing it with four small YAML changes.
The Mystery: Why Was Cache Hit Rate So Inconsistent?
The pipeline already had caching. Dual-layer, in fact:
cache-from: |
type=registry,ref=pmicr.azurecr.io/pmi-client:buildcache
type=gha,scope=build-${{ github.ref_name }}
cache-to: |
type=registry,ref=pmicr.azurecr.io/pmi-client:buildcache,mode=max
type=gha,mode=max,scope=build-${{ github.ref_name }}
Two cache backends. mode=max storing all intermediate layers. This should be fast. And sometimes it was — I had one run that completed the build step in 42 seconds. Perfect cache hit. But the next run on the same branch would take 6 minutes. Then 7 minutes. Then 42 seconds again.
The inconsistency was the clue.
Root Cause: One Cache Tag, Four Environments
Look at the registry cache line:
type=registry,ref=pmicr.azurecr.io/pmi-client:buildcache
One tag: buildcache. Shared across every branch that triggers CI: brs-dev, brs-prd, qa, demo.
Now look at the build args:
build-args: |
NEXT_PUBLIC_API_URL=${{ secrets[format('{0}_API_URL', env_prefix)] }}
NEXT_PUBLIC_KEYCLOAK_URL=${{ secrets[format('{0}_KEYCLOAK_URL', env_prefix)] }}
# ... 14 NEXT_PUBLIC_* variables total
Each environment has different secrets. DEV_API_URL points to dev-api.pmi-korea.com, PROD_API_URL points to api.pmi-korea.com. These values are baked into the Docker image at build time because Next.js inlines NEXT_PUBLIC_* variables during compilation.
Here’s what was happening:
- Push to
brs-dev— builds with DEV secrets, writes cache layers tobuildcache - Push to
brs-prd— pulls cache frombuildcache, but build args differ (PROD secrets) — builder stage cache miss — full rebuild, overwritesbuildcachewith PROD layers - Push to
brs-devagain — pulls cache frombuildcache, but it now contains PROD layers — cache miss again
The 42-second runs? Those were consecutive pushes to the same branch before any other branch had a chance to overwrite the cache.
The GHA cache line was already branch-scoped (scope=build-${{ github.ref_name }}), but GHA cache has a 10GB repo limit and gets evicted aggressively. The registry cache is unlimited and persistent — and it was the one that wasn’t scoped.
The Fix: Four Changes, One File
All changes went into .github/workflows/ci-cd.yml. No Dockerfile modifications, no application code changes.
1. Branch-Scoped Registry Cache
# Before
type=registry,ref=...pmi-client:buildcache
# After
type=registry,ref=...pmi-client:buildcache-${{ github.ref_name }}
Now brs-dev writes to buildcache-brs-dev, brs-prd writes to buildcache-brs-prd. No more cross-contamination.
2. Remove BUILDKIT_INLINE_CACHE=1
This build arg embeds cache metadata directly into the production image layers. It was the “easy” caching approach before the dual-backend strategy was in place. With explicit type=registry and type=gha backends already configured, it was pure dead weight — adding 5-10% bloat to every layer pushed to ACR and pulled by every pod in the cluster.
3. Explicit platforms: linux/amd64
We only deploy to AMD64 nodes. Without this, BuildKit may probe for multi-platform support and attempt a QEMU-based ARM64 build — doubling build time. One line, zero ambiguity.
4. Build/Push Separation
# Before: atomic operation
- name: Build and push image
uses: docker/build-push-action@v7
with:
push: true
# After: two distinct steps
- name: Build image
uses: docker/build-push-action@v7
with:
push: false
load: true
- name: Push image to ACR
run: |
echo "$TAGS" | while IFS= read -r tag; do
[ -n "$tag" ] && docker push "$tag"
done
This doesn’t make the build faster today. It creates a seam — a point between build and push where I can later insert an image-level smoke test or a manual approval gate. It also separates build time from push time in the logs, so I can see exactly where time goes.
Results
First CI run after merging:
| Job | Before | After |
|---|---|---|
| Setup | 5s | 9s |
| Test | 1m22s | 1m13s |
| Build | 5m33s | 1m33s |
| GitOps | 4s | 4s |
| Total | ~7 min | ~3 min |
That first run was a cache miss (populating the new buildcache-brs-dev tag for the first time). Subsequent pushes to the same branch hit full cache — the build step drops to under 30 seconds.
The 72% reduction on the build step came almost entirely from change #1 (branch-scoped cache). The other three are hygiene and future-proofing.
What I Learned
Cache strategies are only as good as their key design. Having “caching enabled” means nothing if the key collides across contexts. This is the same lesson as React Query’s queryKey — if two queries share a key but fetch different data, you get stale results. Same principle, different domain.
Redundant optimizations compound into bloat. BUILDKIT_INLINE_CACHE=1 was probably added during initial setup, before the dual-backend cache was configured. Nobody removed it because it wasn’t obviously hurting anything. But “not obviously hurting” and “actively helping” are not the same thing.
Observability enables optimization. I couldn’t have diagnosed the cache eviction pattern without comparing build logs across branches and correlating timing with which branch pushed last. The total diff: +17 lines, -5 lines. The fix was trivial once the root cause was visible.
This was one piece of a broader CI/CD optimization effort that also included workflow restructuring (~120-170 min/month saved), bundle optimization (optimizePackageImports for MUI, explicit sharp dependency), and fixing a signin prerender error that added 30+ seconds of noise to build logs.
Combined, these changes took the pipeline from “sometimes 10 minutes, sometimes 7, unpredictably” to “consistently under 3 minutes.” The build is no longer the bottleneck. I push, I wait, and by the time I’ve read the diff one more time, the deploy is done.
This is a sequel to How I Built a Full CI/CD Safety Net. That post covered building the pipeline. This one covers making it fast.