Skip to main content

Cloud Run Deployment Timeout Fix

Problem

GitHub Actions workflow “Deploy FastAPI Backend to Cloud Run” fails at “Wait for deployment to be ready” with:
  • Deployment did not become ready within 600 seconds
  • Revision is created but never reaches Ready status

Root Cause Analysis

The workflow now includes auto-diagnostics that will identify the exact failure reason on timeout:
  1. Revision Conditions - Shows why the revision isn’t Ready
  2. Container Config - Shows image, env vars, ports, concurrency settings
  3. Recent Logs - Last 200 lines of logs from the failing revision

Common Issues (from diagnostics)

  1. PORT binding - Container must listen on 0.0.0.0:$PORT (usually 8080)
  2. Missing env vars - JWT_SECRET_KEY, GCP_PROJECT_ID, CORS_ORIGINS, etc.
  3. Import errors - ModuleNotFoundError in logs
  4. Missing files - FileNotFoundError (auth_users.yaml, etc.)
  5. Permission errors - BigQuery/secret access (sa-worker service account)
  6. Startup timeout - App takes > 600s to start (unlikely)

Changes Made

1. Enhanced Workflow Diagnostics (.github/workflows/deploy_cloudrun.yml)

Added comprehensive auto-diagnostics on deployment timeout:
  • Gets latest created revision (may differ from latest ready)
  • Shows revision conditions (why it’s not Ready)
  • Shows container config (image, env vars, ports)
  • Shows last 200 log lines
  • Lists common issues to check

2. Diagnostic Script (scripts/diagnose_cloudrun_deployment.sh)

Standalone script to diagnose deployment failures:
./scripts/diagnose_cloudrun_deployment.sh [service_name] [region] [project]

3. Production Test Script (scripts/test_prod_save_endpoint.ps1)

PowerShell script to test the save endpoint after deployment:
.\scripts\test_prod_save_endpoint.ps1 -BusinessId "cf57d5b867062353" -Token $token

Verification Steps

Step 1: Check Dockerfile Configuration

The Dockerfile is correctly configured:
  • ✅ Uses --host 0.0.0.0 --port ${PORT:-8080}
  • ✅ Sets ENV PORT=8080 as default
  • ✅ Uses exec form for proper signal handling

Step 2: Trigger Deployment

The workflow auto-triggers on push to main when api/** files change, or can be manually triggered.

Step 3: Check Diagnostics on Failure

If deployment times out, the workflow will automatically:
  1. Show revision conditions
  2. Show container config
  3. Show recent logs (200 lines)
  4. List common issues

Step 4: Verify Deployment Success

After successful deployment:
  1. Check revision is Ready
  2. Verify deployed SHA matches commit (>= 7eebf15)
  3. Test save endpoint:
    .\scripts\test_prod_save_endpoint.ps1 -BusinessId "cf57d5b867062353"
    

Expected Outcomes

After fix:
  • ✅ Cloud Run revision becomes Ready
  • ✅ Deployed SHA >= 7eebf15
  • ✅ Two successful prod saves (one via script, one via UI)
  • ✅ CI logs print revision failure reason automatically if it happens again

Next Steps

  1. Commit these changes to main
  2. Trigger deployment (auto or manual)
  3. If timeout occurs, check the auto-diagnostics output
  4. Apply fix based on diagnostic output
  5. Redeploy and verify save endpoint works

Files Changed

  • .github/workflows/deploy_cloudrun.yml - Enhanced diagnostics
  • scripts/diagnose_cloudrun_deployment.sh - New diagnostic script
  • scripts/test_prod_save_endpoint.ps1 - New test script
  • docs/CLOUDRUN_DEPLOY_TIMEOUT_FIX.md - This document