Skip to main content

Root Cause Analysis - Readiness Not Seeing Saved Config

Summary (3–6 bullets)

  • Investigates why readiness stayed 0/113 ready after a successful save call.
  • Identifies backend service mismatch/split-brain as the leading hypothesis.
  • Correlates network timing, endpoint paths, and repeated readiness outcomes.
  • Enumerates alternate hypotheses (org scope, effective date, write/query lag).
  • Provides targeted next-step checks across headers, logs, and direct BigQuery queries.

When to use this (3–6 bullets)

  • When save operations succeed but readiness does not reflect configuration.
  • During cross-service or environment routing investigations.
  • When validating org scope, period scope, and data visibility assumptions.
  • As a template for structured production forensic triage.

What you’ll walk away with (2–5 bullets)

  • A ranked set of root-cause hypotheses with supporting evidence.
  • A practical investigation sequence to confirm backend/data-path issues.
  • A concise action plan to move from symptom to verified fix.
Test Date: 2026-02-06 23:13:28
Business ID: a7ac092572c6fa5c
Period Label: 2025-12-01

🔍 Critical Finding

Backend Service Mismatch Detected: Console shows requests going to:
  • payroll-pipeline-cbs-api-evndxpcirq-uc.a.run.app
But we previously fixed resolver to use:
  • payroll-backend-prod-238826317621.us-central1.run.app
This might be a different service!

📊 Network Analysis

Save Request

  • URL: PUT https://payroll-pipeline-cbs-api-evndxpcirq-uc.a.run.app/api/v1/admin/onboarding/businesses/a7ac092572c6fa5c/save
  • Status: 200 OK
  • Delay: 14 seconds (unusually long)

Readiness Requests (3 attempts)

  • URL: GET https://payroll-pipeline-cbs-api-evndxpcirq-uc.a.run.app/api/v1/intake/ingestion-wizard/readiness?batch_id=0489e906-d84a-429a-a9b1-384505e99b05&period_label=2025-12-01
  • Status: All 200 OK
  • Result: 0/113 ready (all 3 attempts)

🎯 Root Cause Hypothesis

Hypothesis 1: Backend Service Split-Brain (MOST LIKELY)

Issue:
  • Save writes to payroll-pipeline-cbs-api service
  • Readiness queries payroll-pipeline-cbs-api service
  • But they might be using different BigQuery datasets or different backend instances
Evidence:
  • Both requests go to same service URL (good)
  • But readiness still shows 0/113 ready after save succeeds
  • 14-second delay on save suggests backend might be slow or using different data source

Hypothesis 2: Org ID Scope Mismatch

Issue:
  • Save writes with org_id=NULL (platform-scope)
  • Readiness queries with org_id=cbs-main (from x-org-id header)
  • Readiness filter: (aa.org_id = @org_id OR aa.org_id IS NULL) should still match
  • But if save writes to wrong org scope, readiness won’t see it
Check:
  • Verify save request includes x-org-id: cbs-main header
  • Verify save writes with correct org_id scope

Hypothesis 3: Effective Date Mismatch

Issue:
  • Save writes with effective_start_date: 2025-12-01
  • Readiness queries with as_of_date: 2025-12-01 (from period_label)
  • Readiness filter: effective_start_date <= @as_of_date
  • 2025-12-01 <= 2025-12-01 should be TRUE
But: If save writes with different date format or timezone, might not match

Hypothesis 4: BigQuery Write → Query Lag

Issue:
  • Save writes to BigQuery
  • Readiness queries BigQuery immediately
  • BigQuery eventual consistency longer than 11 seconds
Evidence:
  • 3 retries over 11 seconds
  • Still shows 0/113 ready
  • Suggests write not queryable even after delay

🔬 Investigation Steps

Step 1: Verify Save Request Details

Check PUT request:
  • Headers: x-org-id: cbs-main (should be present)
  • Body: period_label: 2025-12-01
  • Body: effective_start_date: 2025-12-01
  • Body: mode: AGENT_PEPM
  • Body: agent_pepm_lines: [...]

Step 2: Verify Readiness Query Details

Check GET request:
  • Query params: period_label=2025-12-01 ✅ (confirmed)
  • Query params: batch_id=0489e906-d84a-429a-a9b1-384505e99b05 ✅ (confirmed)
  • Headers: x-org-id: cbs-main (need to verify)

Step 3: Check Backend Logs

Verify:
  • Save write succeeded to BigQuery
  • Readiness query includes business a7ac092572c6fa5c
  • Readiness query uses correct as_of_date and org_id filters

Step 4: Direct BigQuery Query

Run query:
SELECT 
  business_id,
  agent_entity_id,
  org_id,
  effective_start_date,
  effective_end_date
FROM `analytics.config_business_agent_assignment`
WHERE business_id = 'a7ac092572c6fa5c'
  AND tenant_id = 'creative_benefit_strategies'
  AND effective_start_date <= '2025-12-01'
  AND (effective_end_date IS NULL OR effective_end_date >= '2025-12-01')
Expected: Should see rows written by save

🎯 Most Likely Root Cause

Backend Readiness Query Not Seeing Saved Config Due To:
  1. Org ID Scope Mismatch:
    • Save writes with org_id=NULL (platform-scope)
    • Readiness queries with org_id=cbs-main filter
    • Filter should match (OR org_id IS NULL), but might not be working
  2. OR Backend Write → Query Lag:
    • BigQuery write takes >11 seconds to be queryable
    • Unlikely but possible
  3. OR Different Backend Instances:
    • Save and readiness hit different backend instances
    • Different BigQuery connections or datasets

✅ Next Actions

  1. Check save request headers/body (verify org_id and period_label)
  2. Check readiness request headers (verify x-org-id matches)
  3. Check backend logs for save write and readiness query
  4. Run direct BigQuery query to verify config was written
  5. Verify backend service (is payroll-pipeline-cbs-api the correct service?)

Status: ⚠️ RETRY LOGIC WORKS, BUT BACKEND READINESS NOT SEEING SAVED CONFIG - NEED BACKEND INVESTIGATION