Skip to main content

AI Chatbot Production Incident - December 24, 2025

Date: December 24, 2025
Status: ✅ Fixed (routing + RBAC hardening complete)
Remaining Issue: 403-for-self bug (documented for future debugging)

Executive Summary

Production incident involving AI chatbot RBAC bypass and Cloud Run 404 errors. Root causes identified and fixed:
  1. Routing Issue: Frontend calling wrong endpoint path, backend missing legacy alias mount
  2. RBAC Bypass: Multiple intent handlers querying data without authorization checks
  3. Name-based Data Leak: BigQuery queries using unsafe name filtering instead of agent_id
All critical fixes deployed. One non-critical bug remains (403 for self-queries) and is documented for future debugging.

System Architecture

  • Frontend: Next.js on Vercel (payroll-pipeline-cbs.vercel.app)
  • Backend: FastAPI on Cloud Run (payroll-backend-prod service)
  • Canonical AI Endpoint: /api/v1/ai/query-public
  • Legacy Compat Endpoint: /ai/query-public (temporary, for stale bundles)
  • RBAC Model: agent/CEO = self + downline, admin = tenant-wide
  • Data Layer: BigQuery + Vertex AI (commission math/datasets unchanged)

Incident Timeline

Initial Symptoms

  • Agent “Kenny Young” receiving answers about other agents (e.g., “Tommy Dang”) - RBAC violation
  • Network tab showed chatbot calling Vercel host instead of Cloud Run
  • Cloud Run endpoint /api/v1/ai/query-public returning 404
  • Backend logs showed router not mounted despite main.py claiming it was

Root Causes Identified

  1. Frontend Routing: aiClient.ts hardcoded to wrong Cloud Run service (fastapi-backend instead of payroll-backend-prod)
  2. Backend 404: Router only mounted at /api/v1/ai, but stale frontend bundles calling /ai/query-public
  3. RBAC Bypass: Demo mode falling back to CEO role, intent handlers skipping authorization
  4. Name-based Leak: BigQuery queries using agent_name filtering with LIKE/partial matching

Fixes Implemented

1. Routing / 404 Fixes

Problem: Frontend requests hitting /ai/query-public (missing /api/v1 prefix) and backend old revision didn’t mount legacy alias. Solution: Backend router mounted at BOTH:
  • /api/v1/ai (canonical)
  • /ai (legacy compat)
Code Changes:
  • api/main.py lines 274-280: Dual router mount using same router instance
  • Same router instance ensures identical RBAC/auth behavior
  • Legacy alias temporary, will be removed after stale bundles expire
Files Modified:
  • api/main.py: Added legacy alias mount with enhanced startup logging
Verification:
  • Both endpoints return 401 (not 404) - endpoint exists, auth required
  • Router mount logs present in Cloud Run startup logs

2. Frontend URL Normalization + Client-Only Safety

Problem:
  • aiClient.ts hardcoded to wrong Cloud Run service
  • Module-init env validation causing SSR/build fragility
  • No guardrails for origin-only URL format
Solution:
  • Added normalizeOrigin() function to extract origin from URL
  • Added buildAiQueryPublicUrl() to consistently construct canonical endpoint
  • Moved env validation from module-init to call-time (prevents SSR/build crashes)
  • Added "use client" directive to AISidebar.tsx for clarity
Code Changes:
  • dashboard/src/lib/aiClient.ts:
    • normalizeOrigin(): Strips path components, returns origin-only
    • buildAiQueryPublicUrl(): Always constructs /api/v1/ai/query-public
    • getAiOriginOrThrow(): Call-time validation (not module-init)
    • Guardrail warning if NEXT_PUBLIC_AI_URL contains /api/ paths
  • dashboard/src/components/AISidebar.tsx: Added "use client" directive
  • dashboard/env.local.template: Documented NEXT_PUBLIC_AI_URL with correct format
Environment Variable:
  • NEXT_PUBLIC_AI_URL: Must be origin-only (e.g., https://payroll-backend-prod-evndxpcirq-uc.a.run.app)
  • Production value: https://payroll-backend-prod-evndxpcirq-uc.a.run.app
  • Never include /api/v1 or any path components
Verification:
  • Browser console shows correct AI_BASE_URL log
  • Network tab confirms requests to canonical path
  • Build succeeds even if env var temporarily unset (call-time validation)

3. RBAC Leak Fixes (Critical)

3.1 Demo Mode Bypass

Problem:
  • When no valid JWT token present, endpoint fell back to demo mode with role: "ceo"
  • CEO role allows tenant-wide access, bypassing all RBAC checks
  • Even when Bearer token present but invalid, it would decode without verification
Solution:
  • Rule 1: If Bearer token present but invalid → return 401 (no fallback)
  • Rule 2: Demo mode only allowed when NO Authorization header is present
  • Rule 3: Demo mode disabled in production (requires valid Bearer token)
  • Rule 4: Non-prod demo mode uses restricted viewer role (not CEO)
Code Changes:
  • api/routes/ai_query.py lines 3310-3399:
    • Strict Bearer token validation (401 on invalid token)
    • Production requires authentication (no demo fallback)
    • Non-prod demo uses viewer role (restricted scope)
Security Impact:
  • Before: Demo mode → tenant-wide access
  • After: Demo mode → restricted viewer (non-prod only), prod requires auth

3.2 AGENT_COMMISSION Intent Bypass

Problem:
  • AGENT_COMMISSION intent handler extracted agent_name from question
  • Called fetch_agent_data() directly without RBAC enforcement
  • Allowed any authenticated user to query any agent’s commission data
Solution:
  • Added RBAC enforcement before any data fetch
  • Extracts requested_target (agent name) from question
  • Calls authorize_target_agent_id() to check authorization (downline-aware for agents)
  • Only queries data for authorized_agent_id
  • Returns 403 if user is not authorized
Code Changes:
  • api/routes/ai_query.py AGENT_COMMISSION handler: Added authorize_target_agent_id() call before data fetch
Security Impact:
  • Before: Any agent queryable by any authenticated user
  • After: Only self + downline queryable (for agent role)

3.3 CHURN Intent Bypass

Problem:
  • CHURN intent handler querying data without RBAC checks
  • Similar to AGENT_COMMISSION bypass
Solution:
  • Added explicit authorize_target_agent_id() call before data fetching
  • Same RBAC enforcement pattern as AGENT_COMMISSION
Code Changes:
  • api/routes/ai_query.py CHURN handler: Added authorize_target_agent_id() call before data fetch

3.4 fetch_business_data() Guard

Problem:
  • fetch_business_data() passing authorized_agent_id=None for agent/CEO roles
  • Falling back to less secure agent_name filtering
Solution:
  • Enforced authorized_agent_id requirement for agent/CEO roles
  • Changed status code from 500 to 403 (correct classification)
  • Fail-closed: raises 403 if agent/CEO role missing authorized_agent_id
Code Changes:
  • api/routes/ai_query.py fetch_business_data(): Added guard requiring authorized_agent_id for agent/CEO roles
  • Status code changed to HTTP_403_FORBIDDEN (was 500)
Security Impact:
  • Before: Silent fallback to name-based filtering
  • After: Fail-closed with 403 (no data leak)

4. Name-based Leak Hardening in BigQuery Querying

Problem:
  • get_agent_report_from_view() using agent_name filtering with LIKE/partial matching
  • Risk of name collisions, partial matches, row multiplication
  • No deduplication guard for dim_agent_hierarchy rows
Solution:
  • Added dedupe CTE (dah_one) with QUALIFY ROW_NUMBER() to ensure single row per (tenant_id, agent_id)
  • Removed LIKE partial matching (exact equality only)
  • Added NULL guards (COALESCE, IS NOT NULL) for correctness
  • Join surface constrained to single authorized agent row
Code Changes:
  • api/bigquery/queries.py get_agent_report_from_view():
    • CTE dah_one with QUALIFY ROW_NUMBER() OVER (PARTITION BY tenant_id, agent_id ORDER BY created_at DESC) = 1
    • Exact equality for agent_name comparisons (no LIKE)
    • NULL guards for agent_name fields
    • Fallback agent_name filter (admin-only) uses exact equality
Security Impact:
  • Before: LIKE partial matching, no dedupe guard, row multiplication risk
  • After: Exact equality, dedupe CTE, single row per agent_id, NULL guards

Key Commits (Grouped by Category)

Routing/Alias

  • Dual router mount: /api/v1/ai (canonical) + /ai (legacy)
  • Enhanced startup logging for route verification
  • Legacy alias removal plan documented

Frontend Normalization

  • aiClient.ts: URL normalization, call-time env validation
  • AISidebar.tsx: Added "use client" directive
  • env.local.template: Documented NEXT_PUBLIC_AI_URL format

RBAC Fixes

  • Demo mode bypass: Disabled in prod, restricted viewer in non-prod
  • AGENT_COMMISSION intent: Added authorize_target_agent_id() enforcement
  • CHURN intent: Added authorize_target_agent_id() enforcement
  • fetch_business_data(): Added guard requiring authorized_agent_id for agent/CEO

BigQuery Hardening

  • Dedupe CTE with QUALIFY ROW_NUMBER()
  • Removed LIKE partial matching (exact equality only)
  • Added NULL guards (COALESCE, IS NOT NULL)
  • Admin-only fallback path documented

Deployment Discovery

Critical Finding: Multiple deploy paths exist, but only ONE targets the canonical prod service.
  • Canonical PROD backend service: payroll-backend-prod
  • Wrong/legacy deploy path: GitHub Actions / Cloud Build config that deploys to payroll-pipeline-cbs or fastapi-backend (WRONG service for the prod URL)
  • Correct deploy path: Manual deploy command targeting payroll-backend-prod (the Cloud Run URL used by dashboards)
Action Required: Always verify which Cloud Run service the dashboard points to (NEXT_PUBLIC_AI_URL origin) before deploying.

Current Bug (To Debug Later)

Symptom

  • Agent “Kenny” gets 403 even when asking about his own data
  • Example: “how much did kenny young make this month” returns 403 Forbidden
  • Request hits canonical endpoint: POST https://payroll-backend-prod-evndxpcirq-uc.a.run.app/api/v1/ai/query-public

Likely Causes

  • A) agent_id enrichment missing/None: authorize_target_agent_id() denies or cannot resolve self
  • B) name→id resolution mismatch: resolve_agent_name_to_id() returns different agent_id than current_user.agent_id
  • C) tenant_id mismatch: JWT tenant_id doesn’t match request context
  • D) requested_target extraction: Returns different name than user’s agent_name

Debug Plan (Future)

See docs/AI_CHATBOT_DEBUG_NEXT.md for detailed debugging steps and log markers.

Verification Checklist

✅ Completed

  • Both endpoints return not-404 (/api/v1/ai/query-public and /ai/query-public)
  • Router mount logs present in Cloud Run startup logs
  • Frontend calls canonical path (verified in Network tab)
  • RBAC enforcement active (demo mode disabled in prod, intent handlers enforce authorization)

⏳ Manual Verification Required

  • Auth gating: No token → 401, Invalid token → 401
  • RBAC matrix: Agent self/downline → 200, Agent non-downline → 403, Admin → 200
  • Row multiplication: Same query returns stable row counts
  • docs/AI_CHATBOT_SECURITY.md - Security features and API contract
  • docs/RBAC_FIX_SUMMARY.md - Detailed RBAC fix summary
  • docs/DEPLOYMENT_RUNBOOK.md - Deployment procedures
  • docs/DASHBOARD_SMOKE_TEST.md - Smoke test checklist
  • docs/AI_CHATBOT_DEBUG_NEXT.md - Debug guide for 403-for-self bug
  • docs/DEPLOYMENT_VERIFICATION.md - Post-deployment verification steps

Lessons Learned

  1. Always verify deploy target: Check which Cloud Run service dashboard points to before deploying
  2. Fail-closed RBAC: When in doubt, deny access (403) rather than allow (data leak)
  3. ID-based filtering: Always prefer agent_id over agent_name for data queries
  4. Call-time validation: Move env validation off module-init to prevent SSR/build crashes
  5. Legacy compatibility: Temporary aliases help with stale bundle transitions, but document removal plan

Next Steps

  1. ✅ Backend deployed with all fixes
  2. ⏳ Complete manual RBAC verification (auth gating, RBAC matrix, row multiplication)
  3. ⏳ Debug 403-for-self bug (see docs/AI_CHATBOT_DEBUG_NEXT.md)
  4. 📋 Plan cleanup: Remove legacy /ai mount after stale bundles expire (monitor usage first)