AI Chatbot Production Incident - December 24, 2025
Date: December 24, 2025Status: ✅ Fixed (routing + RBAC hardening complete)
Remaining Issue: 403-for-self bug (documented for future debugging)
Executive Summary
Production incident involving AI chatbot RBAC bypass and Cloud Run 404 errors. Root causes identified and fixed:- Routing Issue: Frontend calling wrong endpoint path, backend missing legacy alias mount
- RBAC Bypass: Multiple intent handlers querying data without authorization checks
- Name-based Data Leak: BigQuery queries using unsafe name filtering instead of agent_id
System Architecture
- Frontend: Next.js on Vercel (
payroll-pipeline-cbs.vercel.app) - Backend: FastAPI on Cloud Run (
payroll-backend-prodservice) - Canonical AI Endpoint:
/api/v1/ai/query-public - Legacy Compat Endpoint:
/ai/query-public(temporary, for stale bundles) - RBAC Model: agent/CEO = self + downline, admin = tenant-wide
- Data Layer: BigQuery + Vertex AI (commission math/datasets unchanged)
Incident Timeline
Initial Symptoms
- Agent “Kenny Young” receiving answers about other agents (e.g., “Tommy Dang”) - RBAC violation
- Network tab showed chatbot calling Vercel host instead of Cloud Run
- Cloud Run endpoint
/api/v1/ai/query-publicreturning 404 - Backend logs showed router not mounted despite
main.pyclaiming it was
Root Causes Identified
- Frontend Routing:
aiClient.tshardcoded to wrong Cloud Run service (fastapi-backendinstead ofpayroll-backend-prod) - Backend 404: Router only mounted at
/api/v1/ai, but stale frontend bundles calling/ai/query-public - RBAC Bypass: Demo mode falling back to CEO role, intent handlers skipping authorization
- Name-based Leak: BigQuery queries using
agent_namefiltering with LIKE/partial matching
Fixes Implemented
1. Routing / 404 Fixes
Problem: Frontend requests hitting/ai/query-public (missing /api/v1 prefix) and backend old revision didn’t mount legacy alias.
Solution: Backend router mounted at BOTH:
/api/v1/ai(canonical)/ai(legacy compat)
api/main.pylines 274-280: Dual router mount using same router instance- Same router instance ensures identical RBAC/auth behavior
- Legacy alias temporary, will be removed after stale bundles expire
api/main.py: Added legacy alias mount with enhanced startup logging
- Both endpoints return 401 (not 404) - endpoint exists, auth required
- Router mount logs present in Cloud Run startup logs
2. Frontend URL Normalization + Client-Only Safety
Problem:aiClient.tshardcoded to wrong Cloud Run service- Module-init env validation causing SSR/build fragility
- No guardrails for origin-only URL format
- Added
normalizeOrigin()function to extract origin from URL - Added
buildAiQueryPublicUrl()to consistently construct canonical endpoint - Moved env validation from module-init to call-time (prevents SSR/build crashes)
- Added
"use client"directive toAISidebar.tsxfor clarity
dashboard/src/lib/aiClient.ts:normalizeOrigin(): Strips path components, returns origin-onlybuildAiQueryPublicUrl(): Always constructs/api/v1/ai/query-publicgetAiOriginOrThrow(): Call-time validation (not module-init)- Guardrail warning if
NEXT_PUBLIC_AI_URLcontains/api/paths
dashboard/src/components/AISidebar.tsx: Added"use client"directivedashboard/env.local.template: DocumentedNEXT_PUBLIC_AI_URLwith correct format
NEXT_PUBLIC_AI_URL: Must be origin-only (e.g.,https://payroll-backend-prod-evndxpcirq-uc.a.run.app)- Production value:
https://payroll-backend-prod-evndxpcirq-uc.a.run.app - Never include
/api/v1or any path components
- Browser console shows correct
AI_BASE_URLlog - Network tab confirms requests to canonical path
- Build succeeds even if env var temporarily unset (call-time validation)
3. RBAC Leak Fixes (Critical)
3.1 Demo Mode Bypass
Problem:- When no valid JWT token present, endpoint fell back to demo mode with
role: "ceo" - CEO role allows tenant-wide access, bypassing all RBAC checks
- Even when Bearer token present but invalid, it would decode without verification
- Rule 1: If Bearer token present but invalid → return 401 (no fallback)
- Rule 2: Demo mode only allowed when NO Authorization header is present
- Rule 3: Demo mode disabled in production (requires valid Bearer token)
- Rule 4: Non-prod demo mode uses restricted
viewerrole (not CEO)
api/routes/ai_query.pylines 3310-3399:- Strict Bearer token validation (401 on invalid token)
- Production requires authentication (no demo fallback)
- Non-prod demo uses
viewerrole (restricted scope)
- Before: Demo mode → tenant-wide access
- After: Demo mode → restricted viewer (non-prod only), prod requires auth
3.2 AGENT_COMMISSION Intent Bypass
Problem:AGENT_COMMISSIONintent handler extractedagent_namefrom question- Called
fetch_agent_data()directly without RBAC enforcement - Allowed any authenticated user to query any agent’s commission data
- Added RBAC enforcement before any data fetch
- Extracts
requested_target(agent name) from question - Calls
authorize_target_agent_id()to check authorization (downline-aware for agents) - Only queries data for
authorized_agent_id - Returns 403 if user is not authorized
api/routes/ai_query.pyAGENT_COMMISSION handler: Addedauthorize_target_agent_id()call before data fetch
- Before: Any agent queryable by any authenticated user
- After: Only self + downline queryable (for agent role)
3.3 CHURN Intent Bypass
Problem:CHURNintent handler querying data without RBAC checks- Similar to AGENT_COMMISSION bypass
- Added explicit
authorize_target_agent_id()call before data fetching - Same RBAC enforcement pattern as AGENT_COMMISSION
api/routes/ai_query.pyCHURN handler: Addedauthorize_target_agent_id()call before data fetch
3.4 fetch_business_data() Guard
Problem:fetch_business_data()passingauthorized_agent_id=Nonefor agent/CEO roles- Falling back to less secure
agent_namefiltering
- Enforced
authorized_agent_idrequirement for agent/CEO roles - Changed status code from 500 to 403 (correct classification)
- Fail-closed: raises 403 if agent/CEO role missing
authorized_agent_id
api/routes/ai_query.pyfetch_business_data(): Added guard requiringauthorized_agent_idfor agent/CEO roles- Status code changed to
HTTP_403_FORBIDDEN(was 500)
- Before: Silent fallback to name-based filtering
- After: Fail-closed with 403 (no data leak)
4. Name-based Leak Hardening in BigQuery Querying
Problem:get_agent_report_from_view()usingagent_namefiltering with LIKE/partial matching- Risk of name collisions, partial matches, row multiplication
- No deduplication guard for
dim_agent_hierarchyrows
- Added dedupe CTE (
dah_one) withQUALIFY ROW_NUMBER()to ensure single row per(tenant_id, agent_id) - Removed LIKE partial matching (exact equality only)
- Added NULL guards (
COALESCE,IS NOT NULL) for correctness - Join surface constrained to single authorized agent row
api/bigquery/queries.pyget_agent_report_from_view():- CTE
dah_onewithQUALIFY ROW_NUMBER() OVER (PARTITION BY tenant_id, agent_id ORDER BY created_at DESC) = 1 - Exact equality for
agent_namecomparisons (no LIKE) - NULL guards for
agent_namefields - Fallback
agent_namefilter (admin-only) uses exact equality
- CTE
- Before: LIKE partial matching, no dedupe guard, row multiplication risk
- After: Exact equality, dedupe CTE, single row per agent_id, NULL guards
Key Commits (Grouped by Category)
Routing/Alias
- Dual router mount:
/api/v1/ai(canonical) +/ai(legacy) - Enhanced startup logging for route verification
- Legacy alias removal plan documented
Frontend Normalization
aiClient.ts: URL normalization, call-time env validationAISidebar.tsx: Added"use client"directiveenv.local.template: DocumentedNEXT_PUBLIC_AI_URLformat
RBAC Fixes
- Demo mode bypass: Disabled in prod, restricted viewer in non-prod
- AGENT_COMMISSION intent: Added
authorize_target_agent_id()enforcement - CHURN intent: Added
authorize_target_agent_id()enforcement fetch_business_data(): Added guard requiringauthorized_agent_idfor agent/CEO
BigQuery Hardening
- Dedupe CTE with
QUALIFY ROW_NUMBER() - Removed LIKE partial matching (exact equality only)
- Added NULL guards (
COALESCE,IS NOT NULL) - Admin-only fallback path documented
Deployment Discovery
Critical Finding: Multiple deploy paths exist, but only ONE targets the canonical prod service.- Canonical PROD backend service:
payroll-backend-prod - Wrong/legacy deploy path: GitHub Actions / Cloud Build config that deploys to
payroll-pipeline-cbsorfastapi-backend(WRONG service for the prod URL) - Correct deploy path: Manual deploy command targeting
payroll-backend-prod(the Cloud Run URL used by dashboards)
NEXT_PUBLIC_AI_URL origin) before deploying.
Current Bug (To Debug Later)
Symptom
- Agent “Kenny” gets 403 even when asking about his own data
- Example: “how much did kenny young make this month” returns 403 Forbidden
- Request hits canonical endpoint:
POST https://payroll-backend-prod-evndxpcirq-uc.a.run.app/api/v1/ai/query-public
Likely Causes
- A) agent_id enrichment missing/None:
authorize_target_agent_id()denies or cannot resolve self - B) name→id resolution mismatch:
resolve_agent_name_to_id()returns different agent_id thancurrent_user.agent_id - C) tenant_id mismatch: JWT tenant_id doesn’t match request context
- D) requested_target extraction: Returns different name than user’s
agent_name
Debug Plan (Future)
Seedocs/AI_CHATBOT_DEBUG_NEXT.md for detailed debugging steps and log markers.
Verification Checklist
✅ Completed
- Both endpoints return not-404 (
/api/v1/ai/query-publicand/ai/query-public) - Router mount logs present in Cloud Run startup logs
- Frontend calls canonical path (verified in Network tab)
- RBAC enforcement active (demo mode disabled in prod, intent handlers enforce authorization)
⏳ Manual Verification Required
- Auth gating: No token → 401, Invalid token → 401
- RBAC matrix: Agent self/downline → 200, Agent non-downline → 403, Admin → 200
- Row multiplication: Same query returns stable row counts
Related Documentation
docs/AI_CHATBOT_SECURITY.md- Security features and API contractdocs/RBAC_FIX_SUMMARY.md- Detailed RBAC fix summarydocs/DEPLOYMENT_RUNBOOK.md- Deployment proceduresdocs/DASHBOARD_SMOKE_TEST.md- Smoke test checklistdocs/AI_CHATBOT_DEBUG_NEXT.md- Debug guide for 403-for-self bugdocs/DEPLOYMENT_VERIFICATION.md- Post-deployment verification steps
Lessons Learned
- Always verify deploy target: Check which Cloud Run service dashboard points to before deploying
- Fail-closed RBAC: When in doubt, deny access (403) rather than allow (data leak)
- ID-based filtering: Always prefer
agent_idoveragent_namefor data queries - Call-time validation: Move env validation off module-init to prevent SSR/build crashes
- Legacy compatibility: Temporary aliases help with stale bundle transitions, but document removal plan
Next Steps
- ✅ Backend deployed with all fixes
- ⏳ Complete manual RBAC verification (auth gating, RBAC matrix, row multiplication)
- ⏳ Debug 403-for-self bug (see
docs/AI_CHATBOT_DEBUG_NEXT.md) - 📋 Plan cleanup: Remove legacy
/aimount after stale bundles expire (monitor usage first)