AI Chatbot Production Incident - December 24, 2025

Date: December 24, 2025
Status: ✅ Fixed (routing + RBAC hardening complete)
Remaining Issue: 403-for-self bug (documented for future debugging)

Executive Summary

Production incident involving AI chatbot RBAC bypass and Cloud Run 404 errors. Root causes identified and fixed:

Routing Issue: Frontend calling wrong endpoint path, backend missing legacy alias mount
RBAC Bypass: Multiple intent handlers querying data without authorization checks
Name-based Data Leak: BigQuery queries using unsafe name filtering instead of agent_id

All critical fixes deployed. One non-critical bug remains (403 for self-queries) and is documented for future debugging.

System Architecture

Frontend: Next.js on Vercel (payroll-pipeline-cbs.vercel.app)
Backend: FastAPI on Cloud Run (payroll-backend-prod service)
Canonical AI Endpoint: /api/v1/ai/query-public
Legacy Compat Endpoint: /ai/query-public (temporary, for stale bundles)
RBAC Model: agent/CEO = self + downline, admin = tenant-wide
Data Layer: BigQuery + Vertex AI (commission math/datasets unchanged)

Incident Timeline

Initial Symptoms

Agent “Kenny Young” receiving answers about other agents (e.g., “Tommy Dang”) - RBAC violation
Network tab showed chatbot calling Vercel host instead of Cloud Run
Cloud Run endpoint /api/v1/ai/query-public returning 404
Backend logs showed router not mounted despite main.py claiming it was

Root Causes Identified

Frontend Routing: aiClient.ts hardcoded to wrong Cloud Run service (fastapi-backend instead of payroll-backend-prod)
Backend 404: Router only mounted at /api/v1/ai, but stale frontend bundles calling /ai/query-public
RBAC Bypass: Demo mode falling back to CEO role, intent handlers skipping authorization
Name-based Leak: BigQuery queries using agent_name filtering with LIKE/partial matching

Fixes Implemented

1. Routing / 404 Fixes

Problem: Frontend requests hitting /ai/query-public (missing /api/v1 prefix) and backend old revision didn’t mount legacy alias. Solution: Backend router mounted at BOTH:

/api/v1/ai (canonical)
/ai (legacy compat)

Code Changes:

api/main.py lines 274-280: Dual router mount using same router instance
Same router instance ensures identical RBAC/auth behavior
Legacy alias temporary, will be removed after stale bundles expire

Files Modified:

api/main.py: Added legacy alias mount with enhanced startup logging

Verification:

Both endpoints return 401 (not 404) - endpoint exists, auth required
Router mount logs present in Cloud Run startup logs

2. Frontend URL Normalization + Client-Only Safety

Problem:

aiClient.ts hardcoded to wrong Cloud Run service
Module-init env validation causing SSR/build fragility
No guardrails for origin-only URL format

Solution:

Added normalizeOrigin() function to extract origin from URL
Added buildAiQueryPublicUrl() to consistently construct canonical endpoint
Moved env validation from module-init to call-time (prevents SSR/build crashes)
Added "use client" directive to AISidebar.tsx for clarity

Code Changes:

dashboard/src/lib/aiClient.ts:
- normalizeOrigin(): Strips path components, returns origin-only
- buildAiQueryPublicUrl(): Always constructs /api/v1/ai/query-public
- getAiOriginOrThrow(): Call-time validation (not module-init)
- Guardrail warning if NEXT_PUBLIC_AI_URL contains /api/ paths
dashboard/src/components/AISidebar.tsx: Added "use client" directive
dashboard/env.local.template: Documented NEXT_PUBLIC_AI_URL with correct format

Environment Variable:

NEXT_PUBLIC_AI_URL: Must be origin-only (e.g., https://payroll-backend-prod-evndxpcirq-uc.a.run.app)
Production value: https://payroll-backend-prod-evndxpcirq-uc.a.run.app
Never include /api/v1 or any path components

Verification:

Browser console shows correct AI_BASE_URL log
Network tab confirms requests to canonical path
Build succeeds even if env var temporarily unset (call-time validation)

3. RBAC Leak Fixes (Critical)

3.1 Demo Mode Bypass

Problem:

When no valid JWT token present, endpoint fell back to demo mode with role: "ceo"
CEO role allows tenant-wide access, bypassing all RBAC checks
Even when Bearer token present but invalid, it would decode without verification

Solution:

Rule 1: If Bearer token present but invalid → return 401 (no fallback)
Rule 2: Demo mode only allowed when NO Authorization header is present
Rule 3: Demo mode disabled in production (requires valid Bearer token)
Rule 4: Non-prod demo mode uses restricted viewer role (not CEO)

Code Changes:

api/routes/ai_query.py lines 3310-3399:
- Strict Bearer token validation (401 on invalid token)
- Production requires authentication (no demo fallback)
- Non-prod demo uses viewer role (restricted scope)

Security Impact:

Before: Demo mode → tenant-wide access
After: Demo mode → restricted viewer (non-prod only), prod requires auth

3.2 AGENT_COMMISSION Intent Bypass

Problem:

AGENT_COMMISSION intent handler extracted agent_name from question
Called fetch_agent_data() directly without RBAC enforcement
Allowed any authenticated user to query any agent’s commission data

Solution:

Added RBAC enforcement before any data fetch
Extracts requested_target (agent name) from question
Calls authorize_target_agent_id() to check authorization (downline-aware for agents)
Only queries data for authorized_agent_id
Returns 403 if user is not authorized

Code Changes:

api/routes/ai_query.py AGENT_COMMISSION handler: Added authorize_target_agent_id() call before data fetch

Security Impact:

Before: Any agent queryable by any authenticated user
After: Only self + downline queryable (for agent role)

3.3 CHURN Intent Bypass

Problem:

CHURN intent handler querying data without RBAC checks
Similar to AGENT_COMMISSION bypass

Solution:

Added explicit authorize_target_agent_id() call before data fetching
Same RBAC enforcement pattern as AGENT_COMMISSION

Code Changes:

api/routes/ai_query.py CHURN handler: Added authorize_target_agent_id() call before data fetch

3.4 fetch_business_data() Guard

Problem:

fetch_business_data() passing authorized_agent_id=None for agent/CEO roles
Falling back to less secure agent_name filtering

Solution:

Enforced authorized_agent_id requirement for agent/CEO roles
Changed status code from 500 to 403 (correct classification)
Fail-closed: raises 403 if agent/CEO role missing authorized_agent_id

Code Changes:

api/routes/ai_query.py fetch_business_data(): Added guard requiring authorized_agent_id for agent/CEO roles
Status code changed to HTTP_403_FORBIDDEN (was 500)

Security Impact:

Before: Silent fallback to name-based filtering
After: Fail-closed with 403 (no data leak)

4. Name-based Leak Hardening in BigQuery Querying

Problem:

get_agent_report_from_view() using agent_name filtering with LIKE/partial matching
Risk of name collisions, partial matches, row multiplication
No deduplication guard for dim_agent_hierarchy rows

Solution:

Added dedupe CTE (dah_one) with QUALIFY ROW_NUMBER() to ensure single row per (tenant_id, agent_id)
Removed LIKE partial matching (exact equality only)
Added NULL guards (COALESCE, IS NOT NULL) for correctness
Join surface constrained to single authorized agent row

Code Changes:

api/bigquery/queries.py get_agent_report_from_view():
- CTE dah_one with QUALIFY ROW_NUMBER() OVER (PARTITION BY tenant_id, agent_id ORDER BY created_at DESC) = 1
- Exact equality for agent_name comparisons (no LIKE)
- NULL guards for agent_name fields
- Fallback agent_name filter (admin-only) uses exact equality

Security Impact:

Before: LIKE partial matching, no dedupe guard, row multiplication risk
After: Exact equality, dedupe CTE, single row per agent_id, NULL guards

Key Commits (Grouped by Category)

Routing/Alias

Dual router mount: /api/v1/ai (canonical) + /ai (legacy)
Enhanced startup logging for route verification
Legacy alias removal plan documented

Frontend Normalization

aiClient.ts: URL normalization, call-time env validation
AISidebar.tsx: Added "use client" directive
env.local.template: Documented NEXT_PUBLIC_AI_URL format

RBAC Fixes

Demo mode bypass: Disabled in prod, restricted viewer in non-prod
AGENT_COMMISSION intent: Added authorize_target_agent_id() enforcement
CHURN intent: Added authorize_target_agent_id() enforcement
fetch_business_data(): Added guard requiring authorized_agent_id for agent/CEO

BigQuery Hardening

Dedupe CTE with QUALIFY ROW_NUMBER()
Removed LIKE partial matching (exact equality only)
Added NULL guards (COALESCE, IS NOT NULL)
Admin-only fallback path documented

Deployment Discovery

Critical Finding: Multiple deploy paths exist, but only ONE targets the canonical prod service.

Canonical PROD backend service: payroll-backend-prod
Wrong/legacy deploy path: GitHub Actions / Cloud Build config that deploys to payroll-pipeline-cbs or fastapi-backend (WRONG service for the prod URL)
Correct deploy path: Manual deploy command targeting payroll-backend-prod (the Cloud Run URL used by dashboards)

Action Required: Always verify which Cloud Run service the dashboard points to (NEXT_PUBLIC_AI_URL origin) before deploying.

Current Bug (To Debug Later)

Symptom

Agent “Kenny” gets 403 even when asking about his own data
Example: “how much did kenny young make this month” returns 403 Forbidden
Request hits canonical endpoint: POST https://payroll-backend-prod-evndxpcirq-uc.a.run.app/api/v1/ai/query-public

Likely Causes

A) agent_id enrichment missing/None: authorize_target_agent_id() denies or cannot resolve self
B) name→id resolution mismatch: resolve_agent_name_to_id() returns different agent_id than current_user.agent_id
C) tenant_id mismatch: JWT tenant_id doesn’t match request context
D) requested_target extraction: Returns different name than user’s agent_name

Debug Plan (Future)

See docs/AI_CHATBOT_DEBUG_NEXT.md for detailed debugging steps and log markers.

Verification Checklist

✅ Completed

Both endpoints return not-404 (/api/v1/ai/query-public and /ai/query-public)
Router mount logs present in Cloud Run startup logs
Frontend calls canonical path (verified in Network tab)
RBAC enforcement active (demo mode disabled in prod, intent handlers enforce authorization)

⏳ Manual Verification Required

Auth gating: No token → 401, Invalid token → 401
RBAC matrix: Agent self/downline → 200, Agent non-downline → 403, Admin → 200
Row multiplication: Same query returns stable row counts

docs/AI_CHATBOT_SECURITY.md - Security features and API contract
docs/RBAC_FIX_SUMMARY.md - Detailed RBAC fix summary
docs/DEPLOYMENT_RUNBOOK.md - Deployment procedures
docs/DASHBOARD_SMOKE_TEST.md - Smoke test checklist
docs/AI_CHATBOT_DEBUG_NEXT.md - Debug guide for 403-for-self bug
docs/DEPLOYMENT_VERIFICATION.md - Post-deployment verification steps

Lessons Learned

Always verify deploy target: Check which Cloud Run service dashboard points to before deploying
Fail-closed RBAC: When in doubt, deny access (403) rather than allow (data leak)
ID-based filtering: Always prefer agent_id over agent_name for data queries
Call-time validation: Move env validation off module-init to prevent SSR/build crashes
Legacy compatibility: Temporary aliases help with stale bundle transitions, but document removal plan

Next Steps

✅ Backend deployed with all fixes
⏳ Complete manual RBAC verification (auth gating, RBAC matrix, row multiplication)
⏳ Debug 403-for-self bug (see docs/AI_CHATBOT_DEBUG_NEXT.md)
📋 Plan cleanup: Remove legacy /ai mount after stale bundles expire (monitor usage first)

Start Here

Operate & Support

Reference

Build & Integrate

AI CHATBOT INCIDENT 2025 12 24

AI Chatbot Production Incident - December 24, 2025

Executive Summary

System Architecture

Incident Timeline

Initial Symptoms

Root Causes Identified

Fixes Implemented

1. Routing / 404 Fixes

2. Frontend URL Normalization + Client-Only Safety

3. RBAC Leak Fixes (Critical)

3.1 Demo Mode Bypass

3.2 AGENT_COMMISSION Intent Bypass

3.3 CHURN Intent Bypass

3.4 fetch_business_data() Guard

4. Name-based Leak Hardening in BigQuery Querying

Key Commits (Grouped by Category)

Routing/Alias

Frontend Normalization

RBAC Fixes

BigQuery Hardening

Deployment Discovery

Current Bug (To Debug Later)

Symptom

Likely Causes

Debug Plan (Future)

Verification Checklist

✅ Completed

⏳ Manual Verification Required

Lessons Learned

Next Steps

Start Here

Operate & Support

Reference

Build & Integrate

​AI Chatbot Production Incident - December 24, 2025

​Executive Summary

​System Architecture

​Incident Timeline

​Initial Symptoms

​Root Causes Identified

​Fixes Implemented

​1. Routing / 404 Fixes

​2. Frontend URL Normalization + Client-Only Safety

​3. RBAC Leak Fixes (Critical)

​3.1 Demo Mode Bypass

​3.2 AGENT_COMMISSION Intent Bypass

​3.3 CHURN Intent Bypass

​3.4 fetch_business_data() Guard

​4. Name-based Leak Hardening in BigQuery Querying

​Key Commits (Grouped by Category)

​Routing/Alias

​Frontend Normalization

​RBAC Fixes

​BigQuery Hardening

​Deployment Discovery

​Current Bug (To Debug Later)

​Symptom

​Likely Causes

​Debug Plan (Future)

​Verification Checklist

​✅ Completed

​⏳ Manual Verification Required

​Related Documentation

​Lessons Learned

​Next Steps

AI Chatbot Production Incident - December 24, 2025

Executive Summary

System Architecture

Incident Timeline

Initial Symptoms

Root Causes Identified

Fixes Implemented

1. Routing / 404 Fixes

2. Frontend URL Normalization + Client-Only Safety

3. RBAC Leak Fixes (Critical)

3.1 Demo Mode Bypass

3.2 AGENT_COMMISSION Intent Bypass

3.3 CHURN Intent Bypass

3.4 fetch_business_data() Guard

4. Name-based Leak Hardening in BigQuery Querying

Key Commits (Grouped by Category)

Routing/Alias

Frontend Normalization

RBAC Fixes

BigQuery Hardening

Deployment Discovery

Current Bug (To Debug Later)

Symptom

Likely Causes

Debug Plan (Future)

Verification Checklist

✅ Completed

⏳ Manual Verification Required

Related Documentation

Lessons Learned

Next Steps