P0 Discovery Blocker Fixes (v2) - Implementation Summary
Objective
Fix P0 Business Discovery so every payroll business from an upload is materialized into:dim_business_mapping(seen dates)config_business_onboarding(registry) BEFORE readiness runs, ensuring “Not ready for ingest” is complete.
Blockers Fixed
BLOCKER 1 — Schema Alignment ✅
Problem: Initial implementation attempted to writefirst_seen_date/last_seen_date into config_business_onboarding, which doesn’t have these columns.
Solution: Implemented discovery as TWO DISTINCT UPSERTS:
- A)
dim_business_mapping: Upserts withbusiness_id,tenant_id,normalized_name,business_label,first_seen_date,last_seen_date(DATE, month-start normalized) - B)
config_business_onboarding: Upserts registry rows ONLY (no date fields)
api/bigquery/business_onboarding_queries.py-discover_businesses_from_upload()function
BLOCKER 2 — Date Cast Safety ✅
Problem:transaction_events_raw.period_label is STRING, not DATE. Plain CAST can crash; SAFE_CAST alone can silently null.
Solution:
- Use
SAFE_CAST(TRIM(period_label) AS DATE)with explicit invalid count tracking - Log warning if
invalid_period_count > 0 - Include
invalid_period_countin response and debug endpoint - Exclude invalid periods from aggregation (WHERE
period_date IS NOT NULL)
api/bigquery/business_onboarding_queries.py- Discovery query SQLapi/routes/intake_processor.py- Logging for invalid periods
BLOCKER 3 — Tenant + Org Isolation ✅
Problem: Discovery queries did not verify batch ownership, posing cross-tenant enumeration risk. Solution:- Verify batch ownership BEFORE discovery via
get_ingestion_batch(batch_id, tenant_id) - Return 404 if batch not found (no existence leak)
- JOIN
ingestion_batchesin discovery query whenbatch_idis provided - All queries scoped by
tenant_idANDbatch_id(when available)
api/bigquery/business_onboarding_queries.py- Batch ownership verification + JOIN clauseapi/routes/business_onboarding.py- Debug endpoint batch ownership check
BLOCKER 4 — Debug Endpoint Hardening ✅
Problem: Debug endpoint lacked batch ownership verification. Solution:- Verify batch ownership BEFORE discovery (404 if not found)
- Auth required (admin/CEO only via
require_admin_or_ceo) - Tenant scoped (via
get_tenant_id) - Response includes
invalid_period_count
api/routes/business_onboarding.py-debug_discover_businesses_endpoint()
SEALED BUSINESS_ID RULE ✅
Decision: Keep existing sealed pattern (MD5(normalized_name) only).
Rationale:
- Tenant isolation enforced by query keys (
tenant_id + normalized_name), not hash mutation - Changing hash inputs would break existing data integrity
- No schema migrations required
- None (pattern preserved)
Tests Added
File:api/tests/test_business_discovery_p0.py
- Normalization determinism: Same input → same normalized output
- Business_id determinism: Sealed MD5 pattern (no tenant_id/org_id in hash)
- Discovery idempotency: Same batch twice = no duplicate rows
- Invalid period detection:
invalid_period_count > 0surfaced - Debug endpoint security: 401 unauth; 404 wrong-tenant batch
Response Structure Changes
Before:Deployment + Validation Runbook
After deploy:- Upload Dec payroll → capture
batch_id - Call debug endpoint:
- Verify missing businesses appear (AIC, Motor Medics, Auto Intensive Care of Savannah)
- Verify
invalid_period_countshown
- Confirm BigQuery:
dim_business_mappinghas thosebusiness_idswithlast_seen_datein Decconfig_business_onboardingcontains registry rows for those ids
- Refresh Preflight/readiness:
- Those businesses now appear as NOT READY with reasons
- Repeat refresh:
- No duplicates (idempotency verified)
Files Modified
-
api/bigquery/business_onboarding_queries.py- Complete rewrite of
discover_businesses_from_upload() - Two upserts (dim_business_mapping + config_business_onboarding)
- SAFE_CAST with invalid count tracking
- Batch ownership verification
- JOIN with ingestion_batches
- Complete rewrite of
-
api/routes/business_onboarding.py- Debug endpoint batch ownership verification
- 404 handling for cross-tenant batches
-
api/routes/intake_processor.py- Updated logging to use new response structure
- Warning log for invalid periods
-
api/tests/test_business_discovery_p0.py(NEW)- Comprehensive test suite for all blockers
Security Guarantees
- ✅ Tenant isolation enforced (batch ownership verification)
- ✅ No cross-tenant enumeration (404 for unowned batches)
- ✅ Admin/CEO only debug endpoint
- ✅ No silent failures (invalid periods logged + surfaced)
Performance Guarantees
- ✅ Discovery filters by
batch_id(no full-table scans) - ✅ DISTINCT applied after tenant + batch filters
- ✅ Idempotent MERGE operations (no duplicate rows)
Audit Trail
- All discovery operations logged with tenant_id, batch_id, counts
- Invalid periods explicitly counted and logged
- Batch ownership verification logged (warnings for failures)