Skip to main content

P0 Discovery Blocker Fixes (v2) - Implementation Summary

Objective

Fix P0 Business Discovery so every payroll business from an upload is materialized into:
  • dim_business_mapping (seen dates)
  • config_business_onboarding (registry) BEFORE readiness runs, ensuring “Not ready for ingest” is complete.

Blockers Fixed

BLOCKER 1 — Schema Alignment ✅

Problem: Initial implementation attempted to write first_seen_date/last_seen_date into config_business_onboarding, which doesn’t have these columns. Solution: Implemented discovery as TWO DISTINCT UPSERTS:
  • A) dim_business_mapping: Upserts with business_id, tenant_id, normalized_name, business_label, first_seen_date, last_seen_date (DATE, month-start normalized)
  • B) config_business_onboarding: Upserts registry rows ONLY (no date fields)
Files Changed:
  • api/bigquery/business_onboarding_queries.py - discover_businesses_from_upload() function

BLOCKER 2 — Date Cast Safety ✅

Problem: transaction_events_raw.period_label is STRING, not DATE. Plain CAST can crash; SAFE_CAST alone can silently null. Solution:
  • Use SAFE_CAST(TRIM(period_label) AS DATE) with explicit invalid count tracking
  • Log warning if invalid_period_count > 0
  • Include invalid_period_count in response and debug endpoint
  • Exclude invalid periods from aggregation (WHERE period_date IS NOT NULL)
SQL Pattern:
WITH scoped AS (
  SELECT
    TRIM(ter.business_name_raw) AS raw_business_label,
    TRIM(ter.period_label) AS raw_period_label,
    SAFE_CAST(TRIM(ter.period_label) AS DATE) AS period_date
  FROM transaction_events_raw ter
  ...
),
period_stats AS (
  SELECT
    COUNTIF(raw_period_label IS NOT NULL AND period_date IS NULL) AS invalid_period_count
  FROM scoped
)
Files Changed:
  • api/bigquery/business_onboarding_queries.py - Discovery query SQL
  • api/routes/intake_processor.py - Logging for invalid periods

BLOCKER 3 — Tenant + Org Isolation ✅

Problem: Discovery queries did not verify batch ownership, posing cross-tenant enumeration risk. Solution:
  • Verify batch ownership BEFORE discovery via get_ingestion_batch(batch_id, tenant_id)
  • Return 404 if batch not found (no existence leak)
  • JOIN ingestion_batches in discovery query when batch_id is provided
  • All queries scoped by tenant_id AND batch_id (when available)
Files Changed:
  • api/bigquery/business_onboarding_queries.py - Batch ownership verification + JOIN clause
  • api/routes/business_onboarding.py - Debug endpoint batch ownership check

BLOCKER 4 — Debug Endpoint Hardening ✅

Problem: Debug endpoint lacked batch ownership verification. Solution:
  • Verify batch ownership BEFORE discovery (404 if not found)
  • Auth required (admin/CEO only via require_admin_or_ceo)
  • Tenant scoped (via get_tenant_id)
  • Response includes invalid_period_count
Files Changed:
  • api/routes/business_onboarding.py - debug_discover_businesses_endpoint()

SEALED BUSINESS_ID RULE ✅

Decision: Keep existing sealed pattern (MD5(normalized_name) only). Rationale:
  • Tenant isolation enforced by query keys (tenant_id + normalized_name), not hash mutation
  • Changing hash inputs would break existing data integrity
  • No schema migrations required
Files Changed:
  • None (pattern preserved)

Tests Added

File: api/tests/test_business_discovery_p0.py
  1. Normalization determinism: Same input → same normalized output
  2. Business_id determinism: Sealed MD5 pattern (no tenant_id/org_id in hash)
  3. Discovery idempotency: Same batch twice = no duplicate rows
  4. Invalid period detection: invalid_period_count > 0 surfaced
  5. Debug endpoint security: 401 unauth; 404 wrong-tenant batch

Response Structure Changes

Before:
{
    "success": bool,
    "discovered_count": int,
    "upserted_count": int,  # Only config_business_onboarding
    "businesses": [...]
}
After:
{
    "success": bool,
    "discovered_count": int,
    "dim_upserted_count": int,  # dim_business_mapping upserts
    "onboarding_upserted_count": int,  # config_business_onboarding upserts
    "invalid_period_count": int,  # NEW: Invalid period_label strings
    "businesses": [...]
}

Deployment + Validation Runbook

After deploy:
  1. Upload Dec payroll → capture batch_id
  2. Call debug endpoint:
    GET /api/v1/admin/onboarding/debug/discover/{batch_id}
    
    • Verify missing businesses appear (AIC, Motor Medics, Auto Intensive Care of Savannah)
    • Verify invalid_period_count shown
  3. Confirm BigQuery:
    • dim_business_mapping has those business_ids with last_seen_date in Dec
    • config_business_onboarding contains registry rows for those ids
  4. Refresh Preflight/readiness:
    • Those businesses now appear as NOT READY with reasons
  5. Repeat refresh:
    • No duplicates (idempotency verified)

Files Modified

  1. api/bigquery/business_onboarding_queries.py
    • Complete rewrite of discover_businesses_from_upload()
    • Two upserts (dim_business_mapping + config_business_onboarding)
    • SAFE_CAST with invalid count tracking
    • Batch ownership verification
    • JOIN with ingestion_batches
  2. api/routes/business_onboarding.py
    • Debug endpoint batch ownership verification
    • 404 handling for cross-tenant batches
  3. api/routes/intake_processor.py
    • Updated logging to use new response structure
    • Warning log for invalid periods
  4. api/tests/test_business_discovery_p0.py (NEW)
    • Comprehensive test suite for all blockers

Security Guarantees

  • ✅ Tenant isolation enforced (batch ownership verification)
  • ✅ No cross-tenant enumeration (404 for unowned batches)
  • ✅ Admin/CEO only debug endpoint
  • ✅ No silent failures (invalid periods logged + surfaced)

Performance Guarantees

  • ✅ Discovery filters by batch_id (no full-table scans)
  • ✅ DISTINCT applied after tenant + batch filters
  • ✅ Idempotent MERGE operations (no duplicate rows)

Audit Trail

  • All discovery operations logged with tenant_id, batch_id, counts
  • Invalid periods explicitly counted and logged
  • Batch ownership verification logged (warnings for failures)