Business Identity Resolution - Design Proposal
Purpose
Make business identity resolution explicit, auditable, and deterministic in Repo B without changing financial logic. This replaces the hidden dependency onbusiness_id_mapping.csv that exists in Repo A.
Current State Analysis
business_id_mapping.csv Summary
Location:data/reference/business_id_mapping.csv
Statistics:
- Total rows: 185
- Unique business_ids: 185
- Unique normalized_names: 185
- Unique business_names: 185
- Businesses with >1 alias: 0 (1:1 mapping currently)
business_id(16-char hex): Deterministic hash-based IDbusiness_name(original): Raw business name from sourcenormalized_name(UPPERCASE): Normalized for matchingis_active(boolean): Active status flagfirst_seen_date(YYYY-MM): First appearancelast_seen_date(YYYY-MM): Last appearancecreated_at(ISO timestamp): Record creationupdated_at(ISO timestamp): Last update
- Static mapping: Currently 1:1 (no aliases), but schema supports multiple aliases per
business_id - Time-evolving:
first_seen_dateandlast_seen_dateshow businesses appear/disappear over time - September anomaly:
4TK Holdings LLC dba Elite Commercial Cleaningexists in mapping (business_id:985db0ed14b0b5f1, first_seen: 2025-09, last_seen: 2025-09)
September 2025 Failure Correlation
Failing Business:4tk holdings LLC DBA elite commercial cleaning
Mapping Status:
- ✅ Exists in mapping:
business_id = 985db0ed14b0b5f1 - ✅ Normalized name:
4TK HOLDINGS LLC DBA ELITE COMMERCIAL CLEANING - ✅ First/Last seen: 2025-09 (new business)
- ✅ Agent assignment: TOMMY DANG, PEPM $3.00
- Net payout: 1.00 credit each)
- Agent commission calculated: 3.00 PEPM × 12 pay_periods / 12)
- Owner residual: -$24.00 (violates invariant:
agent_total_sum ≤ net_payout)
- Merged this business with another (if aliases existed), or
- Dropped it silently (if validation failed), or
- Applied different normalization that changed the business name matching
Proposed Solution: Business Identity Resolver
Design Principles
- Deterministic: Same input → same output (no randomness)
- Auditable: All resolution decisions logged with provenance
- Idempotent: Re-running resolution produces identical results
- Explicit: No hidden mappings or magic strings
- Safe for no-code: Can be invoked via API/UI without manual CSV edits
Inputs
Raw Business Names (from Stage 0 canonical input):business_label_normalized(string): Normalized business name from raw payroll filesource_file(string): Provenance (which file this came from)period_label(date): When this business was seen
- Historical
business_id_mapping.csv(for backward compatibility) - Active business registry (future: BigQuery table)
Outputs
Business Identity Object:Resolution Strategy
Phase 1: Hash-Based (Deterministic, No Mapping Required)
Algorithm:- Normalize input:
UPPERCASE(TRIM(business_label_normalized)) - Generate hash:
SHA256(normalized_name)[:16]→ hex string - Return
business_id = hash
- ✅ No mapping file required
- ✅ Deterministic (same name → same ID)
- ✅ Handles new businesses automatically
- ✅ No manual maintenance
- ❌ Cannot merge aliases (e.g., “ABC Corp” vs “ABC Corporation”)
- ❌ Cannot handle name changes over time
Phase 2: Alias Resolution (Future Enhancement)
Algorithm:- Try hash-based resolution (Phase 1)
- If
business_id_mapping.csvexists, check for aliases:- Lookup
normalized_name→business_id - If found, return mapped
business_id+ aliases
- Lookup
- If not found, create new hash-based ID and optionally write to mapping
- ✅ Supports alias merging (backward compatible with Repo A)
- ✅ Handles name variations
- ❌ Requires mapping file maintenance
- ❌ Manual alias definition needed
Integration with Stage 0
Location:repo_b/stage0/business_resolver.py (new module)
API:
No-Code Ingestion Safety
Requirements:- No manual CSV edits: Business IDs generated automatically
- API-friendly: Can be called via REST endpoint
- Idempotent: Same input → same output (safe to retry)
- Audit trail: All resolutions logged with provenance
Backward Compatibility
Migration Strategy:- Phase 1 (Current): Use hash-based resolution (no mapping required)
- Phase 2 (Future): Optionally load
business_id_mapping.csvfor alias support - Phase 3 (Future): Migrate mapping to BigQuery table for real-time updates
- If Repo A used
business_id_mapping.csvfor aliases, Phase 2 will support the same CSV format - Hash-based IDs will differ from Repo A’s IDs (by design—clean break)
- Financial logic unchanged (business_id is metadata only)
Implementation Plan (NO CODE YET)
Step 1: Design Approval
- ✅ Analyze
business_id_mapping.csvstructure - ✅ Identify September failure root cause (data quality, not mapping)
- ✅ Draft resolver design (this document)
Step 2: Create Resolver Module (Future)
- Create
repo_b/stage0/business_resolver.py - Implement hash-based resolution (Phase 1)
- Add unit tests for deterministic behavior
Step 3: Integrate with Stage 0 (Future)
- Modify
normalize_to_canonical()to call resolver - Attach
business_idto canonical records - Add provenance metadata
Step 4: Alias Support (Future, Optional)
- Load
business_id_mapping.csvfor backward compatibility - Implement alias lookup logic
- Support manual alias definitions
Step 5: BigQuery Integration (Future, Optional)
- Create
payroll_raw.dim_business_mappingtable - Migrate CSV mappings to BigQuery
- Enable real-time alias updates via API
Red Flags & Alignment Check
✅ Resolved
- Hidden dependency identified:
business_id_mapping.csvusage documented - September failure explained: Data quality issue (agent commission > net payout), not mapping
- Design explicit: Hash-based resolution is deterministic and auditable
- No financial logic changes: Business ID is metadata only
⚠️ Open Questions
- Alias merging: Should Repo B support alias merging like Repo A? (Phase 2 decision)
- ID migration: Should Repo B use same business_ids as Repo A? (Hash-based will differ)
- Manual overrides: How to handle manual business ID assignments? (Future enhancement)
🔒 Constraints Maintained
- ✅ No changes to Stage 1–4 financial logic
- ✅ No changes to dashboard queries
- ✅ No changes to BigQuery schemas (yet)
- ✅ Backward compatible with existing canonical JSON format
Next Steps
- Review this design for alignment with requirements
- Approve Phase 1 (hash-based resolution) for implementation
- Defer Phase 2 (alias support) until Repo A migration complete
- Document September data quality issue separately (not a resolver problem)
Last Updated: 2025-12-19 Status: Design Proposal (No Implementation) Author: Cursor AI Assistant