Skip to main content

Business Identity Resolution - Design Proposal

Purpose

Make business identity resolution explicit, auditable, and deterministic in Repo B without changing financial logic. This replaces the hidden dependency on business_id_mapping.csv that exists in Repo A.

Current State Analysis

business_id_mapping.csv Summary

Location: data/reference/business_id_mapping.csv Statistics:
  • Total rows: 185
  • Unique business_ids: 185
  • Unique normalized_names: 185
  • Unique business_names: 185
  • Businesses with >1 alias: 0 (1:1 mapping currently)
Schema:
  • business_id (16-char hex): Deterministic hash-based ID
  • business_name (original): Raw business name from source
  • normalized_name (UPPERCASE): Normalized for matching
  • is_active (boolean): Active status flag
  • first_seen_date (YYYY-MM): First appearance
  • last_seen_date (YYYY-MM): Last appearance
  • created_at (ISO timestamp): Record creation
  • updated_at (ISO timestamp): Last update
Key Findings:
  1. Static mapping: Currently 1:1 (no aliases), but schema supports multiple aliases per business_id
  2. Time-evolving: first_seen_date and last_seen_date show businesses appear/disappear over time
  3. September anomaly: 4TK Holdings LLC dba Elite Commercial Cleaning exists in mapping (business_id: 985db0ed14b0b5f1, first_seen: 2025-09, last_seen: 2025-09)

September 2025 Failure Correlation

Failing Business: 4tk holdings LLC DBA elite commercial cleaning Mapping Status:
  • Exists in mapping: business_id = 985db0ed14b0b5f1
  • Normalized name: 4TK HOLDINGS LLC DBA ELITE COMMERCIAL CLEANING
  • First/Last seen: 2025-09 (new business)
  • Agent assignment: TOMMY DANG, PEPM $3.00
Failure Root Cause:
  • Net payout: 12.00(12employees×12.00 (12 employees × 1.00 credit each)
  • Agent commission calculated: 36.00(12employees×36.00 (12 employees × 3.00 PEPM × 12 pay_periods / 12)
  • Owner residual: -$24.00 (violates invariant: agent_total_sum ≤ net_payout)
Analysis: This is NOT a mapping issue—the business identity is correctly resolved. The failure is a data quality issue: agent commission calculation assumes monthly pay_periods=12, but the actual net payout ($12.00) suggests a different pay structure or data error in the source file. Repo A Behavior (Inferred): Repo A likely either:
  1. Merged this business with another (if aliases existed), or
  2. Dropped it silently (if validation failed), or
  3. Applied different normalization that changed the business name matching

Proposed Solution: Business Identity Resolver

Design Principles

  1. Deterministic: Same input → same output (no randomness)
  2. Auditable: All resolution decisions logged with provenance
  3. Idempotent: Re-running resolution produces identical results
  4. Explicit: No hidden mappings or magic strings
  5. Safe for no-code: Can be invoked via API/UI without manual CSV edits

Inputs

Raw Business Names (from Stage 0 canonical input):
  • business_label_normalized (string): Normalized business name from raw payroll file
  • source_file (string): Provenance (which file this came from)
  • period_label (date): When this business was seen
Reference Data (optional, for alias resolution):
  • Historical business_id_mapping.csv (for backward compatibility)
  • Active business registry (future: BigQuery table)

Outputs

Business Identity Object:
{
    "business_id": "985db0ed14b0b5f1",  # 16-char hex (deterministic hash)
    "business_name_canonical": "4TK HOLDINGS LLC DBA ELITE COMMERCIAL CLEANING",  # Normalized
    "business_name_source": "4tk holdings  LLC  DBA elite commercial cleaning",  # Original
    "resolution_method": "hash_based",  # or "alias_match", "manual_override"
    "resolution_confidence": 1.0,  # 0.0-1.0
    "aliases": [],  # List of known aliases (if any)
    "first_seen_date": date(2025, 9, 1),
    "last_seen_date": date(2025, 9, 1),
    "provenance": {
        "source_file": "Sept 2025 Combined Payroll.csv",
        "resolved_at": datetime(...),
        "resolver_version": "1.0.0"
    }
}

Resolution Strategy

Phase 1: Hash-Based (Deterministic, No Mapping Required)

Algorithm:
  1. Normalize input: UPPERCASE(TRIM(business_label_normalized))
  2. Generate hash: SHA256(normalized_name)[:16] → hex string
  3. Return business_id = hash
Advantages:
  • ✅ No mapping file required
  • ✅ Deterministic (same name → same ID)
  • ✅ Handles new businesses automatically
  • ✅ No manual maintenance
Limitations:
  • ❌ Cannot merge aliases (e.g., “ABC Corp” vs “ABC Corporation”)
  • ❌ Cannot handle name changes over time

Phase 2: Alias Resolution (Future Enhancement)

Algorithm:
  1. Try hash-based resolution (Phase 1)
  2. If business_id_mapping.csv exists, check for aliases:
    • Lookup normalized_namebusiness_id
    • If found, return mapped business_id + aliases
  3. If not found, create new hash-based ID and optionally write to mapping
Advantages:
  • ✅ Supports alias merging (backward compatible with Repo A)
  • ✅ Handles name variations
Limitations:
  • ❌ Requires mapping file maintenance
  • ❌ Manual alias definition needed

Integration with Stage 0

Location: repo_b/stage0/business_resolver.py (new module) API:
def resolve_business_identity(
    business_label_normalized: str,
    source_file: str,
    period_label: date,
    mapping_file: Optional[Path] = None
) -> Dict[str, Any]:
    """
    Resolve business identity from normalized name.
    
    Returns business_id and metadata for audit trail.
    """
    # Phase 1: Hash-based (always deterministic)
    normalized = business_label_normalized.upper().strip()
    business_id = hashlib.sha256(normalized.encode()).hexdigest()[:16]
    
    # Phase 2: Alias lookup (if mapping provided)
    if mapping_file and mapping_file.exists():
        # Check for aliases, merge if found
        pass
    
    return {
        "business_id": business_id,
        "business_name_canonical": normalized,
        "business_name_source": business_label_normalized,
        "resolution_method": "hash_based",
        # ... other fields
    }
Stage 0 Integration Point:
# In repo_b/stage0/raw_intake.py -> normalize_to_canonical()
for record in records:
    # ... existing normalization ...
    
    # NEW: Resolve business identity
    business_identity = resolve_business_identity(
        business_label_normalized=record["business_label_normalized"],
        source_file=source_filename,
        period_label=period_label
    )
    
    # Attach business_id to canonical record
    record["business_id"] = business_identity["business_id"]
    record["business_resolution_metadata"] = business_identity  # For audit

No-Code Ingestion Safety

Requirements:
  1. No manual CSV edits: Business IDs generated automatically
  2. API-friendly: Can be called via REST endpoint
  3. Idempotent: Same input → same output (safe to retry)
  4. Audit trail: All resolutions logged with provenance
Example API Usage:
POST /api/v1/business/resolve
{
    "business_name": "4tk holdings  LLC  DBA elite commercial cleaning",
    "source_file": "upload_2025-09-15.csv",
    "period_label": "2025-09-01"
}

Response:
{
    "business_id": "985db0ed14b0b5f1",
    "business_name_canonical": "4TK HOLDINGS LLC DBA ELITE COMMERCIAL CLEANING",
    "resolution_method": "hash_based",
    "confidence": 1.0
}

Backward Compatibility

Migration Strategy:
  1. Phase 1 (Current): Use hash-based resolution (no mapping required)
  2. Phase 2 (Future): Optionally load business_id_mapping.csv for alias support
  3. Phase 3 (Future): Migrate mapping to BigQuery table for real-time updates
Repo A Parity:
  • If Repo A used business_id_mapping.csv for aliases, Phase 2 will support the same CSV format
  • Hash-based IDs will differ from Repo A’s IDs (by design—clean break)
  • Financial logic unchanged (business_id is metadata only)

Implementation Plan (NO CODE YET)

Step 1: Design Approval

  • ✅ Analyze business_id_mapping.csv structure
  • ✅ Identify September failure root cause (data quality, not mapping)
  • ✅ Draft resolver design (this document)

Step 2: Create Resolver Module (Future)

  • Create repo_b/stage0/business_resolver.py
  • Implement hash-based resolution (Phase 1)
  • Add unit tests for deterministic behavior

Step 3: Integrate with Stage 0 (Future)

  • Modify normalize_to_canonical() to call resolver
  • Attach business_id to canonical records
  • Add provenance metadata

Step 4: Alias Support (Future, Optional)

  • Load business_id_mapping.csv for backward compatibility
  • Implement alias lookup logic
  • Support manual alias definitions

Step 5: BigQuery Integration (Future, Optional)

  • Create payroll_raw.dim_business_mapping table
  • Migrate CSV mappings to BigQuery
  • Enable real-time alias updates via API

Red Flags & Alignment Check

✅ Resolved

  1. Hidden dependency identified: business_id_mapping.csv usage documented
  2. September failure explained: Data quality issue (agent commission > net payout), not mapping
  3. Design explicit: Hash-based resolution is deterministic and auditable
  4. No financial logic changes: Business ID is metadata only

⚠️ Open Questions

  1. Alias merging: Should Repo B support alias merging like Repo A? (Phase 2 decision)
  2. ID migration: Should Repo B use same business_ids as Repo A? (Hash-based will differ)
  3. Manual overrides: How to handle manual business ID assignments? (Future enhancement)

🔒 Constraints Maintained

  • ✅ No changes to Stage 1–4 financial logic
  • ✅ No changes to dashboard queries
  • ✅ No changes to BigQuery schemas (yet)
  • ✅ Backward compatible with existing canonical JSON format

Next Steps

  1. Review this design for alignment with requirements
  2. Approve Phase 1 (hash-based resolution) for implementation
  3. Defer Phase 2 (alias support) until Repo A migration complete
  4. Document September data quality issue separately (not a resolver problem)

Last Updated: 2025-12-19 Status: Design Proposal (No Implementation) Author: Cursor AI Assistant