Business Identity Resolution - Design Proposal

Purpose

Make business identity resolution explicit, auditable, and deterministic in Repo B without changing financial logic. This replaces the hidden dependency on business_id_mapping.csv that exists in Repo A.

Current State Analysis

business_id_mapping.csv Summary

Location: data/reference/business_id_mapping.csv Statistics:

Total rows: 185
Unique business_ids: 185
Unique normalized_names: 185
Unique business_names: 185
Businesses with >1 alias: 0 (1:1 mapping currently)

Schema:

business_id (16-char hex): Deterministic hash-based ID
business_name (original): Raw business name from source
normalized_name (UPPERCASE): Normalized for matching
is_active (boolean): Active status flag
first_seen_date (YYYY-MM): First appearance
last_seen_date (YYYY-MM): Last appearance
created_at (ISO timestamp): Record creation
updated_at (ISO timestamp): Last update

Key Findings:

Static mapping: Currently 1:1 (no aliases), but schema supports multiple aliases per business_id
Time-evolving: first_seen_date and last_seen_date show businesses appear/disappear over time
September anomaly: 4TK Holdings LLC dba Elite Commercial Cleaning exists in mapping (business_id: 985db0ed14b0b5f1, first_seen: 2025-09, last_seen: 2025-09)

September 2025 Failure Correlation

Failing Business: 4tk holdings LLC DBA elite commercial cleaning Mapping Status:

✅ Exists in mapping: business_id = 985db0ed14b0b5f1
✅ Normalized name: 4TK HOLDINGS LLC DBA ELITE COMMERCIAL CLEANING
✅ First/Last seen: 2025-09 (new business)
✅ Agent assignment: TOMMY DANG, PEPM $3.00

Failure Root Cause:

Net payout: $12.00 (12 employees ×$ 1.00 credit each)
Agent commission calculated: $36.00 (12 employees ×$ 3.00 PEPM × 12 pay_periods / 12)
Owner residual: -$24.00 (violates invariant: agent_total_sum ≤ net_payout)

Analysis: This is NOT a mapping issue—the business identity is correctly resolved. The failure is a data quality issue: agent commission calculation assumes monthly pay_periods=12, but the actual net payout ($12.00) suggests a different pay structure or data error in the source file. Repo A Behavior (Inferred): Repo A likely either:

Merged this business with another (if aliases existed), or
Dropped it silently (if validation failed), or
Applied different normalization that changed the business name matching

Proposed Solution: Business Identity Resolver

Design Principles

Deterministic: Same input → same output (no randomness)
Auditable: All resolution decisions logged with provenance
Idempotent: Re-running resolution produces identical results
Explicit: No hidden mappings or magic strings
Safe for no-code: Can be invoked via API/UI without manual CSV edits

Inputs

Raw Business Names (from Stage 0 canonical input):

business_label_normalized (string): Normalized business name from raw payroll file
source_file (string): Provenance (which file this came from)
period_label (date): When this business was seen

Reference Data (optional, for alias resolution):

Historical business_id_mapping.csv (for backward compatibility)
Active business registry (future: BigQuery table)

Outputs

Business Identity Object:

{
    "business_id": "985db0ed14b0b5f1",  # 16-char hex (deterministic hash)
    "business_name_canonical": "4TK HOLDINGS LLC DBA ELITE COMMERCIAL CLEANING",  # Normalized
    "business_name_source": "4tk holdings  LLC  DBA elite commercial cleaning",  # Original
    "resolution_method": "hash_based",  # or "alias_match", "manual_override"
    "resolution_confidence": 1.0,  # 0.0-1.0
    "aliases": [],  # List of known aliases (if any)
    "first_seen_date": date(2025, 9, 1),
    "last_seen_date": date(2025, 9, 1),
    "provenance": {
        "source_file": "Sept 2025 Combined Payroll.csv",
        "resolved_at": datetime(...),
        "resolver_version": "1.0.0"
    }
}

Resolution Strategy

Phase 1: Hash-Based (Deterministic, No Mapping Required)

Algorithm:

Normalize input: UPPERCASE(TRIM(business_label_normalized))
Generate hash: SHA256(normalized_name)[:16] → hex string
Return business_id = hash

Advantages:

✅ No mapping file required
✅ Deterministic (same name → same ID)
✅ Handles new businesses automatically
✅ No manual maintenance

Limitations:

❌ Cannot merge aliases (e.g., “ABC Corp” vs “ABC Corporation”)
❌ Cannot handle name changes over time

Phase 2: Alias Resolution (Future Enhancement)

Algorithm:

Try hash-based resolution (Phase 1)
If business_id_mapping.csv exists, check for aliases:
- Lookup normalized_name → business_id
- If found, return mapped business_id + aliases
If not found, create new hash-based ID and optionally write to mapping

Advantages:

✅ Supports alias merging (backward compatible with Repo A)
✅ Handles name variations

Limitations:

❌ Requires mapping file maintenance
❌ Manual alias definition needed

Integration with Stage 0

Location: repo_b/stage0/business_resolver.py (new module) API:

def resolve_business_identity(
    business_label_normalized: str,
    source_file: str,
    period_label: date,
    mapping_file: Optional[Path] = None
) -> Dict[str, Any]:
    """
    Resolve business identity from normalized name.
    
    Returns business_id and metadata for audit trail.
    """
    # Phase 1: Hash-based (always deterministic)
    normalized = business_label_normalized.upper().strip()
    business_id = hashlib.sha256(normalized.encode()).hexdigest()[:16]
    
    # Phase 2: Alias lookup (if mapping provided)
    if mapping_file and mapping_file.exists():
        # Check for aliases, merge if found
        pass
    
    return {
        "business_id": business_id,
        "business_name_canonical": normalized,
        "business_name_source": business_label_normalized,
        "resolution_method": "hash_based",
        # ... other fields
    }

Stage 0 Integration Point:

# In repo_b/stage0/raw_intake.py -> normalize_to_canonical()
for record in records:
    # ... existing normalization ...
    
    # NEW: Resolve business identity
    business_identity = resolve_business_identity(
        business_label_normalized=record["business_label_normalized"],
        source_file=source_filename,
        period_label=period_label
    )
    
    # Attach business_id to canonical record
    record["business_id"] = business_identity["business_id"]
    record["business_resolution_metadata"] = business_identity  # For audit

No-Code Ingestion Safety

Requirements:

No manual CSV edits: Business IDs generated automatically
API-friendly: Can be called via REST endpoint
Idempotent: Same input → same output (safe to retry)
Audit trail: All resolutions logged with provenance

Example API Usage:

POST /api/v1/business/resolve
{
    "business_name": "4tk holdings  LLC  DBA elite commercial cleaning",
    "source_file": "upload_2025-09-15.csv",
    "period_label": "2025-09-01"
}

Response:
{
    "business_id": "985db0ed14b0b5f1",
    "business_name_canonical": "4TK HOLDINGS LLC DBA ELITE COMMERCIAL CLEANING",
    "resolution_method": "hash_based",
    "confidence": 1.0
}

Backward Compatibility

Migration Strategy:

Phase 1 (Current): Use hash-based resolution (no mapping required)
Phase 2 (Future): Optionally load business_id_mapping.csv for alias support
Phase 3 (Future): Migrate mapping to BigQuery table for real-time updates

Repo A Parity:

If Repo A used business_id_mapping.csv for aliases, Phase 2 will support the same CSV format
Hash-based IDs will differ from Repo A’s IDs (by design—clean break)
Financial logic unchanged (business_id is metadata only)

Implementation Plan (NO CODE YET)

Step 1: Design Approval

✅ Analyze business_id_mapping.csv structure
✅ Identify September failure root cause (data quality, not mapping)
✅ Draft resolver design (this document)

Step 2: Create Resolver Module (Future)

Create repo_b/stage0/business_resolver.py
Implement hash-based resolution (Phase 1)
Add unit tests for deterministic behavior

Step 3: Integrate with Stage 0 (Future)

Modify normalize_to_canonical() to call resolver
Attach business_id to canonical records
Add provenance metadata

Step 4: Alias Support (Future, Optional)

Load business_id_mapping.csv for backward compatibility
Implement alias lookup logic
Support manual alias definitions

Step 5: BigQuery Integration (Future, Optional)

Create payroll_raw.dim_business_mapping table
Migrate CSV mappings to BigQuery
Enable real-time alias updates via API

Red Flags & Alignment Check

✅ Resolved

Hidden dependency identified: business_id_mapping.csv usage documented
September failure explained: Data quality issue (agent commission > net payout), not mapping
Design explicit: Hash-based resolution is deterministic and auditable
No financial logic changes: Business ID is metadata only

⚠️ Open Questions

Alias merging: Should Repo B support alias merging like Repo A? (Phase 2 decision)
ID migration: Should Repo B use same business_ids as Repo A? (Hash-based will differ)
Manual overrides: How to handle manual business ID assignments? (Future enhancement)

🔒 Constraints Maintained

✅ No changes to Stage 1–4 financial logic
✅ No changes to dashboard queries
✅ No changes to BigQuery schemas (yet)
✅ Backward compatible with existing canonical JSON format

Next Steps

Review this design for alignment with requirements
Approve Phase 1 (hash-based resolution) for implementation
Defer Phase 2 (alias support) until Repo A migration complete
Document September data quality issue separately (not a resolver problem)

Last Updated: 2025-12-19 Status: Design Proposal (No Implementation) Author: Cursor AI Assistant

Start Here

Operate & Support

Reference

Build & Integrate

Business identity resolution

Business Identity Resolution - Design Proposal

Purpose

Current State Analysis

business_id_mapping.csv Summary

September 2025 Failure Correlation

Proposed Solution: Business Identity Resolver

Design Principles

Inputs

Outputs

Resolution Strategy

Phase 1: Hash-Based (Deterministic, No Mapping Required)

Phase 2: Alias Resolution (Future Enhancement)

Integration with Stage 0

No-Code Ingestion Safety

Backward Compatibility

Implementation Plan (NO CODE YET)

Step 1: Design Approval

Step 2: Create Resolver Module (Future)

Step 3: Integrate with Stage 0 (Future)

Step 4: Alias Support (Future, Optional)

Step 5: BigQuery Integration (Future, Optional)

Red Flags & Alignment Check

✅ Resolved

⚠️ Open Questions

🔒 Constraints Maintained

Next Steps

Start Here

Operate & Support

Reference

Build & Integrate

​Business Identity Resolution - Design Proposal

​Purpose

​Current State Analysis

​business_id_mapping.csv Summary

​September 2025 Failure Correlation

​Proposed Solution: Business Identity Resolver

​Design Principles

​Inputs

​Outputs

​Resolution Strategy

​Phase 1: Hash-Based (Deterministic, No Mapping Required)

​Phase 2: Alias Resolution (Future Enhancement)

​Integration with Stage 0

​No-Code Ingestion Safety

​Backward Compatibility

​Implementation Plan (NO CODE YET)

​Step 1: Design Approval

​Step 2: Create Resolver Module (Future)

​Step 3: Integrate with Stage 0 (Future)

​Step 4: Alias Support (Future, Optional)

​Step 5: BigQuery Integration (Future, Optional)

​Red Flags & Alignment Check

​✅ Resolved

​⚠️ Open Questions

​🔒 Constraints Maintained

​Next Steps

Business Identity Resolution - Design Proposal

Purpose

Current State Analysis

business_id_mapping.csv Summary

September 2025 Failure Correlation

Proposed Solution: Business Identity Resolver

Design Principles

Inputs

Outputs

Resolution Strategy

Phase 1: Hash-Based (Deterministic, No Mapping Required)

Phase 2: Alias Resolution (Future Enhancement)

Integration with Stage 0

No-Code Ingestion Safety

Backward Compatibility

Implementation Plan (NO CODE YET)

Step 1: Design Approval

Step 2: Create Resolver Module (Future)

Step 3: Integrate with Stage 0 (Future)

Step 4: Alias Support (Future, Optional)

Step 5: BigQuery Integration (Future, Optional)

Red Flags & Alignment Check

✅ Resolved

⚠️ Open Questions

🔒 Constraints Maintained

Next Steps