Skip to main content

Identity Stability Guarantees - Audit-Grade Documentation

Date: 2025-01-XX
Status: ⚠️ REQUIRES IMMEDIATE FIXES
Impact: Critical for semantic view deterministic IDs and tenant isolation

Executive Summary

This document provides audit-grade guarantees for business_id and agent identity stability, which are foundational for:
  • Deterministic stage1_row_id and stage3_component_id hashes
  • Tenant isolation enforcement
  • Historical data continuity
  • Commission traceability accuracy
Current Status: ⚠️ Business ID stability is PARTIALLY IMPLEMENTED with critical gaps. Agent identity requires migration to agent_key pattern.

1. Business ID Stability

1.1 Current Implementation

Generation Location: utils/business_id_generator.py
Generation Method: MD5 hash of normalized business name (first 16 hex characters)
Normalization Formula: UPPER(TRIM(business_name))
Code Reference:
def _generate_hash_id(self, normalized_name: str) -> str:
    """Generate deterministic hash-based ID from normalized name"""
    hash_obj = hashlib.md5(normalized_name.encode('utf-8'))
    return hash_obj.hexdigest()[:16]
Storage:
  • CSV file: data/reference/business_id_mapping.csv
  • BigQuery table: payroll-bi-gauntlet.payroll_analytics.dim_business_mapping (or payroll_raw.dim_business_mapping)

1.2 Stability Guarantees

GUARANTEED STABLE:

  1. Deterministic Generation: Same normalized_name → same business_id (MD5 hash)
  2. Cross-Period Stability: business_id does not change across months/periods
  3. Cross-Run Stability: Re-running same period generates identical business_id
  4. Immutable Once Assigned: No automatic regeneration of existing IDs

NOT GUARANTEED (Critical Gaps):

  1. Missing tenant_id in Schema:
    • Current schema (dim_business_mapping_schema.json) does NOT include tenant_id
    • Risk: Cannot enforce tenant isolation at dimension level
    • Impact: Cross-tenant business name collisions possible
  2. No MERGE Strategy:
    • Current loader (load_reference_tables.py) uses WRITE_TRUNCATE
    • Risk: Data loss on reload, no idempotent upserts
    • Impact: Cannot safely update last_seen_date or is_active status
  3. No Upsert Key Definition:
    • No explicit MERGE key documented
    • Required MERGE Key: (tenant_id, normalized_name) (once tenant_id added)

1.3 Required Fixes

Fix 1: Add tenant_id to Schema

File: integration/bigquery/schemas/dim_business_mapping_schema.json Change:
[
  {
    "name": "tenant_id",
    "type": "STRING",
    "mode": "REQUIRED",
    "description": "Tenant identifier for multi-tenant isolation"
  },
  {
    "name": "business_id",
    ...
  },
  ...
]

Fix 2: Create MERGE SQL Script

File: integration/bigquery/sql/tables/dim_business_mapping.sql (NEW) Required MERGE Statement:
-- Dimension Table: Business Mapping
-- Stable business ID mapping with tenant isolation
-- MERGE Key: (tenant_id, normalized_name)

CREATE TABLE IF NOT EXISTS `payroll-bi-gauntlet.payroll_analytics.dim_business_mapping` (
  tenant_id STRING NOT NULL,
  business_id STRING NOT NULL,
  business_name STRING NOT NULL,
  normalized_name STRING NOT NULL,
  is_active BOOL NOT NULL,
  first_seen_date DATE,
  last_seen_date DATE,
  created_at TIMESTAMP NOT NULL,
  updated_at TIMESTAMP
)
CLUSTER BY tenant_id, normalized_name, business_id;

-- Idempotent MERGE for business mapping updates
MERGE `payroll-bi-gauntlet.payroll_analytics.dim_business_mapping` AS target
USING (
  SELECT * FROM UNNEST([
    -- Example: Add businesses here or load from CSV
    STRUCT(
      'creative_benefit_strategies' AS tenant_id,
      '6cd1a713faf7c2ba' AS business_id,
      'Teledata Contractors Inc' AS business_name,
      'TELEDATA CONTRACTORS INC' AS normalized_name,
      TRUE AS is_active,
      DATE('2024-01-01') AS first_seen_date,
      DATE('2025-01-01') AS last_seen_date,
      CURRENT_TIMESTAMP() AS created_at,
      CURRENT_TIMESTAMP() AS updated_at
    )
  ])
) AS source
ON target.tenant_id = source.tenant_id
  AND target.normalized_name = source.normalized_name
WHEN MATCHED THEN
  UPDATE SET
    business_name = source.business_name,
    is_active = source.is_active,
    last_seen_date = GREATEST(COALESCE(target.last_seen_date, DATE('1900-01-01')), source.last_seen_date),
    updated_at = CURRENT_TIMESTAMP()
WHEN NOT MATCHED THEN
  INSERT (tenant_id, business_id, business_name, normalized_name, is_active, first_seen_date, last_seen_date, created_at, updated_at)
  VALUES (source.tenant_id, source.business_id, source.business_name, source.normalized_name, source.is_active, source.first_seen_date, source.last_seen_date, source.created_at, source.updated_at);
MERGE Key: (tenant_id, normalized_name)
Immutable Field: business_id (never updated in MERGE, only inserted)

Fix 3: Update BusinessIDGenerator to Include tenant_id

File: utils/business_id_generator.py Change: Add tenant_id parameter to all methods and CSV schema.

1.4 Guarantee Statement (Post-Fix)

After fixes are applied:
BUSINESS ID STABILITY GUARANTEE:
For any given (tenant_id, normalized_name) tuple:
  1. business_id is deterministically generated from normalized_name using MD5 hash (first 16 hex chars)
  2. business_id is immutable once assigned (never updated in MERGE operations)
  3. business_id is stable across periods and reruns (same input → same output)
  4. business_id is tenant-scoped (same normalized name in different tenants → different IDs)
  5. MERGE operations use (tenant_id, normalized_name) as the unique key
  6. No automatic regeneration occurs for existing businesses
Invariant Check:
-- QA: Verify business_id stability (should return 0 rows)
SELECT 
  tenant_id,
  normalized_name,
  COUNT(DISTINCT business_id) AS business_id_count
FROM `payroll-bi-gauntlet.payroll_analytics.dim_business_mapping`
GROUP BY tenant_id, normalized_name
HAVING COUNT(DISTINCT business_id) > 1;

2. Agent Identity Stability

2.1 Current Implementation

Problem: Upstream agent_id is NOT stable or may be missing. Current Approach:
  • agent_id is minted from agent_name using mint_agent_id_from_name() function
  • Function generates deterministic hash-based IDs: "NAME:" + SHA256(normalized_name)[:12]
  • Used in stage3_snapshots and semantic views
Code Reference (repo_b/upload_to_bq.py):
def mint_agent_id_from_name(agent_name: str) -> str:
    """Generate stable agent_id from agent_name"""
    normalized = normalize_agent_name(agent_name)
    if normalized in {"OWNER", "UNASSIGNED"}:
        return normalized
    hash_obj = hashlib.sha256(normalized.encode('utf-8'))
    return f"NAME:{hash_obj.hexdigest()[:12]}"

2.2 Stability Issues

NOT GUARANTEED:

  1. Name Variations: “Kenny Young” vs “KENNY YOUNG” → different hashes (if normalization inconsistent)
  2. No Canonical Agent Key: Current system uses agent_id which is derived, not authoritative
  3. Hierarchy References: dim_agent_hierarchy uses agent_id but source may be unstable
  4. Semantic View Dependency: stage3_component_id hash includes agent_id, making it fragile

2.3 Required Solution: agent_key Pattern

Proposal: Create dim_agent_identity Table

Purpose: Canonical agent identity mapping with aliases and effective dates Schema (integration/bigquery/schemas/dim_agent_identity_schema.json):
[
  {
    "name": "tenant_id",
    "type": "STRING",
    "mode": "REQUIRED"
  },
  {
    "name": "agent_key",
    "type": "STRING",
    "mode": "REQUIRED",
    "description": "Stable canonical agent identifier (deterministic hash)"
  },
  {
    "name": "display_name",
    "type": "STRING",
    "mode": "REQUIRED",
    "description": "Canonical display name (normalized)"
  },
  {
    "name": "aliases",
    "type": "ARRAY<STRING>",
    "mode": "NULLABLE",
    "description": "Known name variations (e.g., ['KENNY YOUNG', 'Kenny Young'])"
  },
  {
    "name": "upstream_agent_id",
    "type": "STRING",
    "mode": "NULLABLE",
    "description": "Upstream agent_id if available (may be NULL)"
  },
  {
    "name": "effective_start_date",
    "type": "DATE",
    "mode": "NULLABLE"
  },
  {
    "name": "effective_end_date",
    "type": "DATE",
    "mode": "NULLABLE"
  },
  {
    "name": "is_active",
    "type": "BOOL",
    "mode": "REQUIRED"
  }
]
MERGE Key: (tenant_id, agent_key) Generation Logic:
def generate_agent_key(agent_name: str) -> str:
    """Generate stable agent_key from normalized name"""
    normalized = normalize_agent_name(agent_name)  # UPPER, TRIM, collapse spaces
    if normalized in {"OWNER", "UNASSIGNED"}:
        return normalized
    hash_obj = hashlib.sha256(normalized.encode('utf-8'))
    return f"AGENT:{hash_obj.hexdigest()[:16]}"  # Longer than agent_id for uniqueness

2.4 Required Code Changes

Change 1: Update Semantic View to Use agent_key

File: integration/bigquery/sql/views/semantic_view_commission_traceability.sql Current (line 65):
CAST(agent_id AS STRING), '|',
Change To:
CAST(COALESCE(ai.agent_key, s3.agent_id, 'UNKNOWN') AS STRING), '|',
Add Join:
agent_identity AS (
  SELECT
    tenant_id,
    agent_key,
    display_name,
    upstream_agent_id,
    aliases
  FROM `payroll-bi-gauntlet.payroll_analytics.dim_agent_identity`
  WHERE is_active = TRUE
    AND (effective_end_date IS NULL OR effective_end_date >= CURRENT_DATE())
  QUALIFY ROW_NUMBER() OVER (
    PARTITION BY tenant_id, agent_key
    ORDER BY effective_start_date DESC
  ) = 1
)
Update Final SELECT:
-- Replace agent_id references with agent_key where appropriate
COALESCE(ai.agent_key, s1c.agent_id) AS agent_key,  -- New canonical key
s1c.agent_id,  -- Keep for backward compatibility

Change 2: Update Tenant Isolation Tests

File: tests/test_semantic_view_qa_invariants.sql Current Check A uses agent_id:
INNER JOIN `payroll-bi-gauntlet.payroll_analytics.dim_agent_hierarchy` d
  ON v.agent_id = d.agent_id
Change To (use agent_key):
INNER JOIN `payroll-bi-gauntlet.payroll_analytics.dim_agent_identity` ai
  ON v.agent_key = ai.agent_key
  AND v.tenant_id = ai.tenant_id
INNER JOIN `payroll-bi-gauntlet.payroll_analytics.dim_agent_hierarchy` d
  ON ai.upstream_agent_id = d.agent_id  -- Or join via agent_key if hierarchy updated
WHERE v.tenant_id != ai.tenant_id

2.5 Guarantee Statement (Post-Migration)

After migration to agent_key:
AGENT IDENTITY STABILITY GUARANTEE:
  1. agent_key is deterministically generated from normalized display_name using SHA256 hash
  2. agent_key is immutable once assigned (never updated in MERGE)
  3. agent_key is stable across periods and reruns
  4. agent_key is tenant-scoped (same name in different tenants → different keys)
  5. Name variations are handled via aliases array (all map to same agent_key)
  6. MERGE operations use (tenant_id, agent_key) as the unique key
  7. upstream_agent_id is nullable and may change without affecting agent_key stability

3. Required Migrations (Non-Breaking)

3.1 Business Mapping Migration

Steps:
  1. Add tenant_id column to existing dim_business_mapping table (ALTER TABLE)
  2. Backfill tenant_id for existing rows (default to 'creative_benefit_strategies')
  3. Update BusinessIDGenerator to include tenant_id in all operations
  4. Switch loader from WRITE_TRUNCATE to MERGE-based upserts
  5. Run QA invariant to verify no business_id churn
Migration SQL:
-- Step 1: Add tenant_id column
ALTER TABLE `payroll-bi-gauntlet.payroll_analytics.dim_business_mapping`
ADD COLUMN IF NOT EXISTS tenant_id STRING;

-- Step 2: Backfill tenant_id
UPDATE `payroll-bi-gauntlet.payroll_analytics.dim_business_mapping`
SET tenant_id = 'creative_benefit_strategies'
WHERE tenant_id IS NULL;

-- Step 3: Make tenant_id REQUIRED (after backfill)
ALTER TABLE `payroll-bi-gauntlet.payroll_analytics.dim_business_mapping`
ALTER COLUMN tenant_id SET OPTIONS(description="Tenant identifier");

-- Step 4: Add clustering
ALTER TABLE `payroll-bi-gauntlet.payroll_analytics.dim_business_mapping`
SET OPTIONS(
  clustering_fields = ['tenant_id', 'normalized_name', 'business_id']
);

3.2 Agent Identity Migration

Steps:
  1. Create dim_agent_identity table
  2. Populate from existing dim_agent_hierarchy + stage3_snapshots (extract unique agents)
  3. Generate agent_key for all existing agents
  4. Update semantic view to use agent_key (backward compatible: keep agent_id)
  5. Update QA invariants to check agent_key stability
Migration SQL:
-- Step 1: Create dim_agent_identity
CREATE TABLE `payroll-bi-gauntlet.payroll_analytics.dim_agent_identity` (
  tenant_id STRING NOT NULL,
  agent_key STRING NOT NULL,
  display_name STRING NOT NULL,
  aliases ARRAY<STRING>,
  upstream_agent_id STRING,
  effective_start_date DATE,
  effective_end_date DATE,
  is_active BOOL NOT NULL,
  created_at TIMESTAMP NOT NULL,
  updated_at TIMESTAMP
)
CLUSTER BY tenant_id, agent_key;

-- Step 2: Populate from existing data
INSERT INTO `payroll-bi-gauntlet.payroll_analytics.dim_agent_identity`
SELECT DISTINCT
  tenant_id,
  -- Generate agent_key (deterministic hash)
  CONCAT('AGENT:', TO_HEX(SHA256(UPPER(TRIM(agent_name))))[:16]) AS agent_key,
  UPPER(TRIM(agent_name)) AS display_name,
  [UPPER(TRIM(agent_name))] AS aliases,  -- Start with canonical name
  agent_id AS upstream_agent_id,
  DATE('2024-01-01') AS effective_start_date,
  NULL AS effective_end_date,
  TRUE AS is_active,
  CURRENT_TIMESTAMP() AS created_at,
  CURRENT_TIMESTAMP() AS updated_at
FROM `payroll-bi-gauntlet.payroll_processed.stage3_snapshots`
WHERE period_label >= DATE('2024-01-01')
  AND agent_name IS NOT NULL
GROUP BY tenant_id, agent_name, agent_id;

4. QA Invariants

4.1 Business ID Churn Detection

File: tests/test_semantic_view_qa_invariants.sql (add new invariant)
-- ============================================================================
-- INVARIANT 7: Business ID Stability (No Churn)
-- ============================================================================
-- Each normalized_name should have exactly one business_id over time

SELECT
  tenant_id,
  normalized_name,
  COUNT(DISTINCT business_id) AS business_id_count,
  ARRAY_AGG(DISTINCT business_id) AS business_ids,
  CASE
    WHEN COUNT(DISTINCT business_id) = 1 THEN 'PASS'
    ELSE 'FAIL'
  END AS stability_status
FROM `payroll-bi-gauntlet.payroll_analytics.dim_business_mapping`
GROUP BY tenant_id, normalized_name
HAVING COUNT(DISTINCT business_id) > 1  -- Show only failures
ORDER BY business_id_count DESC;

-- Expected: 0 rows (no business_id churn)

4.2 Agent Key Stability

File: tests/test_semantic_view_qa_invariants.sql (add new invariant)
-- ============================================================================
-- INVARIANT 8: Agent Key Stability
-- ============================================================================
-- Each display_name should have exactly one agent_key

SELECT
  tenant_id,
  display_name,
  COUNT(DISTINCT agent_key) AS agent_key_count,
  CASE
    WHEN COUNT(DISTINCT agent_key) = 1 THEN 'PASS'
    ELSE 'FAIL'
  END AS stability_status
FROM `payroll-bi-gauntlet.payroll_analytics.dim_agent_identity`
GROUP BY tenant_id, display_name
HAVING COUNT(DISTINCT agent_key) > 1  -- Show only failures
ORDER BY agent_key_count DESC;

-- Expected: 0 rows (no agent_key churn)

5. Summary: Guarantee Statements

BUSINESS ID STABILITY (Post-Fix)

MERGE Key: (tenant_id, normalized_name)
Immutable Field: business_id
Generation: MD5 hash of normalized_name (first 16 hex chars)
Guarantee: Same (tenant_id, normalized_name) → same business_id forever

AGENT IDENTITY STABILITY (Post-Migration)

MERGE Key: (tenant_id, agent_key)
Immutable Field: agent_key
Generation: SHA256 hash of normalized display_name (prefix: "AGENT:" + 16 hex chars)
Guarantee: Same (tenant_id, normalized_name) → same agent_key forever
Alias Handling: Multiple name variations map to same agent_key via aliases array

6. Action Items

Immediate (Blocking):

  1. ⚠️ Add tenant_id to dim_business_mapping schema
  2. ⚠️ Create MERGE SQL for dim_business_mapping (key: tenant_id, normalized_name)
  3. ⚠️ Update BusinessIDGenerator to include tenant_id
  4. ⚠️ Migrate loader from WRITE_TRUNCATE to MERGE

Short-term (Required for Semantic View):

  1. Create dim_agent_identity table
  2. Update semantic view to use agent_key (backward compatible)
  3. Add QA invariants for ID stability checks

Long-term (Enhancement):

  1. Implement alias resolution for business names
  2. Add audit logging for ID generation
  3. Create admin UI for manual ID overrides

Status: ⚠️ CRITICAL FIXES REQUIRED BEFORE PRODUCTION USE