Identity Stability Guarantees - Audit-Grade Documentation

Date: 2025-01-XX
Status: ⚠️ REQUIRES IMMEDIATE FIXES
Impact: Critical for semantic view deterministic IDs and tenant isolation

Executive Summary

This document provides audit-grade guarantees for business_id and agent identity stability, which are foundational for:

Deterministic stage1_row_id and stage3_component_id hashes
Tenant isolation enforcement
Historical data continuity
Commission traceability accuracy

Current Status: ⚠️ Business ID stability is PARTIALLY IMPLEMENTED with critical gaps. Agent identity requires migration to agent_key pattern.

1. Business ID Stability

1.1 Current Implementation

Generation Location: utils/business_id_generator.py
Generation Method: MD5 hash of normalized business name (first 16 hex characters)
Normalization Formula: UPPER(TRIM(business_name)) Code Reference:

def _generate_hash_id(self, normalized_name: str) -> str:
    """Generate deterministic hash-based ID from normalized name"""
    hash_obj = hashlib.md5(normalized_name.encode('utf-8'))
    return hash_obj.hexdigest()[:16]

Storage:

CSV file: data/reference/business_id_mapping.csv
BigQuery table: payroll-bi-gauntlet.payroll_analytics.dim_business_mapping (or payroll_raw.dim_business_mapping)

1.2 Stability Guarantees

✅ GUARANTEED STABLE:

Deterministic Generation: Same normalized_name → same business_id (MD5 hash)
Cross-Period Stability: business_id does not change across months/periods
Cross-Run Stability: Re-running same period generates identical business_id
Immutable Once Assigned: No automatic regeneration of existing IDs

❌ NOT GUARANTEED (Critical Gaps):

Missing tenant_id in Schema:
- Current schema (dim_business_mapping_schema.json) does NOT include tenant_id
- Risk: Cannot enforce tenant isolation at dimension level
- Impact: Cross-tenant business name collisions possible
No MERGE Strategy:
- Current loader (load_reference_tables.py) uses WRITE_TRUNCATE
- Risk: Data loss on reload, no idempotent upserts
- Impact: Cannot safely update last_seen_date or is_active status
No Upsert Key Definition:
- No explicit MERGE key documented
- Required MERGE Key: (tenant_id, normalized_name) (once tenant_id added)

1.3 Required Fixes

Fix 1: Add `tenant_id` to Schema

File: integration/bigquery/schemas/dim_business_mapping_schema.json Change:

[
  {
    "name": "tenant_id",
    "type": "STRING",
    "mode": "REQUIRED",
    "description": "Tenant identifier for multi-tenant isolation"
  },
  {
    "name": "business_id",
    ...
  },
  ...
]

Fix 2: Create MERGE SQL Script

File: integration/bigquery/sql/tables/dim_business_mapping.sql (NEW) Required MERGE Statement:

-- Dimension Table: Business Mapping
-- Stable business ID mapping with tenant isolation
-- MERGE Key: (tenant_id, normalized_name)

CREATE TABLE IF NOT EXISTS `payroll-bi-gauntlet.payroll_analytics.dim_business_mapping` (
  tenant_id STRING NOT NULL,
  business_id STRING NOT NULL,
  business_name STRING NOT NULL,
  normalized_name STRING NOT NULL,
  is_active BOOL NOT NULL,
  first_seen_date DATE,
  last_seen_date DATE,
  created_at TIMESTAMP NOT NULL,
  updated_at TIMESTAMP
)
CLUSTER BY tenant_id, normalized_name, business_id;

-- Idempotent MERGE for business mapping updates
MERGE `payroll-bi-gauntlet.payroll_analytics.dim_business_mapping` AS target
USING (
  SELECT * FROM UNNEST([
    -- Example: Add businesses here or load from CSV
    STRUCT(
      'creative_benefit_strategies' AS tenant_id,
      '6cd1a713faf7c2ba' AS business_id,
      'Teledata Contractors Inc' AS business_name,
      'TELEDATA CONTRACTORS INC' AS normalized_name,
      TRUE AS is_active,
      DATE('2024-01-01') AS first_seen_date,
      DATE('2025-01-01') AS last_seen_date,
      CURRENT_TIMESTAMP() AS created_at,
      CURRENT_TIMESTAMP() AS updated_at
    )
  ])
) AS source
ON target.tenant_id = source.tenant_id
  AND target.normalized_name = source.normalized_name
WHEN MATCHED THEN
  UPDATE SET
    business_name = source.business_name,
    is_active = source.is_active,
    last_seen_date = GREATEST(COALESCE(target.last_seen_date, DATE('1900-01-01')), source.last_seen_date),
    updated_at = CURRENT_TIMESTAMP()
WHEN NOT MATCHED THEN
  INSERT (tenant_id, business_id, business_name, normalized_name, is_active, first_seen_date, last_seen_date, created_at, updated_at)
  VALUES (source.tenant_id, source.business_id, source.business_name, source.normalized_name, source.is_active, source.first_seen_date, source.last_seen_date, source.created_at, source.updated_at);

MERGE Key: (tenant_id, normalized_name)
Immutable Field: business_id (never updated in MERGE, only inserted)

Fix 3: Update BusinessIDGenerator to Include tenant_id

File: utils/business_id_generator.py Change: Add tenant_id parameter to all methods and CSV schema.

1.4 Guarantee Statement (Post-Fix)

After fixes are applied:

BUSINESS ID STABILITY GUARANTEE:
For any given (tenant_id, normalized_name) tuple:

business_id is deterministically generated from normalized_name using MD5 hash (first 16 hex chars)

business_id is immutable once assigned (never updated in MERGE operations)

business_id is stable across periods and reruns (same input → same output)

business_id is tenant-scoped (same normalized name in different tenants → different IDs)

MERGE operations use (tenant_id, normalized_name) as the unique key

No automatic regeneration occurs for existing businesses

Invariant Check:

-- QA: Verify business_id stability (should return 0 rows)
SELECT 
  tenant_id,
  normalized_name,
  COUNT(DISTINCT business_id) AS business_id_count
FROM `payroll-bi-gauntlet.payroll_analytics.dim_business_mapping`
GROUP BY tenant_id, normalized_name
HAVING COUNT(DISTINCT business_id) > 1;

2. Agent Identity Stability

2.1 Current Implementation

Problem: Upstream agent_id is NOT stable or may be missing. Current Approach:

agent_id is minted from agent_name using mint_agent_id_from_name() function
Function generates deterministic hash-based IDs: "NAME:" + SHA256(normalized_name)[:12]
Used in stage3_snapshots and semantic views

Code Reference (repo_b/upload_to_bq.py):

def mint_agent_id_from_name(agent_name: str) -> str:
    """Generate stable agent_id from agent_name"""
    normalized = normalize_agent_name(agent_name)
    if normalized in {"OWNER", "UNASSIGNED"}:
        return normalized
    hash_obj = hashlib.sha256(normalized.encode('utf-8'))
    return f"NAME:{hash_obj.hexdigest()[:12]}"

2.2 Stability Issues

❌ NOT GUARANTEED:

Name Variations: “Kenny Young” vs “KENNY YOUNG” → different hashes (if normalization inconsistent)
No Canonical Agent Key: Current system uses agent_id which is derived, not authoritative
Hierarchy References: dim_agent_hierarchy uses agent_id but source may be unstable
Semantic View Dependency: stage3_component_id hash includes agent_id, making it fragile

2.3 Required Solution: `agent_key` Pattern

Proposal: Create `dim_agent_identity` Table

Purpose: Canonical agent identity mapping with aliases and effective dates Schema (integration/bigquery/schemas/dim_agent_identity_schema.json):

[
  {
    "name": "tenant_id",
    "type": "STRING",
    "mode": "REQUIRED"
  },
  {
    "name": "agent_key",
    "type": "STRING",
    "mode": "REQUIRED",
    "description": "Stable canonical agent identifier (deterministic hash)"
  },
  {
    "name": "display_name",
    "type": "STRING",
    "mode": "REQUIRED",
    "description": "Canonical display name (normalized)"
  },
  {
    "name": "aliases",
    "type": "ARRAY<STRING>",
    "mode": "NULLABLE",
    "description": "Known name variations (e.g., ['KENNY YOUNG', 'Kenny Young'])"
  },
  {
    "name": "upstream_agent_id",
    "type": "STRING",
    "mode": "NULLABLE",
    "description": "Upstream agent_id if available (may be NULL)"
  },
  {
    "name": "effective_start_date",
    "type": "DATE",
    "mode": "NULLABLE"
  },
  {
    "name": "effective_end_date",
    "type": "DATE",
    "mode": "NULLABLE"
  },
  {
    "name": "is_active",
    "type": "BOOL",
    "mode": "REQUIRED"
  }
]

MERGE Key: (tenant_id, agent_key) Generation Logic:

def generate_agent_key(agent_name: str) -> str:
    """Generate stable agent_key from normalized name"""
    normalized = normalize_agent_name(agent_name)  # UPPER, TRIM, collapse spaces
    if normalized in {"OWNER", "UNASSIGNED"}:
        return normalized
    hash_obj = hashlib.sha256(normalized.encode('utf-8'))
    return f"AGENT:{hash_obj.hexdigest()[:16]}"  # Longer than agent_id for uniqueness

2.4 Required Code Changes

Change 1: Update Semantic View to Use `agent_key`

File: integration/bigquery/sql/views/semantic_view_commission_traceability.sql Current (line 65):

CAST(agent_id AS STRING), '|',

Change To:

CAST(COALESCE(ai.agent_key, s3.agent_id, 'UNKNOWN') AS STRING), '|',

Add Join:

agent_identity AS (
  SELECT
    tenant_id,
    agent_key,
    display_name,
    upstream_agent_id,
    aliases
  FROM `payroll-bi-gauntlet.payroll_analytics.dim_agent_identity`
  WHERE is_active = TRUE
    AND (effective_end_date IS NULL OR effective_end_date >= CURRENT_DATE())
  QUALIFY ROW_NUMBER() OVER (
    PARTITION BY tenant_id, agent_key
    ORDER BY effective_start_date DESC
  ) = 1
)

Update Final SELECT:

-- Replace agent_id references with agent_key where appropriate
COALESCE(ai.agent_key, s1c.agent_id) AS agent_key,  -- New canonical key
s1c.agent_id,  -- Keep for backward compatibility

Change 2: Update Tenant Isolation Tests

File: tests/test_semantic_view_qa_invariants.sql Current Check A uses agent_id:

INNER JOIN `payroll-bi-gauntlet.payroll_analytics.dim_agent_hierarchy` d
  ON v.agent_id = d.agent_id

Change To (use agent_key):

INNER JOIN `payroll-bi-gauntlet.payroll_analytics.dim_agent_identity` ai
  ON v.agent_key = ai.agent_key
  AND v.tenant_id = ai.tenant_id
INNER JOIN `payroll-bi-gauntlet.payroll_analytics.dim_agent_hierarchy` d
  ON ai.upstream_agent_id = d.agent_id  -- Or join via agent_key if hierarchy updated
WHERE v.tenant_id != ai.tenant_id

2.5 Guarantee Statement (Post-Migration)

After migration to agent_key:

AGENT IDENTITY STABILITY GUARANTEE:

agent_key is deterministically generated from normalized display_name using SHA256 hash

agent_key is immutable once assigned (never updated in MERGE)

agent_key is stable across periods and reruns

agent_key is tenant-scoped (same name in different tenants → different keys)

Name variations are handled via aliases array (all map to same agent_key)

MERGE operations use (tenant_id, agent_key) as the unique key

upstream_agent_id is nullable and may change without affecting agent_key stability

3. Required Migrations (Non-Breaking)

3.1 Business Mapping Migration

Steps:

Add tenant_id column to existing dim_business_mapping table (ALTER TABLE)
Backfill tenant_id for existing rows (default to 'creative_benefit_strategies')
Update BusinessIDGenerator to include tenant_id in all operations
Switch loader from WRITE_TRUNCATE to MERGE-based upserts
Run QA invariant to verify no business_id churn

Migration SQL:

-- Step 1: Add tenant_id column
ALTER TABLE `payroll-bi-gauntlet.payroll_analytics.dim_business_mapping`
ADD COLUMN IF NOT EXISTS tenant_id STRING;

-- Step 2: Backfill tenant_id
UPDATE `payroll-bi-gauntlet.payroll_analytics.dim_business_mapping`
SET tenant_id = 'creative_benefit_strategies'
WHERE tenant_id IS NULL;

-- Step 3: Make tenant_id REQUIRED (after backfill)
ALTER TABLE `payroll-bi-gauntlet.payroll_analytics.dim_business_mapping`
ALTER COLUMN tenant_id SET OPTIONS(description="Tenant identifier");

-- Step 4: Add clustering
ALTER TABLE `payroll-bi-gauntlet.payroll_analytics.dim_business_mapping`
SET OPTIONS(
  clustering_fields = ['tenant_id', 'normalized_name', 'business_id']
);

3.2 Agent Identity Migration

Steps:

Create dim_agent_identity table
Populate from existing dim_agent_hierarchy + stage3_snapshots (extract unique agents)
Generate agent_key for all existing agents
Update semantic view to use agent_key (backward compatible: keep agent_id)
Update QA invariants to check agent_key stability

Migration SQL:

-- Step 1: Create dim_agent_identity
CREATE TABLE `payroll-bi-gauntlet.payroll_analytics.dim_agent_identity` (
  tenant_id STRING NOT NULL,
  agent_key STRING NOT NULL,
  display_name STRING NOT NULL,
  aliases ARRAY<STRING>,
  upstream_agent_id STRING,
  effective_start_date DATE,
  effective_end_date DATE,
  is_active BOOL NOT NULL,
  created_at TIMESTAMP NOT NULL,
  updated_at TIMESTAMP
)
CLUSTER BY tenant_id, agent_key;

-- Step 2: Populate from existing data
INSERT INTO `payroll-bi-gauntlet.payroll_analytics.dim_agent_identity`
SELECT DISTINCT
  tenant_id,
  -- Generate agent_key (deterministic hash)
  CONCAT('AGENT:', TO_HEX(SHA256(UPPER(TRIM(agent_name))))[:16]) AS agent_key,
  UPPER(TRIM(agent_name)) AS display_name,
  [UPPER(TRIM(agent_name))] AS aliases,  -- Start with canonical name
  agent_id AS upstream_agent_id,
  DATE('2024-01-01') AS effective_start_date,
  NULL AS effective_end_date,
  TRUE AS is_active,
  CURRENT_TIMESTAMP() AS created_at,
  CURRENT_TIMESTAMP() AS updated_at
FROM `payroll-bi-gauntlet.payroll_processed.stage3_snapshots`
WHERE period_label >= DATE('2024-01-01')
  AND agent_name IS NOT NULL
GROUP BY tenant_id, agent_name, agent_id;

4. QA Invariants

4.1 Business ID Churn Detection

File: tests/test_semantic_view_qa_invariants.sql (add new invariant)

-- ============================================================================
-- INVARIANT 7: Business ID Stability (No Churn)
-- ============================================================================
-- Each normalized_name should have exactly one business_id over time

SELECT
  tenant_id,
  normalized_name,
  COUNT(DISTINCT business_id) AS business_id_count,
  ARRAY_AGG(DISTINCT business_id) AS business_ids,
  CASE
    WHEN COUNT(DISTINCT business_id) = 1 THEN 'PASS'
    ELSE 'FAIL'
  END AS stability_status
FROM `payroll-bi-gauntlet.payroll_analytics.dim_business_mapping`
GROUP BY tenant_id, normalized_name
HAVING COUNT(DISTINCT business_id) > 1  -- Show only failures
ORDER BY business_id_count DESC;

-- Expected: 0 rows (no business_id churn)

4.2 Agent Key Stability

File: tests/test_semantic_view_qa_invariants.sql (add new invariant)

-- ============================================================================
-- INVARIANT 8: Agent Key Stability
-- ============================================================================
-- Each display_name should have exactly one agent_key

SELECT
  tenant_id,
  display_name,
  COUNT(DISTINCT agent_key) AS agent_key_count,
  CASE
    WHEN COUNT(DISTINCT agent_key) = 1 THEN 'PASS'
    ELSE 'FAIL'
  END AS stability_status
FROM `payroll-bi-gauntlet.payroll_analytics.dim_agent_identity`
GROUP BY tenant_id, display_name
HAVING COUNT(DISTINCT agent_key) > 1  -- Show only failures
ORDER BY agent_key_count DESC;

-- Expected: 0 rows (no agent_key churn)

5. Summary: Guarantee Statements

✅ BUSINESS ID STABILITY (Post-Fix)

MERGE Key: (tenant_id, normalized_name)
Immutable Field: business_id
Generation: MD5 hash of normalized_name (first 16 hex chars)
Guarantee: Same (tenant_id, normalized_name) → same business_id forever

✅ AGENT IDENTITY STABILITY (Post-Migration)

MERGE Key: (tenant_id, agent_key)
Immutable Field: agent_key
Generation: SHA256 hash of normalized display_name (prefix: "AGENT:" + 16 hex chars)
Guarantee: Same (tenant_id, normalized_name) → same agent_key forever
Alias Handling: Multiple name variations map to same agent_key via aliases array

6. Action Items

Immediate (Blocking):

⚠️ Add tenant_id to dim_business_mapping schema
⚠️ Create MERGE SQL for dim_business_mapping (key: tenant_id, normalized_name)
⚠️ Update BusinessIDGenerator to include tenant_id
⚠️ Migrate loader from WRITE_TRUNCATE to MERGE

Short-term (Required for Semantic View):

Create dim_agent_identity table
Update semantic view to use agent_key (backward compatible)
Add QA invariants for ID stability checks

Long-term (Enhancement):

Implement alias resolution for business names
Add audit logging for ID generation
Create admin UI for manual ID overrides

Status: ⚠️ CRITICAL FIXES REQUIRED BEFORE PRODUCTION USE

Start Here

Operate & Support

Reference

Build & Integrate

IDENTITY STABILITY GUARANTEES

Identity Stability Guarantees - Audit-Grade Documentation

Executive Summary

1. Business ID Stability

1.1 Current Implementation

1.2 Stability Guarantees

✅ GUARANTEED STABLE:

❌ NOT GUARANTEED (Critical Gaps):

1.3 Required Fixes

Fix 1: Add `tenant_id` to Schema

Fix 2: Create MERGE SQL Script

Fix 3: Update BusinessIDGenerator to Include tenant_id

1.4 Guarantee Statement (Post-Fix)

2. Agent Identity Stability

2.1 Current Implementation

2.2 Stability Issues

❌ NOT GUARANTEED:

2.3 Required Solution: `agent_key` Pattern

Proposal: Create `dim_agent_identity` Table

2.4 Required Code Changes

Change 1: Update Semantic View to Use `agent_key`

Change 2: Update Tenant Isolation Tests

2.5 Guarantee Statement (Post-Migration)

3. Required Migrations (Non-Breaking)

3.1 Business Mapping Migration

3.2 Agent Identity Migration

4. QA Invariants

4.1 Business ID Churn Detection

4.2 Agent Key Stability

5. Summary: Guarantee Statements

✅ BUSINESS ID STABILITY (Post-Fix)

✅ AGENT IDENTITY STABILITY (Post-Migration)

6. Action Items

Immediate (Blocking):

Short-term (Required for Semantic View):

Long-term (Enhancement):

Start Here

Operate & Support

Reference

Build & Integrate

​Identity Stability Guarantees - Audit-Grade Documentation

​Executive Summary

​1. Business ID Stability

​1.1 Current Implementation

​1.2 Stability Guarantees

​✅ GUARANTEED STABLE:

​❌ NOT GUARANTEED (Critical Gaps):

​1.3 Required Fixes

​Fix 1: Add tenant_id to Schema

​Fix 2: Create MERGE SQL Script

​Fix 3: Update BusinessIDGenerator to Include tenant_id

​1.4 Guarantee Statement (Post-Fix)

​2. Agent Identity Stability

​2.1 Current Implementation

​2.2 Stability Issues

​❌ NOT GUARANTEED:

​2.3 Required Solution: agent_key Pattern

​Proposal: Create dim_agent_identity Table

​2.4 Required Code Changes

​Change 1: Update Semantic View to Use agent_key

​Change 2: Update Tenant Isolation Tests

​2.5 Guarantee Statement (Post-Migration)

​3. Required Migrations (Non-Breaking)

​3.1 Business Mapping Migration

​3.2 Agent Identity Migration

​4. QA Invariants

​4.1 Business ID Churn Detection

​4.2 Agent Key Stability

​5. Summary: Guarantee Statements

​✅ BUSINESS ID STABILITY (Post-Fix)

​✅ AGENT IDENTITY STABILITY (Post-Migration)

​6. Action Items

​Immediate (Blocking):

​Short-term (Required for Semantic View):

​Long-term (Enhancement):

Identity Stability Guarantees - Audit-Grade Documentation

Executive Summary

1. Business ID Stability

1.1 Current Implementation

1.2 Stability Guarantees

✅ GUARANTEED STABLE:

❌ NOT GUARANTEED (Critical Gaps):

1.3 Required Fixes

Fix 1: Add `tenant_id` to Schema

Fix 2: Create MERGE SQL Script

Fix 3: Update BusinessIDGenerator to Include tenant_id

1.4 Guarantee Statement (Post-Fix)

2. Agent Identity Stability

2.1 Current Implementation

2.2 Stability Issues

❌ NOT GUARANTEED:

2.3 Required Solution: `agent_key` Pattern

Proposal: Create `dim_agent_identity` Table

2.4 Required Code Changes

Change 1: Update Semantic View to Use `agent_key`

Change 2: Update Tenant Isolation Tests

2.5 Guarantee Statement (Post-Migration)

3. Required Migrations (Non-Breaking)

3.1 Business Mapping Migration

3.2 Agent Identity Migration

4. QA Invariants

4.1 Business ID Churn Detection

4.2 Agent Key Stability

5. Summary: Guarantee Statements

✅ BUSINESS ID STABILITY (Post-Fix)

✅ AGENT IDENTITY STABILITY (Post-Migration)

6. Action Items

Immediate (Blocking):

Short-term (Required for Semantic View):

Long-term (Enhancement):