System Metadata Extraction — ELT Transition Validation
Summary (3–6 bullets)
- Captures schema-level and logic-level metadata required for ELT transition validation.
- Documents
transaction_events_rawfield types/modes and complex value handling. - Extracts
process_row()return contract, nullability behavior, and serialization details. - Provides staging context (GCS paths, table existence) and recommended load-job setup.
When to use this (3–6 bullets)
- When implementing or reviewing ELT ingestion for onboarding/intake events.
- During schema mapping work between Python payloads and BigQuery structures.
- When troubleshooting array/null/decimal serialization issues in load jobs.
- Before creating or validating staging-table strategy for transition rollout.
What you’ll walk away with (2–5 bullets)
- An implementation-ready field mapping and type contract for ELT handoff.
- Clear serialization rules for arrays, nulls, and decimal strings.
- A concrete staging/load configuration baseline to execute safely.
Objective: Extract specific schema and logic definitions for “The Architect” to validate the ELT transition
1. Schema Definition
1.1 transaction_events_raw Schema
Source:bq show --schema --format=json payroll-bi-gauntlet.payroll_analytics.transaction_events_rawDDL File:
integration/bigquery/sql/tables/transaction_events_raw.sql
Full JSON Schema:
1.2 Critical Field Types & Modes
error_codes:- Type:
STRING - Mode:
REPEATED(ARRAY<STRING>in BigQuery) - Nullable: No (REPEATED fields are arrays, empty array = [])
- Type:
STRING - Mode:
REPEATED(ARRAY<STRING>in BigQuery) - Nullable: No (REPEATED fields are arrays, empty array = [])
- Not present in schema (no metadata field exists)
credit_raw:STRING(nullable)debit_raw:STRING(nullable)total_raw:STRING(nullable)- Rationale: Audit-grade money handling - no float math, exact decimal precision
source_system:STRING(nullable)member_id_raw:STRING(nullable)payee_agent_id_raw:STRING(nullable)credit_raw:STRING(nullable)debit_raw:STRING(nullable)total_raw:STRING(nullable)period_code_raw:STRING(nullable)pay_periods_raw:STRING(nullable)
2. Logic Extraction: process_row()
2.1 Function Signature
File:api/routes/intake_processor.py:574-713
2.2 Return Signature
Return Type:Dict[str, Any]
Return Keys (exact structure):
2.3 None Value Handling
Pattern:process_row() returns Python None for optional fields, not empty strings.
Examples:
member_id_raw:row.get("member_id", "").strip() or None→ ReturnsNoneif emptycredit_raw:(row.get("credit_raw") or row.get("credit") or "").strip() or None→ ReturnsNoneif emptyerror_codes:error_codes if error_codes else None→ ReturnsNoneif list is emptywarnings:warnings if warnings else None→ ReturnsNoneif list is empty
business_name_raw is never None (required field, defaults to empty string "" if missing, but validation will mark row as REJECTED).
2.4 Complex Type Serialization
error_codes:- Python Type:
List[str](e.g.,["MISSING_BUSINESS_NAME", "INVALID_DECIMAL_CREDIT"]) - Return Value:
List[str] | None(empty list becomesNone) - BigQuery Binding: Must be converted to
bigquery.ArrayQueryParameter("error_codes", "STRING", error_codes or []) - Note: Current upsert function handles
None→[]conversion:row.get("error_codes") or []
- Python Type:
List[str](e.g.,["MISSING_MEMBER_ID"]) - Return Value:
List[str] | None(empty list becomesNone) - BigQuery Binding: Must be converted to
bigquery.ArrayQueryParameter("warnings", "STRING", warnings or []) - Note: Current upsert function handles
None→[]conversion:row.get("warnings") or []
- Python Type:
str(JSON-encoded string) - Serialization:
json.dumps(row, sort_keys=True)(full input row dict as JSON) - BigQuery Type:
STRING(not JSON type, stored as text)
- Python Type:
str | None(decimal string like"1234.56"orNone) - Serialization: Already strings (no conversion needed)
- BigQuery Type:
STRING(not NUMERIC - per ADR-002)
2.5 Row Status Logic
- If
error_codeslist is non-empty →"REJECTED" - If
error_codeslist is empty →"ACCEPTED" - Warnings do NOT affect status (only error_codes)
3. Staging Context
3.1 GCS Bucket Naming Convention
Environment Variable:ONBOARDING_STORAGE_BUCKETConfig File:
api/config/onboarding.py
Current Patterns:
-
Intake Upload (
api/routes/intake.py:173):Example:gs://payroll-bi-gauntlet-onboarding-storage/intake/creative_benefit_strategies/f97d7357-986d-46d4-822c-74eb3d467ac3/December 2025 Combnied Payroll.csv -
Onboarding Files (
api/routes/onboarding.py:107): -
Bulk Import (
api/services/upload_store.py:274-280):
3.2 Staging Table Existence
Query:bq ls payroll_analytics | grep -i stagingResult: No existing staging tables found Conclusion:
transaction_events_raw_staging must be created from scratch.
Recommended DDL (mirror target schema):
4. Summary for ELT Implementation
4.1 Schema Mapping (Python → BigQuery)
| Python Return Value | BigQuery Type | Binding Method |
|---|---|---|
str (non-null) | STRING | ScalarQueryParameter("field", "STRING", value) |
None | STRING (nullable) | ScalarQueryParameter("field", "STRING", None) |
List[str] (non-empty) | ARRAY<STRING> | ArrayQueryParameter("field", "STRING", value) |
None (for arrays) | ARRAY<STRING> (empty) | ArrayQueryParameter("field", "STRING", []) |
int | INT64 | ScalarQueryParameter("field", "INT64", value) |
str (timestamp) | TIMESTAMP | ScalarQueryParameter("field", "TIMESTAMP", value) |
4.2 NDJSON Serialization for Load Job
Format: Newline-delimited JSON (one record per line) Example Record:error_codesandwarningsmust be serialized as JSON arrays ([]for empty,nullfor None, or["CODE1","CODE2"]for values)- BigQuery Load Job with
source_format=NEWLINE_DELIMITED_JSONwill parse arrays correctly - Decimal strings remain strings (no conversion)
4.3 Load Job Configuration
Recommended Config:End of System Metadata Extraction