TASK: Extract & Freeze BigQuery Contract (Schema + Excel→BQ Expectations)
Context
Before finalizing Repo B Excel mapping, we must explicitly document what BigQuery expects today. Repo A is the current source of truth; Repo B must not break any downstream loads. Important Note: BigQuery primarily loads from CSV snapshots and DataFrames, not Excel files directly. However, Excel sheet structures may inform downstream processes, so we document both:- Primary: BigQuery schema expectations (columns, types, constraints)
- Secondary: Excel sheet/column names (for completeness and downstream compatibility)
Goal
Produce a single “BQ Contract” document that states:- Which BigQuery dataset(s)/tables/views are written to or read from
- The exact schema expectations (columns + types) for each table/view involved in the pipeline
- Any Excel sheet-name + column dependencies used for ingestion/parsing (even if indirect)
- CSV/DataFrame column mappings to BigQuery columns
- What is optional vs required
Step 1: Locate all BQ-touch points in the repo
Search the repo for:- BigQuery client initialization (
google.cloud.bigquery,bigquery.Client) - Table/view creation SQL (
CREATE TABLE,CREATE VIEW,MERGE,INSERT) - ETL loaders (
load_table_from_dataframe,load_table_from_file,insert_rows) - Schema definitions (
integration/bigquery/schemas/*.json) - Dataset/table name constants
- Any “Excel → DataFrame → BQ” loaders (indirect path)
- Any references to sheet names like “Agent Commissions”, “Full Detail”, etc. (even if indirect)
- Views that align with Excel outputs (e.g.,
analytics_agent_report_v5matches “V5 Excel output”)
Step 2: Extract the contract
For each table/view involved:- Dataset name
- Table/view name
- Column list (in order if order matters)
- Data types (STRING/NUMERIC/DATE/etc.)
- Key constraints (unique keys, required fields, partitioning fields if present)
- Source mapping: Which Repo B outputs (CSV/DataFrame columns) map to BQ columns
- Any transforms/normalizations applied before load
payroll_raw.stage1_snapshots(raw ingestion layer)payroll_processed.stage3_snapshots(TPA-normalized data)payroll_analytics.ceo_metrics(executive KPIs)payroll_analytics.growth_loss(employee churn analysis)- Key views:
analytics_agent_report_v5,analytics_ceo_snapshot, etc.
Step 2.5: Document Data Flow
Document the actual data flow:Step 3: Identify Excel dependencies explicitly
Find any code that:- Reads Excel sheets by name (for any purpose, even if not direct BQ input)
- Assumes specific sheet names in Repo B outputs
- Assumes specific columns inside those sheets
- REQUIRED_SHEETS (exact strings, if any)
- REQUIRED_COLUMNS per sheet (if any)
- OPTIONAL sheets/columns
- Mapping: Excel sheet/column → CSV/DataFrame column → BigQuery column
Step 4: Produce the deliverable
Createdocs/BQ_CONTRACT.md containing:
- Source of truth section (where schema came from: SQL files, JSON schemas, Python code)
- Data Flow section (Repo B → CSV → BQ path)
- Tables/Views section (with full schema details)
- Excel Dependencies section (sheet names, columns, mappings - even if indirect)
- CSV/DataFrame Mappings section (column mappings from Repo B outputs to BQ)
- Change Policy: what Repo B can add (extras) vs must not change
Constraints
- Do NOT change any production code in this task.
- This is read-only extraction + documentation only.
- Document both direct and indirect dependencies (even if Excel isn’t directly loaded).
Deliverables
docs/BQ_CONTRACT.md- Complete contract document- A short summary in the terminal: “What BQ expects / What Repo B must preserve”
Why This Sequencing
- It de-risks “Agent Commissions” vs “Agent2 Breakdown” naming decisions
- It prevents breaking ingestion jobs that may be hardcoded
- It gives us an audit artifact (“contract”) we can pin to a tag/commit
- It documents the actual data flow (CSV/DataFrame → BQ, not Excel → BQ directly)
docs/BQ_CONTRACT.md, we’ll use it to finalize the mapping layer with zero guesswork.