Skip to main content

TASK: Extract & Freeze BigQuery Contract (Schema + Excel→BQ Expectations)

Context

Before finalizing Repo B Excel mapping, we must explicitly document what BigQuery expects today. Repo A is the current source of truth; Repo B must not break any downstream loads. Important Note: BigQuery primarily loads from CSV snapshots and DataFrames, not Excel files directly. However, Excel sheet structures may inform downstream processes, so we document both:
  • Primary: BigQuery schema expectations (columns, types, constraints)
  • Secondary: Excel sheet/column names (for completeness and downstream compatibility)

Goal

Produce a single “BQ Contract” document that states:
  1. Which BigQuery dataset(s)/tables/views are written to or read from
  2. The exact schema expectations (columns + types) for each table/view involved in the pipeline
  3. Any Excel sheet-name + column dependencies used for ingestion/parsing (even if indirect)
  4. CSV/DataFrame column mappings to BigQuery columns
  5. What is optional vs required

Step 1: Locate all BQ-touch points in the repo

Search the repo for:
  • BigQuery client initialization (google.cloud.bigquery, bigquery.Client)
  • Table/view creation SQL (CREATE TABLE, CREATE VIEW, MERGE, INSERT)
  • ETL loaders (load_table_from_dataframe, load_table_from_file, insert_rows)
  • Schema definitions (integration/bigquery/schemas/*.json)
  • Dataset/table name constants
  • Any “Excel → DataFrame → BQ” loaders (indirect path)
  • Any references to sheet names like “Agent Commissions”, “Full Detail”, etc. (even if indirect)
  • Views that align with Excel outputs (e.g., analytics_agent_report_v5 matches “V5 Excel output”)
Output a list of file paths + short notes on what each file does.

Step 2: Extract the contract

For each table/view involved:
  • Dataset name
  • Table/view name
  • Column list (in order if order matters)
  • Data types (STRING/NUMERIC/DATE/etc.)
  • Key constraints (unique keys, required fields, partitioning fields if present)
  • Source mapping: Which Repo B outputs (CSV/DataFrame columns) map to BQ columns
  • Any transforms/normalizations applied before load
If the schema is defined in SQL, copy the column definitions (no long dumps—just the schema blocks). If it’s inferred from pandas, list the final output columns and the intended BQ type. Focus Areas:
  • payroll_raw.stage1_snapshots (raw ingestion layer)
  • payroll_processed.stage3_snapshots (TPA-normalized data)
  • payroll_analytics.ceo_metrics (executive KPIs)
  • payroll_analytics.growth_loss (employee churn analysis)
  • Key views: analytics_agent_report_v5, analytics_ceo_snapshot, etc.

Step 2.5: Document Data Flow

Document the actual data flow:
Repo B Engine → CSV Snapshots → BigQuery ETL Loader → BigQuery Tables
Repo B Excel → (human consumption, not direct BQ input)
Views: Some views align with Excel outputs (e.g., `agent_report_v5` matches V5 Excel)

Step 3: Identify Excel dependencies explicitly

Find any code that:
  • Reads Excel sheets by name (for any purpose, even if not direct BQ input)
  • Assumes specific sheet names in Repo B outputs
  • Assumes specific columns inside those sheets
Note: BigQuery primarily loads from CSV/DataFrame sources, not Excel directly. Excel sheet names are documented for completeness and downstream process compatibility. Produce:
  • REQUIRED_SHEETS (exact strings, if any)
  • REQUIRED_COLUMNS per sheet (if any)
  • OPTIONAL sheets/columns
  • Mapping: Excel sheet/column → CSV/DataFrame column → BigQuery column

Step 4: Produce the deliverable

Create docs/BQ_CONTRACT.md containing:
  • Source of truth section (where schema came from: SQL files, JSON schemas, Python code)
  • Data Flow section (Repo B → CSV → BQ path)
  • Tables/Views section (with full schema details)
  • Excel Dependencies section (sheet names, columns, mappings - even if indirect)
  • CSV/DataFrame Mappings section (column mappings from Repo B outputs to BQ)
  • Change Policy: what Repo B can add (extras) vs must not change

Constraints

  • Do NOT change any production code in this task.
  • This is read-only extraction + documentation only.
  • Document both direct and indirect dependencies (even if Excel isn’t directly loaded).

Deliverables

  1. docs/BQ_CONTRACT.md - Complete contract document
  2. A short summary in the terminal: “What BQ expects / What Repo B must preserve”

Why This Sequencing

  • It de-risks “Agent Commissions” vs “Agent2 Breakdown” naming decisions
  • It prevents breaking ingestion jobs that may be hardcoded
  • It gives us an audit artifact (“contract”) we can pin to a tag/commit
  • It documents the actual data flow (CSV/DataFrame → BQ, not Excel → BQ directly)
Once Cursor returns docs/BQ_CONTRACT.md, we’ll use it to finalize the mapping layer with zero guesswork.