TASK: Extract & Freeze BigQuery Contract (Schema + Excel→BQ Expectations)

Context

Before finalizing Repo B Excel mapping, we must explicitly document what BigQuery expects today. Repo A is the current source of truth; Repo B must not break any downstream loads. Important Note: BigQuery primarily loads from CSV snapshots and DataFrames, not Excel files directly. However, Excel sheet structures may inform downstream processes, so we document both:

Primary: BigQuery schema expectations (columns, types, constraints)
Secondary: Excel sheet/column names (for completeness and downstream compatibility)

Goal

Produce a single “BQ Contract” document that states:

Which BigQuery dataset(s)/tables/views are written to or read from
The exact schema expectations (columns + types) for each table/view involved in the pipeline
Any Excel sheet-name + column dependencies used for ingestion/parsing (even if indirect)
CSV/DataFrame column mappings to BigQuery columns
What is optional vs required

Step 1: Locate all BQ-touch points in the repo

Search the repo for:

BigQuery client initialization (google.cloud.bigquery, bigquery.Client)
Table/view creation SQL (CREATE TABLE, CREATE VIEW, MERGE, INSERT)
ETL loaders (load_table_from_dataframe, load_table_from_file, insert_rows)
Schema definitions (integration/bigquery/schemas/*.json)
Dataset/table name constants
Any “Excel → DataFrame → BQ” loaders (indirect path)
Any references to sheet names like “Agent Commissions”, “Full Detail”, etc. (even if indirect)
Views that align with Excel outputs (e.g., analytics_agent_report_v5 matches “V5 Excel output”)

Output a list of file paths + short notes on what each file does.

Step 2: Extract the contract

For each table/view involved:

Dataset name
Table/view name
Column list (in order if order matters)
Data types (STRING/NUMERIC/DATE/etc.)
Key constraints (unique keys, required fields, partitioning fields if present)
Source mapping: Which Repo B outputs (CSV/DataFrame columns) map to BQ columns
Any transforms/normalizations applied before load

If the schema is defined in SQL, copy the column definitions (no long dumps—just the schema blocks). If it’s inferred from pandas, list the final output columns and the intended BQ type. Focus Areas:

payroll_raw.stage1_snapshots (raw ingestion layer)
payroll_processed.stage3_snapshots (TPA-normalized data)
payroll_analytics.ceo_metrics (executive KPIs)
payroll_analytics.growth_loss (employee churn analysis)
Key views: analytics_agent_report_v5, analytics_ceo_snapshot, etc.

Step 2.5: Document Data Flow

Document the actual data flow:

Repo B Engine → CSV Snapshots → BigQuery ETL Loader → BigQuery Tables
Repo B Excel → (human consumption, not direct BQ input)
Views: Some views align with Excel outputs (e.g., `agent_report_v5` matches V5 Excel)

Step 3: Identify Excel dependencies explicitly

Find any code that:

Reads Excel sheets by name (for any purpose, even if not direct BQ input)
Assumes specific sheet names in Repo B outputs
Assumes specific columns inside those sheets

Note: BigQuery primarily loads from CSV/DataFrame sources, not Excel directly. Excel sheet names are documented for completeness and downstream process compatibility. Produce:

REQUIRED_SHEETS (exact strings, if any)
REQUIRED_COLUMNS per sheet (if any)
OPTIONAL sheets/columns
Mapping: Excel sheet/column → CSV/DataFrame column → BigQuery column

Step 4: Produce the deliverable

Create docs/BQ_CONTRACT.md containing:

Source of truth section (where schema came from: SQL files, JSON schemas, Python code)
Data Flow section (Repo B → CSV → BQ path)
Tables/Views section (with full schema details)
Excel Dependencies section (sheet names, columns, mappings - even if indirect)
CSV/DataFrame Mappings section (column mappings from Repo B outputs to BQ)
Change Policy: what Repo B can add (extras) vs must not change

Constraints

Do NOT change any production code in this task.
This is read-only extraction + documentation only.
Document both direct and indirect dependencies (even if Excel isn’t directly loaded).

Deliverables

docs/BQ_CONTRACT.md - Complete contract document
A short summary in the terminal: “What BQ expects / What Repo B must preserve”

Why This Sequencing

It de-risks “Agent Commissions” vs “Agent2 Breakdown” naming decisions
It prevents breaking ingestion jobs that may be hardcoded
It gives us an audit artifact (“contract”) we can pin to a tag/commit
It documents the actual data flow (CSV/DataFrame → BQ, not Excel → BQ directly)

Once Cursor returns docs/BQ_CONTRACT.md, we’ll use it to finalize the mapping layer with zero guesswork.

Start Here

Operate & Support

Reference

Build & Integrate

BQ CONTRACT EXTRACTION PROMPT

TASK: Extract & Freeze BigQuery Contract (Schema + Excel→BQ Expectations)

Context

Goal

Step 1: Locate all BQ-touch points in the repo

Step 2: Extract the contract

Step 2.5: Document Data Flow

Step 3: Identify Excel dependencies explicitly

Step 4: Produce the deliverable

Constraints

Deliverables

Why This Sequencing

Start Here

Operate & Support

Reference

Build & Integrate

​TASK: Extract & Freeze BigQuery Contract (Schema + Excel→BQ Expectations)

​Context

​Goal

​Step 1: Locate all BQ-touch points in the repo

​Step 2: Extract the contract

​Step 2.5: Document Data Flow

​Step 3: Identify Excel dependencies explicitly

​Step 4: Produce the deliverable

​Constraints

​Deliverables

​Why This Sequencing

TASK: Extract & Freeze BigQuery Contract (Schema + Excel→BQ Expectations)

Context

Goal

Step 1: Locate all BQ-touch points in the repo

Step 2: Extract the contract

Step 2.5: Document Data Flow

Step 3: Identify Excel dependencies explicitly

Step 4: Produce the deliverable

Constraints

Deliverables

Why This Sequencing