Build Your Data Stack

Zero-Dollar Data Engineer

A complete guide to building a production-ready data engineering platform using only free tiers. Real pipelines, real warehousing, real compliance - zero budget required.

What this guide covers

GitHub Actions, Cloudflare D1, AWS S3, BigQuery Sandbox, dbt Core, Google Colab, Databricks Community Edition, Tableau Public, Microsoft Presidio, and DVC - all free, all production-grade patterns.

Part 0

Why Kaggle Won't Get You Hired

Kaggle is a trap. It does not teach you how data moves from an application database to a warehouse, how to handle PII and compliance requirements, how to schedule pipelines that run reliably at 3 AM, or how to debug a broken data flow at 2 PM on a Friday. Real-world data work is 90% plumbing, 10% modeling. This guide is the antidote.

5 things Kaggle does NOT teach

!How data physically moves from application databases to analytics systems
!PII identification, masking, and compliance requirements (GDPR, CCPA, HIPAA)
!Scheduling and orchestrating pipelines that run reliably without human intervention
!Debugging silent failures - pipelines that run but produce wrong results
!Idempotency, data versioning, and making pipelines safe to re-run

Part 1

The Architecture

A zero-dollar modern data stack that mirrors what enterprises use - just on free tiers. Every component maps directly to a paid enterprise equivalent.

ZERO-DOLLAR MODERN DATA STACK

  ┌─────────────────────────────────────────────────────────────────┐
  │  DATA GENERATION                                                │
  │  Python Faker + Cloudflare D1  →  GitHub Actions Cron  (FREE)  │
  └──────────────────────────┬──────────────────────────────────────┘
                             │
                             ▼
  ┌─────────────────────────────────────────────────────────────────┐
  │  PII COMPLIANCE GATEWAY                                         │
  │  Cloudflare Worker + Microsoft Presidio  (FREE)                 │
  └──────────────────────────┬──────────────────────────────────────┘
                             │
                             ▼
  ┌─────────────────────────────────────────────────────────────────┐
  │  DATA LAKE                                                      │
  │  AWS S3 Free Tier (5 GB) or Cloudflare R2 (10 GB/month)        │
  └───────────────┬─────────────────────────────────────────────────┘
                 │
        ┌────────┴────────┐
        ▼                 ▼
  ┌──────────────┐  ┌─────────────────────────────────────────┐
  │  ETL ENGINE  │  │  ORCHESTRATION                          │
  │  dbt Core    │  │  GitHub Actions (2000 min/month FREE)   │
  └──────┬───────┘  └────────────────────┬────────────────────┘
         └───────────────┬───────────────┘
                         ▼
  ┌─────────────────────────────────────────────────────────────────┐
  │  DATA WAREHOUSE                                                 │
  │  BigQuery Sandbox (1 TB queries/month FREE)                     │
  └────────────────┬──────────────────────┬───────────────────────┬─┘
                   │                      │                       │
                   ▼                      ▼                       ▼
  ┌──────────────────────┐  ┌─────────────────────┐  ┌──────────────────────┐
  │  MODEL TRAINING      │  │  VISUALIZATION      │  │  DATA VERSIONING     │
  │  Google Colab (FREE) │  │  Tableau Public     │  │  DVC + GitHub        │
  │  Databricks CE       │  │  Looker Studio      │  │  (FREE)              │
  └──────────────────────┘  └─────────────────────┘  └──────────────────────┘

Stack Components

Data Generation

Python Faker, Cloudflare D1, GitHub Actions Cron

Faker creates realistic synthetic data. D1 is a serverless SQLite at the edge. Actions runs cron jobs for free.

PII Compliance Gateway

Cloudflare Worker + Microsoft Presidio

Workers intercept data at the edge before it hits storage. Presidio detects and masks 30+ PII entity types.

Storage and Data Lake

AWS S3 Free Tier (5 GB) or Cloudflare R2 (10 GB/month)

Industry-standard object storage. Parquet format keeps files small and query-fast.

ETL and Warehousing

BigQuery Sandbox, dbt Core, GitHub Actions

BigQuery gives 1 TB free queries per month. dbt transforms data with version-controlled SQL. Actions orchestrates it all.

Model Training

Google Colab, Databricks Community Edition, MLflow local

Colab gives free GPU access. Databricks CE provides a full Spark environment. MLflow tracks experiments locally.

Visualization

Tableau Public, Looker Studio

Tableau Public is free and your work is publicly discoverable by recruiters. Looker Studio connects directly to BigQuery.

Part 2

Your First Week

You do not need 4 months. The entire foundation can be working in 7 days. Here is the daily plan.

Day 1

Create a GitHub repo called zero-dollar-data-stack. Install Python and Faker. Write generate_data.py producing 1,000 synthetic customers and 5,000 transactions. Commit and push.

Day 2

Add a GitHub Actions cron workflow that runs your generator script hourly. Validate the YAML, trigger a manual run, confirm it completes in the Actions tab.

Day 3

Create a Cloudflare account. Deploy your first Worker. Route generated data through a basic PII masking function. Verify masked emails in the Worker response.

Day 4

Set up AWS S3 free tier or Cloudflare R2. Write Python to convert your masked data to Parquet and upload it. Confirm the file appears in the bucket.

Day 5

Create a BigQuery Sandbox project. Write a Python ETL script to load your Parquet file from S3 into a BigQuery table. Run your first SQL query against your own pipeline.

Day 6

Install dbt Core. Initialize a dbt project. Write a staging model that cleans and standardizes your BigQuery data. Add not_null and unique tests. Run dbt from the CLI.

Day 7

Connect Tableau Public to BigQuery. Build one meaningful chart (daily transaction volume or customer tier distribution). Publish it. Your zero-dollar stack is live.

Part 3

Code Examples

Production-ready code snippets for every layer of the stack. Copy, adapt, and commit these to your repo.

1. Synthetic Data Generation with Faker

# generate_customers.py
from faker import Faker
import pandas as pd
import random
from datetime import datetime

fake = Faker()
Faker.seed(42)
random.seed(42)

def generate_customers(n: int) -> pd.DataFrame:
    records = []
    for _ in range(n):
        records.append({
            'customer_id':   fake.uuid4(),
            'name':          fake.name(),
            'email':         fake.email(),
            'phone':         fake.phone_number(),
            'ssn':           fake.ssn(),               # PII - will be masked
            'address':       fake.address().replace('\n', ', '),
            'signup_date':   fake.date_between(start_date='-2y', end_date='today').isoformat(),
            'plan':          random.choice(['free', 'pro', 'enterprise']),
            'monthly_spend': round(random.uniform(0, 500), 2),
            'churned':       random.choice([0, 0, 0, 1]),  # ~25% churn rate
        })
    return pd.DataFrame(records)

if __name__ == '__main__':
    df = generate_customers(1000)
    df.to_parquet('data/raw/customers.parquet', index=False)
    print(f"Generated {len(df)} customers → data/raw/customers.parquet")

2. GitHub Actions Schedule

# .github/workflows/daily_pipeline.yml
name: Daily Data Pipeline

on:
  schedule:
    - cron: '0 3 * * *'   # 3 AM UTC every day
  workflow_dispatch:       # allow manual trigger

jobs:
  generate-and-load:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: '3.11'

      - name: Install dependencies
        run: pip install faker pandas pyarrow boto3 google-cloud-bigquery

      - name: Generate synthetic data
        run: python generate_customers.py

      - name: Mask PII via Cloudflare Worker
        env:
          WORKER_URL: ${{ secrets.WORKER_URL }}
        run: python mask_pii.py

      - name: Upload to S3
        env:
          AWS_ACCESS_KEY_ID:     ${{ secrets.AWS_ACCESS_KEY_ID }}
          AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
        run: python upload_to_s3.py

      - name: Load to BigQuery
        env:
          GOOGLE_CREDENTIALS: ${{ secrets.GOOGLE_CREDENTIALS }}
        run: python load_to_bigquery.py

      - name: Run dbt models
        run: |
          pip install dbt-bigquery
          dbt run --profiles-dir .
          dbt test --profiles-dir .

3. PII Masking Cloudflare Worker

// worker.js - deployed to Cloudflare Workers
const PII_PATTERNS = {
  email:  /[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}/g,
  phone:  /(\+?1[-.\s]?)?\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4}/g,
  ssn:    /\b\d{3}-\d{2}-\d{4}\b/g,
};

function maskPII(text) {
  let masked = text;
  masked = masked.replace(PII_PATTERNS.email,  '[EMAIL_REDACTED]');
  masked = masked.replace(PII_PATTERNS.phone,  '[PHONE_REDACTED]');
  masked = masked.replace(PII_PATTERNS.ssn,    '[SSN_REDACTED]');
  return masked;
}

export default {
  async fetch(request, env) {
    if (request.method !== 'POST') {
      return new Response('Method Not Allowed', { status: 405 });
    }

    const body  = await request.json();
    const clean = {};

    for (const [key, value] of Object.entries(body)) {
      clean[key] = typeof value === 'string' ? maskPII(value) : value;
    }

    // Log PII detection event to D1 audit table
    await env.DB.prepare(
      'INSERT INTO pii_audit (ts, record_id, fields_masked) VALUES (?, ?, ?)'
    ).bind(Date.now(), body.customer_id, JSON.stringify(Object.keys(clean))).run();

    return Response.json(clean);
  },
};

4. ETL to BigQuery

# load_to_bigquery.py
import pandas as pd
from google.cloud import bigquery
from google.oauth2 import service_account
import json, os

credentials = service_account.Credentials.from_service_account_info(
    json.loads(os.environ['GOOGLE_CREDENTIALS'])
)
client = bigquery.Client(credentials=credentials, project='your-project-id')

def load_parquet_to_bq(parquet_path: str, table_id: str) -> None:
    df = pd.read_parquet(parquet_path)

    job_config = bigquery.LoadJobConfig(
        write_disposition=bigquery.WriteDisposition.WRITE_TRUNCATE,
        autodetect=True,
        source_format=bigquery.SourceFormat.PARQUET,
    )

    job = client.load_table_from_dataframe(df, table_id, job_config=job_config)
    job.result()   # wait for completion

    table = client.get_table(table_id)
    print(f"Loaded {table.num_rows} rows into {table_id}")

if __name__ == '__main__':
    load_parquet_to_bq(
        parquet_path='data/masked/customers.parquet',
        table_id='your-project-id.raw.customers',
    )

5. BigQuery Query from Colab

# Google Colab - authenticate and query BigQuery
from google.colab import auth
from google.cloud import bigquery
import pandas as pd

auth.authenticate_user()   # opens browser OAuth flow
client = bigquery.Client(project='your-project-id')

query = """
    SELECT
        plan,
        COUNT(*)                          AS total_customers,
        AVG(monthly_spend)                AS avg_spend,
        SUM(churned) / COUNT(*)           AS churn_rate
    FROM `your-project-id.marts.customers`
    GROUP BY plan
    ORDER BY avg_spend DESC
"""

df = client.query(query).to_dataframe()
print(df.to_string(index=False))

# Quick visualisation
import matplotlib.pyplot as plt
df.plot(x='plan', y='churn_rate', kind='bar', figsize=(8, 4))
plt.title('Churn Rate by Plan')
plt.tight_layout()
plt.show()

6. Idempotent Load Pattern

# idempotent_load.py - safe to re-run any number of times
from google.cloud import bigquery

def idempotent_merge(client, staging_table: str, target_table: str, key: str) -> None:
    """
    MERGE is idempotent: running it twice produces the same result.
    Never use APPEND without a deduplication step.
    """
    merge_sql = f"""
    MERGE `{target_table}` AS target
    USING `{staging_table}` AS source
    ON target.{key} = source.{key}
    WHEN MATCHED THEN
        UPDATE SET
            target.email         = source.email,
            target.monthly_spend = source.monthly_spend,
            target.churned       = source.churned,
            target.updated_at    = CURRENT_TIMESTAMP()
    WHEN NOT MATCHED THEN
        INSERT ROW
    """
    job = client.query(merge_sql)
    job.result()
    print(f"Merge complete: {staging_table} -> {target_table} on {key}")

7. DVC Data Versioning

# bash - set up DVC with S3 remote
pip install dvc[s3]

# Initialize DVC in your repo
dvc init
git commit -m "Initialize DVC"

# Configure S3 as the remote storage
dvc remote add -d myremote s3://your-bucket/dvc-store
git add .dvc/config
git commit -m "Configure DVC remote"

# Track your data files
dvc add data/raw/customers.parquet
dvc add data/masked/customers.parquet
git add data/raw/customers.parquet.dvc data/masked/customers.parquet.dvc .gitignore
git commit -m "Add DVC tracked datasets v1"

# Push data to S3
dvc push

# After next pipeline run - track new version
dvc add data/raw/customers.parquet
git add data/raw/customers.parquet.dvc
git commit -m "Update dataset v2 - 5000 records"
dvc push

# Reproduce a previous version
git checkout v1
dvc pull

Part 4

Pro Tips

Lessons from production pipelines that beginner guides skip. These are the details that separate engineers who understand data systems from those who just learned the syntax.

Tip 1

Idempotency - always use MERGE or WRITE_TRUNCATE

Never use APPEND without a deduplication step. If your pipeline reruns due to a failure, APPEND creates duplicate rows silently. MERGE updates existing rows and inserts new ones - the result is identical whether you run it once or ten times. WRITE_TRUNCATE replaces the table entirely on each run. Both are safe. APPEND without dedup is not.

Tip 2

Secret management - never hardcode credentials

Every credential that appears in your source code will eventually end up in a git history, a log file, or a Slack message. Use GitHub Actions secrets for CI/CD, python-dotenv for local development (and .env in .gitignore), and environment variables everywhere else. If you have ever committed a key, rotate it immediately - assume it is compromised.

Tip 3

Monitoring without paying - Python logging to GitHub Actions

GitHub Actions workflow summaries are free and persistent. After each pipeline run, write row counts, null rates, schema changes, and PII detection counts to the summary using the GITHUB_STEP_SUMMARY environment variable. You get a searchable audit log of every pipeline run at zero cost. Add a simple row count assertion - if output rows are less than 90% of input rows, fail the job.

Tip 4

Data versioning with DVC

Git tracks code. DVC tracks data. Without DVC, you cannot reproduce results from three months ago because you do not know what the data looked like then. DVC stores a small .dvc pointer file in git (a content hash) and stores the actual data in S3. Checking out a git tag restores both the code and the exact data version that was used. This is a core ML engineering practice and almost no portfolio projects demonstrate it.

Part 5

The Enterprise Comparison

Every component in this stack maps directly to its enterprise equivalent. When you join a company using Snowflake and Airflow, you already understand the patterns.

Layer	Your Free Stack	Enterprise Equivalent	Cost Saved
Operational DB	Cloudflare D1 (SQLite)	AWS RDS PostgreSQL	$200+/mo
Data Lake	AWS S3 / Cloudflare R2	Databricks Delta Lake	$500+/mo
Warehouse	BigQuery Sandbox	Snowflake / Redshift	$400+/mo
Orchestration	GitHub Actions Cron	Apache Airflow / Prefect	$100+/mo
Transformation	dbt Core	dbt Cloud	$100+/mo
ML Platform	Colab + Databricks CE	Databricks ML / SageMaker	$1000+/mo
BI	Tableau Public / Looker Studio	Tableau Server / Looker	$300+/mo
Monitoring	GitHub Actions Summaries	Monte Carlo / Great Expectations	$200+/mo

Part 6

Take the Challenge

Reading is not building. Start this week. The first milestone is straightforward and takes less than two hours.

Week 1 Mission

01 Create a GitHub repository named zero-dollar-data-stack
02 Install Python 3.11+ and the Faker library (pip install faker pandas pyarrow)
03 Copy the Faker data generator from Part 3 and adapt it for your domain
04 Generate your first 100 synthetic customers as a Parquet file
05 Push to GitHub with a clear README describing what the data represents
06 Come back next week and build the GitHub Actions cron job to automate it

The goal is not perfection. The goal is a public GitHub commit with real code that you can point to. That commit is worth more than 10 Kaggle notebooks.

Part 7

The Invisible 80%

The zero-dollar stack covers the visible parts of data engineering. But experienced practitioners know that roughly 80% of what determines whether a project succeeds or fails is invisible in any tutorial. Here are the layers nobody teaches.

This is not about adding more tools. It is about building the mindset that separates someone who can run a pipeline from someone who owns one.

Layer 01

The Economics of Data

Every data project is a business decision dressed in technical clothing. Most beginners ask "Can we build this model?" Senior practitioners ask "Should we build this model, and what does it cost if we are wrong?"

Cost of FP vs FN

A false positive in fraud detection costs investigation time. A false negative costs actual fraud. These are not equal. Design your model around the cost that matters most.

Model ROI

If building a model costs $50k in engineering time and saves $10k per year, it is a bad investment. Calculate before you build, not after.

Technical Debt Interest

Messy code today costs more to fix tomorrow. Every shortcut is a loan with compound interest. Pay it down before it compounds.

Opportunity Cost

While you are optimizing one model, what are you not building? A 1% accuracy improvement rarely justifies 6 months of engineering time.

Layer 02

Data Governance and Lineage

Nobody trusts data automatically. Stakeholders ask: where did this number come from? Who transformed it? When? Why does it differ from last week? Without lineage, you cannot answer. With lineage, you trace any number back to its source in minutes.

Data Lineage

Use OpenLineage (free) with Marquez. Document every transformation in dbt docs. Every ETL job writes a manifest: timestamp, row counts, source query.

Data Contracts

Great Expectations checks in your GitHub Actions. Define what the data must look like before it touches the warehouse.

Metadata Catalog

DataHub or Amundsen (both open source). A catalog of what data exists, what it means, and who owns it.

SLA Monitoring

A Cloudflare Worker that checks BigQuery timestamps and sends an email if data is stale by more than N hours.

Layer 03

Experimentation Infrastructure

Most pipelines stop at prediction: "We predict which users will churn." A complete pipeline closes the loop: "We predicted churn, ran an intervention, and measured the causal impact." Prediction without measurement is not a data product - it is a hypothesis.

Feature Flagging

The ability to turn models on and off for specific user segments. Without this, you cannot isolate model impact from other changes.

A/B Testing Framework

Randomized assignment of treatment and control. "Churn went down" is not evidence. "Churn went down in the treatment group, not the control group" is evidence.

Causal Inference

When you cannot randomize, use synthetic controls or difference-in-differences. The absence of a control group does not mean the absence of causal claims.

Feedback Loop

Capture actual outcomes to retrain the model. A deployed model with no feedback mechanism degrades silently over time.

Layer 04

Incident Response and On-Call

Data pipelines fail at 3 AM on Sundays. They fail when your CEO is presenting. They fail when you are on vacation. The question is not whether your pipeline will fail - it is how fast you can detect and recover.

Alerting

GitHub Actions sends email on failure. UptimeRobot (free tier) can ping a health-check endpoint. You should know about failures before stakeholders do.

Runbooks

A /runbooks directory in your repo with markdown files for common failure modes. "Pipeline failed" should trigger a procedure, not a search.

Post-Mortems

After every incident, write a blameless post-mortem. What happened, why, and what changes prevent recurrence. Track in GitHub Issues.

SLOs

Define what "working" means: data freshness under 2 hours, row counts within 20% of expected, schema unchanged. Monitor these explicitly.

Layer 05

Data Quality as Code

Trust is built on quality, not promises. Your pipeline should stop if data quality checks fail. Bad data should never reach the warehouse.

Freshness

max(timestamp) is within the expected window. If data is more than 2 hours old, halt and alert.

Volume

Row count is within 20% of the historical average. A 90% drop usually means a broken source, not a quiet day.

Schema

Column names and types match the contract. A silent upstream schema change is a classic cause of silent downstream failures.

Distribution

Statistical distributions of key fields have not shifted dramatically. Use Evidently AI (free) for drift detection.

Uniqueness

COUNT(DISTINCT id) = COUNT(*). Duplicate primary keys corrupt every downstream join.

Referential Integrity

Every foreign key points to a valid record. Orphaned transaction records without a customer are a data modeling failure.

Layer 06

Security and Compliance by Design

Do not add security at the end. Build it into every layer from the start. The practices you develop on a free-tier project are identical to what enterprises enforce.

Least Privilege Access

Your ETL service account should only be able to write to one table. Not all tables. Not all datasets. Only what it needs.

Secret Management

Credentials live in GitHub Actions secrets or a .env file that is gitignored. Never in source code. Never in the commit history.

Data Retention Policies

Raw data should be deleted after 90 days. Processed aggregates after 2 years. Schedule cleanup jobs. Old data is liability, not asset.

Audit Logging

BigQuery audit logs and Cloudflare Worker logs tell you who accessed what data and when. Even in development, build this habit.

Layer 07

Disaster Recovery and Business Continuity

If someone runs DROP TABLE orders right now, how long until you recover? Hours? Days? The answer determines whether this is a portfolio project or a production system.

Accidental Table Deletion

BigQuery time travel lets you query any table as it existed up to 7 days ago. Enable it. Test it. Add it to your runbook.

Corrupted Data Load

S3 versioning keeps every version of every object. DVC versions datasets alongside code. A rollback should take minutes.

Human Error

Git for code. DVC for data. MLflow for models. Everything should be reversible. If it is not reversible, it should require a second approval.

Service Outage

Know your recovery time objective before the outage happens. Write it down. A plan you write during an incident is not a plan - it is improvisation.

Layer 08

The Human Layer

You can build the perfect model. If nobody uses it, it is worthless. Adoption is a design problem, not a communication problem.

Stakeholder Trust

People do not trust what they do not understand. Explain in business terms. Show examples. Be willing to say "I do not know" and come back with the answer.

Adoption Friction

If your tool requires 5 steps instead of 2, nobody uses it. Optimize for user experience first, technical elegance second.

Organizational Resistance

Data often challenges existing beliefs. "The data suggests..." lands better than "You were wrong about...". Frame insights as opportunities, not indictments.

Documentation

If you are the only person who understands how it works, you are the bottleneck. Write docs. Record walkthroughs. Make yourself replaceable.