Zero-Dollar Data Engineer
A complete guide to building a production-ready data engineering platform using only free tiers. Real pipelines, real warehousing, real compliance - zero budget required.
What this guide covers
GitHub Actions, Cloudflare D1, AWS S3, BigQuery Sandbox, dbt Core, Google Colab, Databricks Community Edition, Tableau Public, Microsoft Presidio, and DVC - all free, all production-grade patterns.
Why Kaggle Won't Get You Hired
Kaggle is a trap. It does not teach you how data moves from an application database to a warehouse, how to handle PII and compliance requirements, how to schedule pipelines that run reliably at 3 AM, or how to debug a broken data flow at 2 PM on a Friday. Real-world data work is 90% plumbing, 10% modeling. This guide is the antidote.
5 things Kaggle does NOT teach
- !How data physically moves from application databases to analytics systems
- !PII identification, masking, and compliance requirements (GDPR, CCPA, HIPAA)
- !Scheduling and orchestrating pipelines that run reliably without human intervention
- !Debugging silent failures - pipelines that run but produce wrong results
- !Idempotency, data versioning, and making pipelines safe to re-run
The Architecture
A zero-dollar modern data stack that mirrors what enterprises use - just on free tiers. Every component maps directly to a paid enterprise equivalent.
ZERO-DOLLAR MODERN DATA STACK
┌─────────────────────────────────────────────────────────────────┐
│ DATA GENERATION │
│ Python Faker + Cloudflare D1 → GitHub Actions Cron (FREE) │
└──────────────────────────┬──────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ PII COMPLIANCE GATEWAY │
│ Cloudflare Worker + Microsoft Presidio (FREE) │
└──────────────────────────┬──────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ DATA LAKE │
│ AWS S3 Free Tier (5 GB) or Cloudflare R2 (10 GB/month) │
└───────────────┬─────────────────────────────────────────────────┘
│
┌────────┴────────┐
▼ ▼
┌──────────────┐ ┌─────────────────────────────────────────┐
│ ETL ENGINE │ │ ORCHESTRATION │
│ dbt Core │ │ GitHub Actions (2000 min/month FREE) │
└──────┬───────┘ └────────────────────┬────────────────────┘
└───────────────┬───────────────┘
▼
┌─────────────────────────────────────────────────────────────────┐
│ DATA WAREHOUSE │
│ BigQuery Sandbox (1 TB queries/month FREE) │
└────────────────┬──────────────────────┬───────────────────────┬─┘
│ │ │
▼ ▼ ▼
┌──────────────────────┐ ┌─────────────────────┐ ┌──────────────────────┐
│ MODEL TRAINING │ │ VISUALIZATION │ │ DATA VERSIONING │
│ Google Colab (FREE) │ │ Tableau Public │ │ DVC + GitHub │
│ Databricks CE │ │ Looker Studio │ │ (FREE) │
└──────────────────────┘ └─────────────────────┘ └──────────────────────┘ Stack Components
Python Faker, Cloudflare D1, GitHub Actions Cron
Faker creates realistic synthetic data. D1 is a serverless SQLite at the edge. Actions runs cron jobs for free.
Cloudflare Worker + Microsoft Presidio
Workers intercept data at the edge before it hits storage. Presidio detects and masks 30+ PII entity types.
AWS S3 Free Tier (5 GB) or Cloudflare R2 (10 GB/month)
Industry-standard object storage. Parquet format keeps files small and query-fast.
BigQuery Sandbox, dbt Core, GitHub Actions
BigQuery gives 1 TB free queries per month. dbt transforms data with version-controlled SQL. Actions orchestrates it all.
Google Colab, Databricks Community Edition, MLflow local
Colab gives free GPU access. Databricks CE provides a full Spark environment. MLflow tracks experiments locally.
Tableau Public, Looker Studio
Tableau Public is free and your work is publicly discoverable by recruiters. Looker Studio connects directly to BigQuery.
Your First Week
You do not need 4 months. The entire foundation can be working in 7 days. Here is the daily plan.
Create a GitHub repo called zero-dollar-data-stack. Install Python and Faker. Write generate_data.py producing 1,000 synthetic customers and 5,000 transactions. Commit and push.
Add a GitHub Actions cron workflow that runs your generator script hourly. Validate the YAML, trigger a manual run, confirm it completes in the Actions tab.
Create a Cloudflare account. Deploy your first Worker. Route generated data through a basic PII masking function. Verify masked emails in the Worker response.
Set up AWS S3 free tier or Cloudflare R2. Write Python to convert your masked data to Parquet and upload it. Confirm the file appears in the bucket.
Create a BigQuery Sandbox project. Write a Python ETL script to load your Parquet file from S3 into a BigQuery table. Run your first SQL query against your own pipeline.
Install dbt Core. Initialize a dbt project. Write a staging model that cleans and standardizes your BigQuery data. Add not_null and unique tests. Run dbt from the CLI.
Connect Tableau Public to BigQuery. Build one meaningful chart (daily transaction volume or customer tier distribution). Publish it. Your zero-dollar stack is live.
Code Examples
Production-ready code snippets for every layer of the stack. Copy, adapt, and commit these to your repo.
1. Synthetic Data Generation with Faker
# generate_customers.py
from faker import Faker
import pandas as pd
import random
from datetime import datetime
fake = Faker()
Faker.seed(42)
random.seed(42)
def generate_customers(n: int) -> pd.DataFrame:
records = []
for _ in range(n):
records.append({
'customer_id': fake.uuid4(),
'name': fake.name(),
'email': fake.email(),
'phone': fake.phone_number(),
'ssn': fake.ssn(), # PII - will be masked
'address': fake.address().replace('\n', ', '),
'signup_date': fake.date_between(start_date='-2y', end_date='today').isoformat(),
'plan': random.choice(['free', 'pro', 'enterprise']),
'monthly_spend': round(random.uniform(0, 500), 2),
'churned': random.choice([0, 0, 0, 1]), # ~25% churn rate
})
return pd.DataFrame(records)
if __name__ == '__main__':
df = generate_customers(1000)
df.to_parquet('data/raw/customers.parquet', index=False)
print(f"Generated {len(df)} customers → data/raw/customers.parquet") 2. GitHub Actions Schedule
# .github/workflows/daily_pipeline.yml
name: Daily Data Pipeline
on:
schedule:
- cron: '0 3 * * *' # 3 AM UTC every day
workflow_dispatch: # allow manual trigger
jobs:
generate-and-load:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: '3.11'
- name: Install dependencies
run: pip install faker pandas pyarrow boto3 google-cloud-bigquery
- name: Generate synthetic data
run: python generate_customers.py
- name: Mask PII via Cloudflare Worker
env:
WORKER_URL: ${{ secrets.WORKER_URL }}
run: python mask_pii.py
- name: Upload to S3
env:
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
run: python upload_to_s3.py
- name: Load to BigQuery
env:
GOOGLE_CREDENTIALS: ${{ secrets.GOOGLE_CREDENTIALS }}
run: python load_to_bigquery.py
- name: Run dbt models
run: |
pip install dbt-bigquery
dbt run --profiles-dir .
dbt test --profiles-dir . 3. PII Masking Cloudflare Worker
// worker.js - deployed to Cloudflare Workers
const PII_PATTERNS = {
email: /[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}/g,
phone: /(\+?1[-.\s]?)?\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4}/g,
ssn: /\b\d{3}-\d{2}-\d{4}\b/g,
};
function maskPII(text) {
let masked = text;
masked = masked.replace(PII_PATTERNS.email, '[EMAIL_REDACTED]');
masked = masked.replace(PII_PATTERNS.phone, '[PHONE_REDACTED]');
masked = masked.replace(PII_PATTERNS.ssn, '[SSN_REDACTED]');
return masked;
}
export default {
async fetch(request, env) {
if (request.method !== 'POST') {
return new Response('Method Not Allowed', { status: 405 });
}
const body = await request.json();
const clean = {};
for (const [key, value] of Object.entries(body)) {
clean[key] = typeof value === 'string' ? maskPII(value) : value;
}
// Log PII detection event to D1 audit table
await env.DB.prepare(
'INSERT INTO pii_audit (ts, record_id, fields_masked) VALUES (?, ?, ?)'
).bind(Date.now(), body.customer_id, JSON.stringify(Object.keys(clean))).run();
return Response.json(clean);
},
}; 4. ETL to BigQuery
# load_to_bigquery.py
import pandas as pd
from google.cloud import bigquery
from google.oauth2 import service_account
import json, os
credentials = service_account.Credentials.from_service_account_info(
json.loads(os.environ['GOOGLE_CREDENTIALS'])
)
client = bigquery.Client(credentials=credentials, project='your-project-id')
def load_parquet_to_bq(parquet_path: str, table_id: str) -> None:
df = pd.read_parquet(parquet_path)
job_config = bigquery.LoadJobConfig(
write_disposition=bigquery.WriteDisposition.WRITE_TRUNCATE,
autodetect=True,
source_format=bigquery.SourceFormat.PARQUET,
)
job = client.load_table_from_dataframe(df, table_id, job_config=job_config)
job.result() # wait for completion
table = client.get_table(table_id)
print(f"Loaded {table.num_rows} rows into {table_id}")
if __name__ == '__main__':
load_parquet_to_bq(
parquet_path='data/masked/customers.parquet',
table_id='your-project-id.raw.customers',
) 5. BigQuery Query from Colab
# Google Colab - authenticate and query BigQuery
from google.colab import auth
from google.cloud import bigquery
import pandas as pd
auth.authenticate_user() # opens browser OAuth flow
client = bigquery.Client(project='your-project-id')
query = """
SELECT
plan,
COUNT(*) AS total_customers,
AVG(monthly_spend) AS avg_spend,
SUM(churned) / COUNT(*) AS churn_rate
FROM `your-project-id.marts.customers`
GROUP BY plan
ORDER BY avg_spend DESC
"""
df = client.query(query).to_dataframe()
print(df.to_string(index=False))
# Quick visualisation
import matplotlib.pyplot as plt
df.plot(x='plan', y='churn_rate', kind='bar', figsize=(8, 4))
plt.title('Churn Rate by Plan')
plt.tight_layout()
plt.show() 6. Idempotent Load Pattern
# idempotent_load.py - safe to re-run any number of times
from google.cloud import bigquery
def idempotent_merge(client, staging_table: str, target_table: str, key: str) -> None:
"""
MERGE is idempotent: running it twice produces the same result.
Never use APPEND without a deduplication step.
"""
merge_sql = f"""
MERGE `{target_table}` AS target
USING `{staging_table}` AS source
ON target.{key} = source.{key}
WHEN MATCHED THEN
UPDATE SET
target.email = source.email,
target.monthly_spend = source.monthly_spend,
target.churned = source.churned,
target.updated_at = CURRENT_TIMESTAMP()
WHEN NOT MATCHED THEN
INSERT ROW
"""
job = client.query(merge_sql)
job.result()
print(f"Merge complete: {staging_table} -> {target_table} on {key}") 7. DVC Data Versioning
# bash - set up DVC with S3 remote
pip install dvc[s3]
# Initialize DVC in your repo
dvc init
git commit -m "Initialize DVC"
# Configure S3 as the remote storage
dvc remote add -d myremote s3://your-bucket/dvc-store
git add .dvc/config
git commit -m "Configure DVC remote"
# Track your data files
dvc add data/raw/customers.parquet
dvc add data/masked/customers.parquet
git add data/raw/customers.parquet.dvc data/masked/customers.parquet.dvc .gitignore
git commit -m "Add DVC tracked datasets v1"
# Push data to S3
dvc push
# After next pipeline run - track new version
dvc add data/raw/customers.parquet
git add data/raw/customers.parquet.dvc
git commit -m "Update dataset v2 - 5000 records"
dvc push
# Reproduce a previous version
git checkout v1
dvc pull Pro Tips
Lessons from production pipelines that beginner guides skip. These are the details that separate engineers who understand data systems from those who just learned the syntax.
Idempotency - always use MERGE or WRITE_TRUNCATE
Never use APPEND without a deduplication step. If your pipeline reruns due to a failure, APPEND creates duplicate rows silently. MERGE updates existing rows and inserts new ones - the result is identical whether you run it once or ten times. WRITE_TRUNCATE replaces the table entirely on each run. Both are safe. APPEND without dedup is not.
Secret management - never hardcode credentials
Every credential that appears in your source code will eventually end up in a git history, a log file, or a Slack message. Use GitHub Actions secrets for CI/CD, python-dotenv for local development (and .env in .gitignore), and environment variables everywhere else. If you have ever committed a key, rotate it immediately - assume it is compromised.
Monitoring without paying - Python logging to GitHub Actions
GitHub Actions workflow summaries are free and persistent. After each pipeline run, write row counts, null rates, schema changes, and PII detection counts to the summary using the GITHUB_STEP_SUMMARY environment variable. You get a searchable audit log of every pipeline run at zero cost. Add a simple row count assertion - if output rows are less than 90% of input rows, fail the job.
Data versioning with DVC
Git tracks code. DVC tracks data. Without DVC, you cannot reproduce results from three months ago because you do not know what the data looked like then. DVC stores a small .dvc pointer file in git (a content hash) and stores the actual data in S3. Checking out a git tag restores both the code and the exact data version that was used. This is a core ML engineering practice and almost no portfolio projects demonstrate it.
The Enterprise Comparison
Every component in this stack maps directly to its enterprise equivalent. When you join a company using Snowflake and Airflow, you already understand the patterns.
| Layer | Your Free Stack | Enterprise Equivalent | Cost Saved |
|---|---|---|---|
| Operational DB | Cloudflare D1 (SQLite) | AWS RDS PostgreSQL | $200+/mo |
| Data Lake | AWS S3 / Cloudflare R2 | Databricks Delta Lake | $500+/mo |
| Warehouse | BigQuery Sandbox | Snowflake / Redshift | $400+/mo |
| Orchestration | GitHub Actions Cron | Apache Airflow / Prefect | $100+/mo |
| Transformation | dbt Core | dbt Cloud | $100+/mo |
| ML Platform | Colab + Databricks CE | Databricks ML / SageMaker | $1000+/mo |
| BI | Tableau Public / Looker Studio | Tableau Server / Looker | $300+/mo |
| Monitoring | GitHub Actions Summaries | Monte Carlo / Great Expectations | $200+/mo |
Take the Challenge
Reading is not building. Start this week. The first milestone is straightforward and takes less than two hours.
Week 1 Mission
- 01 Create a GitHub repository named zero-dollar-data-stack
- 02 Install Python 3.11+ and the Faker library (pip install faker pandas pyarrow)
- 03 Copy the Faker data generator from Part 3 and adapt it for your domain
- 04 Generate your first 100 synthetic customers as a Parquet file
- 05 Push to GitHub with a clear README describing what the data represents
- 06 Come back next week and build the GitHub Actions cron job to automate it
The goal is not perfection. The goal is a public GitHub commit with real code that you can point to. That commit is worth more than 10 Kaggle notebooks.
The Invisible 80%
The zero-dollar stack covers the visible parts of data engineering. But experienced practitioners know that roughly 80% of what determines whether a project succeeds or fails is invisible in any tutorial. Here are the layers nobody teaches.
This is not about adding more tools. It is about building the mindset that separates someone who can run a pipeline from someone who owns one.
The Economics of Data
Every data project is a business decision dressed in technical clothing. Most beginners ask "Can we build this model?" Senior practitioners ask "Should we build this model, and what does it cost if we are wrong?"
A false positive in fraud detection costs investigation time. A false negative costs actual fraud. These are not equal. Design your model around the cost that matters most.
If building a model costs $50k in engineering time and saves $10k per year, it is a bad investment. Calculate before you build, not after.
Messy code today costs more to fix tomorrow. Every shortcut is a loan with compound interest. Pay it down before it compounds.
While you are optimizing one model, what are you not building? A 1% accuracy improvement rarely justifies 6 months of engineering time.
Data Governance and Lineage
Nobody trusts data automatically. Stakeholders ask: where did this number come from? Who transformed it? When? Why does it differ from last week? Without lineage, you cannot answer. With lineage, you trace any number back to its source in minutes.
Use OpenLineage (free) with Marquez. Document every transformation in dbt docs. Every ETL job writes a manifest: timestamp, row counts, source query.
Great Expectations checks in your GitHub Actions. Define what the data must look like before it touches the warehouse.
DataHub or Amundsen (both open source). A catalog of what data exists, what it means, and who owns it.
A Cloudflare Worker that checks BigQuery timestamps and sends an email if data is stale by more than N hours.
Experimentation Infrastructure
Most pipelines stop at prediction: "We predict which users will churn." A complete pipeline closes the loop: "We predicted churn, ran an intervention, and measured the causal impact." Prediction without measurement is not a data product - it is a hypothesis.
The ability to turn models on and off for specific user segments. Without this, you cannot isolate model impact from other changes.
Randomized assignment of treatment and control. "Churn went down" is not evidence. "Churn went down in the treatment group, not the control group" is evidence.
When you cannot randomize, use synthetic controls or difference-in-differences. The absence of a control group does not mean the absence of causal claims.
Capture actual outcomes to retrain the model. A deployed model with no feedback mechanism degrades silently over time.
Incident Response and On-Call
Data pipelines fail at 3 AM on Sundays. They fail when your CEO is presenting. They fail when you are on vacation. The question is not whether your pipeline will fail - it is how fast you can detect and recover.
GitHub Actions sends email on failure. UptimeRobot (free tier) can ping a health-check endpoint. You should know about failures before stakeholders do.
A /runbooks directory in your repo with markdown files for common failure modes. "Pipeline failed" should trigger a procedure, not a search.
After every incident, write a blameless post-mortem. What happened, why, and what changes prevent recurrence. Track in GitHub Issues.
Define what "working" means: data freshness under 2 hours, row counts within 20% of expected, schema unchanged. Monitor these explicitly.
Data Quality as Code
Trust is built on quality, not promises. Your pipeline should stop if data quality checks fail. Bad data should never reach the warehouse.
max(timestamp) is within the expected window. If data is more than 2 hours old, halt and alert.
Row count is within 20% of the historical average. A 90% drop usually means a broken source, not a quiet day.
Column names and types match the contract. A silent upstream schema change is a classic cause of silent downstream failures.
Statistical distributions of key fields have not shifted dramatically. Use Evidently AI (free) for drift detection.
COUNT(DISTINCT id) = COUNT(*). Duplicate primary keys corrupt every downstream join.
Every foreign key points to a valid record. Orphaned transaction records without a customer are a data modeling failure.
Security and Compliance by Design
Do not add security at the end. Build it into every layer from the start. The practices you develop on a free-tier project are identical to what enterprises enforce.
Your ETL service account should only be able to write to one table. Not all tables. Not all datasets. Only what it needs.
Credentials live in GitHub Actions secrets or a .env file that is gitignored. Never in source code. Never in the commit history.
Raw data should be deleted after 90 days. Processed aggregates after 2 years. Schedule cleanup jobs. Old data is liability, not asset.
BigQuery audit logs and Cloudflare Worker logs tell you who accessed what data and when. Even in development, build this habit.
Disaster Recovery and Business Continuity
If someone runs DROP TABLE orders right now, how long until you recover? Hours? Days? The answer determines whether this is a portfolio project or a production system.
BigQuery time travel lets you query any table as it existed up to 7 days ago. Enable it. Test it. Add it to your runbook.
S3 versioning keeps every version of every object. DVC versions datasets alongside code. A rollback should take minutes.
Git for code. DVC for data. MLflow for models. Everything should be reversible. If it is not reversible, it should require a second approval.
Know your recovery time objective before the outage happens. Write it down. A plan you write during an incident is not a plan - it is improvisation.
The Human Layer
You can build the perfect model. If nobody uses it, it is worthless. Adoption is a design problem, not a communication problem.
People do not trust what they do not understand. Explain in business terms. Show examples. Be willing to say "I do not know" and come back with the answer.
If your tool requires 5 steps instead of 2, nobody uses it. Optimize for user experience first, technical elegance second.
Data often challenges existing beliefs. "The data suggests..." lands better than "You were wrong about...". Frame insights as opportunities, not indictments.
If you are the only person who understands how it works, you are the bottleneck. Write docs. Record walkthroughs. Make yourself replaceable.
Before You Call It Complete
Technical
Pipeline handles failures gracefully and alerts on breach
Data quality checks run automatically on every load
Any number can be traced back to its source
Recovery from a disaster scenario has been tested
Code is documented, versioned, and reproducible
Business
The ROI of the project is calculated and documented
Cost of a false positive vs false negative is known
A non-technical person can use the output without help
Assumptions and limitations are explicitly documented
You can explain the architecture in an interview without notes
Notable Free Resources
Official documentation and community resources for every tool in the stack.
Python synthetic data generation library
Cloudflare D1Serverless SQLite at the edge
BigQuery Sandbox1 TB free queries per month, no credit card
GitHub Actions Docs2,000 free minutes per month on public repos
dbt CoreTransform data with version-controlled SQL
Tableau PublicFree BI with public portfolio publishing
Databricks CommunityFree Spark environment with notebook UI
DVCData version control for ML pipelines
Microsoft PresidioOpen source PII detection and anonymization