Data Simplified for Beginners

The data project playbook

The order of operations is critical, and getting it wrong is one of the most common and costly mistakes. Here is the precise, step-by-step workflow used in real-world industry - from problem definition to deployment - with common pitfalls at every stage.

The most expensive mistake in data science

Data leakage - letting information from your test set influence training - produces models that look great on paper but fail completely in production. This guide is structured to help you understand exactly when and why it happens, and how to avoid it.

Part 0

Foundational Terminology

Before writing a single line of code, you need to understand what you're working with. These terms appear in every job description, paper, and codebase.

Data

Raw, unprocessed facts and figures. A list of sales transactions is data.

Dataset

A structured collection of data, typically in tabular format - rows and columns.

Feature / Variable / Attribute

A column in your dataset. A characteristic being measured, such as customer_age or product_price.

Observation / Record / Instance

A row in your dataset. A single data point - one customer, one transaction.

Target Variable

The column you are trying to predict. Also called the label, output, or dependent variable.

Model

A mathematical function learned from data that maps inputs (features) to outputs (predictions).

Types of Data

Numerical Quantitative - it is a number

Continuous

Can take any value within a range. Examples: Height, Temperature, Salary. Can be measured.

Discrete

Takes only specific, countable values. Examples: Number of children, Number of cars. Can be counted.

Categorical Qualitative - represents groups or categories

Nominal

Categories with no order. Examples: Color (Red, Blue), City (London, Paris).

Ordinal

Categories with a meaningful order. Examples: Education (High School < Bachelor's < Master's).

Binary

Only two categories. Examples: Yes/No, Churn/Not Churn, 0/1.

Time-Series

Data points indexed in time order. Examples: Stock price per minute, daily website traffic. Requires specialised handling to avoid lookahead bias.

Phase 0

Problem Definition & Framing

Before any technical work, define the business goal. What problem are you solving? How will the model's output be used? This step determines your target variable, evaluation metric, and whether ML is even the right tool.

Industry rule

A technically perfect model that solves the wrong problem is worthless. Never skip this phase. Define success in business terms first - "reduce customer churn by 10%", not "achieve 90% accuracy".

Define your target variable

What exactly are you predicting? A specific column (e.g., will_churn), a numeric value (e.g., next_month_revenue), or a cluster label?

Define your success metric

Accuracy, Precision, Recall, F1, RMSE, AUC-ROC - choose the metric that reflects the actual cost of being wrong in your business context.

Phase 1

Data Acquisition & Ingestion

Getting data from its source - files, databases, APIs, or cloud storage.

# Python (Pandas)
import pandas as pd

df = pd.read_csv('data.csv')                          # CSV file
df = pd.read_excel('data.xlsx')                       # Excel file
df = pd.read_sql('SELECT * FROM orders', conn)        # SQL database
df = pd.read_json('data.json')                        # JSON file
df = pd.read_parquet('data.parquet')                  # Parquet (columnar)

# PySpark
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("DataProject").getOrCreate()

df = spark.read.csv('s3://bucket/data.csv', header=True, inferSchema=True)
df = spark.read.parquet('s3://bucket/data.parquet')   # Preferred format at scale
df = spark.read.format("jdbc") \
    .option("url", "jdbc:postgresql://host/db") \
    .option("dbtable", "orders") \
    .option("user", "user").option("password", "pwd") \
    .load()                                           # JDBC database

Common pitfalls

×Forgetting to set a random seed - leads to non-reproducible results across runs.
×Not checking the character encoding of a text file - results in garbled strings.
×Assuming all data is in one place. Real-world data is scattered across databases, lakes, and APIs.

Phase 2

Initial Exploration & Summary Statistics

Get a first look at the data's structure, basic properties, and obvious quality issues. A max age of 150 or a min salary of 0 should trigger immediate investigation.

# Python (Pandas)
df.info()             # Schema, dtypes, non-null counts
df.shape              # (rows, columns)
df.head(10)           # First 10 rows
df.sample(5)          # Random sample  - better for bias detection

# Numerical columns
df.describe()         # count, mean, std, min, 25%, 50%, 75%, max

# Categorical columns
df['column'].value_counts()          # Frequency distribution
df['column'].value_counts(normalize=True)  # As proportions

# Check for mixed types
df.dtypes                            # A numeric column read as 'object' = mixed types

# PySpark
df.printSchema()                            # Schema and types
print((df.count(), len(df.columns)))        # Shape equivalent
df.show(10)
df.sample(False, 0.01).show(5)             # ~1% random sample

df.describe().show()                        # Summary stats
df.groupBy('column').count().orderBy('count', ascending=False).show()

Common pitfalls

×Running df.describe() and moving on without reading the values. Max age of 150 is a data error, not a quirk.
×Missing mixed data types - a column that should be numeric but has "N/A" strings gets silently read as object dtype.
×Confusing mean vs median: if they differ significantly, the data is skewed and the mean is misleading.

Phase 3

Data Cleaning & Preprocessing

The most time-consuming phase - transforming raw data into a clean format suitable for analysis and modelling. Some cleaning must happen before EDA to avoid being misled.

Handling Missing Values

# Python (Pandas)  - Detection
df.isnull().sum()                          # Count nulls per column
df.isnull().mean() * 100                   # Null % per column

# Drop: only if nulls are few and missing at random
df.dropna(subset=['critical_column'])

# Impute numerical: prefer median (robust to outliers)
df['age'].fillna(df['age'].median(), inplace=True)

# Impute categorical: use mode or explicit "Unknown"
df['city'].fillna(df['city'].mode()[0], inplace=True)
df['city'].fillna('Unknown', inplace=True)

# PySpark
from pyspark.sql.functions import col, count, when, isnan

# Count nulls per column
df.select([count(when(col(c).isNull() | isnan(c), c)).alias(c) for c in df.columns]).show()

# Drop rows where specific column is null
df = df.dropna(subset=['critical_column'])

# Fill with a constant
df = df.fillna({'age': df.approxQuantile('age', [0.5], 0.01)[0]})

# Fill categorical with 'Unknown'
df = df.fillna({'city': 'Unknown'})

Outlier Detection & Treatment

# IQR method (Python)
Q1 = df['salary'].quantile(0.25)
Q3 = df['salary'].quantile(0.75)
IQR = Q3 - Q1
lower = Q1 - 1.5 * IQR
upper = Q3 + 1.5 * IQR

# Identify outliers
outliers = df[(df['salary'] < lower) | (df['salary'] > upper)]

# Option 1: Remove (only if data entry error)
df = df[(df['salary'] >= lower) & (df['salary'] <= upper)]

# Option 2: Winsorize / cap at percentiles (preferred)
df['salary'] = df['salary'].clip(lower=lower, upper=upper)

Feature Scaling - Critical Timing Rule

When to scale

Fit the scaler ONLY on the training data, then transform both train and test sets. Never fit on the full dataset - that leaks test set statistics into training.

# Python  - CORRECT order (after splitting)
from sklearn.preprocessing import StandardScaler, MinMaxScaler

scaler = StandardScaler()                         # Z-score: (x - mean) / std
X_train_scaled = scaler.fit_transform(X_train)    # Fit AND transform training
X_test_scaled  = scaler.transform(X_test)         # Transform ONLY  - no fit

# MinMaxScaler for bounded range [0, 1]
mm_scaler = MinMaxScaler()
X_train_mm = mm_scaler.fit_transform(X_train)
X_test_mm  = mm_scaler.transform(X_test)

# Save the fitted scaler  - you'll need it at inference time
import joblib
joblib.dump(scaler, 'scaler.pkl')

Standardisation (Z-score)

Output centred around 0, std of 1. Use when data follows Gaussian distribution or distribution is unknown. Works with SVM, Linear Regression, Neural Networks.

Normalisation (Min-Max)

Output in range [0, 1]. Use when data has known bounds or for neural networks with sigmoid/tanh activations. Sensitive to outliers.

Common pitfalls

×Data Leakage: imputing missing values using statistics from the full dataset before splitting. The test mean/median should be unknown at training time.
×Treating all outliers as bad data without investigating root cause - outliers are often the most interesting signal (fraud, rare events).
×Scaling before splitting - the scaler learns test set statistics and produces an overly optimistic evaluation.
×Forgetting to convert categorical columns to numbers before passing to scikit-learn models.

Phase 4

Exploratory Data Analysis (EDA)

Understand the patterns, relationships, and story within the data. This is an iterative process - visualise everything, then go back and clean more if needed.

# Python  - Univariate Analysis
import matplotlib.pyplot as plt
import seaborn as sns

# Numerical: distribution shape, skewness, outliers
df['salary'].hist(bins=50)
sns.boxplot(x=df['salary'])

# Categorical: frequency balance
df['department'].value_counts().plot(kind='bar')

# Python  - Bivariate Analysis
# Numerical vs Numerical: correlation and scatter
sns.scatterplot(data=df, x='experience_years', y='salary')
corr_matrix = df.select_dtypes(include='number').corr()
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm')

# Categorical vs Numerical
sns.boxplot(data=df, x='education_level', y='salary')

# Categorical vs Categorical
pd.crosstab(df['purchased'], df['marketing_channel'], normalize='index')

# PySpark  - Summary and distributions at scale
from pyspark.sql.functions import corr, col, skewness, kurtosis

# Distribution stats
df.select(
    skewness('salary').alias('salary_skew'),
    kurtosis('salary').alias('salary_kurt')
).show()

# Correlation between two numerical columns
df.select(corr('experience_years', 'salary')).show()

# Categorical frequency
df.groupBy('department').count().orderBy('count', ascending=False).show()

# Cross-tabulation
df.crosstab('purchased', 'marketing_channel').show()

Common pitfalls

×Skipping EDA entirely and jumping to modelling - you will build a model on data you don't understand.
×Only looking at numbers and skipping visualisation. Anscombe's Quartet is 4 datasets with identical summary stats that look completely different when plotted.
×Confusing correlation with causation. Two variables moving together does not mean one causes the other.

Phase 5

Feature Engineering & Selection

Creating new features or transforming existing ones to help your model learn better. All engineering steps must be learned from training data only and then applied to the test set.

# Python  - Feature Creation
# Date parts
df['signup_date'] = pd.to_datetime(df['signup_date'])
df['signup_day_of_week'] = df['signup_date'].dt.dayofweek
df['signup_month']       = df['signup_date'].dt.month

# Combining features
df['area'] = df['height'] * df['width']

# Binning continuous to categorical
df['age_group'] = pd.cut(df['age'], bins=[0, 25, 45, 65, 100],
                         labels=['Young', 'Middle', 'Senior', 'Elder'])

# Log transform to reduce skewness
import numpy as np
df['log_salary'] = np.log1p(df['salary'])   # log1p handles zeros

# Encoding categorical variables
# One-Hot Encoding (for nominal categories)
df = pd.get_dummies(df, columns=['city'], drop_first=True)

# Ordinal Encoding (for ordered categories)
from sklearn.preprocessing import OrdinalEncoder
oe = OrdinalEncoder(categories=[['Low', 'Medium', 'High']])
df['priority_encoded'] = oe.fit_transform(df[['priority']])

# PySpark  - Feature Engineering
from pyspark.sql.functions import dayofweek, month, log1p, col
from pyspark.ml.feature import StringIndexer, OneHotEncoder, VectorAssembler

# Date parts
df = df.withColumn('signup_day_of_week', dayofweek('signup_date'))
df = df.withColumn('signup_month', month('signup_date'))

# Log transform
df = df.withColumn('log_salary', log1p(col('salary')))

# String to numeric index, then one-hot
indexer = StringIndexer(inputCol='city', outputCol='city_idx')
encoder = OneHotEncoder(inputCol='city_idx', outputCol='city_vec')

# Assemble all features into a single vector (required for Spark ML)
assembler = VectorAssembler(
    inputCols=['age', 'log_salary', 'city_vec', 'signup_day_of_week'],
    outputCol='features'
)

Common pitfalls

×Data leakage through aggregation: creating a feature like average_customer_spend using the full dataset before splitting introduces future information.
×The Curse of Dimensionality: creating too many features with no domain justification makes models slow and prone to overfitting.
×Skipping domain knowledge: the best feature ideas come from understanding the business, not from automated feature generation.

Phase 6

Model Training, Validation & Testing

The golden rule of splitting

Split your data FIRST, before any step that learns from or uses the target variable. This is non-negotiable. All subsequent cleaning, scaling, and feature engineering that "learns" parameters must happen after this split, using only training data statistics.

# Python  - The correct split-first workflow
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, roc_auc_score

X = df.drop('target', axis=1)
y = df['target']

# Stratify ensures class balance is preserved in both splits
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# NOW apply your scaler (fit on train only)
scaler = StandardScaler()
X_train_sc = scaler.fit_transform(X_train)
X_test_sc  = scaler.transform(X_test)

# Cross-validation on training set (no test data involved)
model = RandomForestClassifier(n_estimators=100, random_state=42)
cv_scores = cross_val_score(model, X_train_sc, y_train, cv=5, scoring='roc_auc')
print(f"CV AUC: {cv_scores.mean():.3f} ± {cv_scores.std():.3f}")

# Train on full training set
model.fit(X_train_sc, y_train)

# Final evaluation  - done ONCE on the held-out test set
y_pred = model.predict(X_test_sc)
print(classification_report(y_test, y_pred))
print(f"Test AUC: {roc_auc_score(y_test, model.predict_proba(X_test_sc)[:, 1]):.3f}")

# PySpark ML Pipeline  - encapsulates transformers + model in correct order
from pyspark.ml import Pipeline
from pyspark.ml.classification import RandomForestClassifier as SparkRF
from pyspark.ml.evaluation import BinaryClassificationEvaluator

# Split first
train_df, test_df = df.randomSplit([0.8, 0.2], seed=42)

# Define pipeline stages (fit happens only on train_df)
pipeline = Pipeline(stages=[indexer, encoder, assembler, SparkRF(
    featuresCol='features',
    labelCol='target',
    numTrees=100,
    seed=42
)])

# Train
model = pipeline.fit(train_df)

# Evaluate on test
predictions = model.transform(test_df)
evaluator = BinaryClassificationEvaluator(labelCol='target', metricName='areaUnderROC')
print(f"Test AUC: {evaluator.evaluate(predictions):.3f}")

Common pitfalls

×Splitting after any step that learns from the target variable - the most common and most costly mistake.
×Using the test set multiple times to "improve" the model. Once you evaluate on the test set, you are done. Using it to make decisions leaks its information.
×Not stratifying the split for imbalanced classification - the test set may end up with no examples of the minority class.
×Reporting CV score on training data as if it were the final evaluation. CV estimates generalisation; the held-out test set measures it.

Phase 7

Model Interpretation & Deployment

Explaining the model's behaviour and putting it into production in a form that other systems can use.

# Save the model AND the preprocessing objects together
import joblib

joblib.dump(scaler, 'artifacts/scaler.pkl')
joblib.dump(model,  'artifacts/model.pkl')

# Serve via FastAPI
from fastapi import FastAPI
from pydantic import BaseModel
import numpy as np

app = FastAPI()

scaler_loaded = joblib.load('artifacts/scaler.pkl')
model_loaded  = joblib.load('artifacts/model.pkl')

class PredictionRequest(BaseModel):
    features: list[float]

@app.post('/predict')
def predict(request: PredictionRequest):
    X = np.array(request.features).reshape(1, -1)
    X_scaled = scaler_loaded.transform(X)          # Use SAME fitted scaler
    proba = model_loaded.predict_proba(X_scaled)[0, 1]
    return {'churn_probability': round(float(proba), 4)}

# PySpark  - save and load
model.save('s3://bucket/models/churn_v1')

# Load for batch inference
from pyspark.ml import PipelineModel
loaded_model = PipelineModel.load('s3://bucket/models/churn_v1')
batch_predictions = loaded_model.transform(new_data_df)

Common pitfalls

×Handing over a Jupyter notebook to engineering and calling it "deployed". A notebook is not a service.
×Forgetting to save and version the preprocessing objects (scaler, encoder) alongside the model - loading the model without the same scaler produces garbage predictions.
×Not setting up monitoring. Concept drift - when real-world data distribution changes over time - will silently degrade your model's performance.

The correct order - at a glance

01 Problem Definition - define target variable and success metric
02 Data Acquisition - load from files, databases, or cloud storage
03 Initial Exploration - schema, dtypes, summary stats, basic quality check
04 SPLIT DATA into train / validation / test - this is the first gate
05 Data Cleaning on train only - missing values, outliers, type conversions
06 EDA on training data only - distributions, correlations, bivariate analysis
07 Feature Engineering on train, apply transform to test - encoding, scaling, new features
08 Model Training with cross-validation on training set
09 Hyperparameter Tuning using validation set
10 Final Evaluation on held-out test set - done ONCE
11 Deployment - save model + preprocessing artifacts, build API, set up monitoring

Part 8

The Missing Pieces

The workflow above is standard. But it is incomplete. Real-world projects require additional layers that tutorials ignore.

Version Control (Not Just for Code)

Git for code. DVC for data. MLflow for models. If you cannot reproduce a result from three months ago, you do not have a process - you have chaos.

Git

Version control for code and notebooks. Commit after every meaningful change.

DVC

Data Version Control. Track large datasets and model artifacts alongside your Git history.

MLflow

Log parameters, metrics, and model artifacts for every experiment run.

Data Validation

Before you do anything, check that the data meets your expectations. Define a contract: "Column age must be between 0 and 120. Column country must exist in our reference list." If the data violates the contract, stop. Alert someone. Do not proceed with garbage.

# Using Great Expectations for data contracts
import great_expectations as ge

df_ge = ge.from_pandas(df)

# Define your data contract
df_ge.expect_column_values_to_be_between('age', min_value=0, max_value=120)
df_ge.expect_column_values_to_not_be_null('customer_id')
df_ge.expect_column_values_to_be_in_set('country', {'US', 'UK', 'CA', 'AU'})

results = df_ge.validate()
if not results['success']:
    raise ValueError("Data contract violated. Pipeline halted.")

Handling Imbalanced Data

If 1% of your cases are fraud, a model that predicts "not fraud" for everything is 99% accurate - and completely useless. Learn precision, recall, F1-score. Learn SMOTE. Learn class weights. Accuracy is a trap.

# Option 1: Class weights (built into most sklearn estimators)
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(class_weight='balanced', n_estimators=100, random_state=42)

# Option 2: SMOTE oversampling
from imblearn.over_sampling import SMOTE
smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X_train, y_train)

# Always evaluate with the right metrics for imbalanced problems
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred))  # precision, recall, F1

The accuracy trap

On a 99:1 imbalanced dataset, a model predicting the majority class always achieves 99% accuracy. F1-score, AUC-ROC, and precision-recall curves are the metrics that actually matter.

The Bias-Variance Tradeoff

Diagnose before you tune.

Overfitting (High Variance)

High training accuracy + low test accuracy

Simplify the model. Add regularization. Get more data. Reduce features.

Underfitting (High Bias)

Low training accuracy + low test accuracy

Add complexity. Try a more powerful model. Add more features. Reduce regularization.

Model Monitoring

Deployment is not the end. It is the beginning. Monitor for data drift (input features changing distribution) and concept drift (the relationship between features and target changing). When performance degrades, retrain.

Data Drift

The statistical distribution of input features shifts over time. The model was trained on last year's users; today's users behave differently.

Concept Drift

The relationship between features and the target variable changes. Fraud patterns evolve as attackers adapt to your detection.

Performance Monitoring

Track precision, recall, and AUC on a rolling window of production predictions. Set alerting thresholds for degradation.

Retraining Triggers

Retrain on schedule (weekly/monthly) or on-demand when metrics breach thresholds. Never let a stale model silently fail.

Part 9

The Who - Different Roles, Different Mindsets

"Data scientist" is not one job. It is many. And the differences matter. In a startup, you do all of them. In a large enterprise, you specialize. Know which you want - and which you are walking into.

Role Core Question Typical Day Key Skill

Data Analyst "What happened?" Dashboards. SQL. Business presentations. SQL, Tableau, storytelling

Data Engineer "How do we access it reliably?" Pipelines. Warehouses. Cloud infrastructure. Python, Spark, Airflow, dbt

ML Engineer "How do we put it in production?" APIs. Scaling. Monitoring. CI/CD. Docker, FastAPI, MLflow, cloud

Data Scientist "Why will it happen? What should we do?" EDA. Feature engineering. Modeling. A/B testing. Statistics, Python, ML, communication

Research Scientist "What is the next breakthrough?" Papers. New architectures. Deep theory. Deep math, PyTorch, research writing

Part 10

The Hidden Layer - What Nobody Teaches

Technical skill is table stakes. The professionals who advance are the ones who master what no course covers.

The Stakeholder Translation

You speak p-values and confidence intervals. Your boss speaks revenue and retention. Learn to translate. Your communication skills determine your impact - and your career trajectory.

Avoid this

"The model has an AUC-ROC of 0.85."

Say this instead

"This model will help us identify 20% more high-value customers, adding roughly $50k in monthly revenue."

The Ethics Layer

Data reflects the world. The world has biases. If you train on historical hiring data, and your company historically hired mostly men for senior roles, your model will "learn" that maleness predicts seniority. You are the gatekeeper. Check for bias. Test for fairness. Ask: does this model harm anyone?

The Technical Debt Layer

Notebooks are for exploration. Production requires functions, docstrings, tests, and version control. Write code that your future self - and your colleagues - can understand.

The Data Maturity Model

Not every company is ready for machine learning. If you join a Stage 1 company expecting Stage 4 work, you will be frustrated and fail. Match your expectations to reality.

Stage 1 Descriptive

"What happened?"

They need reliable dashboards and consistent data definitions.

Stage 2 Diagnostic

"Why did it happen?"

They need root cause analysis and drill-down capability.

Stage 3 Predictive

"What will happen?"

They need statistical models and ML pipelines.

Stage 4 Prescriptive

"What should we do?"

They need automated decision systems and closed-loop optimization.

Part 11

The Mindset - See, Think, Observe, Speak

The framework that ties everything together. Most beginners stay at "See." The professionals move through all four levels.

See The Technician

Load the data. Check the types. Run df.describe(). This is table stakes. Anyone can do it.

Do not stop here.

Think The Analyst

Explore possibilities. Generate hypotheses. Connect data to business context. Why are there so many nulls? Why do sales spike in December? You are not just viewing data - you are interrogating it.

Observe The Detective

Validate your thinking with evidence. Find actual patterns. "I thought nulls were random. But after grouping by region, I observe they only occur in one country. That is not random. That is a clue." Move from guessing to knowing.

Speak The Storyteller

Translate observations into action. "If we get 10% more users to finish the tutorial, we will retain an extra 500 customers this quarter - $2M in revenue. I recommend we send a reminder email on Day 2." This is where you stop being the "person who shows charts" and become the "person who drives decisions."

The Question That Changes Everything

Most people ask

"What code should I write?"

The professional asks

"What story is this data trying to tell me?"

Stop watching data. Start thinking with it. Start observing what it hides. Start speaking what it reveals. That is the difference between someone who works with data - and someone who transforms how their organization uses it.

Pick a domain that genuinely interests you. Domain knowledge eventually outweighs coding ability.