Manuel Alejandro Matías Astorga

4 minute read

📌 Project Overview

This project implements a complete machine learning pipeline to detect fraudulent transactions using the IEEE-CIS dataset. It includes:

Deep EDA in R and Python
Robust preprocessing and feature engineering
Ensemble modeling (XGBoost, LightGBM, CatBoost)
A FastAPI deployment for real-time predictions
Unit testing with pytest for pipeline robustness

🚀 Check the full repository: GitHub

🔍 Exploratory Data Analysis

Full notebook: RPubs EDA report

To lay the foundation for modeling, we explored class balance, missingness, temporal patterns and categorical signals—selecting only the most actionable insights below.

1. Class Imbalance

Transaction amount distribution (log scale)

96.5 % non-fraud vs. 3.5 % fraud
Baseline models were trained on raw data to measure “lift” from later oversampling/threshold tuning

2. Missing-Value Patterns

Missing data heatmap (sample of top 50 vars)

Many id_ & Vxxx features > 85 % missing
Suggests dropping or engineering “missingness” flags before modeling

3. Categorical Fraud Rates

Fraud rate by ProductCD

Product C has ~12 % fraud rate vs. ~3 % overall
Discover cards and mobile devices also show elevated risk

4. Temporal Signal

Fraud ratio by hour of day

Fraud peaks around 6 – 9 AM (UTC)
No strong day-of-week effect, but morning flag can improve feature set

Key Takeaways

Severe imbalance (3.5 % fraud) → plan SMOTE/threshold tuning
High missingness in many features → use missingness indicators or drop
Strong categorical drivers (ProductCD, card networks) → prioritize in feature engineering
Temporal window matters → add IsMorning/IsNight flags

🔧 Data Preprocessing Pipeline

Full notebook: 02_Preprocessing_Modeling_in_Python.ipynb

To prepare the IEEE-CIS dataset for modeling, we built a simple, reproducible pipeline that applies:

📥 Missing-value imputation
- Numerical → median
- Categorical → most frequent
🔤 Categorical encoding
- LabelEncoder per column, saved to models/label_encoders.pkl
📏 Feature scaling
- StandardScaler on all numeric features, saved to models/scaler.pkl
💾 Data export
- Cleaned DataFrame written to data/processed/train_clean.csv

Pseudocode overview

# 1) Impute
df[num_cols] = SimpleImputer('median').fit_transform(df[num_cols])
df[cat_cols] = SimpleImputer('most_frequent').fit_transform(df[cat_cols])

# 2) Encode
for col in cat_cols:
    df[col] = LabelEncoder().fit_transform(df[col].astype(str))

# 3) Scale
df[num_cols] = StandardScaler().fit_transform(df[num_cols])

# 4) Persist transformers and cleaned data
joblib.dump(scaler, 'models/scaler.pkl')
joblib.dump(encoders, 'models/label_encoders.pkl')
df.to_csv('data/processed/train_clean.csv', index=False)

Why this matters

Reproducibility: Same exact logic runs in both notebooks and the FastAPI service.
Modularity: Transformers are versioned via joblib, so you can swap in SMOTE, PCA or any new step later without touching your API.
Performance: Full preprocess on ~590 K × 434 data executes in seconds, keeping your inference path lightweight.

Next up, we’ll load this clean data, split with stratification, and benchmark our first models (Logistic Regression, Random Forest, XGBoost, etc.). Let me know if you’d like that section drafted too!

🏁 Baseline Models

To set a performance floor, we trained two vanilla classifiers on the fully preprocessed data (no sampling, no feature selection). All scores are stratified 5-fold CV averages:

Model	Accuracy	Precision	Recall	F1-Score	ROC AUC
Logistic Regression	0.972	0.825	0.264	0.400	0.856
Random Forest	0.980	0.942	0.449	0.608	0.933

Model	ROC-AUC Plot
Logistic Regression
Random Forest

Key takeaways:

Random Forest improves ROC AUC by ~0.08 over Logistic Regression.

Logistic Regression still offers very fast training & simple interpretation.

For a full side‐by‐side of all models (including tuned ensembles and stacking), check out:
Model Comparison Heatmap

⚙️ Phase 2: Advanced Models & Ensembles

After establishing our baseline, we pushed three state-of-the-art gradient boosters through hyperparameter tuning, and then combined them via two ensemble strategies. All scores are stratified 5-fold CV averages on the preprocessed data.

1 XGBoost

We initialized XGBoost with common defaults and then we would tuned via GridSearchCV:

xgb = XGBClassifier(
    n_estimators=200,
    max_depth=6,
    learning_rate=0.1,
    subsample=0.8,
    colsample_bytree=0.8,
    use_label_encoder=False,
    eval_metric='logloss',
    random_state=42
)

Key metrics

🚀 ROC AUC: 0.9234
🎯 Precision (fraud): 0.957
🔍 Recall (fraud): 0.589
🎛️ F1-Score (fraud): 0.729

Observations:

Strong out-of-the-box performance, beating Logistic Regression by +0.07 AUC.
High precision means few false alarms; recall (0.59) leaves room for further improvement via oversampling.

2 LightGBM

LightGBM is a fast, memory-efficient gradient boosting framework:

lgb = LGBMClassifier(
    n_estimators=200,
    max_depth=6,
    learning_rate=0.1,
    subsample=0.8,
    colsample_bytree=0.8,
    random_state=42
)

Key metrics

🚀 ROC AUC: 0.9157
🎯 Precision (fraud): 0.880
🔍 Recall (fraud): 0.410
🎛️ F1-Score (fraud): 0.560

Observations:

Slightly lower AUC than XGBoost but still highly discriminative.
Very fast training on large data—ideal for rapid iteration.

3 CatBoost

CatBoost natively handles categorical features and counteracts imbalance efficiently:

cb = CatBoostClassifier(
    iterations=500,
    learning_rate=0.1,
    depth=6,
    random_state=42,
    verbose=False
)

Key metrics

🚀 ROC AUC: 0.9174
🎯 Precision (fraud): 0.946
🔍 Recall (fraud): 0.497
🎛️ F1-Score (fraud): 0.651

Observations:

Competitive AUC close to LightGBM with minimal encoding effort.
Good balance of precision/recall out of the box.

In the next phase, we’ll perform hyperparameter tuning to maximize our classification performance—focusing especially on the F1-score, which is crucial when the positive (fraud) class is both rare and of primary interest.

⚙️ Phase 3: Hyperparameter Tuning

In this phase, we applied grid and randomized searches to each of our core models—Logistic Regression, Random Forest, XGBoost, LightGBM and CatBoost—using a small subsample for speed, then retrained the best estimators on the full data to measure real-world performance.

Model	Best Parameters	ROC AUC	Precision (fraud)	Recall (fraud)	F1-Score (fraud)
Logistic Regression	`C=0.01`, `penalty='l2'`, `solver='liblinear'`	0.818	0.825	0.264	0.400
Random Forest	`n_estimators=300`, `max_depth=20`, `min_samples_split=2`, `min_samples_leaf=2`, `max_features='sqrt'`	0.911	0.940	0.370	0.530
XGBoost	`subsample=0.8`, `n_estimators=500`, `max_depth=10`, `learning_rate=0.05`, `colsample_bytree=0.6`, `gamma=0`, `min_child_weight=1`	0.969	0.957	0.589	0.729
LightGBM	`subsample=0.6`, `n_estimators=500`, `max_depth=10`, `learning_rate=0.1`, `colsample_bytree=1.0`, `num_leaves=63`, `min_child_samples=10`	0.964	0.930	0.620	0.740
CatBoost	`iterations=500`, `depth=8`, `learning_rate=0.1`, `border_count=128`, `l2_leaf_reg=1`, `bagging_temperature=0`	0.960	0.950	0.500	0.651

🔍 Key Takeaways

Defaults were strong: Logistic Regression actually saw a drop in AUC after tuning, indicating its default C=1 was near-optimal for our data.
Tree-based gains: XGBoost tuning delivered the best AUC (0.969), closely followed by LightGBM and CatBoost.
Precision vs. recall: All tuned boosters maintain very high precision (>0.93) but recall ranges from 0.50 (CatBoost) to 0.62 (LightGBM), suggesting further imbalance strategies could improve fraud capture.

🔗 Phase 4: Ensemble Models

In the final stage, we combined our top-tuned boosters into two ensemble strategies to maximize robustness and performance.

1 Soft Voting Ensemble

We created a VotingClassifier that averages the predicted probabilities of our three best models:

from sklearn.ensemble import VotingClassifier

ensemble_voting = VotingClassifier(
    estimators=[
        ("xgb", best_xgb_model),
        ("lgb", best_lgb_model),
        ("cat", best_cat_model)
    ],
    voting="soft",      # average probabilities
    weights=[1, 1, 1],  # equal weighting
    n_jobs=-1
)

ensemble_voting.fit(X_train, y_train)

Voting Ensemble Metrics

🚀 ROC AUC: 0.9648
🎯 Precision (fraud): 0.955
🔍 Recall (fraud): 0.571
🎛️ F1-Score (fraud): 0.715

2 Stacking Ensemble

Next, we used a StackingClassifier to learn optimal combinations of base-model outputs, with a LogisticRegression meta‐learner:

from sklearn.ensemble import StackingClassifier
from sklearn.linear_model import LogisticRegression

stacking = StackingClassifier(
    estimators=[
        ("xgb", best_xgb_model),
        ("lgb", best_lgb_model),
        ("cat", best_cat_model)
    ],
    final_estimator=LogisticRegression(max_iter=1000),
    passthrough=False,
    cv=5,
    n_jobs=-1
)

stacking.fit(X_train, y_train)

Stacking Ensemble Metrics

🚀 ROC AUC: 0.9691
🎯 Precision (fraud): 0.924
🔍 Recall (fraud): 0.679
🎛️ F1-Score (fraud): 0.783

Therefore, the stacking ensemble emerges as the clear winner—as shown in our baseline comparison table.

🧪 Testing

We implemented tests to ensure pipeline integrity:

✅ Preprocessing pipeline does not crash with valid data
✅ API responds with expected output structure
✅ Unit tests managed with pytest

🖥️ Deployment

The final model was locally deployed using FastAPI. Key features:

/predict endpoint returns prediction and fraud probability
Interactive Swagger UI available at /docs
Can be run locally via Uvicorn or deployed in a Docker container

uvicorn src.main:app --reload

Or with Docker

docker pull alexmatiasastorga/fraud-api:latest
docker run -d -p 8000:8000 alexmatiasastorga/fraud-api

📌 Conclusion

This project demonstrates a real-world machine learning workflow from raw data to deployment. Future improvements may include:

DAG automation with Apache Airflow
Cloud deployment (Render or AWS)
Monitoring with MLFlow or Prometheus

🔗 Links

🔍 Project Repository
🐳 Docker Image

Share on

Twitter Facebook LinkedIn

📌 Project Overview

🔍 Exploratory Data Analysis

1. Class Imbalance

2. Missing-Value Patterns

3. Categorical Fraud Rates

4. Temporal Signal

Key Takeaways

🔧 Data Preprocessing Pipeline

🏁 Baseline Models

⚙️ Phase 2: Advanced Models & Ensembles

1 XGBoost

2 LightGBM

3 CatBoost

⚙️ Phase 3: Hyperparameter Tuning

🔍 Key Takeaways

🔗 Phase 4: Ensemble Models

1 Soft Voting Ensemble

2 Stacking Ensemble

🧪 Testing

🖥️ Deployment

📌 Conclusion

🔗 Links

Share on

You may also enjoy