Project 01

Network Anomaly Detection

Production-grade ML pipeline for network intrusion detection — trained on 2.83M labelled flows, containerised FastAPI inference deployed on AWS ECS Fargate, with automated CI/CD and runtime drift monitoring.

Network Anomaly Detection pipeline diagram

Overview

Built a full production ML pipeline for network intrusion detection on the CICIDS2017 dataset — 2.83M labelled flows, 15 attack classes, 48 features after preprocessing.

The core model is an XGBoost classifier (macro F1 = 0.9036) trained with sample weights to handle class imbalance without distorting the data distribution. An Isolation Forest, trained exclusively on benign traffic, acts as a novelty gate — flagging out-of-distribution flows before XGBoost classifies them.

The pipeline runs end-to-end in production: a FastAPI inference API with four endpoints is containerised with Docker, pushed to AWS ECR, and deployed on ECS Fargate. GitHub Actions handles CI/CD — a push to main triggers a full rebuild and redeploy. Model artifacts are stored in S3 and pulled at startup via boto3. Runtime drift is monitored per feature using KS tests, logged to MLflow.

2.83M Labelled flows
0.9036 XGBoost macro F1
15 Attack classes

Pipeline

01 Preprocessing
Ingests 8 CICIDS2017 CSVs, drops zero-variance, duplicate, and high-correlation features, fits a StandardScaler, and exports feature_names.json and scaler.pkl to artifacts/. 79 raw features reduced to 48.
02 Training
XGBoost classifier trained with sample-weight class balancing (chosen over SMOTE to preserve the true data distribution). Isolation Forest trained exclusively on BENIGN traffic — its role is novelty detection, not classification.
03 Ensemble
Isolation Forest gates incoming flows — anomalous flows are passed to XGBoost for attack classification; non-flagged flows bypass IF entirely. All runs logged to MLflow for experiment tracking and comparison.
04 Inference API
FastAPI application exposing four endpoints: /predict/classify, /predict/anomaly, /predict/ensemble, /predict/batch. Model artifacts pulled from AWS S3 at startup via boto3. Interactive docs at /docs.
05 CI/CD & Drift Monitoring
GitHub Actions pipeline: push to main → Docker build → push to AWS ECR → ECS Fargate redeploy. drift.py computes per-feature Kolmogorov–Smirnov statistics against the training distribution at runtime. KS > 0.5 triggers a high-drift alert; results logged to MLflow.

Model Performance

Model Approach Macro F1
Random Forest v1 SMOTE oversampling 0.8785
Random Forest v2 class_weight='balanced' 0.8696
Keras DNN Categorical cross-entropy 0.3371
XGBoost Sample weights (production) 0.9036
Isolation Forest BENIGN-only novelty detection
IF → XGBoost Ensemble Novelty gate + classifier 0.87*

* Ensemble F1 is lower than standalone XGBoost by design — IF gates traffic that XGBoost was never trained on. The ensemble's value is breadth of detection, not benchmark maximisation.

Design Decisions

Sample weights over SMOTE

SMOTE introduces synthetic samples that can distort minority-class decision boundaries. Sample weights rebalance training without touching the data distribution — preferable for production where inference runs on raw flows.

No log1p transformation

Tree-based models are invariant to monotonic feature transformations. Log1p was evaluated and deliberately dropped; the notebooks document this with supporting evidence.

Isolation Forest as a novelty gate, not a classifier

IF is trained on BENIGN-only traffic to flag out-of-distribution flows. Hyperparameter tuning was deliberately skipped — overfitting the contamination parameter to labelled benchmarks would undermine its real purpose.

Deterministic preprocessing

Correlation-based feature dropping uses np.triu(k=1) to guarantee a stable, reproducible column set regardless of execution order.

Project Structure

Expand a folder to browse. Click any filename to open it on GitHub.

network-anomaly-detector/
notebooks/
01_eda.ipynb # Exploratory analysis, class distribution
02_preprocessing.ipynb # Feature engineering decisions
03_model_training.ipynb # Model comparison (RF, XGBoost, DNN)
04_anomaly_detection.ipynb # Isolation Forest, ensemble construction
src/
preprocess.py # Feature selection, scaling, artifact export
train_xgb.py # XGBoost training
train_anomaly.py # Isolation Forest training
evaluate_xgb.py # Classification evaluation
evaluate_anomaly.py # Anomaly evaluation
ensemble.py # IF → XGBoost pipeline · MLflow logging
api/
main.py # FastAPI app, lifespan, S3 loading
predict.py # Endpoint logic
schemas.py # Pydantic request/response models
monitor/
drift.py # KS-test drift detection · MLflow
artifacts/  git-ignored · feature_names.json, scaler.pkl
feature_names.json
scaler.pkl
models/  git-ignored · stored in AWS S3
xgb.pkl
iso_forest.pkl
dashboard.html # Portfolio demo (DEMO/LIVE mode)
.github/workflows/deploy.yml # CI/CD → ECR → ECS Fargate

What I Learned

Getting XGBoost to 0.90 macro F1 was the straightforward part — the harder work was designing the system around it. Keeping preprocessing deterministic, giving the Isolation Forest a genuinely different purpose from the classifier, and arguing why a lower ensemble F1 is actually the correct outcome all required thinking beyond benchmark numbers.

Deploying to ECS Fargate and wiring up CI/CD made clear how much production ML differs from notebook ML. Artifact management, environment variables, container versioning, and drift monitoring never appear in a Kaggle competition — but they dominate real deployments.