Network Anomaly Detection | Stanley Southwick

Overview

Built a full production ML pipeline for network intrusion detection on the CICIDS2017 dataset — 2.83M labelled flows, 15 attack classes, 48 features after preprocessing.

The core model is an XGBoost classifier (macro F1 = 0.9036) trained with sample weights to handle class imbalance without distorting the data distribution. An Isolation Forest, trained exclusively on benign traffic, acts as a novelty gate — flagging out-of-distribution flows before XGBoost classifies them.

The pipeline runs end-to-end in production: a FastAPI inference API with four endpoints is containerised with Docker, pushed to AWS ECR, and deployed on ECS Fargate. GitHub Actions handles CI/CD — a push to main triggers a full rebuild and redeploy. Model artifacts are stored in S3 and pulled at startup via boto3. Runtime drift is monitored per feature using KS tests, logged to MLflow.

2.83M Labelled flows

0.9036 XGBoost macro F1

15 Attack classes

Pipeline

01 Preprocessing

Ingests 8 CICIDS2017 CSVs, drops zero-variance, duplicate, and high-correlation features, fits a StandardScaler, and exports feature_names.json and scaler.pkl to artifacts/. 79 raw features reduced to 48.

02 Training

XGBoost classifier trained with sample-weight class balancing (chosen over SMOTE to preserve the true data distribution). Isolation Forest trained exclusively on BENIGN traffic — its role is novelty detection, not classification.

03 Ensemble

Isolation Forest gates incoming flows — anomalous flows are passed to XGBoost for attack classification; non-flagged flows bypass IF entirely. All runs logged to MLflow for experiment tracking and comparison.

04 Inference API

FastAPI application exposing four endpoints: /predict/classify, /predict/anomaly, /predict/ensemble, /predict/batch. Model artifacts pulled from AWS S3 at startup via boto3. Interactive docs at /docs.

05 CI/CD & Drift Monitoring

GitHub Actions pipeline: push to main → Docker build → push to AWS ECR → ECS Fargate redeploy. drift.py computes per-feature Kolmogorov–Smirnov statistics against the training distribution at runtime. KS > 0.5 triggers a high-drift alert; results logged to MLflow.

Model Performance

Model	Approach	Macro F1
Random Forest v1	SMOTE oversampling	0.8785
Random Forest v2	class_weight='balanced'	0.8696
Keras DNN	Categorical cross-entropy	0.3371
XGBoost	Sample weights (production)	0.9036
Isolation Forest	BENIGN-only novelty detection	—
IF → XGBoost Ensemble	Novelty gate + classifier	0.87*

* Ensemble F1 is lower than standalone XGBoost by design — IF gates traffic that XGBoost was never trained on. The ensemble's value is breadth of detection, not benchmark maximisation.

Design Decisions

Sample weights over SMOTE

SMOTE introduces synthetic samples that can distort minority-class decision boundaries. Sample weights rebalance training without touching the data distribution — preferable for production where inference runs on raw flows.

No log1p transformation

Tree-based models are invariant to monotonic feature transformations. Log1p was evaluated and deliberately dropped; the notebooks document this with supporting evidence.

Isolation Forest as a novelty gate, not a classifier

IF is trained on BENIGN-only traffic to flag out-of-distribution flows. Hyperparameter tuning was deliberately skipped — overfitting the contamination parameter to labelled benchmarks would undermine its real purpose.

Deterministic preprocessing

Correlation-based feature dropping uses np.triu(k=1) to guarantee a stable, reproducible column set regardless of execution order.

Project Structure

Expand a folder to browse. Click any filename to open it on GitHub.

network-anomaly-detector/

notebooks/

01_eda.ipynb # Exploratory analysis, class distribution

02_preprocessing.ipynb # Feature engineering decisions

03_model_training.ipynb # Model comparison (RF, XGBoost, DNN)

04_anomaly_detection.ipynb # Isolation Forest, ensemble construction

src/

preprocess.py # Feature selection, scaling, artifact export

train_xgb.py # XGBoost training

train_anomaly.py # Isolation Forest training

evaluate_xgb.py # Classification evaluation

evaluate_anomaly.py # Anomaly evaluation

ensemble.py # IF → XGBoost pipeline · MLflow logging

api/

main.py # FastAPI app, lifespan, S3 loading

predict.py # Endpoint logic

schemas.py # Pydantic request/response models

monitor/

drift.py # KS-test drift detection · MLflow

artifacts/ git-ignored · feature_names.json, scaler.pkl

feature_names.json

scaler.pkl

models/ git-ignored · stored in AWS S3

xgb.pkl

iso_forest.pkl

dashboard.html # Portfolio demo (DEMO/LIVE mode)

Dockerfile

.github/workflows/deploy.yml # CI/CD → ECR → ECS Fargate

requirements.txt

What I Learned

Getting XGBoost to 0.90 macro F1 was the straightforward part — the harder work was designing the system around it. Keeping preprocessing deterministic, giving the Isolation Forest a genuinely different purpose from the classifier, and arguing why a lower ensemble F1 is actually the correct outcome all required thinking beyond benchmark numbers.

Deploying to ECS Fargate and wiring up CI/CD made clear how much production ML differs from notebook ML. Artifact management, environment variables, container versioning, and drift monitoring never appear in a Kaggle competition — but they dominate real deployments.