Fraud Detection Model: A Machine Learning Case Study

Project Overview

This project demonstrates how machine learning can be applied to detect fraudulent financial transactions. Using a dataset of ~200 companies’ financial statements from Kaggle, I developed a classification model that achieved 91% accuracy in identifying fraud. The goal was not only predictive performance; but also interpretability, ensuring that investigators could understand and trust the results.

Data Set

Size: ~200 companies’ financial statements, 32 features
Features included: transaction type, amount, balance data, origin/destination accounts, fraud label
Target variable: binary classification — fraudulent (1) vs. legitimate (0)
Preprocessing: handled nulls, encoded categorical features (e.g., transaction type), normalized numerical fields.

Methodology

Data Preparation
- Cleaned missing values and reformatted categorical fields.
- Encoded the Fraud column as binary (1 = fraud, 0 = non-fraud).
- Normalized numeric features to improve comparability.
Feature Engineering
- Created dummy variables for categorical entries.
- Examined correlations between financial ratios and fraud outcomes.
Modeling
- Trained Logistic Regression, Decision Trees, and Random Forest classifiers.
- Applied cross-validation to evaluate stability of results.
- Tuned hyperparameters of Random Forest for maximum recall (minimizing false negatives).
Evaluation Metrics
- Accuracy, Precision, Recall, F1 Score.
- Confusion matrix to visualize false positives/negatives.
- ROC/AUC to measure classification strength.

Findings

Random Forest achieved ~91% accuracy, outperforming baseline models.
Key predictive features included debt ratios, liabilities, and profit measures.
Because the dataset was relatively balanced, the model achieved strong recall (low rate of missed fraud cases).
Demonstrated that machine learning can add measurable value in fraud risk assessment, complementing traditional auditing methods.

Download Code Here

Click Here