
Fraud Detection Model: A Machine Learning Case Study
Project Overview
This project demonstrates how machine learning can be applied to detect fraudulent financial transactions. Using a dataset of ~200 companies’ financial statements from Kaggle, I developed a classification model that achieved 91% accuracy in identifying fraud. The goal was not only predictive performance; but also interpretability, ensuring that investigators could understand and trust the results.
Data Set
Size: ~200 companies’ financial statements, 32 features
Features included: transaction type, amount, balance data, origin/destination accounts, fraud label
Target variable: binary classification — fraudulent (1) vs. legitimate (0)
Preprocessing: handled nulls, encoded categorical features (e.g., transaction type), normalized numerical fields.
Methodology
Data Preparation
Cleaned missing values and reformatted categorical fields.
Encoded the Fraud column as binary (1 = fraud, 0 = non-fraud).
Normalized numeric features to improve comparability.
Feature Engineering
Created dummy variables for categorical entries.
Examined correlations between financial ratios and fraud outcomes.
Modeling
Trained Logistic Regression, Decision Trees, and Random Forest classifiers.
Applied cross-validation to evaluate stability of results.
Tuned hyperparameters of Random Forest for maximum recall (minimizing false negatives).
Evaluation Metrics
Accuracy, Precision, Recall, F1 Score.
Confusion matrix to visualize false positives/negatives.
ROC/AUC to measure classification strength.
Findings
Random Forest achieved ~91% accuracy, outperforming baseline models.
Key predictive features included debt ratios, liabilities, and profit measures.
Because the dataset was relatively balanced, the model achieved strong recall (low rate of missed fraud cases).
Demonstrated that machine learning can add measurable value in fraud risk assessment, complementing traditional auditing methods.