Ok, let's create a Machine Learning model to detect financial fraud.

For me, one of the most interesting and practical applications of ML is fraud detection in financial transactions, a field that can be approached from classification models or anomaly detection analysis.

In general, exploring and experimenting with fraud detection models is somewhat more complex due to the lack of publicly available datasets. This is understandable, as few financial institutions are willing to expose potentially sensitive financial and operational information to the public domain.

The objective of this article is to present a simple approach to the problem of fraud detection, which I hope will serve as either an initial reference or a starting point for developing more complex models.

This project is based on a synthetic dataset created by PaySim, which has generated a dataset from private real-world transactions of a mobile money service implemented in an African country. The synthetic dataset simulates normal transactional operations, into which malicious behavior has been injected to evaluate the performance of fraud detection models. The original dataset is available on the Kaggle platform (https://www.kaggle.com/ntnu-testimon/paysim1).

The code for this project can be found at:

https://github.com/ortizcarlos/fraud_detection/blob/master/Financial%20fraud%20detection.ipynb

Let's start with some data exploration.

Let’s start by exploring the data to gain an understanding of key information such as dataset size, columns, missing data, outliers, etc.

The dataset is moderately large with 6,362,620 rows and 11 columns.

Definition of the dataset columns (copied from Kaggle)

Types of transactions and distribution

‘PAYMENT’, ‘TRANSFER’, ‘CASH_OUT’, ‘DEBIT’, ‘CASH_IN’

Distribución por tipo de transacción (type)

Let’s see how many frauds we have.

Only 8,213 rows out of a total of just over 6 million represent frauds (only 0.13%!). This is almost to be expected, as generally in a financial dataset, the number of fraudulent samples is almost always very small compared to the total number of transactions.

This class imbalance (frauds represent only 0.13% of the total) is a very important factor to consider when building our model.

Now, let’s see how many frauds there are by transaction type:

According to the data, frauds have only occurred in CASH_OUT and TRANSFER transactions.

Distribution of the transaction amount (Column 'amount')

Let’s visualize histograms of the transaction amount by transaction type (type), trying to identify any patterns in the data.

The previous histograms show that for all cases, the data is concentrated on the left with very marked outliers in the CASH_OUT and TRANSFER transaction types, which are exactly the types of transactions where we have frauds. Let’s see if visualizing the interquartile range of the data reveals something more interesting.

Let’s now visualize the distribution of the amount in fraudulent transactions.

The histogram of the frauds also shows that the data is concentrated on the left (positive skew) and outliers at 10M.

Feature Engineering

Let’s start with the conversion of the ‘type’ column, which is categorical, with the values (‘PAYMENT’, ‘TRANSFER’, ‘CASH_OUT’, ‘DEBIT’, ‘CASH_IN’).

For the ‘type’ column, one could consider an ordinal representation such as 1 = PAYMENT, 2 = TRANSFER, 3 = CASH_OUT, and so on. However, this might lead the model to learn that PAYMENT and TRANSFER are more similar than DEBIT and CASH_IN, when in reality there is no ordinal relationship between them. Therefore, it’s better to perform a One Hot Encoding.

Correlation of the features with the fraud indicator.

Let’s now look at the correlation of all the columns in the dataset with respect to the label (isFraud), which is the value our model should be able to predict.

We can observe that the columns most correlated with the isFraud label are:

Positively: amount, TRANSFER, isFlaggedFraud, CASH_OUT, oldbalanceOrg

Negatively: PAYMENT and CASH_IN (precisely, there are no frauds for these types of transactions).

Data selection for training and testing

Before splitting the dataset into training and testing, it is important to find a splitting method that is more effective for the model than a purely random one. In this case, since we have so few records indicating fraud, it would be beneficial for the training and testing datasets to have an equal proportion of rows with frauds. This type of split is called stratified splitting, and for this, we will use StratifiedShuffleSplit from scikit-learn.

Model selection and training phase

Let’s start by creating a model based on the Random Forest algorithm, which is a flexible and easy-to-use algorithm that produces good results even without tuning the hyperparameters. Additionally, it is useful for handling large datasets with many features, although this particular dataset is not that large.

After training the model, we found Wow!, an accuracy of 99.99%! But… not so fast! Let’s not be dazzled by that value. Remember that the number of rows with frauds represents ONLY 0.13% of the total, so an accuracy of 99.99% could just be telling us that our model is excellent at predicting NON-fraudulent transactions 🙂

For classification models, there are more suitable metrics to evaluate performance, including the confusion matrix and precision and recall. When our model predicts a fraud, precision tells us the percentage of accuracy with which it did so. Meanwhile, recall (also known as sensitivity or completeness) tells us what percentage of the total frauds our model is able to identify.

First, we calculate the confusion matrix, where we obtain: TP: true positives, FP: false positives, TN: true negatives, and FN: false negatives. With these, we calculate precision and recall.

The results are still very good! We have a precision of 100% and a recall of 91.64%!

This means that our model is able to recognize 91.64% of the frauds, and when it predicts a fraud, it does so with 100% accuracy!

However, we must be cautious and ensure that the good values of these metrics are not simply due to the model overfitting the training data…

Cross validation

To ensure that the model’s performance is as good as the previous metrics suggest, let’s implement a mechanism called cross-validation. This basically involves splitting the training dataset into k smaller data blocks (folds), and with each of these:

  • The model is trained using k-1 blocks of the training dataset.
  • The resulting model is validated with the remaining block or fold that has not been seen by the model, using a predefined metric (in this case, F1).

Related Articles