Using information present at the time of transaction, can I predict whether a credit card purchase is fraudulent?
I found a great dataset on kaggle to give this a try.
Dataset: https://www.kaggle.com/datasets/kelvinkelue/credit-card-fraud-prediction
My work: https://www.kaggle.com/code/kleina47/credit-card-fraud/edit
This dataset has 21 features and 1 target variable (is_fraud). At first glance, the most interesting features are transaction amt, city population, and I'm also interested in engineering a feature for distance. I'll also include a dummy variable for gender.
To engineer a distance feature, I used the geopy library. The dataset included the latitude and longitude of the transaction (merch_lat & merch_long) as well as the the latitude and longitude of the card holder (lat & long)
I scaled the data using StandardScaler and created dummy variables using pandas before splitting into test and training sets
The inital results were surprisingly good! I achieved 0.9957 accuracy, with most features being significant. Unfortunately, our engineering variable of distance was not significant (p value of 0.28), so I dropped it.
When re-running the model the accuracy stayed exactly the same, and all of the variables were significant.
This is where I initially left the model, but as I've been learning more about accuracy vs precision vs recall, I decided to revisit the analysis. And... the results were quite disappointing...
When revisiting this problem, I decided to create a classification matrix to understand more about the model results. Unfortunately, I actually predicted fraudulent credit card transactions exactly 0 times. So why was this the case?
It turns out that I never looked at the frequency of fraudulent transactions. If I had done this from the beginning, I would have realized that only 2145 out of over 550k transactions were fraudulent.
Since accuracy is only a measure of the number of accurately classified transactions, even though the model was not able to classify a single fraudulent transaction accurately, the accuracy was quite high since the correctly classified legitimate transactions drowned out these results.
In this case, even though the model was highly accurate, it creates more trouble than it's worth, flagging legitimate transactions as fraud and missing all of the actual fraud!
This problem prompted me to understand how I could improve recall, particularly for a dataset where there are such few positive events.
The most impactful change that I tried was altering the "class weighting" parameter in the sklearn Logistic Regression. This penalizes false negatives more than false positives based on frequency of fraud in the dataset, causing the model to predict fraud for more often. The impact was improving recall from 0 to .75. However, precision decreased to .05 because of the increased number of false positives.
I learned that you can also manually set the class weights. I doubled the class weight to 260 from 130, and while it did not impact the model in any significant way, this is useful to know for future classification models.
When I tried the new weighting method, I added all of the independent variables back into the model. Now that the recall has improved, I wanted to make sure that a) all of the independent variables are significant and b) that the new model was not overfitting to the training data
a) I've previously used statsmodel to assess the significance of independent variables, but statsmodel does not have a class weight method like sklearn. To achieve a similar effect, I oversampled the fraud cases for a new statsmodel logistic regression. Now, all the independent variables were significant.
b) Finally, I checked the classification report of the training dataset, and it yielded almost identical precision an recall, indicating that the model was not overfitting
A few things other things I tried that yielded no improvements:
I set a classification threshold of 0.2, but in my opinion this yielded too many false positives. Depending on the capacity and risk tolerance, a lower threshold could be considered, but I'll leave it as is for now.
I tested tried SMOTE (Synthetic Minority Over-sampling Technique) which synthetically creates more cases of fraud based on the existing cases in the dataset. In my case, this seemed to have little to no effect on the model.
This was a lot of fun, and I learned a lot about how to assess the efficacy of a model beyond just looking at accuracy. If you have any feedback on my approach, send me a message here. Thanks for reading!