Drawing on my background in security, I wanted to see how well I could predict whether a web authentication is malicious. Furthermore, I wanted to think about how I might be able to balance user experience (seamless auth) with security.
The dataset can be found on Kaggle. You can find my jupyter notebook on my GitHub. Thanks for reading!
There's some interesting fields here. Upon first glance, I would assume that protocol type, login attempts, and ip reputation would be strong indicators. Here, I'm also assuming that failed_logins means times the login failed when attempting the current session. One feature I would have loved to see is location. Maybe this is part of the engineered feature of ip reputation, but it would have been great to compare the current location to the last sign in location.
Because these data were already very clean with no null values or spurious results, this was very straightforward. This dataset consisted of almost 10k login attempts and had a surprisingly high number of malicious login attepts. My assumption is that these were manually labeled by some incident response team.
The independent variables show no signs of correlation, so there's no need to worry about multicollinearity 👍
I've been learning about sklearn's pipeline feature, so I decided to give it a try for this project. After splitting the data into train and test sets, I created a preprocessor that generates dummy variables using OneHotEncoder while also scaling numeric variables. Then, I added the preprocessor to the pipeline, along with a Logistic Regression model.
I tested the accuracy of the Logistic Regression using both train and test samples, with both revealing an accuracy of .74, indicating no overfitting, as well as fairly decent results.
The confusion matrix and classification report indicate that precision is also .74, but that recall is only .66 for malicious login attempts.
Depending on the company, we may want to tune the model to have better recall, since false positives only have the consequence of forcing authentication, while false negatives mean a significant security event.
Using statsmodel library, I created a new logistic regression so that we could take a look at p values. The results reveal that we can drop network packet size, protocol_type, and unusal_access_time, so I did just that.
When rerunning the logistic regression without these variables, accuracy remained at .74, but recall improved to .74 while precision decreased to .66. I think this is probably better (depending on the nature of business) because again this means more of the malicous attempts are actually being detected, even if that means more benign attempts are wrongly flagged as malicious. In other words, better safe than sorry.
I checked out a few other Kaggle examples, and all of them were right around .74 accuracy when using Logistic Regression (although random forest was able to improve to around .84).
Lastly, I cross validated the accuracy for the v2 model to confirm no overfitting, and I was happy to report that all trials were within the same ballpark.
Lastly, I wanted to see how the logistic regression probabilities were distributed. I plotted the predicted probabilities on a histogram, and I decided to also draw two thresholds (see green and red lines).
The idea of the thresholds is that anything below the green line would allow normal authentication. Anything between the green and the red lines may require additional verification, like MFA. Anything above the red line may require additional verification as well as notify the user of suspected malicious behavior.
Based on this curve, that would means 48.7% of users would be able to authenticate normallly, 46.2% would be asked to use MFA for that sign in, and 5.2% would be sent a malicious activity notification in addition to MFA.
If you want to check out any other notebooks on this dataset, I really enjoyed these two:
https://www.kaggle.com/code/nukimayasari/cybersecurity-intrusion
https://www.kaggle.com/code/madhuraatmarambhagat/cybersecurity-intrusion-prediction
If you have any feedback on my approach, send me a message here. Thanks for reading!