Predicting credit card customer churning and devising strategies for customer retention

Backgroud

I will analyse a dataset consisting of information on demographics and credit card status of 10000 customers in a bank (source) to build a model for predicting churning and identifying crucial features so that the bank can devise more effective retention strategies. This analysis is thus solving a supervised, classification problem.

This analysis demonstrates my skills in data exploration, manipulation and visualisation, machine learning model training as well as telling stories with data to solve business problems.

I will use pandas and numpy for data exploration and manipulation, matplotlib and seaborn for data viz, scikit-learn and catboost for machine learning model building, evaluation and data preprocessing.

Exploratory data analysis

First of all, let's glimpse over the data types of the features and whether any missing values are presented in the dataset. No missing values are observed, and all features are defined in reasonable types. The Attrition_Flag will be the target variable which indicates whether a customer has churned on the bank or not.

Moreover, the last two columns of the dataset seems to be about previous attempts of building a naive bayes model to predict churning. Let's also drop these two columns since we are more interested in how the existing features can help predict customer churning. The CLIENTNUM column which records the metadata of customers should also be dropped since this will not be a useful feature for the classification task.

The next step will be looking at the summary statistics of the columns. Starting with the numerical features, there does not seem to be any abnormal values due to data entry or other human errors and all fall within reasonable ranges. Here are some preliminary insights:

As for the categorical features, there are some interesting patterns exhibited as well:

Let's also visualise the features by the churning status of customers. For the categorical features, these two groups of customers do not appear to differ much substantively based on the chi-squared test with $\alpha$ at 0.05 level except for Gender and Income Category. Whether and how strongly they may be related to customer churning or not can be tested by looking at the importance of these features in the trained classification model later.

As for the numerical features, according to the results of the Kruskal-Wallis test which tests whether difference exists between the medians of churning and non-churning customers, using $\alpha$ at 0.05 level as the threshold, we can see that significant differences do exist for all but Customer Age and Months on book. The fact that the median of the Months on book feature is not statistically significant between churning and non-churning customers suggests that how much time a customer has used the bank's service may not be closely related to the probability of churning.

In terms of the more substantive differences, here are some observed from the violin plots on these numerical features:

  1. The distribution of the average utilisation rate of credit limits for churning customers peaks at 0%, whereas for existing customers they usually use 20% of their credits per month
  2. Both the medians of transaction amounts (Total Trans Amt) and counts (Total Trans Ct) per month by existing customers are higher than churning ones
  3. Lastly, churned customers on average contacted the bank more frequently and had more inactive months in last year than existing customers

Predicting customer churning with machine learning model

Now that I have explored the data and performed some data transformations when needed, it is time to build the models to predict customer churning. Specifically, I will try one linear classifier (Logistic Regression) and two ensemble methods (Random Forest and CatBoost).

The reason that I picked CatBoost as the gradient boosting method used in this analysis is that 5 out of the 19 features are categorical, and CatBoost can offer both one-hot encoding for binary features and ordered target encoding for high cardinal features which, essentially, first permutes the dataset and then target encode each sample using only objects before this given sample (source). Ordered target encoding can avoid the excessive sparsity introduced by one hot encoding for high cardinality features which will hamper the performance of tree-based models.

First of all, one-third of the dataset will be split as the test set for model validation, with both the train and test sets being stratified on the target variable (Attrition_Flag) due to the rare case problem mentioned before. Let's also recode the target variable into 1 or 0, with 1 meaning a customer has churned and 0 otherwise.

Building baseline models

I will first test a few classification models with their default parameters to obtain their baseline performance and decide how to proceed next. Note that the class weights are adjusted in the model parameters (class_weight in logistic regression and random forest/ scale_pos_weight in CatBoost) as the proportion of the number of the majority class (i.e. existing customers) to that of the minority class (i.e. those who churned).

As for the metrics used for judging model performance, apart from log loss which will be the loss function to be minimised for training the classification models, the F1 Score which is the harmonic mean of precision (the proportion of customers predicted to be churning by the model are actually churning) and recall (the proportion of churning customers in reality that the model also correctly predicts as churning) will also be used, since by definition the cost of wrongly predicting a churning customer as not churning (i.e. false negative) will be larger than that of predicting a non-churning customer as churning (i.e. false positive) in this context. Moreover, in a class imbalance case, the F1 score is more sensitive to false positives than the ROC AUC score, thereby reflecting the actual performance of the model in predicting positive cases more realistically (source).

After running the models, the CatBoost model performs significantly better than the rest on both F1 score and log loss in 5-fold cross validation and test set. Even without feature selection and hyperparameter tuning, the CatBoost model can already achieve an F1 score of 0.9078 and a log loss of 0.0908 in the test data. I will now move on with the CatBoost model to see if I could further improve its performance.

Selecting features for CatBoost

The next step will be to eliminate insignificant features to make the model more parsimonious, but how many should I remove? This can be decided by looking at the log loss value of each model with a given number of features removed vis-a-vis that of the model with all features used for training. The catboost module provides the convenient select_features method to do so. Let's see how the log loss value of the model would change if I removed up to 9 features in the dataset.

Looking at the above plot, it seems like I can remove up to 5 features without making the model having a higher log loss than when it is trained with all available features. Thus, I will create the parsimonious model with the remaining 14 features as the input.

Consistent with the line plot monitoring the change in log loss in relation to removing features above, the simplified CatBoost model performs very similarly to the one with all features used during training in all of the evaluation metrics. Specifically, the F1 score only dropped by 0.0005, whereas the log loss only increased by 0.0002.

Hyperparameter tuning

The next step I can take to improve the performance of the CatBoost model is hyperparameter tuning. I will use the hyperopt which uses bayesian optimisation to search for the optimal hyperparameter values of machine learning models.

With some hyperparameter tuning, the CatBoost model further improves its performance in both achieving lower log loss and a higher F1 score in the test set. Specifically, the log loss decreased by 0.007 while the F1 score increased by 0.004 in the test set for CatBoost model with tuned hyperparameters. We are now ready to look deeper into the model's performance and identify features that are closely related to customer churning.

Looking deeper into the CatBoost model

With the primary business goal of finding potential churning customers so that the bank can reach out in advance for retention, it is necessary to look deeper into the recall of the model because this metric is the most relevant in this context. By looking at the classification report below, the model performs really well on identifying churning customers, since the recall score is around 0.962756, only about 0.04 less than the maximum possible value of 1.

Even though the precision of the model is only about 0.864548, in this context having a model which generates more false positives (i.e. predicting existing customers will churn) will likely be more desirable than a model which performs worse in correctly predicting churning customers to be churning soon, because failing to identify churning customers will then mean bank losing sources of revenue.

But how should the bank reach out to customers? In other words, how should the bank design its retention strategies according to which features of the customers? We can know which features are the most crucial in predicting customer churning by using the SHAP values of each feature, which gauges the impact of a feature having a certain value vis-a-vis the feature being at its baseline value on the model's prediction .

In terms of the magnitude of each feature in affecting the model's prediction, the bar plot below showing the mean absolute SHAP value of each feature indicates that Total_Trans_Ct and Total_Trans_Amt are the two most significant features in predicting whether a customer would churn or not. The Total_Revolving_Bal runs third for its importance in the model. But what about the directions of the impact of the features on the model?

We can look deeper into how each feature's value is associated with the model's output with the shap module's beeswarm plot. Specifically, blue colour represents low feature value, whereas red represents high feature value. As for model output, this means the probability of a customer churning as predicted by the model. From the plot below, there are several interesting observations:

  1. The higher the total transaction count in the last year for a customer, the less likely he/she will churn.
  2. Surprisingly, high transaction amount in the last 12 months actually increases the probability of a customer churning.
  3. The higher the increase of each customer's transaction count and amount from Q4 to Q1, the less likely he/she wil churn.
  4. If a customer contacts the bank more often in the last year, then he/she is more likely to churn later.
  5. Lastly, customers who buy more products from the bank will be less likely to churn.

Based on the above SHAP value plots, I suggest the bank to design customer retention strategies with the following points in mind:

  1. The bank should reach out to customers whose transaction counts in the last 12 months are below the average level and investigate the reasons behind in order to devise strategies for retaining these customers, since low transaction counts might mean they are less willing to use the bank's credit cards anymore.
  2. It is advisable to provide interest rate discounts for customers with high transaction amounts so that they will be less likely to opt for business competitors who offer regressive interest rates in response to the amount of spending with credit cards.
  3. It may also be beneficial to promote other products while reaching out to potentially churning customers, since increasing the amount of product they purchase from the bank may reduce their probability of churning later as well.
  4. Lastly, it is imperative that the bank address customers' concerns or complaints efficiently, particularly when a customer is contacting the bank more frequently than before. This is because this may signify a customer is not completely satisfied with the bank's services and may churn if his/her demands are not addressed properly.