Customer Churn Prediction with R

Churn prediction can help a business forecast the potential churn rate for a customer.Subsequently, a business can use churn prediction to better retain customers. The cost of acquiring a new customer is substantially higher than keeping an existing one and a business can save a lot of money by retaining their current customers. Predicting churn rate allows a business to do that at a higher level. Customers leave their companies frequently and even with low percentages of customer churn a business can lose a lot of money due to customers leaving. Thus, it is important for a business to smartly use their data from the past and present to gain a better understanding of their customers. Having the ability to predict if a customer is going to leave is a very valuable tool that can improve a companies’ revenue.

The data used in this project has been gathered from IBM community datasets and consist of 21 features and 7043 observations. Each feature lists the characteristics of each customer.

To prepare the data for further analysis, we started with identifying the missing values and discovered that the dataset had 11 missing values. Therefore, eliminated all rows containing these values.We then Identified variable types, created convenience vectors, transformed categorical variables into factors and eliminated irrelevant variables like customerID.

Exploratory data analysis

To understand the numeric variables and their relationship with each other, we plot a correlation matrix.

Due to the correlation between total charges and monthly charges, we removed totalCharges to prevent multicollinarity.

We then checked for outliars in the data, The following Box plots show that there are no outliars in our numeric data.

All of the categorical variables have a very wide distribution, and thus all of them will be included for the analysis.

Churn Prediction Model:

We split the sample dataset into two subsets: a training set and a test set. The customers are randomly assigned to one of the two subsets and each of the two subsets is given a random group of clients. To construct our models, the training dataset is utilized. Later, we use the test set to evaluate the model's predictability. For the data we used a 80/20 split for training and testing sets.

Decision Tree:

The Dectision tree model indicates that Contract, InternetService and Tenure are some important factors that determine customer churn.

We tested our decision tree model on the test set and evaluated the model performace using a confusion matrix which is table of the number of correct and incorrect predictions made by the model.

Decision Tree Accuracy: 78.2%

Random Forest:

A random forest classifier combine several decision trees to reduce overfitting and bias-related inaccuracy, producing good and more accurate outcome.

The variable importance plot below shows that tenure, MonthlyCharges and contract are the three most important variables in determing customer churn for our dataset.

Random Forest Accuracy: 79.2%

Click to view Code