Predicting Customer Churn Using Pyspark on Amazon EMR

Vincent Chong
10 min readFeb 14, 2023

Problem Introduction

Churn prediction is a crucial task in data science, as it helps businesses identify customers who are likely to leave and take actions to retain them. By analyzing historical data on customer behaviour, a churn prediction model can identify patterns and trends that indicate a high risk of churn. This allows a company to take proactive measures to retain at-risk customers, such as offering promotions or providing personalized customer service. Overall, churn prediction can help companies increase revenue and customer loyalty by reducing customer turnover.

Strategy to solve the problem

In this project, we will be attempting to predict customer churn based on a dummy dataset provided by Udacity as part of the Data Scientist Capstone Project. Due to the large size of the dataset, we will be conducting the entire assessment using PySpark on Amazon Web Service (AWS).

We started by loading and cleaning the dataset before proceeding to do some Exploratory Data Analysis (EDA) to understand the dataset better.

Once the dataset is cleaned, we proceed to Feature Engineering to transform the data so that it is suitable to be used in machine learning models.

The dataset is split into training and testing set. After training the machine learning model on the training set, the performance of the machine learning model is evaluated with the testing set. This process is repeated with multiple machine learning models for performance comparison.

The best performing machine learning model shall be evaluated based on the F1 score due to the dataset being imbalanced (dataset containing more users who did not churned compared to users who churned).

Metrics

Metrics used to measure the performance of a model shall be F1 score. This is due to the characteristic of the problem which is a classification problem as well as the target variable is imbalanced. In other words, there are a lot more users who did not churned compared to users who churned in the dataset. Using the accuracy metric would present inaccurate results as the model may achieve high accuracy despite predicting all users as not churned.

Launch EMR Cluster and Notebook on AWS

We will be working the project on Amazon EMR. Below is a short description of Amazon EMR.

Amazon EMR is a cloud big data platform for running large-scale distributed data processing jobs, interactive SQL queries, and machine learning (ML) applications using open-source analytics framework such as Apache Spark, Apache Hive, Presto, and more.

Before proceeding to start on the EMR Studio with Workspaces, it is important to first to create an IAM role with the correct permissions to use the EMR. Otherwise, you will not be able to create the EMR cluster and link to the Notebook.

Amazon EMR for Data Science
Amazon EMR for Data Science

Exploratory Data Analysis

Data cleaning is a crucial step in the data science process, as it ensures that the data used for analysis is accurate, reliable, and ready for use. Data cleaning involves identifying and removing errors, inconsistencies, and outliers in the data, as well as formatting the data in a consistent and usable format.

First, we will read in the full Sparkify dataset with a file size of 12GB. Due to the large size of the dataset, we will be conducting the entire assessment with PySpark on AWS.

The dataset columns and first few rows are shown below.

Columns of Dataset
Columns of Dataset
First few rows and columns of Dataset
First Few Rows and Columns of Dataset
Total number of rows for dataset
Total Number of Rows for Dataset

Next, we will conduct Exploratory Data Analysis (EDA) on the dataset to better understand the data.

A simple check confirmed that there are no duplicates found in the dataset.

A check for missing values confirmed that there are quite a significant number of missing values for the columns of artist, firstName, lastName, gender, length, location, registration, song and userAgent.

Missing Values in Dataset
Number of Missing Values in Dataset

We can see that quite a large amount of missing values in the column artist and song. By checking on the distinct pages where artist and song have missing values, it is found that only NextSong column does not concurrently have missing values with artist and song column. This can mean that when the user is visiting other pages that does not play any song, no artist and song is recorded in the respective column. In this case, it is worth keeping these rows for the algorithm to study the user’s behaviour based on page visits.

There are also no missing values in userId. Meaning there are no missing values in both the userId and page columns. Hence, we will be using the entire dataset for model training.

UserId Dataframe
DataFrame of userId

Total Users

Aggregating the userId column shows that there are 22,278 unique users in the dataset.

Total unique users in dataset
Total Unique Users in Dataset

Total Users Churned vs Users Not Churned

The number of users who churned vs users not churned in the dataset are shown below:

Users churned vs not churned in dataset
Users Churned vs Users Not Churned in Dataset
Bar Chart: Users Churned vs Users Not Churned
Bar Chart: Users Churned vs Users Not Churned

Based on the bar chart above, we can see that the data is imbalanced between users who churned and users who did not churn. Care will need to be taken when preparing the data for machine learning algorithm and evaluation of the results.

Gender Distribution

The gender distribution is evenly distributed as shown in the bar chart below:

Gender distribution in dataset
Gender Distribution in Dataset
Gender distribution bar chart
Bar Chart: Gender Distribution

Average Number of Page Visits per User

The user’s page visit can be a useful indicator on the user’s behaviour. By aggregating the total page visits of all the users, we can see that the amount of page visits can vary quite differently.

Total Page Visits by All Users
Total Page Visits by All Users

The average page count per user is tabulated below:

Average page count per user
Average Page Count per User
Bar Chart: Average Page Count per User

Feature Engineering

Feature engineering is a crucial step in the data science process, where raw data is transformed into useful features for machine learning models. It involves techniques such as feature scaling, normalization, feature selection and creating new features from existing data. Proper feature engineering can greatly improve the performance and accuracy of the model.

Calculating Average Page Count per Day for Each User

Instead of using each total page count per user to train the machine learning model, we will average each page count per day for each respective users. The idea is that some users may have been using Sparkify much longer than some new users. The total page count may be very different between older users and newer users in terms of accumulated number of page visits. Therefore, by taking an average of total page count per day for each users will allow the machine learning model to differentiate each users more accurately.

Calculating Variance Total Page Count for Each User

Another possible way of differentiating users behaviour apart would be to include variance calculation based on page count for each user. The variance of a user would be calculated based on the total number of page count for each day.

The calculated variance would be large for users who have low page counts on certain days and have high page counts on other days. While variance would be low for users who have consistent page counts on any other days.

This may or may not be a useful indicator of predicting churn.

Variance for Each User
Calculated Variance for Each User

We will also create a new column called label where:
0 = user churned
1 = user not churned

Note that this labelling choice is due to Multiclass Evaluator has default metricLabel set to 0. In other words, the positive class is denoted as 0 according to Multiclass Evaluator which we will be using for evaluation of the machine learning models.

After consolidating the calculated average page count and variance into a single DataFrame, we will have a DataFrame similar to below:

DataFrame after Feature Engineering

Train Test Split

We will split the dataset into training and testing set. The training set will be used to train the machine learning model while the testing set will be used to evaluate the performance of the trained machine learning model.

Below is the total count of the users churned (0) and users not churned (1).

Total Count

After splitting the dataset based on 80% data will be used for the training set and remaining 20% for the testing set, we will get someting similar to below:

Training Set Count
Testing Set Count

Due to dataset imbalance, we will attempt to increase the number of label 0 data using oversampling. Oversampling is the process of artificially increase the number of data by randomly duplicating samples from the minority class.

Testing Set Count after Oversampling

It is important to note that oversampling should only be done on the training set. Oversampling should not be done before the dataset is split into training and testing set. Otherwise, some of the duplicated data might appear in both the training and testing set. This will cause the machine learning model to overfit.

Modelling

Choosing the right machine learning model is important, as it can greatly impact the performance and accuracy of the predictions.

For this project, we are dealing with classification problems. Therefore, we will test out several machine learning models which are suitable for classification problems:

  • Logistic Regression (with hyperparameter tuning)
  • Random Forest Classifier
  • Gradient Boosthing Tree Classifier

Note: Logistic Regression modelling is done with hyperparameter tuning as an example to show the workflow of getting the best model performance by finding the best hyperparameter. More details in the next sub chapter.

Logistic Regression

The performance results for Logistic Regression are shown below:

Logistic Regression Model: Prediction vs Label
Logistic Regression Model: Confusion Matrix

Based on the confusion matrix, the results are interpreted as
True positive = 609
False positive = 404
True negative = 2532
False negative = 954

Logistic Regression Model: F1 Score

Random Forest Classifier

Random Forest Classifier Model: Prediction vs Label
Random Forest Classifier Model: Confusion Matrix

Based on the confusion matrix, the results are interpreted as
True positive = 805
False positive = 209
True negative = 3203
False negative = 282

Random Forest Classifier Model: F1 Score

Gradient Boosting Tree Classifier

Gradient Boosting Tree Classifier Model: Prediction vs Label
Gradient Boosting Tree Classifier Model: Confusion Matrix

Based on the confusion matrix, the results are interpreted as
True positive = 785
False positive = 215
True negative = 3212
False negative = 282

Gradient Boosting Tree Classifier Model: F1 Score

Hyperparameter Tuning

Hyperparameter tuning is done for Logistic Regression model as an example. Logistic Regression has multiple hyperparameters that can be set such as regParam and elasticNetParam which both have a default values of 0. Different hyperparameter values may yield different model performance.

To get an optimal hyperparameter values for Logistic Regression, we loop through the training process multiple times with different hyperparameter values and compare the model performances. The hyperparameter values that yield the best model performance will be then be selected as the final trained model.

Conclusion / Reflection

After training the models and obtaining the F1 scores for each model, we can see that the Random Forest Classifier model achieved the highest F1 score.

F1 Scores Comparison
F1 Score Comparison

At the moment, we only did hyperparameter tuning for Logistic Regression model and yet Random Forest still has the highest model performance. This make sense as Random Forest is known to have better performance because of the model characteristics such as not assuming the data has a linear relationship, give priority on feature selection where weights are given to certain features which are considered as more important, and uses ensemble learning.

Future Works / Improvements

  1. To try working with different time window such as only taking 3 weeks of user’s activity. The idea is that user’s behaviour may change and hence user’s activity from long ago may no longer be relevant.
  2. To further explore other possibilities of creating new features that can improve model performance.
  3. To try other classification models such as Support Vector Machine for comparison.
  4. To try hyperparameter tuning for other models too.

Experience on Using Amazon EMR for Data Science

Some of the technical difficulties experienced when I’m trying to run the notebook on Amazon EMR are:

  1. You need to register yourself as an IAM user with the right permissions to use the Amazon EMR. Otherwise, you will have issues such as not being able to connect the clusters to your workspace notebook. It took me a while to figure this out because the error did not specifically mention that it was a permission issue.
  2. Some of the common libraries such as Pandas and Matplotlib are not available in Amazon EMR and need to be manually installed first before importing.
  3. Plotting graphs follow a set of specific steps for the Amazon EMR PySpark environment compared to the Udacity workspace.

Due to debugging and solving errors that arose when transferring the Notebook from Udacity workspace to Amazon EMR workspace and rerunning the cases multiple times, the entire process ultimately costed me $30.

Github Repo

https://github.com/vincent-chw/Udacity_Sparkify.git

--

--