It Takes a Vision

I recently had the opportunity to represent Defenders of Wildlife at the Build Your Refuge Day event at Valle de Oro National Wildlife Refuge. Setting up my table and looking across the fields…

Smartphone

独家优惠奖金 100% 高达 1 BTC + 180 免费旋转




Starbucks Capstone Challenge

Data Scientist Nanodegre — Capstone Project

The data set contains simulated data that mimics customer behavior on the Starbucks rewards mobile app. Once every few days, Starbucks sends out an offer to users of the mobile app. An offer can be merely an advertisement for a drink or an actual offer such as a discount or BOGO (buy one get one free). Some users might not receive any offer during certain weeks.

Not all users receive the same offer, and that is the challenge to solve with this data set.

The data used in this project is a simplified version of the real Starbucks app because the underlying simulator only has one product whereas Starbucks actually sells dozens of products.

Every offer has a validity period before the offer expires. As an example, a BOGO offer might be valid for only 5 days. In the data set, informational offers have a validity period even though these ads are merely providing information about a product; for example, if an informational offer has 7 days of validity, we can assume the customer is feeling the influence of the offer for 7 days after receiving the advertisement.

In this project, I combine transaction, demographic, and offer data to determine which demographic groups respond best to which offer type, and I do some experiments to find a model that predicts if a user will view or complete an offer.

There are three data sets:

Here is the schema and explanation of each variable in the files:

portfolio.json

profile.json

transcript.json

In this project, I’ll go through the steps listed below:

I’ll use the data provided by Starbucks to answer the following questions:

I've done some experiments to find a model to predict if a user will view or complete an offer. Considering that this is a classification problem, and the metrics that are commonly used in these cases are:

after exploring the dataset, I decided to use the F1-score as the main metric to evaluate the models, because the classes are not completely balanced (60–40), and I didn’t want to risk ending up with wrong conclusions by considering only the accuracy (which is unreliable when the data set is unbalanced).

However, I’ll present a comparison of all the metrics in a table.

First, I need to get an idea of the dataset to figure out which features are available and if I need to work on them to transform the dataset into a new dataset that better fits our needs. For instance, I need to know if the datasets contain null values, their sizes, the format and range of the values in the columns, the distribution of the values, find potential outliers, etc.

The portfolio data set contains the following rows:

Portfolio Dataset

The distribution of the values in the numerical features is shown in the images below.

Portfolio dataset — Distribution of the values in the numerical features

The exploration above shows that:

The first rows of the profile data set are shown below:

Profile dataset head

The distribution of the values of the numerical features is shown in the images below.

As we can see, there are some null values in the profile dataset. The values in the gender feature are in the range [None, ‘F’, ‘M’, ‘O’].

The statistics about the numerical features show in the table below.

The exploration I’ve conducted has shown that:

The first rows of the transcript data set are shown below:

Transcript dataset

The exploration I’ve conducted on this dataset has shown that:

In the portfolio dataset, we have seen that the channels feature contains a list of values. For our analysis, it’s better to transform it into a series of dummy variables.

In the transcript dataset, we should transform the values in the value. In fact, it seems to contain a sort of dictionary with one key and one value. It’s better to create two columns that are specific to each value.

I found some entries with an age of 118. Considering that a so high age could be an outlier, I checked if 118-year-old persons have missing or odd data in the other columns, and I’ve found that all those rows have Null values in the gender and income. For this reason, I decided to treat them as outliers, and I’ve removed them from the dataset.

Below is the distribution of the age after removing the rows with 118 in the age column, and it clearly has a gaussian shape.

Profile dataset — Age distribution after removing the rows with value 118

This influenced the stats of the age in the dataset, which became:

Profile Dataset — Age stats
Transcript dataset after creating the new features from the value

Now that I’ve curated the datasets, I can proceed with the exploratory analysis to answer the questions above (except the last one which needs an ML model) using a statistical approach.

In this step, I’ll use a statistical approach to answer the questions that do not require any ML model.

The mean age can be extracted directly from the profile dataset and, after removing the outliers, I’ve found that it is 54. Below I show the detailed distribution that has a Gaussian shape centered on the mean value.

To answer this question, I’ve filtered the transcript dataset to create two datasets with rows about offer received and offer completed in the event column, because I’m not interested in knowing the values of the transaction or if a profile viewed an offer.

The bar chart above shows that the best offer in the portfolio, in terms of success, is the number 6. After that, the most successful offers are the numbers 5, 8, and 3.

To answer this question, first I needed to merge both the transcript data frame I’ve created in the step above for the completed and received to include even the age column of the profiles.

Below is an example of the success rate by age for the offers 6 and 8.

The success rate of the offers 6 and 8

The division by age using all the possible ages is not very helpful, and probably not necessary because the variables we are working with do not change fast. For this reason, I decided to group the profiles by decades.

Then I did the same thing with the data frame about the offer received and used them to find the success rate of the offers for each age group.

The success rate of the offers for each age group

The graph above shows the most successful offers for each age group, which are:

That being said, offer 6 is the best offer overall because it has a higher success rate for each age group.

To answer this question, I’ve merged the two datasets created starting from the transcript using the event column (offer received and completed) with the profile dataset, but this time to add the gender column.

Then I calculated the success rate for each offer by dividing them by gender. The results are shown in the graph below.

The success rate of the offers by gender

The graph above shows the most successful offers for each gender, that are:

That being said, offering 8 seems to be effective only on those people with gender value equal to O, while offering 6 seems to be most effective overall.

The last step was required to answer the last question which needs a ML model.

After the above analysis, I wanted to go on and try to find a model that can help me to predict the response of a customer to an offer. To do this, the first thing to do is to decide which are the covariates of the model, and which are the target variables we want to predict the value of.

I created one DataFrame containing all data from all Datasets. By doing this, I grouped all data already linked in a single raw. For this first approach, I decided to remove the rows containing missing values.

The merged and curated dataset contained 148805 rows without null values.

Merged dataframe info

To help the model work better, I also transformed the string values into numerical values. In the end, to train the models, I’ve used the input (X) and target (y) arrays below:

X data frame — Input features for the models

The first thing I’ve done, to figure out which metric to use to evaluate the models, was to check if the dataset was balanced in terms of events.

The events transformed into numeric values are:

Removing the null values from the dataset I lost all the transaction events, which is good because I don’t want to predict if there was a transaction. Actually, I don’t even want to predict if the offer is received. Thus, I can remove the rows with event equal to 0. Let’s see if the Y dataset becomes balanced.

The dataset is not super-balanced with the 60.58% of offers received and 39.41% of offers completed, even though we can’t consider it very unbalanced. Thus, I’ll calculate all metrics for completeness, but I’ll use the F1-score as metric to assess the performance of the model, because the F1-score in its weighted form allows taking into account of the balance of the dataset, and is more reliable than the accuracy when working with unbalanced classes.

To find an answer to the question, I’ve decided to use and compare the models below (descriptions from Wikipedia):

Below is the comparison table of each model I tried, where I’ve listed the values for each metric and the F1-score metric for the best model after tuning some hyperparameters:

Bar graph of the performance metrics for each model

The best model overall in terms of F1-Score is the Gradient Boost Classifier (along with the Logistic Regression). The Random Forest Classifier was good too. However, the Gradient Boost Classifier is the best model when it comes to predicting events on unseen data, and the standard deviation between the scores of the Gradient Boost Classifier is the lowest. It means the Gradient Boost Classifier is the model that has the lower risk of overfitting data.

This result is not surprising, because the Gradient Boost Classifier is an Ensemble Technique that aims at reducing variance in the process of building sequential trees as well as the bias by combining various weak predictors like Decision Trees. It also uses a learning rate to prevent the model from overfitting. By tweaking the learning rate, we can find the right balance to have a low bias with low variance.

This makes it a robust model we can rely on. Thus, I was interested in seeing if I could improve the F1-score for the training and test datasets.

I’ve found that the best model in terms of F1-score for both the train and test datasets is the Gradient Boost Classifier. I wanted to experiment if there was a better set of hyperparameters that improved the model performance. To do this I took advantage of the cross-validation GridsearchCV one more time adding some other hyperparameters to the parameters grid, including the learning rate and the maximum depth, which are the most important parameters of this predictor.

The best model calculated using the GridSearchCV reached an F1-score of 0.755 for the training set and 0.753 for the test set.

The performance on the training dataset didn’t improve, but the model did better with new data. It means we have found an improved model based on the Gradient Boost Classifier that is capable of predicting events with better performance when it will be deployed in production.

The study showed that:

The best performance, considering both the training and test datasets, is obtained using the Gradient Boost Classifier with:

The Gradient Boost Classifier is sensitive to outliers that have larger residuals compared to non-outliers, and it could overfit the data if the number of trees is too large. Thus, a further improvement could be to try to:

One thing that surprised me was to find out that people between 100–109 like the people with a gender other usually prefer the offer 8. In the next study, I could check if there’s a correlation between those groups of people.

Add a comment

Related posts:

Nationals turn to Hellickson to face the Cardinals

Bryce Harper clubbed his 30th home run of the season on Tuesday night, a two-run shot in the eighth inning…His 30 homers are tied for second in the National League and are the most among National…

Finding original ways to make cash from your photography.

When I started looking to make money from my photography, I looked for an original way to present it. I had been to many a photography exhibition, trawled craft fairs and country shows, and saw the…

Cause its the first time

I guess its been a long time, maybe around a year … and counting. Ever since we were together that is. I tried to let go and move on from you but everytime I see you, I always forget what my friends…