Summary of a real-life Challenge:

This world-wide company has collected a large volume of data about their existing customers, including demographic features and information about purchases they have made. The company is particularly interested in analyzing customer data to determine any apparent relationships between demographic features known about the customers and the likelihood of a customer purchasing a product. Additionally, the analysis should endeavor to determine whether a customer’s average monthly spend with the company can be predicted from known customer characteristics.

About the Dataset:

This dataset consists of two files: Customers.csv(Customer demographic data) and the Sales.csv(Sales data for existing customers)

Some of the initial questions were:

  1. Can you tell us the median Yearly Income ranked By Occupation?

  2. Is the same order reflected in the median Average Monthly Spend by Occupation?

  3. Which group of customers account for a distinctly high range of Average Monthly Spend values?

  4. Any apparent Correlations with Average Monthly Spend?

  5. Apply what you have learned about the data to building, testing, and optimizing a predictive machine learning model.

After preliminary data exploration, I used descriptive, multivariate analysis and summary statistics to determine the features of interest in relationship to the response variable.

Correlations coefficients were determined using statistical methods and visualiztions. After descriptive statistics of the features of interest, a binary classification model that predicts the likelihood of a customer purchasing a product and a regression model, to predict a customer’s average monthly spend with the company, from these features was created.

After cleaning the data, I did feature engineering on the dataset to create a very important but missing Age variable.

## Warning: package 'knitr' was built under R version 3.4.2
A look at the cleaned and normalized dataset
CustomerID City StateProvinceName CountryRegionName Education Occupation Gender MaritalStatus HomeOwnerFlag NumberCarsOwned NumberChildrenAtHome TotalChildren YearlyIncome Age BikeBuyer AvgMonthSpend
11000 Rockhampton Queensland Australia Bachelors Skilled Manual M M 1 -0.2957460 1.1632792 1 -0.3341607 31 YES 2.0779835
11001 Seaford Victoria Australia Bachelors Skilled Manual M S 0 -0.2957460 -0.5943713 0 -0.4046181 32 NO 1.1500079
11002 Hobart Tasmania Australia Bachelors Clerical M M 1 0.7983891 1.1632792 2 0.4237297 31 YES 0.0707635
11003 North Ryde New South Wales Australia Bachelors Management F S 0 0.7983891 -0.5943713 0 1.2803321 29 YES 1.1150997

The histogram above shows that Yearly income of customers are slightly right skewed - in other words, while some customers have good yearly income, most are at the lower end of the income range.

Next, I had a look at the relationships between some variables.

After exploring the income feature, an attempt was made to identify relationships between features in the dataset.

The following scatter-plot matrix was generated initially to compare numeric features with one another. The key features in this matrix are shown here:

Viewing the above scatter plot matrix shows an apparent relationship between YearlyIncome and some numeric features. Specifically, NumbersOfCarsOwned and AvgMonthSpend, so does Age, HomeOwnerFlag and TotalChildren seem to be related.

The correlation coefficients of the numeric columns was then calculated with the following results:

Correlation matrix of some numeric features
X CustomerID HomeOwnerFlag NumberCarsOwned NumberChildrenAtHome TotalChildren YearlyIncome Age AvgMonthSpend
CustomerID 1.0000000 -0.0291990 -0.0121010 -0.0397062 -0.0712446 -0.0294232 -0.0128701 0.0027548
HomeOwnerFlag -0.0291990 1.0000000 0.2097869 0.3684593 0.5781974 0.3562434 0.6421242 0.2907846
NumberCarsOwned -0.0121010 0.2097869 1.0000000 0.0203926 0.0304295 0.4773002 0.0417541 0.2753784
NumberChildrenAtHome -0.0397062 0.3684593 0.0203926 1.0000000 0.6060767 0.0059898 0.3262236 0.1451770
TotalChildren -0.0712446 0.5781974 0.0304295 0.6060767 1.0000000 0.0220138 0.5476922 0.0261377
YearlyIncome -0.0294232 0.3562434 0.4773002 0.0059898 0.0220138 1.0000000 0.0260389 0.5301257
Age -0.0128701 0.6421242 0.0417541 0.3262236 0.5476922 0.0260389 1.0000000 0.1112192
AvgMonthSpend 0.0027548 0.2907846 0.2753784 0.1451770 0.0261377 0.5301257 0.1112192 1.0000000

In the table above, correlation coefficient between some possible pairs of features are shown.

These correlations validate the scatterplot matrix by showing positive correlation coefficients between YearlyIncome and some features and moderate to strong positive correlations for the other numeric features.

Now, I will start answering some of the initial questions:

Next, I calculated the median Yearly Income, ranked By Occupation. The result shows that customers classified as Professionals have the highest median income.

##       Clerical     Management         Manual   Professional Skilled Manual 
##      0.3000543      1.1366634     -1.3558265      1.9651905     -0.5282283

Then, a plot to validate the above result.

Is the same order reflected in the median Average Monthly Spend by Occupation? The answer is yes, as we can see from the computations below:

##       Clerical     Management         Manual   Professional Skilled Manual 
##     0.09112657     0.33548379    -1.09866037     0.58275003    -0.37140674

A plot to validate the result

Which group of customers account for a distinctly high range of Average Monthly Spend values?

##  F [20,30]  M [20,30]  F (30,40]  M (30,40]  F (40,50]  M (40,50] 
## -0.3019178 -0.2241134 -0.2372726  1.0222334 -0.4849768  0.6422601 
##  F (50,60]  M (50,60] 
## -0.1706591 -0.2227987

We can see from the above computation, that customers with a distinctly high range of Average Monthly Spend are males, in the age group of 30 to 40 years.

The plot below confirms that male customers generally have more Average Monthly Spend than female customers .

The demographics of customers as we can see from the plot below, are highest between 30 and 40 years age buckets.

The demographics again, is reflected in the Income of customers that are BikeBuyers, as we can see from the plot below:

Another look at the demographics shows that almost all the BikeBuyers have children at home. As shown by the boxplot below.

The plot below shows that most of the customers live in the United States, followed by Australia.

Now, I will start Predictive Machine Learning Modelling

I will start Building, testing and optimizing predictive machine learning models based on the above analysis of Customer data.

I have concluded that apparent relationships between AvgMonthSpend, YearlyIncome and individual features are helpful in determining predictive heuristics. However, relationships are often more complex and may only become apparent when multiple features are considered in combination with one another.

To help identify these more complex relationships, I will be using Ensemble modelling: boosting / neural network / random forest / generalized linear model.

Binary Classification / Two Class Logistic Regression:

First, I will do a Classification of Customers, Based on the likelihood of a customer buying a product, using accuracy based stochastic gradient boosting algorithm.

I will be spliting the dataset into two parts based on outcome variable: 75% will be used for training and 25% for validating the model.

index <- createDataPartition(new.awcustomer$BikeBuyer, p=0.75,                              list=FALSE, times = 1)
trainingSet <- new.awcustomer[index,]
testingSet <- new.awcustomer[-index,]
names(trainingSet)
##  [1] "CustomerID"           "FirstName"            "LastName"            
##  [4] "AddressLine1"         "City"                 "StateProvinceName"   
##  [7] "CountryRegionName"    "PostalCode"           "PhoneNumber"         
## [10] "Education"            "Occupation"           "Gender"              
## [13] "MaritalStatus"        "HomeOwnerFlag"        "NumberCarsOwned"     
## [16] "NumberChildrenAtHome" "TotalChildren"        "YearlyIncome"        
## [19] "Age"                  "BikeBuyer"            "AvgMonthSpend"
##  [1] "CustomerID"           "CountryRegionName"    "Education"           
##  [4] "Occupation"           "Gender"               "MaritalStatus"       
##  [7] "HomeOwnerFlag"        "NumberCarsOwned"      "NumberChildrenAtHome"
## [10] "TotalChildren"        "YearlyIncome"         "Age"                 
## [13] "BikeBuyer"            "AvgMonthSpend"

I will be using most of the predictors for estimating feature importance.

predictors <- c("Age", "AvgMonthSpend", "Education", "Gender", "HomeOwnerFlag", "MaritalStatus", "NumberCarsOwned", "NumberChildrenAtHome", "YearlyIncome")

I will do Model fitting using 5-Fold cross-validation

fitControl <- trainControl(
  method = "repeatedcv",
  classProbs = TRUE,
  number = 5,
  repeats = 5)

Next, tune model Parameters with full grid

grid <- expand.grid(n.trees=c(10,20,50,100,500,1000),
                    shrinkage=c(0.01,0.1),n.minobsinnode = c(3,5,10),
                    interaction.depth=c(5,15))
## Warning: package 'gbm' was built under R version 3.4.2
## Warning: package 'randomForest' was built under R version 3.4.2

I will Use the top predictors based on feature importance.

predictors <- c("NumberCarsOwned",
                "NumberChildrenAtHome", "YearlyIncome")

Next, I will Train the stochastic gradient boosting algorithm with model Hyper Parameters

garbagefit <- capture.output(
fitModel<-train(trainingSet[,predictors],trainingSet[,outcome],
                 method='gbm',trControl=fitControl,tuneGrid=grid))
## Warning: package 'gbm' was built under R version 3.4.2

Here is a Model summary with the boosting algorithm

## Stochastic Gradient Boosting 
## 
## 18361 samples
##     9 predictor
##     2 classes: 'NO', 'YES' 
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 18361, 18361, 18361, 18361, 18361, 18361, ... 
## Resampling results across tuning parameters:
## 
##   interaction.depth  n.trees  Accuracy   Kappa    
##   1                   50      0.7516347  0.4955513
##   1                  100      0.7613137  0.5106843
##   1                  150      0.7670166  0.5224110
##   2                   50      0.7669743  0.5223365
##   2                  100      0.7810386  0.5518195
##   2                  150      0.7818269  0.5533816
##   3                   50      0.7805810  0.5501909
##   3                  100      0.7850074  0.5592848
##   3                  150      0.7861815  0.5616602
## 
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
## 
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## Accuracy was used to select the optimal model using  the largest value.
## The final values used for the model were n.trees = 150,
##  interaction.depth = 3, shrinkage = 0.1 and n.minobsinnode = 10.

Plot the model result

Here we can see the feature importance estimation using stochastic gradient boostng algorithm

## gbm variable importance
## 
##                      Overall
## NumberChildrenAtHome 100.000
## YearlyIncome          75.196
## NumberCarsOwned       71.758
## Gender                12.074
## MaritalStatus         11.400
## Age                    9.414
## Education              8.527
## AvgMonthSpend          8.231
## HomeOwnerFlag          0.000

Model summary with Random Forest algorithm

## Random Forest 
## 
## 18361 samples
##     9 predictor
##     2 classes: 'NO', 'YES' 
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 18361, 18361, 18361, 18361, 18361, 18361, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##   2     0.7891239  0.5681127
##   5     0.7696971  0.5306554
##   9     0.7656904  0.5225921
## 
## Accuracy was used to select the optimal model using  the largest value.
## The final value used for the model was mtry = 2.

Plot Random Forest result

Here we can see the feature importance estimation using Random Forest algorithm

## rf variable importance
## 
##                      Overall
## YearlyIncome         100.000
## NumberChildrenAtHome  98.147
## NumberCarsOwned       73.776
## AvgMonthSpend         54.350
## Age                   30.025
## Education             10.673
## MaritalStatus          7.805
## Gender                 7.453
## HomeOwnerFlag          0.000

Model summary with Neural Network algorithm

## Neural Network 
## 
## 18361 samples
##     9 predictor
##     2 classes: 'NO', 'YES' 
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 18361, 18361, 18361, 18361, 18361, 18361, ... 
## Resampling results across tuning parameters:
## 
##   size  decay  Accuracy   Kappa    
##   1     0e+00  0.6131501  0.1637974
##   1     1e-04  0.6165517  0.1826788
##   1     1e-01  0.6950287  0.3821672
##   3     0e+00  0.6935175  0.3716345
##   3     1e-04  0.6986048  0.3774730
##   3     1e-01  0.7482447  0.4900290
##   5     0e+00  0.7287786  0.4593159
##   5     1e-04  0.7301300  0.4519707
##   5     1e-01  0.7623993  0.5169948
## 
## Accuracy was used to select the optimal model using  the largest value.
## The final values used for the model were size = 5 and decay = 0.1.

Neural Network result

Model summary with Logistic Linear Model

## Generalized Linear Model 
## 
## 18361 samples
##     9 predictor
##     2 classes: 'NO', 'YES' 
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 18361, 18361, 18361, 18361, 18361, 18361, ... 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.7020021  0.3975654

Feature importance estimation using Logistic Linear Regression algorithm

## glm variable importance
## 
##                               Overall
## NumberChildrenAtHome         100.0000
## YearlyIncome                  34.2288
## AvgMonthSpend                 27.4186
## GenderM                       26.9424
## NumberCarsOwned               22.3450
## EducationPartial High School  19.3328
## Age                           13.2939
## EducationHigh School           8.3649
## EducationPartial College       3.7832
## HomeOwnerFlag                  1.5195
## MaritalStatusS                 0.2697
## EducationGraduate Degree       0.0000

We have seen that the feature importance estimates of different models differ and thus might be used to get a more holistic view of importance of each predictor. Two main uses of variable importance from various models are:

  1. Predictors that are important for the majority of models represents genuinely important predictors.

  2. For ensembling, I would use predictions from models that have significantly different variable importance, as their predictions are also expected to be different. Although, one thing that must be made sure of is that all of them are sufficiently accurate.

Let’s take a look at the predictions from the boosting model

predictions<-predict.train(object = fitModel,
                           testingSet[, predictors],type="raw")
## predictions
##   NO  YES 
## 1665 2924

Predictions:

predictionsP<-predict.train(object = fitModel,
                           testingSet[,predictors],type="prob")
##          NO       YES
## 1 0.3827253 0.6172747
## 2 0.3827253 0.6172747
## 3 0.3827253 0.6172747
## 4 0.3837658 0.6162342
## 5 0.3827253 0.6172747
## 6 0.3827253 0.6172747

Next, I will start Regression Modelling:

The second part of this real-world challenge is using the analysis to determine whether a customer’s average monthly spend with the company can be predicted from known customer characteristics.

After creating a classification model to predict the likelihood of a customer buying a product, I created a regression model to predict a customer’s average monthly spend with the company. The model was trained with 70% of the data, and tested with the remaining 30%.

regindex <- createDataPartition(new.awcustomer$AvgMonthSpend, p=0.70, 
                             list=FALSE)
regtrain <- new.awcustomer[regindex, ]
regtest <- new.awcustomer[-regindex, ]

Checking the structure of training data

## 'data.frame':    12855 obs. of  19 variables:
##  $ CustomerID          : int  11000 11001 11002 11003 11004 11005 11006 11008 11009 11010 ...
##  $ AddressLine1        : Factor w/ 12742 levels "00, rue Saint-Lazare",..: 3729 1815 6319 1176 8454 8148 2306 6867 3830 8732 ...
##  $ City                : Factor w/ 269 levels "Ballard","Baltimore",..: 205 226 122 175 264 96 158 11 120 96 ...
##  $ StateProvinceName   : Factor w/ 54 levels "Alabama","Alberta",..: 37 50 45 28 28 37 28 50 37 37 ...
##  $ CountryRegionName   : Factor w/ 6 levels "Australia","Canada",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ PostalCode          : Factor w/ 323 levels "1002","10210",..: 80 44 116 23 33 67 18 56 79 67 ...
##  $ PhoneNumber         : Factor w/ 8836 levels "1 (11) 500 555-0110",..: 53 1 75 53 22 42 75 55 1 60 ...
##  $ Education           : Factor w/ 5 levels "Bachelors","Graduate Degree",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Occupation          : Factor w/ 5 levels "Clerical","Management",..: 5 5 1 2 2 2 2 2 5 1 ...
##  $ Gender              : Factor w/ 2 levels "F","M": 2 2 2 1 1 2 1 1 2 1 ...
##  $ MaritalStatus       : Factor w/ 2 levels "M","S": 1 2 1 2 2 2 2 2 2 2 ...
##  $ HomeOwnerFlag       : int  1 0 1 0 0 1 1 1 0 0 ...
##  $ NumberCarsOwned     : num  -0.296 -0.296 0.798 0.798 -0.296 ...
##  $ NumberChildrenAtHome: num  1.163 -0.594 1.163 -0.594 -0.594 ...
##  $ TotalChildren       : int  1 0 2 0 0 0 0 0 0 0 ...
##  $ YearlyIncome        : num  -0.334 -0.405 0.424 1.28 1.222 ...
##  $ Age                 : int  31 32 31 29 28 31 31 32 33 33 ...
##  $ BikeBuyer           : chr  "YES" "NO" "YES" "YES" ...
##  $ AvgMonthSpend       : num  2.078 1.15 0.0708 1.1151 0.1319 ...

I will be fitting the model with a six-fold cross validation.

regFit <- trainControl(method = 'cv', number = 6, 
                           summaryFunction=defaultSummary)

Next, I will tune the model Parameters

Grid <- expand.grid( n.trees = seq(50,1000,50), n.minobsinnode = c(3,5,10),
                     interaction.depth = c(30), shrinkage = c(0.1))

I will Use the best features for training

regFormula <- AvgMonthSpend ~ YearlyIncome + Age + Gender + NumberChildrenAtHome + MaritalStatus 

Training the model

garbagemod <- capture.output(
fitRegMod <- train(regFormula, data=regtrain, method = 'gbm', 
                 trControl=regFit,tuneGrid=Grid,metric='RMSE',maximize=FALSE))

Cross Validated model stochastic gradient boosting

Checking for feature importance

## gbm variable importance
## 
##                       Overall
## YearlyIncome         100.0000
## Age                   56.0970
## GenderM               19.0761
## NumberChildrenAtHome   0.2826
## MaritalStatusS         0.0000

A Plot of feature importance

Next, model residuals and predictions

Let’s take a look at the predictions

##  [1]  1.0 -0.4 -0.3 -1.3  0.4 -0.4 -0.2 -0.4 -1.2 -0.3

I would like to hear from you, kindly send me a feedback