This world-wide company has collected a large volume of data about their existing customers, including demographic features and information about purchases they have made. The company is particularly interested in analyzing customer data to determine any apparent relationships between demographic features known about the customers and the likelihood of a customer purchasing a product. Additionally, the analysis should endeavor to determine whether a customer’s average monthly spend with the company can be predicted from known customer characteristics.
This dataset consists of two files: Customers.csv(Customer demographic data) and the Sales.csv(Sales data for existing customers)
Can you tell us the median Yearly Income ranked By Occupation?
Is the same order reflected in the median Average Monthly Spend by Occupation?
Which group of customers account for a distinctly high range of Average Monthly Spend values?
Any apparent Correlations with Average Monthly Spend?
Apply what you have learned about the data to building, testing, and optimizing a predictive machine learning model.
After preliminary data exploration, I used descriptive, multivariate analysis and summary statistics to determine the features of interest in relationship to the response variable.
Correlations coefficients were determined using statistical methods and visualiztions. After descriptive statistics of the features of interest, a binary classification model that predicts the likelihood of a customer purchasing a product and a regression model, to predict a customer’s average monthly spend with the company, from these features was created.
## Warning: package 'knitr' was built under R version 3.4.2
| CustomerID | City | StateProvinceName | CountryRegionName | Education | Occupation | Gender | MaritalStatus | HomeOwnerFlag | NumberCarsOwned | NumberChildrenAtHome | TotalChildren | YearlyIncome | Age | BikeBuyer | AvgMonthSpend |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 11000 | Rockhampton | Queensland | Australia | Bachelors | Skilled Manual | M | M | 1 | -0.2957460 | 1.1632792 | 1 | -0.3341607 | 31 | YES | 2.0779835 |
| 11001 | Seaford | Victoria | Australia | Bachelors | Skilled Manual | M | S | 0 | -0.2957460 | -0.5943713 | 0 | -0.4046181 | 32 | NO | 1.1500079 |
| 11002 | Hobart | Tasmania | Australia | Bachelors | Clerical | M | M | 1 | 0.7983891 | 1.1632792 | 2 | 0.4237297 | 31 | YES | 0.0707635 |
| 11003 | North Ryde | New South Wales | Australia | Bachelors | Management | F | S | 0 | 0.7983891 | -0.5943713 | 0 | 1.2803321 | 29 | YES | 1.1150997 |
The histogram above shows that Yearly income of customers are slightly right skewed - in other words, while some customers have good yearly income, most are at the lower end of the income range.
After exploring the income feature, an attempt was made to identify relationships between features in the dataset.
The following scatter-plot matrix was generated initially to compare numeric features with one another. The key features in this matrix are shown here:
Viewing the above scatter plot matrix shows an apparent relationship between YearlyIncome and some numeric features. Specifically, NumbersOfCarsOwned and AvgMonthSpend, so does Age, HomeOwnerFlag and TotalChildren seem to be related.
The correlation coefficients of the numeric columns was then calculated with the following results:
| X | CustomerID | HomeOwnerFlag | NumberCarsOwned | NumberChildrenAtHome | TotalChildren | YearlyIncome | Age | AvgMonthSpend |
|---|---|---|---|---|---|---|---|---|
| CustomerID | 1.0000000 | -0.0291990 | -0.0121010 | -0.0397062 | -0.0712446 | -0.0294232 | -0.0128701 | 0.0027548 |
| HomeOwnerFlag | -0.0291990 | 1.0000000 | 0.2097869 | 0.3684593 | 0.5781974 | 0.3562434 | 0.6421242 | 0.2907846 |
| NumberCarsOwned | -0.0121010 | 0.2097869 | 1.0000000 | 0.0203926 | 0.0304295 | 0.4773002 | 0.0417541 | 0.2753784 |
| NumberChildrenAtHome | -0.0397062 | 0.3684593 | 0.0203926 | 1.0000000 | 0.6060767 | 0.0059898 | 0.3262236 | 0.1451770 |
| TotalChildren | -0.0712446 | 0.5781974 | 0.0304295 | 0.6060767 | 1.0000000 | 0.0220138 | 0.5476922 | 0.0261377 |
| YearlyIncome | -0.0294232 | 0.3562434 | 0.4773002 | 0.0059898 | 0.0220138 | 1.0000000 | 0.0260389 | 0.5301257 |
| Age | -0.0128701 | 0.6421242 | 0.0417541 | 0.3262236 | 0.5476922 | 0.0260389 | 1.0000000 | 0.1112192 |
| AvgMonthSpend | 0.0027548 | 0.2907846 | 0.2753784 | 0.1451770 | 0.0261377 | 0.5301257 | 0.1112192 | 1.0000000 |
In the table above, correlation coefficient between some possible pairs of features are shown.
These correlations validate the scatterplot matrix by showing positive correlation coefficients between YearlyIncome and some features and moderate to strong positive correlations for the other numeric features.
Next, I calculated the median Yearly Income, ranked By Occupation. The result shows that customers classified as Professionals have the highest median income.
## Clerical Management Manual Professional Skilled Manual
## 0.3000543 1.1366634 -1.3558265 1.9651905 -0.5282283
Is the same order reflected in the median Average Monthly Spend by Occupation? The answer is yes, as we can see from the computations below:
## Clerical Management Manual Professional Skilled Manual
## 0.09112657 0.33548379 -1.09866037 0.58275003 -0.37140674
Which group of customers account for a distinctly high range of Average Monthly Spend values?
## F [20,30] M [20,30] F (30,40] M (30,40] F (40,50] M (40,50]
## -0.3019178 -0.2241134 -0.2372726 1.0222334 -0.4849768 0.6422601
## F (50,60] M (50,60]
## -0.1706591 -0.2227987
We can see from the above computation, that customers with a distinctly high range of Average Monthly Spend are males, in the age group of 30 to 40 years.
The plot below confirms that male customers generally have more Average Monthly Spend than female customers .
The demographics of customers as we can see from the plot below, are highest between 30 and 40 years age buckets.
The demographics again, is reflected in the Income of customers that are BikeBuyers, as we can see from the plot below:
Another look at the demographics shows that almost all the BikeBuyers have children at home. As shown by the boxplot below.
The plot below shows that most of the customers live in the United States, followed by Australia.
I will start Building, testing and optimizing predictive machine learning models based on the above analysis of Customer data.
I have concluded that apparent relationships between AvgMonthSpend, YearlyIncome and individual features are helpful in determining predictive heuristics. However, relationships are often more complex and may only become apparent when multiple features are considered in combination with one another.
To help identify these more complex relationships, I will be using Ensemble modelling: boosting / neural network / random forest / generalized linear model.
First, I will do a Classification of Customers, Based on the likelihood of a customer buying a product, using accuracy based stochastic gradient boosting algorithm.
I will be spliting the dataset into two parts based on outcome variable: 75% will be used for training and 25% for validating the model.
index <- createDataPartition(new.awcustomer$BikeBuyer, p=0.75, list=FALSE, times = 1)
trainingSet <- new.awcustomer[index,]
testingSet <- new.awcustomer[-index,]
names(trainingSet)
## [1] "CustomerID" "FirstName" "LastName"
## [4] "AddressLine1" "City" "StateProvinceName"
## [7] "CountryRegionName" "PostalCode" "PhoneNumber"
## [10] "Education" "Occupation" "Gender"
## [13] "MaritalStatus" "HomeOwnerFlag" "NumberCarsOwned"
## [16] "NumberChildrenAtHome" "TotalChildren" "YearlyIncome"
## [19] "Age" "BikeBuyer" "AvgMonthSpend"
## [1] "CustomerID" "CountryRegionName" "Education"
## [4] "Occupation" "Gender" "MaritalStatus"
## [7] "HomeOwnerFlag" "NumberCarsOwned" "NumberChildrenAtHome"
## [10] "TotalChildren" "YearlyIncome" "Age"
## [13] "BikeBuyer" "AvgMonthSpend"
I will be using most of the predictors for estimating feature importance.
predictors <- c("Age", "AvgMonthSpend", "Education", "Gender", "HomeOwnerFlag", "MaritalStatus", "NumberCarsOwned", "NumberChildrenAtHome", "YearlyIncome")
I will do Model fitting using 5-Fold cross-validation
fitControl <- trainControl(
method = "repeatedcv",
classProbs = TRUE,
number = 5,
repeats = 5)
Next, tune model Parameters with full grid
grid <- expand.grid(n.trees=c(10,20,50,100,500,1000),
shrinkage=c(0.01,0.1),n.minobsinnode = c(3,5,10),
interaction.depth=c(5,15))
## Warning: package 'gbm' was built under R version 3.4.2
## Warning: package 'randomForest' was built under R version 3.4.2
I will Use the top predictors based on feature importance.
predictors <- c("NumberCarsOwned",
"NumberChildrenAtHome", "YearlyIncome")
Next, I will Train the stochastic gradient boosting algorithm with model Hyper Parameters
garbagefit <- capture.output(
fitModel<-train(trainingSet[,predictors],trainingSet[,outcome],
method='gbm',trControl=fitControl,tuneGrid=grid))
## Warning: package 'gbm' was built under R version 3.4.2
Here is a Model summary with the boosting algorithm
## Stochastic Gradient Boosting
##
## 18361 samples
## 9 predictor
## 2 classes: 'NO', 'YES'
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 18361, 18361, 18361, 18361, 18361, 18361, ...
## Resampling results across tuning parameters:
##
## interaction.depth n.trees Accuracy Kappa
## 1 50 0.7516347 0.4955513
## 1 100 0.7613137 0.5106843
## 1 150 0.7670166 0.5224110
## 2 50 0.7669743 0.5223365
## 2 100 0.7810386 0.5518195
## 2 150 0.7818269 0.5533816
## 3 50 0.7805810 0.5501909
## 3 100 0.7850074 0.5592848
## 3 150 0.7861815 0.5616602
##
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
##
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were n.trees = 150,
## interaction.depth = 3, shrinkage = 0.1 and n.minobsinnode = 10.
Plot the model result
Here we can see the feature importance estimation using stochastic gradient boostng algorithm
## gbm variable importance
##
## Overall
## NumberChildrenAtHome 100.000
## YearlyIncome 75.196
## NumberCarsOwned 71.758
## Gender 12.074
## MaritalStatus 11.400
## Age 9.414
## Education 8.527
## AvgMonthSpend 8.231
## HomeOwnerFlag 0.000
Model summary with Random Forest algorithm
## Random Forest
##
## 18361 samples
## 9 predictor
## 2 classes: 'NO', 'YES'
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 18361, 18361, 18361, 18361, 18361, 18361, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.7891239 0.5681127
## 5 0.7696971 0.5306554
## 9 0.7656904 0.5225921
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.
Plot Random Forest result
Here we can see the feature importance estimation using Random Forest algorithm
## rf variable importance
##
## Overall
## YearlyIncome 100.000
## NumberChildrenAtHome 98.147
## NumberCarsOwned 73.776
## AvgMonthSpend 54.350
## Age 30.025
## Education 10.673
## MaritalStatus 7.805
## Gender 7.453
## HomeOwnerFlag 0.000
Model summary with Neural Network algorithm
## Neural Network
##
## 18361 samples
## 9 predictor
## 2 classes: 'NO', 'YES'
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 18361, 18361, 18361, 18361, 18361, 18361, ...
## Resampling results across tuning parameters:
##
## size decay Accuracy Kappa
## 1 0e+00 0.6131501 0.1637974
## 1 1e-04 0.6165517 0.1826788
## 1 1e-01 0.6950287 0.3821672
## 3 0e+00 0.6935175 0.3716345
## 3 1e-04 0.6986048 0.3774730
## 3 1e-01 0.7482447 0.4900290
## 5 0e+00 0.7287786 0.4593159
## 5 1e-04 0.7301300 0.4519707
## 5 1e-01 0.7623993 0.5169948
##
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were size = 5 and decay = 0.1.
Neural Network result
Model summary with Logistic Linear Model
## Generalized Linear Model
##
## 18361 samples
## 9 predictor
## 2 classes: 'NO', 'YES'
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 18361, 18361, 18361, 18361, 18361, 18361, ...
## Resampling results:
##
## Accuracy Kappa
## 0.7020021 0.3975654
Feature importance estimation using Logistic Linear Regression algorithm
## glm variable importance
##
## Overall
## NumberChildrenAtHome 100.0000
## YearlyIncome 34.2288
## AvgMonthSpend 27.4186
## GenderM 26.9424
## NumberCarsOwned 22.3450
## EducationPartial High School 19.3328
## Age 13.2939
## EducationHigh School 8.3649
## EducationPartial College 3.7832
## HomeOwnerFlag 1.5195
## MaritalStatusS 0.2697
## EducationGraduate Degree 0.0000
We have seen that the feature importance estimates of different models differ and thus might be used to get a more holistic view of importance of each predictor. Two main uses of variable importance from various models are:
Predictors that are important for the majority of models represents genuinely important predictors.
For ensembling, I would use predictions from models that have significantly different variable importance, as their predictions are also expected to be different. Although, one thing that must be made sure of is that all of them are sufficiently accurate.
predictions<-predict.train(object = fitModel,
testingSet[, predictors],type="raw")
## predictions
## NO YES
## 1665 2924
Predictions:
predictionsP<-predict.train(object = fitModel,
testingSet[,predictors],type="prob")
## NO YES
## 1 0.3827253 0.6172747
## 2 0.3827253 0.6172747
## 3 0.3827253 0.6172747
## 4 0.3837658 0.6162342
## 5 0.3827253 0.6172747
## 6 0.3827253 0.6172747
The second part of this real-world challenge is using the analysis to determine whether a customer’s average monthly spend with the company can be predicted from known customer characteristics.
After creating a classification model to predict the likelihood of a customer buying a product, I created a regression model to predict a customer’s average monthly spend with the company. The model was trained with 70% of the data, and tested with the remaining 30%.
regindex <- createDataPartition(new.awcustomer$AvgMonthSpend, p=0.70,
list=FALSE)
regtrain <- new.awcustomer[regindex, ]
regtest <- new.awcustomer[-regindex, ]
## 'data.frame': 12855 obs. of 19 variables:
## $ CustomerID : int 11000 11001 11002 11003 11004 11005 11006 11008 11009 11010 ...
## $ AddressLine1 : Factor w/ 12742 levels "00, rue Saint-Lazare",..: 3729 1815 6319 1176 8454 8148 2306 6867 3830 8732 ...
## $ City : Factor w/ 269 levels "Ballard","Baltimore",..: 205 226 122 175 264 96 158 11 120 96 ...
## $ StateProvinceName : Factor w/ 54 levels "Alabama","Alberta",..: 37 50 45 28 28 37 28 50 37 37 ...
## $ CountryRegionName : Factor w/ 6 levels "Australia","Canada",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ PostalCode : Factor w/ 323 levels "1002","10210",..: 80 44 116 23 33 67 18 56 79 67 ...
## $ PhoneNumber : Factor w/ 8836 levels "1 (11) 500 555-0110",..: 53 1 75 53 22 42 75 55 1 60 ...
## $ Education : Factor w/ 5 levels "Bachelors","Graduate Degree",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Occupation : Factor w/ 5 levels "Clerical","Management",..: 5 5 1 2 2 2 2 2 5 1 ...
## $ Gender : Factor w/ 2 levels "F","M": 2 2 2 1 1 2 1 1 2 1 ...
## $ MaritalStatus : Factor w/ 2 levels "M","S": 1 2 1 2 2 2 2 2 2 2 ...
## $ HomeOwnerFlag : int 1 0 1 0 0 1 1 1 0 0 ...
## $ NumberCarsOwned : num -0.296 -0.296 0.798 0.798 -0.296 ...
## $ NumberChildrenAtHome: num 1.163 -0.594 1.163 -0.594 -0.594 ...
## $ TotalChildren : int 1 0 2 0 0 0 0 0 0 0 ...
## $ YearlyIncome : num -0.334 -0.405 0.424 1.28 1.222 ...
## $ Age : int 31 32 31 29 28 31 31 32 33 33 ...
## $ BikeBuyer : chr "YES" "NO" "YES" "YES" ...
## $ AvgMonthSpend : num 2.078 1.15 0.0708 1.1151 0.1319 ...
I will be fitting the model with a six-fold cross validation.
regFit <- trainControl(method = 'cv', number = 6,
summaryFunction=defaultSummary)
Next, I will tune the model Parameters
Grid <- expand.grid( n.trees = seq(50,1000,50), n.minobsinnode = c(3,5,10),
interaction.depth = c(30), shrinkage = c(0.1))
I will Use the best features for training
regFormula <- AvgMonthSpend ~ YearlyIncome + Age + Gender + NumberChildrenAtHome + MaritalStatus
Training the model
garbagemod <- capture.output(
fitRegMod <- train(regFormula, data=regtrain, method = 'gbm',
trControl=regFit,tuneGrid=Grid,metric='RMSE',maximize=FALSE))
Cross Validated model stochastic gradient boosting
Checking for feature importance
## gbm variable importance
##
## Overall
## YearlyIncome 100.0000
## Age 56.0970
## GenderM 19.0761
## NumberChildrenAtHome 0.2826
## MaritalStatusS 0.0000
A Plot of feature importance
Next, model residuals and predictions
Let’s take a look at the predictions
## [1] 1.0 -0.4 -0.3 -1.3 0.4 -0.4 -0.2 -0.4 -1.2 -0.3