Customer Segmentation Using Machine Learning

Customer segmentation is the process of dividing a customer base into groups of individuals that are similar in certain ways relevant to marketing, such as age, gender, interests, and spending habits. It enables companies to target specific groups with tailored promotions, products, or services that are most likely to resonate with them. Machine learning has become a popular tool for automating the process of customer segmentation, providing a more efficient and effective way to identify patterns and relationships within customer data.

There are several different methods for using machine learning to perform customer segmentation, including:-

Clustering algorithms: These algorithms divide customers into groups based on their characteristics and behaviour. For example, k-means Clustering can be used to find the k number of clusters in a dataset.
Decision trees: These algorithms use a tree-like model to identify the most important variables that influence customer behaviour. By using decision trees, companies can determine which customers are most likely to respond to certain marketing campaigns or products.
Neural networks: These algorithms can be used to model complex relationships between customers and their behaviour. Neural networks can identify patterns in customer data that are not easily recognizable through traditional methods.
Association rule learning: This method finds the relationships between customer attributes and behaviours, such as buying habits and product preferences. Association rule learning can help companies understand which products are frequently purchased together and target customers accordingly.

Advantages of Machine Learning for Customer Segmentation

One of the key benefits of using machine learning for customer segmentation is its ability to process vast amounts of data in real time. This allows companies to quickly identify new trends and patterns in customer behaviour, allowing them to make more informed marketing decisions. Additionally, machine learning algorithms can continuously learn and improve over time, providing a more accurate picture of customer behaviour.
Another advantage of using machine learning for customer segmentation is that it eliminates the need for manual data analysis. This can be a time-consuming and error-prone process, particularly when working with large datasets. Machine learning algorithms can automate the process of data analysis, providing companies with more accurate and reliable results.

We are now going to execute an unsupervised data clustering on the customer records from a grocery store's database.

Importing Libraries

# Importing the Libraries
import numpy as np
import pandas as pd
import datetime
import matplotlib
import matplotlib.pyplot as plt
from matplotlib import colors
import seaborn as sns
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from yellowbrick.cluster import KElbowVisualizer
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt, numpy as np
from mpl_toolkits.mplot3d import Axes3D
from sklearn.cluster import AgglomerativeClustering
from matplotlib.colors import ListedColormap
from sklearn import metrics
import warnings
import sys
if not sys.warnoptions:
    warnings.simplefilter("ignore")
np.random.seed(42)

Loading Data

# Loading the dataset
dataset = pd.read_csv("marketing_campaign.csv" ,sep="\t")
print("Number of datapoints in the dataset:", len(dataset))
dataset.head()

Output:

Data Cleaning

In this, we will do the following task:

Data Cleaning
Feature Engineering

To gain a complete understanding of the procedures, we are going to clean the dataset. Let's examine the information included in the data.

# Here we need to get the information about the features(column name) of the dataset
dataset.info()

Output:

We may deduce and note the following from the output described above:

Income is missing several values.(As only 2194 values are non-null)
Dt Customer, which represents a customer's date of database entry, is not processed as a DateTime.
In our data frame, there are several category features (as well as some dtype: object features). Therefore, later on, we will need to encode them into numeric representations.

We will start by removing the rows with the missing income values for the missing values.

# We need to remove the NA values from our dataset, so we will use .dropna()
dataset = dataset.dropna()
no=len(dataset)
print(f" After eliminating the rows with missing values, there are ultimately {no} number of datapoints in the dataset ")

Output:

The next stage is to build a feature out of "Dt Customer" that shows how long a customer has been a registered user of the company's database. But to keep things straightforward, we are using this value in relation to the most recent client in the record.

Therefore, we must compare the most recent and earliest recorded dates in order to obtain the values.

dataset["Dt_Customer"] = pd.to_datetime(dataset["Dt_Customer"])
dates = []
for i in dataset["Dt_Customer"]:
    i = i.date()
    dates.append(i)  
# Dates of the most recent and oldest client enrollments on record
newest_date = max(dates)
print(f"Date of the most recent customer's enrollment in the records: {newest_date}")
oldest_date = min(dates)
print(f" Date of records' oldest customer's enrollment: {oldest_date}")

Output:

Making a feature ("Customer For How Much Time") that counts the days consumers have been shopping there compared to the date that was last recorded.

#Created a feature "Customer_For_How_Much_Time"
days = []
d_1 = max(dates) #taking it to be the newest customer
for i in dates:
    d = d_1 - i
    days.append(d)
dataset["Customer_For_How_Much_Time"] = days
dataset["Customer_For_How_Much_Time"] = pd.to_numeric(dataset["Customer_For_How_Much_Time"], errors="coerce")

To further understand the data, we will now investigate the distinctive values in the category characteristics.

print("Total categories for the Marital Status feature:\n", dataset["Marital_Status"].value_counts(), "\n")
print("Total categories for the feature Education:\n", dataset["Education"].value_counts())

Output:

We will carry out the following procedures to engineer some new features in the next section:

Extract a customer's "Age" from their "Year Birth," which represents the year of their birth.
Add a new feature called "Spent" that displays the customer's overall spending across all categories over a two-year period.
To separate the living status of couples, create the feature "Living With" from "Marital Status."
Create the feature "Children" to show the total number of youngsters and teens living in a home.
To further clarify the family, Adding a feature that says "Family Size"
Make "Is Parent" a feature to identify whether or not you are a parent.
Finally, by streamlining its value counts, we shall divide "Education" into three categories.
removing some of the pointless features

# Engineering Features
# Now, we need to engineer the feature according to the requirement

#  Age of Customer till today
dataset["Age"] = 2021-dataset["Year_Birth"]

# Total spending on numerous products
dataset["Spent"] = dataset["MntWines"]+ dataset["MntFruits"]+ dataset["MntMeatProducts"]+ dataset["MntFishProducts"]+ dataset["MntSweetProducts"]+ dataset["MntGoldProds"]

# Living condition determined by marriage status "Alone"
dataset["Living_With"]=dataset["Marital_Status"].replace({"Married":"Partner", "Together":"Partner", "Absurd":"Alone", "Widow":"Alone", "YOLO":"Alone", "Divorced":"Alone", "Single":"Alone",})

# A feature that counts the number of kids in the home
dataset["Children"]=dataset["Kidhome"]+dataset["Teenhome"]

# Total number of household members feature
dataset["Family_Size"] = dataset["Living_With"].replace({"Alone": 1, "Partner":2})+ dataset["Children"]

# Feature related to parenting
dataset["Is_Parent"] = np.where(dataset.Children> 0, 1, 0)

# Dividing educational levels into three categories
dataset["Education"]=dataset["Education"].replace({"Basic":"Undergraduate","2n Cycle":"Undergraduate", "Graduation":"Graduate", "Master":"Postgraduate", "PhD":"Postgraduate"})

# For clarity
dataset=dataset.rename(columns={"MntWines": "Wines","MntFruits":"Fruits","MntMeatProducts":"Meat","MntFishProducts":"Fish","MntSweetProducts":"Sweets","MntGoldProds":"Gold"})

# Removing some of the pointless features
to_drop = ["Marital_Status", "Dt_Customer", "Z_CostContact", "Z_Revenue", "Year_Birth", "ID"]
dataset = dataset.drop(to_drop, axis=1)

Let's have a look at the statistics for the data now that we have some additional features.

Output:

The statistics above demonstrate some variations in mean income and age as well as maximum income and age.

Please take note that the maximum age is 128 years, as the data is outdated, and we computed the maximum age to be today (i.e. 2021).

We need to look at the facts from a wider perspective. We will plot a few of the chosen characteristics.

# to plot a few chosen features
# establishing colour preferences
sns.set(rc={"axes.facecolor":"#FFF9ED","figure.facecolor":"#FFF9ED"})
pallet = ["#682F2F", "#9E726F", "#D6B2B1", "#B9C0C9", "#9F8A78", "#F3AB60"]
cmap = colors.ListedColormap(["#682F2F", "#9E726F", "#D6B2B1", "#B9C0C9", "#9F8A78", "#F3AB60"])
#plotting the features that follow
to_be_plotted = [ "Income", "Recency", "Customer_For_How_Much_Time", "Age", "Spent", "Is_Parent"]
print("Relational Script of a Few Selected Features: A subset of data")
plt.figure()
sns.pairplot(dataset[to_be_plotted], hue= "Is_Parent",palette= (["#682F2F","#F3AB60"]))
#Taking hue
plt.show()

Output:

Relational Script of a Few Selected Features: A subset of data

Clearly, the Income and Age characteristics contain a few anomalies. The data's outliers will be eliminated.

# removing the outliers by capping their income and age.
dataset = dataset[(dataset["Age"]<90)]
dataset = dataset[(dataset["Income"]<600000)]
l=len(dataset)
print( f"Following the elimination of the outliers, there are {l} numbers of data points:")

Output:

Let's now examine the relationships between the characteristics. (At this stage, leaving out the classified characteristics)

# correlation matrix
corrmat= dataset.corr()
plt.figure(figsize=(20,20))  
sns.heatmap(corrmat,annot=True, cmap=cmap, center=0)

Output:

<AxesSubplot: >

The new features are present, and the data is rather clean. We'll carry on to the following phase. Specifically, preparing the data.

Data Preprocessing

We will preprocess the data in this part in order to do clustering procedures.

The data is preprocessed using the procedures below:

Labeling the category characteristics
using the default scaler to scale the features
Making a subset dataframe to reduce dimensionality

# Obtain a list of the category variables
s = (dataset.dtypes == 'object')
object_columns = list(s[s].index)

print("the dataset's categorical variables are:", object_columns)

Output:

# The object dtypes are label encoded.
LE=LabelEncoder()
for i in object_columns:
    dataset[i]=dataset[[i]].apply(LE.fit_transform)
   
print("Now, all attributes are numerical.")

Output:

# making a duplicate of the data
copy_dataset = dataset.copy()
# Removing the features on deals accepted and promotions to create a subset of the dataframe
columns_to_delete = ['AcceptedCmp3', 'AcceptedCmp4', 'AcceptedCmp5', 'AcceptedCmp1','AcceptedCmp2', 'Complain', 'Response']
copy_dataset = copy_dataset.drop(columns_to_delete, axis=1)
# Scaling
standard_scaler = StandardScaler()
standard_scaler.fit(copy_dataset)
scaled_dataset = pd.DataFrame(standard_scaler.transform(copy_dataset),columns= copy_dataset.columns )
print(" Now, every feature is scaled ")

Output:

# Using scaled data to reduce the dimensionality
print("Dataframe to be applied in further modelling:")
scaled_dataset.head()

Output:

Dimensionality Reduction

Dimensionality reduction is a technique used in machine learning and data science to reduce the number of features or dimensions in a dataset, while retaining as much information as possible. The goal is to simplify the data while preserving its structure and relationships between variables.

Principal Component Analysis (PCA) is a statistical technique that is used to analyze the structure of complex data sets, such as high-dimensional data sets. It is used to identify patterns in the data, which can then be used to reduce the dimensionality of the data, making it easier to visualize and interpret.

The following actions in this section:

PCA-based dimension reduction
Graphing the compressed dataframe

#Initiating PCA to reduce dimensions, aka features, to 3
pca = PCA(n_components=3)
pca.fit(scaled_dataset)
PCA_dataset = pd.DataFrame(pca.transform(scaled_dataset), columns=(["col1","col2", "col3"]))
PCA_dataset.describe().T

Output:

#A Reduced Dimensional 3D Data Projection
x =PCA_dataset["col1"]
y =PCA_dataset["col2"]
z =PCA_dataset["col3"]
# To plot
fig = plt.figure(figsize=(10,8))
ax = fig.add_subplot(111, projection="3d")
ax.scatter(x,y,z, c="maroon", marker="o" )
ax.set_title("A Reduced Dimensional 3D Data Projection")
plt.show()

Output:

Clustering

Agglomerative Clustering will now be used to achieve the Clustering. A hierarchical clustering technique is agglomerative Clustering. Up until the appropriate number of clusters is reached, samples are merged.

The steps in the Clustering:

Determine the number of clusters to build using the elbow method.
Agglomerative Clustering for Clustering
examining the scatter plot clusters that were created

# Quick review of the elbow technique to determine how many clusters to create.
print('The amount of clusters to generate will be determined using the elbow method:')
Elbow_method = KElbowVisualizer(KMeans(), k=10)
Elbow_method.fit(PCA_dataset)
Elbow_method.show()

Output:

<AxesSubplot: title={'center': 'Distortion Score Elbow for KMeans Clustering'}, xlabel='k', ylabel='distortion score'>

According to the cell above, four clusters will be the best choice for this set of data. To obtain the final clusters, we will then fit the agglomerative clustering model.

# Agglomerative Clustering model launch
aggCluster = AgglomerativeClustering(n_clusters=4)
# model fitting and cluster prediction
yhat_aggCluster = aggCluster.fit_predict(PCA_dataset)
PCA_dataset["Clusters"] = yhat_aggCluster
# The original dataframe is updated with the Clusters feature.
dataset["Clusters"]= yhat_aggCluster

Let's look at the clusters' 3-D distribution to investigate the clusters that were produced.

# Plotting the clusters
fig = plt.figure(figsize=(10,8))
ax = plt.subplot(111, projection='3d', label="bla")
ax.scatter(x, y, z, s=40, c=PCA_dataset["Clusters"], marker='o', cmap = cmap )
ax.set_title("The Plot Of The Clusters")
plt.show()

Output:

Evaluation Models

Since this Clustering was done without supervision, our model cannot be evaluated or scored since it lacks a tagged feature. This section's goal is to examine the patterns in the clusters that have developed and ascertain their nature.

In order to do that, we will use exploratory data analysis to look at the data in the context of clusters and make judgements.

#Plotting countplot of clusters
pal = ["#682F2F","#B9C0C9", "#9F8A78","#F3AB60"]
pl = sns.countplot(x=dataset["Clusters"], palette= pal)
pl.set_title("Arrangement Of The Clusters")
plt.show()

Output:

The clusters seem to be fairly distributed.

pl = sns.scatterplot(data = dataset,x=dataset["Spent"], y=dataset["Income"],hue=dataset["Clusters"], palette= pal)
pl.set_title("Cluster's Income and Spending Profile")
plt.legend()
plt.show()

Output:

The cluster pattern is shown in the income vs expenditure figure.

group 0: high expenditures and average earnings
group 1: high income and spending
group 2: low-income and low expenditure
High expenditure and poor income make up group 3.

The specific distribution of clusters according to the different goods in the data will be the subject of our next examination. The following: Wines, Fruits, Meat, Fish, Sweets, and Gold.

plt.figure()
pl=sns.swarmplot(x=dataset["Clusters"], y=dataset["Spent"], color= "#CBEDDD", alpha=0.5 )
pl=sns.boxenplot(x=dataset["Clusters"], y=dataset["Spent"], palette=pal)
plt.show()

Output:

It is evident from the plot above that cluster 1 is our largest group of clients, closely followed by cluster 0. We may investigate the focused marketing methods each cluster is investing in.

Next, let's look at how our past campaigns performed.

# Adding a tool to calculate the total number of approved promotions
dataset["Total_Promos"] = dataset["AcceptedCmp1"]+ dataset["AcceptedCmp2"]+ dataset["AcceptedCmp3"]+ dataset["AcceptedCmp4"]+ dataset["AcceptedCmp5"]
# Plotting the number of accepted campaigns overall.
plt.figure()
pl = sns.countplot(x=dataset["Total_Promos"],hue=dataset["Clusters"], palette= pal)
pl.set_title("Amount Of Accepted Promotions")
pl.set_xlabel("Total Number Of Promotions Accepted")
plt.show()

Output:

The campaigns have not yet received a large reaction. generally very few participants. Furthermore, no one portion can include all five of them. Perhaps more well designed and targeted promotions are needed to increase sales.

#Graphing the number of deals bought
plt.figure()
pl=sns.boxenplot(y=dataset["NumDealsPurchases"],x=dataset["Clusters"], palette= pal)
pl.set_title("Amount of Deals Bought")
plt.show()

Output:

Campaigns failed, but the transactions were successful. The greatest results came from clusters 0 and 3. Cluster 1, one of our top clients, isn't very interested in the agreements, though. Nothing appears to powerfully draw cluster 2 in.

#for more details on the purchasing style
Places =["NumWebPurchases", "NumCatalogPurchases", "NumStorePurchases",  "NumWebVisitsMonth"]

for i in Places:
    plt.figure()
    sns.jointplot(x=dataset[i],y = dataset["Spent"],hue=dataset["Clusters"], palette= pal)
    plt.show()

Output:

Profiling

Now that the clusters have been created and their purchasing patterns have been examined. Let's take a look at each individual in these clusters. In order to determine who is our star client and who requires further attention from the retail store's marketing staff, we will be profiling the clusters that have been developed.

To make the decision that, in light of the cluster the client is in, we will be graphing some of the aspects that are indicative of their personal characteristics. We shall get to the conclusions based on the results.

Personal = [ "Kidhome","Teenhome","Customer_For_How_Much_Time", "Age", "Children", "Family_Size", "Is_Parent", "Education","Living_With"]

for i in Personal:
    plt.figure()
    sns.jointplot(x=dataset[i], y=dataset["Spent"], hue =dataset["Clusters"], kind="kde", palette=pal)
    plt.show()

Output:

Cluster Number 0:

Are definitely a parent
At the max, have four members in the family and at least 2
Single parents are a subset of this group
Most have a teenager at home
Relatively older

Cluster Number 1 :

Are a definitely not a parent
At the max are only two members in the family
A slight majority of couples over single people
Span all ages
A high-income groups

Cluster Number 2:

The majority of these people are parents
At the max are three members in the family
They majorly have one kind(and not teenagers, typically)
Relatively younger

Cluster Number 3:

They are definitely a parent
At the max are five members in the family and at least 2
The majority of them have a teenager at home
Relatively older
A lower-income group

Unsupervised Clustering was done. Dimensionality reduction and agglomerative Clustering were both used. We developed four clusters and utilized them to profile clients in clusters based on their family configurations, income levels, and spending habits. This may be applied to creating better marketing plans.

In conclusion, customer segmentation is a critical aspect of marketing strategy, and machine learning has become an increasingly popular tool for automating the process. By using machine learning algorithms to process vast amounts of customer data, companies can quickly identify new trends and patterns, target specific customer segments with tailored promotions, and make more informed marketing decisions. With its ability to process data in real time, eliminate the need for manual analysis, and continuously improve over time, machine learning is a powerful tool for customer segmentation.

Next TopicDetecting Phishing Websites using Machine Learning

← prev next →