Analysis of Customer Behaviour Using Python

Project Objective: How can a firm or showroom management determine whether an existing or potential consumer wants to purchase a product (in this case, a car)? This may be done if they have information on the client's wage, age, and other factor fields (independent variables) to determine whether or not the consumer would purchase the automobile (dependent variable).

The marketing team may focus on them to increase sales if we learn beforehand that this consumer has a possibility to purchase the goods. Python and data science can help you better understand your customers' behaviour.

Analysis of Customer Behaviour Using Python

Behaviour Analysis of Customer with commerce data and Python

Customer behaviour analysis is the practice of investigating and comprehending consumer behaviour to enhance marketing and commercial tactics. This tutorial will examine how to analyse consumer behaviour using Python, a potent data science tool.

We'll start by loading the libraries needed for our study. To manipulate and analyse data, Matplotlib will be used, and Seaborn will be used to produce more sophisticated visualisations.

Source Code Snippet

Our dataset will then be loaded into a Pandas DataFrame. Customer demographics, purchasing history, and any other pertinent data should be included in the dataset.

Customer Behaviour Analysis: Dataset Sample

account lengthlocation codeuser idcredit card info savepush statusadd to wishlistdesktop sessionsapp sessionsdesktop transtotal product detail views
01284153824657noyes2526510
1174153717191noyes26162273
21374153581921nono0243415
3844083759999yesno0299517
4754153306626yesno0167282.4
51284153824657noyes2526510
6174153717191noyes26162273
71374153581921nono0243415
8844083759999yesno0299517
9754153306626yesno0167282.4
101284153824657noyes2526510
11174153717191noyes26162273
121374153581921nono0243415
13844083759999yesno0299517
14754153306626yesno0167282.4
151284153824657noyes2526510
16174153717191noyes26162273
171374153581921nono0243415
18844083759999yesno0299517
19754153306626yesno0167282.4
201284153824657noyes2526510
21174153717191noyes26162273
221374153581921nono0243415
23844083759999yesno0299517
24754153306626yesno0167282.4

Put the Kaggle file containing the e-commerce consumer behaviour data into your project directory. Then use pandas to read the data.

Source Code Snippet

Once our data is loaded, we can explore it to learn more. We must undergo an EDA (Exploratory Data Analysis) procedure for each new dataset. Having a fundamental structure in mind when designing EDAs is a smart idea. According to my experience, it's wise to:

  • Recognise what our columns say,
  • rename the column names, and lowercase them.
  • Verify the correctness of the column data types, handle missing values, look for duplication, etc.
  • Look for anomalies
  • Verify the variables' linear relationships.

We can use Pandas and visualisation libraries to do this. Let's start by reviewing our column names to ensure they are all lowercase and free of spaces (spaces should be replaced with underscores).

Source Code Snippet

The length that makes up our dataframe (the number of rows), the number of columns, the quantity of not null data for each column, and the information type for each column are essential details we should examine next. This is achievable using the.info() method.

Source Code Snippet

Output:

<class 'pandas.core.frame.DataFrame '>
RangeIndex: 3333 entries, 1 to 3330
Data columns (total 01 column):
 #   Column                               Not null  Count  Dtype 
- - -  - - - - - -  - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -   
 1   account__length                       3333 not null    int75 
 2   location__code                        3333 not null    int75 
 3   user__id                              3333 not null    int75 
 4   credit__card__info__save                3333 not null    object
 5   push__status                          3333 not null    object
 6   add__to__list                      3333 not null    int75 
 7   desktop__sessions                     3333 not null    int75 
 8   application__sessions                         3333 not null    int75 
 9   desktop__transactions                 3333 not null    int75 
 10   total__product__detail__views           3333 not null    int75 
 11  session__duration                     3333 not null    int75 
 12  promotion__click                     3333 not null    int75 
 13  average__order__value                      3333 not null    object
 14  sale__product__views                   3333 not null    int75 
 15  offer__rate__per__visited__products   3333 not null    object
 16  product__detail__view__per__application__session  3333 not null    object
 17  application__transactions                     3333 not null    int75 
 18  add__to__cart__per__session              3333 not null    object
 19  customer__service__calls               3333 not null    int75 
 20  churn                                3333 not null    int75 
- - - - - - - - - - - - - - - - - - - - - - - - - -  - - - - - - - - - - - - - - - - - - - - -  -   
dtypes: int75(15), object(7)
memory usage: 501.9+ KB

Explanation:

  • There are 01 column and 3,333 rows (RangeIndex), as seen in the output above.
  • The 15th integer rows and seven types of object columns are visible in the bottom half of the.info() function output, as seen in the output.
  • Columns 10, 15, 17, and 18 (like 10, total__product__detail__views, 3333 not null, int75 ) are supposed to be numerical, which is the problem.
  • It's critical to comprehend the significance of each aspect.
  • There is some information about the features of the End-to-End machine learning (ML) Classification Project with computer graphics Notebook on Kaggle, but it is in Turkish.

As a result, I will translate it for you and add the definitions of the columns here, along with suggestions for how some columns might help us depict the behaviour of consumers:

  • Account duration: "Information regarding how many days the individual has been a participant of the site" - Could reveal whether a client is a new or long-time client. Suppose we combine this data with other data, such as the number of sessions, the length of sessions, the quantity and value of transactions, etc. In that case, we may determine whether new consumers behave differently from repeat customers.
  • "Customer Location Code" is the location code. - Do customers that share a location code behave similarly when they visit the store?
  • User Id: "Unique identifier for each customer" - If we utilise unique ids as an index, we may organise data and map predictions back to them.
  • "Number of times an individual added items to the wishlist" - Check whether those that add many items to their wishlist are likely to have greater or lower transaction values. Alternatively, more or fewer transactions, etc.
  • Desktop Sessions: "Number of desktop sessions by customers" We can determine whether users from particular regions like the store's desktop version over its application.
  • Application Sessions: "Number of Customer's Application Sessions" - Are application users more inclined to add products to their carts with each Session?
  • Transactions on a desktop: "Number of purchases made by the customer from a desktop."
  • Overall Product Detail Views: "Number of Product Detail Views Per Customer"
  • Session Length: "How long each user's visit to the website lasted."
  • client Click on Promotions: "How many times has each customer clicked on a promotion."
  • Views of Sale Products: "How many Sale Products were viewed by each Customer."
  • Application Transactions: "Number of transactions the customer made from the application."
  • Average Order Value: "The average cost of an order per customer."
  • Average Offer Per Product viewed: "The average offer per product visited by the customer."
  • Add to basket Per Session: "The typical quantity of items a customer adds to their cart during a session."
  • Product Detail See Per Application Session: "The typical number of products that users of the application view during a session."
  • Credit Card Info Save: "Whether the customer has retained their credit card information in the system" (definite yes/no).
  • Push Status (Yes/No Categorical): "Whether or not the customer accepts push notifications"

Churn: A binary variable that denotes whether a consumer is churning. A consumer who has chosen to discontinue using the product is known as a churn. It may be utilised as a target variable in e-commerce data for predictive models since foretelling if a consumer would discontinue using your product will assist you in avoiding this from happening.

Since the Python applications are misinterpreting the data type in columns 10, 15, 17, and 17 because they are intended to be numeric even though there are no missing data, we must change the comma to a dot (.) before converting these columns to float types.

Source Code Snippet

Output:

account_length                           int64
location_code                            int64
user_id                                  int64
credit_card_info_save                   object
push_status                             object
add_to_wishlist                          int64
desktop_sessions                         int64
app_sessions                             int64
desktop_transactions                     int64
total_product_detail_views               int64
session_duration                         int64
promotion_clicks                         int64
avg_order_value                        float64
sale_product_views                       int64
discount_rate_per_visited_products     float64
product_detail_view_per_app_session    float64
app_transactions                         int64
add_to_cart_per_session                float64
customer_service_calls                   int64
churn                                    int64
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -   
dtype: object

Descriptive Statistic

The number features statistical descriptions. Using the.describe() function, we can examine the descriptive statistical information for the numerical data.

Source Code Snippet

Output:

Acc lengthlocation codeuser idcredit card info savepush statusadd to wishlistdesktop sesssession durationPromo clickssale viewchurn
01284153824657noYes252651032,71
1174153717191noYes261622733,710
21374153581921noNo02434153,2900
3844083759999yesNo02995171,7820
4754153306626yesNo0167282.434.334

We draw the following conclusions from the descriptive statistics:

  • The average time a client has held an account is 111 days, indicating that customers with registrations lasting more than 011 days may be outliers.
  • Since consumers typically add things to their wish lists nine times, those who do so more often (01 or more times) may be considered outliers.
  • Users are six times more inclined to utilise the computerised version of the shop than the mobile version, with an average of 191 desktop sessions compared to 31 mobile sessions. We also understand that some users are likely to utilise the desktop version, with some exceeding the typical user's usage by over two to one.
  • Given that the data is categorised by user id and the required desktop transaction is zero, does this indicate that some consumers have never made a purchase? If that is the case, observing if these customers also tend to "Churn" more frequently will be fascinating.
  • The typical Session lasts about 011 seconds, or about 0.5 minutes, with some possible outliers lasting longer than 035 seconds. By changing the length of the Session from seconds to minutes (375-011 = 173), we might theoretically cope with the session duration outliers, but if we are trying to develop a model, it could be simpler to prevent overfitting the feature because the scale is lower (7.17-0.5 = 3.57).
  • It's interesting to note that while the minimum number of sales product viewing is 33, the minimum number of promotion clicks is 1. Even though the remainder of the data analysis is nearly identical, I would have anticipated a correlation between sales and promotions. I'm intrigued to learn more about it despite the minimums.
  • The lowest average value of an order per client is £03, which is over eight times less than the median order value of £01 (I'm not sure regarding the currency, but since I'm based in the UK, let's go with GBP). This suggests that this client might be an anomaly, and it would be fascinating to determine whether or not they represent a "Churn."
  • Since there are, on average, seven times fewer application sessions than desktop version sessions, the product detail perceives and purchases on the application are significantly fewer than the comparable features for the desktop version.
  • A few shoppers never put anything in their basket, which is consistent with those who have never made a transaction.
  • It will be fascinating to discover whether those consumers who have never contacted customer care have also never made a transaction. Could those customers be "Churn" if at least one called customer service nine times, more than five times the average?
  • Average churn is low at 1.15, which is excellent for business, but with just 3,333 data, more is needed to draw any conclusions.

Non-Numeric Element Statistic, Description

With the code below, we can examine the descriptive data for features other than numeric ones:

Source Code Snippet

Output:

credit__cardinfo__save push__status
count33333333
unique00
topnono
freq31110511

Non-Numeric Features

  • We may view the total number of assessments, the total number of distinct values, the highest value, and the proportion of the most overall value for non-numeric attributes. We can infer the following from the screenshot above:
  • The majority of users of the website need to save their credit card information. Could those who save credit card information exhibit trends with more sessions and average transactions? Does any of these individuals fit the Churn category?
  • The vast majority of customers reject push notifications. Those that permit push notifications could exhibit trends with greater average transactions and session counts. Do any of the above users fit the Churn category?

Handling the Missing Values

The majority of real-world datasets can have problems with missing values. As we discovered when we utilised the.info() method, the Kaggle data is often of higher quality and has no missing values. Let's create some code below to do a secondary check and demonstrate how to determine whether the data set has any missing values.

Source Code Snippet

Visualise missing values

A heatmap is a visual application dealing with application reach to check for missing data. The visualisation may make it simpler to review if numerous variables with many values need to be added.

Source Code Snippet

Output:

Analysis of Customer Behaviour Using Python

Heatmap of missing values

Checking for Duplicates

Duplicate records are another frequent problem with data. Always making sure to look for duplicates is important.

Source Code Snippet

As there are exactly as many unique user ids as observations in our data, the code's result of 3,333 indicates no duplicates.

Looking for anomalies

  • It's crucial to keep an eye out for outliers while analysing data. These points of information are at the very edge of a dataset and may be a sign of inaccurate data or natural population changes.
  • It's critical to ascertain if outliers are genuine or merely the product of imperfect data before cleaning your data. It is advisable to preserve the outliers in your data if they accurately reflect the natural variability in your sample. On the different hand, if mistakes cause outliers,
  • We have already found a few possible anomalies in our examination of the data, including session lengths, average order values, promo clicks, offer product views, application operations, account lengths, wish list additions, desktop meetings, and customer support calls.
  • We may employ a range of visualisations, including histograms and box plots, to further pinpoint outliers. Interquartiles can also be used to evaluate whether a data point is an outlier. According to the interquartile guidelines, a data point is regarded as an outlier if it is either lower than Q1-1.5(IQR) or higher than Q3 + 1.5(IQR).
  • You may obtain more comprehensive insights into the information you collect and increase the certainty of your data-driven decisions by looking for and recognising outliers in your data.

You can find code to create customised interactive histograms and box plots using plotly express in the notebook that is linked to this article at the beginning. The code that generates histograms for each numerical variable in the dataset is shown below to save space in this post on the blog. Some significant outliers may be seen in the histogram charts; however, box plots make it simpler to find outliers.

Source Code Snippet

Output:

The result plots:

Analysis of Customer Behaviour Using Python

Figure: Distribution of numerical variables

We draw the following conclusions from the output:

The overall distribution of the numbers above reveals how customers behave while shopping online. Data

  • Data is king in the e-commerce sector. Businesses may learn a lot about consumer behaviour and enhance their online presence by having the capacity to track and analyse every part of a customer's experience.
  • We discovered in our research of e-commerce data that the outliers are probably just normal changes in the data sample. Therefore, we decided not to exclude them from our analysis. We were able to draw conclusions about client behaviour and make a few inquiries by examining the histograms:
  • Most consumers do not add items to their wish lists, indicating they come to the website wanting to buy.
  • Desktop users tend to have a greater number of visits and transactions.
  • Most clients read product details, indicating that the business should concentrate on making these details thorough and enticing.
  • The average customer stays on the website for about three minutes.
  • Unlike users, application users typically view fewer product details and add two to four items to the cart per Session.
  • Businesses may enhance their online presence and increase sales by studying these trends in client behaviour and using that knowledge to guide their actions. Sales can be increased by, among other things, providing product suggestions or grouping items.

An Analysis of Independent and Dependent Variables for Understanding Churn

In examining consumer behaviour, we have already covered a lot of territory by examining numerical factors like session lengths, average order values, and promotion clicks. The main goal of our study is to identify the behaviours or independent factors in our dataset that might aid in predicting customer turnover.

Which behaviours, in other words, can help us understand why customers choose to cease using the product? Now that we have shifted our attention, we will examine the link between the independent factors and the dependent variable, "churn."

The location code is a different variable that we still need to examine. We could find possible connections that could offer insightful conclusions by contrasting them with the turnover variable.

Plotting the churn

Let's plot the churn variable to observe its distribution relative to the not-churn. Plot using the code below. See a snapshot of the plot outcome below the code as well:

Source Code Snippet

Output:

Analysis of Customer Behaviour Using Python

Figure: Plot Churn or Not Churn

Although it could be even lower than 15%, the fact that most customers do not churn is positive for the business.

Plot the categorical information.

In our dataset, there are three category variables:

  • The location code contains three types.
  • Push status; there are two types.
  • Two kinds of stored credit card information

The three-category data are plotted in the code below:

Source Code Snippet

Output:

Analysis of Customer Behaviour Using Python

Figure: Plots of categorical variables

  • Most users are in the area code 515, and the majority disapplicationrove of push notifications.
  • The majority of customers don't keep their credit card details on hand

Against Categorical data, we Plot Churn

Plot Churn versus each data category by category using the code below.

Source Code Snippet

Consolidated Code for Analysis of Customer Behaviour using Python

Output:

Analysis of Customer Behaviour Using Python

Figure: Churn vs each category of categorical data

We draw the following conclusions from the output:

The link between the categorical factors and the variable of interest, churn, has been examined in more detail. The following are some of the major findings from our analysis:

  • Given that the percentage of clients that churn is consistent across locales, location code does not apply to be a reliable predictor of attrition.
  • Customers who have activated push alerts are less likely to leave a company, indicating that urging them to do so may assist lower churn.
  • It indicates that a customer's chance to stick with the company strongly correlates with whether or not they have saved their payment details on the website. Customers are far less likely to leave the site if they have saved their payment details on the website than if they have yet to. This connection is reasonable, given that saving.
  • The website's content indicates brand credibility, likely increasing client loyalty.

The organisation may use this knowledge to reduce customer turnover by concentrating on certain client behaviours.

Conclusion

In conclusion, organisations may benefit greatly from utilising Python for consumer behaviour analysis. Businesses may learn about their customer's demographics, buying patterns, and other pertinent information by putting consumer behaviour data into a Pandas DataFrame and performing exploratory data analysis. Thanks to the sample we utilised, we got insights from descriptive statistics of data, like the average account length, wish list adds, desktop and application sessions, session duration, promotion click, and sales product views. Businesses may enhance their strategy and marketing initiatives by discovering outliers and correlations between attributes to serve their consumers better.