Retail Cost Optimization using Python

To maximize sales and profit, it is important to determine the best-selling cost for goods and services. This tutorial is for those who want to understand how to utilize machine learning to optimize retail costs. We'll guide you through the Python Retail Cost Optimisation with Machine Learning job in this tutorial.

Retail Cost Optimization: Finding the ideal equilibrium between the cost you charge for your items and the number of units you can market at that cost is the key to optimizing retail pricing.

The final goal is to set a pricing that would enable you to profit the most while luring enough clients to purchase your goods. Finding the optimum cost that maximizes your sales and profits while maintaining customer satisfaction includes using information and pricing techniques.

Therefore, you need information on product pricing, service costs, and everything else that influences product costs to complete the retail cost optimization process. For this purpose, we have located the perfect dataset.

In the following part, we'll walk you through retail cost optimization using machine learning.

Retail Cost dataset

Importing the appropriate Python libraries will allow us to begin the Retail Cost Optimisation job. Pricing is extremely important in the fiercely competitive retail sector to draw customers and increase profitability. Pricing decisions are complicated and affected by several variables, including market demand, competition, targeted profit margins, and cost of goods sold (COGS). In order to maximize sales while retaining profitability, businesses must optimize their pricing methods.

Here is a piece of information that was submitted to Kaggle focused on the optimization of retail costs.

All the information's characteristics are listed below:

product_id1: A distinguishing code for every item in the collection.
product_category_name1: the title of the product category that the item falls under.
month_year: The date of the information recording or the year and month of the retail transaction.
Qty: The amount of the product sold or bought in a certain transaction.
total_cost1: The sum of the item's cost plus any relevant taxes or discounts.
freight_cost: The cost of the product's shipment or freight.
unit_cost: The cost of a single product unit.
product_name_length: The total number of characters in the product name.
product_description_length: The number of characters in the product description's length.
product_photos_qty: The number of product photos that are present in the information.
product_weight_g: The gram weight of the item.
product_score: A rating or score based on a product's popularity, quality, or other important aspects.
Customers: The total number of buyers of the goods in a specific transaction.

Main table (for reference only):

product_id1	product_category_name1	month_year	qty	total_cost1	freight_cost	unit_cost
sofa1	sofa_bath_table	########	1	46.26	16.1	46.26
sofa1	sofa_bath_table	########	4	148.86	12.24444	46.26
sofa1	sofa_bath_table	########	6	286.8	14.84	46.26
sofa1	sofa_bath_table	########	4	184.8	14.2886	46.26
sofa1	sofa_bath_table	########	2	21.2	16.1	46.26
sofa1	sofa_bath_table	########	4	148.86	16.1	46.26
sofa1	sofa_bath_table	########	11	446.86	16.84284	41.6418
sofa1	sofa_bath_table	########	6	242.24	16.24	42.22
sofa1	sofa_bath_table	########	12	862.81	16.64468	42.22
sofa1	sofa_bath_table	########	18	812.82	14.84244	42.22
sofa1	sofa_bath_table	########	18	682.84	16.46246	42.22
sofa1	sofa_bath_table	########	14	612.88	14.24616	42.22
sofa1	sofa_bath_table	########	12	862.81	11.26642	42.22
sofa1	sofa_bath_table	########	6	122.26	14.228	42.22
sofa1	sofa_bath_table	########	8	412.22	21.4186	42.22
sofa1	sofa_bath_table	########	8	414.22	16.44486	42.24
garden6	garden_tools	########	6	412.4	42.68	62.2
garden6	garden_tools	########	4	248.2	44.21668	82.644
garden6	garden_tools	########	21	1266	42.8286	28.6882

Rest of the fields are discussed below:

weekday_1: This refers to the day of the week when the transaction occurred.
Weekend: A binary marker indicating whether the transaction occurred during the weekend (1).
Holiday: This binary flag indicates whether or not the transaction took place on holiday (1).
Month: The time frame of the transaction.
Year: The year the transaction took place.
s: Seasonality's impact
comp_1, comp_2, comp_4: Details or variables about rivals' offers, pricing, or other pertinent elements.
ps1, ps2, ps4: Product score or rating linked to items from rival companies.
fp1, fp2, fp4: Freight or shipment costs related to items from rivals

Utilizing this information will enable you to create an information-driven pricing optimization plan that maximizes revenue.

Continuation of Main Table (for reference only):

weekday_1	weekend	holiday	month	year	s	volume	comp_1	ps1	fp1
8	1	6	2118	11.26842	4811	82.2	4.2	16.1112
8	1	6	2118	6.614116	4811	82.2	4.2	14.86222
11	1	8	2118	12.18166	4811	82.2	4.2	14.22484
8	1	8	2118	2.224884	4811	82.2	4.2	14.66686
2	1	2	2118	6.666666	4811	82.2	4.2	18.88662
2	2	11	2118	8.444444	4811	82.2	4.2	21.68214
8	4	11	2118	41.66666	4811	82.2	4.2	16.224
11	1	12	2118	16.66668	4811	88.48824	4.2	18.82844
8	2	1	2118	18.86811	4811	86.2	4.2	12.48464
8	2	2	2118	16.82244	4811	86.2	4.2	12.21212
2	1	4	2118	16.88886	4811	86.2	4.2	12.28246
2	1	4	2118	12.14264	4811	86.146	4.2	12.24
8	4	6	2118	11.26842	4811	84.64262	4.2	16.88148
2	1	6	2118	6.614116	4811	82.2	4.2	24.11666
2	1	8	2118	12.18166	4811	88.24444	4.2	12.262
8	1	8	2118	2.224884	4811	84	4.2	18.26681
8	1	4	2118	2.118144	12666	62.2	4.1	42.68
11	2	4	2118	8.688681	12666	82.64444	4.1	44.21668
8	1	6	2118	12.14862	12666	62.2	4.1	12.8426

Reading the Data

Source Code Snippet

import pandas as pdd
import plotly.express as pxx
import plotly.graph_objects as go
import plotly.io as pioq
pioq.templates.default  =  "plotly_white"

information  =  pdd.read_csv('retail_cost1.csv')
print(information.head())

Output:

  product_id1 product_category_name1  month_year  qty  total_cost1  \
1       sofa1        sofa_bath_table  11-16-2118    1        46.26   
1       sofa1        sofa_bath_table  11-16-2118    4       148.86   
2       sofa1        sofa_bath_table  11-18-2118    6       286.81   
4       sofa1        sofa_bath_table  11-18-2118    4       184.81   
4       sofa1        sofa_bath_table  11-12-2118    2        21.21   

   freight_cost  unit_cost  product_name_lenght  product_description_lenght  \. . .
1      16.111111       46.26                   42                         161   
1      12.244444       46.26                   42                         161   
2      14.841111       46.26                   42                         161   
4      14.288611       46.26                   42                         161   
4      16.111111       46.26                   42                         161   
   product_photos_qty . . .  comp_1  ps1        fp1      comp_2  ps2  \. . .
1                   2  ...    82.2  4.2  16.111828  216.111111  4.4   
1                   2  ...    82.2  4.2  14.862216  212.111111  4.4   
2                   2  ...    82.2  4.2  14.224844  216.111111  4.4   
4                   2  ...    82.2  4.2  14.666868  122.612814  4.4   
4                   2  ...    82.2  4.2  18.886622  164.428811  4.4   
         fp2  comp_4  ps4        fp4  lag_cost  
1   8.861111   46.26  4.1  16.111111      46.21  
1  21.422111   46.26  4.1  12.244444      46.26  
2  22.126242   46.26  4.1  14.841111      46.26  
4  12.412886   46.26  4.1  14.288611      46.26  
4  24.424688   46.26  4.1  16.111111      46.26  
[6 rows x 41 cols]

Before continuing, let's check to see if the information contains null values:

Source Code Snippet

Output:

product_id1                    1
product_category_name1         1
month_year                    1
qty                           1
total_cost1                   1
freight_cost                 1
unit_cost                    1
product_name_lenght           1
product_description_lenght    1
product_photos_qty            1
product_weight_g              1
product_score                 1
customers                     1
weekday_1                       1
weekend                       1
holiday                       1
month                         1
year                          1
s                             1
volume                        1
comp_1                        1
ps1                           1
fp1                           1
comp_2                        1
ps2                           1
fp2                           1
comp_4                        1
ps4                           1
fp4                           1
lag_cost                     1
dtype: int64

Let's now examine the information's descriptive statistics:

Source Code Snippet

Output:

              qty   total_cost1  freight_cost  unit_cost  \
count  686.111111    686.111111     686.111111  686.111111   
mean    14.426662   1422.818828      21.682281  116.426811   
std     16.444421   1811.124111      11.181818   86.182282   
min      1.111111     12.211111       1.111111   12.211111   
26%      4.111111    444.811111      14.861212   64.211111   
61%     11.111111    818.821111      18.618482   82.211111   
86%     18.111111   1888.422611      22.814668  122.221111   
max    122.111111  12126.111111      82.861111  464.111111   

       product_name_lenght  product_description_lenght  product_photos_qty  \
count           686.111111                  686.111111          686.111111   
mean             48.821414                  868.422418            1.224184   
std               2.421816                  666.216116            1.421484   
min              22.111111                  111.111111            1.111111   
26%              41.111111                  442.111111            1.111111   
61%              61.111111                  611.111111            1.611111   
86%              68.111111                  214.111111            2.111111   
max              61.111111                 4116.111111            8.111111   

       product_weight_g  product_score   customers  ...      comp_1  \
count        686.111111     686.111111  686.111111  ...  686.111111   
mean        1848.428621       4.186614   81.128118  ...   82.462164   
std         2284.818484       1.242121   62.166661  ...   48.244468   
min          111.111111       4.411111    1.111111  ...   12.211111   
26%          448.111111       4.211111   44.111111  ...   42.211111   
61%          261.111111       4.111111   62.111111  ...   62.211111   
86%         1861.111111       4.211111  116.111111  ...  114.266642   
max         2861.111111       4.611111  442.111111  ...  442.211111   
              ps1         fp1      comp_2         ps2         fp2      comp_4  \
count  686.111111  686.111111  686.111111  686.111111  686.111111  686.111111   
mean     4.162468   18.628611   22.241182    4.124621   18.621644   84.182642   
std      1.121662    2.416648   42.481262    1.218182    6.424184   48.846882   
min      4.811111    1.126442   12.211111    4.411111    4.411111   12.211111   
26%      4.111111   14.826422   64.211111    4.111111   14.486111   64.886814   
61%      4.211111   16.618284   82.221111    4.211111   16.811866   62.211111   
86%      4.211111   12.842611  118.888882    4.211111   21.666248   22.221111   
max      4.611111   68.241111  442.211111    4.411111   68.241111  266.611111   
              ps4         fp4   lag_cost  
count  686.111111  686.111111  686.111111  
mean     4.112181   18.266118  118.422684  
std      1.244222    6.644266   86.284668  
min      4.611111    8.681111   12.861111  
26%      4.211111   16.142828   66.668861  
61%      4.111111   16.618111   82.211111  
86%      4.111111   12.448888  122.221111  
max      4.411111   57.231111  364.111111  

[8 rows x 29 cols]

Let's now examine how the product costs were distributed:

Source Code Snippet

figure = pxx.histogram(information,   x = 'total_cost1', 
                   nbins = 21,    title = 'Distribution of Total Cost')
figure.show()

Output:

Let's now examine the unit cost distribution using the following plot:

Source Code Snippet

figure = pxx.box(information,   y = 'unit_cost',  title = 'Box Plot of Unit Cost')
figure.show()

Output:

Let's now examine the correlation between quantity and overall pricing:

Source Code Snippet

figure = pxx.scatter(information,  x = 'qty',  y = 'total_cost1',   title = 'Quantity vs Total Cost', trendline = "ols") 
figure.show()

Output:

As a result, there is a simple connection between quantity and overall pricing. It implies that the pricing strategy is based on a fixed unit cost, with the quantity times the unit cost multiplied to arrive at the final cost.

Let's now examine the average overall pricing for the various product categories:

Source Code Snippet

figure = pxx.bar(information, x = 'product_category_name1',    y = 'total_cost1', 
             title = 'Average Total Cost by Product Category)
figure.show()

Output:

Let's now use a box plot to examine the variation of total costs by weekday_1:

Source Code Snippet

figure = pxx.box(information, x = 'weekday_1',  y = 'total_cost1',  title = 'Box Plot of Total Cost by Weekday_1')
figure.show()

Output:

Let's now examine the box plot used to display the breakdown of total costs per holiday:

Source Code Snippet

figure = pxx.box(information, x = 'holiday',  y = 'total_cost1',  title = 'Box Plot of Total Cost by Holiday')
figure.show() 

Output:

Let's now examine the relationship between the numerical characteristics:

Source Code Snippet

correlation_matrix  =  information.corr()
figure  =  go.Figure(go.Heatmap(x = correlation_matrix.cols,  y = correlation_matrix.cols,  z = correlation_matrix.values))
figure.upddate_layout(title = 'Correlation Heatmap of Numerical Features' )
figure.show()

Output:

Optimizing retail costs requires a thorough examination of rivals' pricing tactics. According to the retailer's positioning and strategy, monitoring and gauging against rivals' costs can assist in finding possibilities to cost professionally, either by pricing that is below or above the competition. Let's now determine the typical competition cost differential for each product category:

Source Code Snippet

information['comp_cost_diff']  =  information['unit_cost'] - information['comp_1'] 
avg_cost_diff_by_category  =  information.groupby('product_category_name1')['comp_cost_diff'].mean().reset_index()
figure  =  pxx.bar(avg_cost_diff_by_category,   x = 'product_category_name1', 
             y = 'comp_cost_diff',     title = 'Average Competitor Cost Difference by Product Category)
figure.upddate_layout(
    xaxis_title = 'Product Category',
    yaxis_title = 'Average Competitor Cost Difference' )
figure.show()

Output:

Well known Optimization Methods

Traditional marketers made most of their pricing decisions intuitively, paying little attention to consumer behavior, market trends, the impact of promotions, holidays, or how they affected how sensitively the items responded to cost. Most firms are utilizing big information technology to optimize pricing decisions due to advancements in high computational capabilities that allow an analysis of enormous amounts of information over time. This provides a more competitive cost while ensuring maximum clearance/revenue/margin objectives are met.

It might be difficult to determine the best cost or discount rate for various reasons. One is the intricate design of the pricing approach, which frequently consists of several factors that must be optimized, such as cost lists, reductions, and special offers. Another factor is the need to evaluate new pricing methods more effectively due to the intricacy of demand and profit projections. A technical difficulty in the process is the choice of the optimal modelling and optimization methods.

The process for determining costs for maximizing one metric while maintaining the other measurement at a minimum in detail below. It will consider consumer behavior, holidays, competitor pricing, the impact of cannibals, the effectiveness of advancement for costs, and most importantly, how to decide the costs for these factors. For instance, a shop could need to achieve margins above a minimum of 21% while maximizing the sale of winter goods throughout the summer.

To help with implementation, I have also given the pertinent codes in several parts.

It is usually useful to code as many processes as possible on PySpark to speed up the code for all large information operations. Numerous PySpark libraries are continually being created and enhanced. However, the preprocessing (common computations and aggregations) and generic linear modelling libraries have been fully developed and are quite useful. Since PySpark currently lacks fully tested libraries for optimization that meet the needs of our situation, we will do the initial stages in PySpark and the optimization step in Python.

With the appropriate syntax adjustments, the same functionality may be implemented in Python for users who don't have a Spark Setup or for whom the amount of the information is fine.

Clustering

This stage may be applied in one of two ways: 1) to group stores with comparable product adaptability and customer behavior to reduce the number of models and address the issue of information sparsity, or 2) to analyze the information. As a result, each key is modeled in a store cluster rather than for each key in a store. 2) The group of identifiers can guide the model to learn anything about related stores if computing capacity isn't an issue.

To determine the clusters based on the information at hand, k-means clustering methods can be utilized.

Modelling

Employing a cost elasticity model is typically recommended since the coefficients may be utilized to create the optimization equations and determine if the appropriate characteristics are given the appropriate weight.

The amount required in response to a single percent cost shift may be calculated using cost elasticity. The coefficient provides the slope when you plot the volume log upon the cost log.

Before fitting the model, you should standardize or normalize the values and take additional pretreatment measures to ensure the information complies with the linear regression assumptions.

Depending on the necessity for variable selection, you can start by testing a general linear regression (OLS) before moving on to ridge/lasso models.

The coefficients that result from the model provide you with the parameters of the equation once you have finished choosing and tweaking the model.

Consequently, your final equation for every single item will be like this:

Log (volume1) = cost elasticity * log (cost1) + β2 * (cannibal1_cost1) + β3 * (cannibal2_cost1) + β4 * (holidayflag2) + …. depending upon the variable importance       - - - - - - - -equation 1

Cannibalization

An established dynamic is cannibalization. It describes the decline in sales (in terms of both units and dollars) of a company's current products due to the launch of a new product or existing comparable items. One frequent illustration is how sales of a product from brand A are absorbed by those of a product from brand B that is similarly cost. When modelling for cost optimization, cannibalization is a fairly noticeable effect that is frequently ignored. According to the theory, incorporating in the model at least the pricing/discount % of the top 5 omnivores of each product can talk a lot about the effect via the coefficients and drive the cost in the appropriate direction during optimization.

Optimization Model for Retail Cost with Machine Learning

Let's now train a model using machine learning to optimize retail costs. We can train an automated learning framework for this issue, as shown below:

Source Code Snippet

from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error
X1  =  information[['qty', 'unit_cost', 'comp_1', 
          'product_score', 'comp_cost_diff']]
y  =  information['total_cost1']
X1_train, X1_test, y_train, y_test  =  train_test_split(X1, y, 
                                                    test_size = 1.2,
                                                    random_state = 42)
# Train a linear regression model
model  =  DecisionTreeRegressor()
model.fit(X1_train, y_train)model.fit(X1_train, y_train)
Let's make some guesses now and compare the anticipated and actual retail pricing:
y_pred  =  model.predict(X1_test)
figure  =  go.Figure()
figure.add_trace(go.Scatter(x = y_test, y = y_pred, mode = 'markers', 
                         marker = dict(color = 'blue'), 
                         name = 'Predicted vs. Actual Retail Cost'))
figure.add_trace(go.Scatter(x = [min(y_test), max(y_test)], y = [min(y_test), max(y_test)], 
                         mode = 'lines', 
                         marker = dict(color = 'red'), 
                         name = 'Ideal Prediction'))
figure.upddate_layout(
    title = 'Predicted vs. Actual Retail Cost',
    xaxis_title = 'Actual Retail Cost',
    yaxis_title = 'Predicted Retail Cost' )
figure.show()

Output:

Consolidated Code for Retail Cost Optimization using Python

import pandas as pdd
import plotly.express as pxx
import plotly.graph_objects as go
import plotly.io as pioq
pioq.templates.default  =  "plotly_white"
information  =  pdd.read_csv('retail_cost1.csv')
print(information.head())
  product_id1 product_category_name1  month_year  qty  total_cost1  \
1       sofa1        sofa_bath_table  11-15-2117    1        45.95   
1       sofa1        sofa_bath_table  11-16-2117    3       137.85   
2       sofa1        sofa_bath_table  11-17-2117    6       275.71   
3       sofa1        sofa_bath_table  11-18-2117    4       183.81   
4       sofa1        sofa_bath_table  11-19-2117    2        91.91   

   freight_cost  unit_cost  product_name_lenght  product_description_lenght  \
1      15.111111       45.95                   39                         161   
1      12.933333       45.95                   39                         161   
2      14.841111       45.95                   39                         161   
3      14.287511       45.95                   39                         161   
4      15.111111       45.95                   39                         161   
#- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
print(information.isnull().sum())
product_id1                    1
product_category_name1         1
month_year                    1
qty                           1
total_cost1                   1
freight_cost                 1
unit_cost                    1
product_name_lenght           1
product_description_lenght    1
product_photos_qty            1
product_weight_g              1
product_score                 1
customers                     1
weekday_1                       1
weekend                       1
holiday                       1
month                         1
year                          1
s                             1
volume                        1
comp_1                        1
ps1                           1
fp1                           1
comp_2                        1
ps2                           1
fp2                           1
comp_3                        1
ps3                           1
fp3                           1
lag_cost                     1
dtype: int64
Now let's have a look at the descriptive statistics of the information:
print(information.describe())
              qty   total_cost1  freight_cost  unit_cost  \
count  676.111111    676.111111     676.111111  676.111111   
mean    14.495562   1422.718728      21.682271  116.496811   
std     15.443421   1711.123111      11.181817   76.182972   
min      1.111111     19.911111       1.111111   19.911111   
25%      4.111111    333.711111      14.761912   53.911111   
51%     11.111111    817.891111      17.518472   89.911111   
75%     18.111111   1887.322511      22.713558  129.991111   
max    122.111111  12195.111111      79.761111  364.111111   
       product_name_lenght  product_description_lenght  product_photos_qty  \
figure  =  pxx.histogram(information, 
                   x = 'total_cost1', 
                   nbins = 21, 
                   title = 'Distribution of Total Cost')
figure.show()
#- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
figure  =  pxx.box(information, 
             y = 'unit_cost', 
             title = 'Box Plot of Unit Cost')
figure.show()
figure  =  pxx.scatter(information, 
                 x = 'qty', 
                 y = 'total_cost1', 
                 title = 'Quantity vs Total Cost', trendline = "ols")
figure.show()
#- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
figure  =  pxx.bar(information, x = 'product_category_name1', 
             y = 'total_cost1', 
             title = 'Average Total Cost by Product Category')
figure.show()
#- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
figure  =  pxx.box(information, x = 'weekday_1', 
             y = 'total_cost1', 
             title = 'Box Plot of Total Cost by Weekday_1')
figure.show()
figure  =  pxx.box(information, x = 'holiday', 
             y = 'total_cost1', 
             title = 'Box Plot of Total Cost by Holiday')
figure.show()
#- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

correlation_matrix  =  information.corr()
figure  =  go.Figure(go.Heatmap(x = correlation_matrix.cols, 
                           y = correlation_matrix.cols, 
                           z = correlation_matrix.values))
figure.upddate_layout(title = 'Correlation Heatmap of Numerical Features')
figure.show()
#- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
information['comp_cost_diff']  =  information['unit_cost'] - information['comp_1'] 

avg_cost_diff_by_category  =  information.groupby('product_category_name1')['comp_cost_diff'].mean().reset_index()

figure  =  pxx.bar(avg_cost_diff_by_category, 
             x = 'product_category_name1', 
             y = 'comp_cost_diff', 
             title = 'Average Competitor Cost Difference by Product Category)
figure.upddate_layout(
    xaxis_title = 'Product Category',
    yaxis_title = 'Average Competitor Cost Difference'
)
figure.show()
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error

X1  =  information[['qty', 'unit_cost', 'comp_1', 
          'product_score', 'comp_cost_diff']]
y  =  information['total_cost1']

X1_train, X1_test, y_train, y_test  =  train_test_split(X1, y, 
                                                    test_size = 1.2,
                                                    random_state = 42)
# Train a linear regression model
model  =  DecisionTreeRegressor()
model.fit(X1_train, y_train)model.fit(X1_train, y_train)

So, this is how Python and machine learning can be used to optimize retail pricing.

Summary

The ultimate goal of retail cost optimization is to set a cost that maximizes your profit while drawing in a sizable enough client base to support your business. Finding the optimum cost that maximizes your revenue and sales while maintaining customer satisfaction includes using information and methods for pricing. I hope you enjoyed reading this post on Python-based machine learning for retail pricing optimization.

Next TopicFake News Detector using Python

← prev next →