Rossmann Store Sales Prediction

Introduction:

The demand for a good or service is constantly shifting. Without effectively predicting client demand and future sales of products/services, no firm can enhance its financial performance. Sales forecasting predicts a given product's demand or sales over a predetermined time frame. I'll demonstrate how machine learning can be used to forecast sales using a real-world business challenge from Kaggle in this post. Everything is resolved entirely from the start in this case study. So, you will watch every step of how a case study is resolved in the actual world.

Issue Statement

In seven European nations, Rossmann runs more than 3,000 pharmacies.

Rossmann shop managers must forecast their daily sales six weeks in advance. Factors affecting store sales, including marketing, rivalry, state and federal holidays, seasonality, and location. The accuracy of the findings can be highly variable because thousands of different managers are making sales predictions based on their situations.

Error Metric: RMSPE stands for Root Mean Square Percent Error.

The metric's formula is as follows:

Rossmann Store Sales Prediction

Objectives:

  1. Use the data to forecast sales for the next six weeks.
  2. Reduce the specified measure as much as you can.

Data:

The files are as follows:

  1. train.csv
  2. test.csv
  3. store.csv

Data fields:

  • Id: A unique identifier for a double contained in the test set.
  • The store is a unique Id for every store.
  • Sales are the daily turnover, which is what you are assuming.
  • Customers are how many there are on a particular day.
  • Open: A 0 means the store was closed, and a one means it was open.

StateHoliday is a flag that denotes a state holiday. With very few exceptions, all businesses are typically closed on national holidays. Keep in mind that every institution is out on weekends and federal holidays. Stands for a public holiday, B for the Easter break, C for Christmas, and N for none.

If the closing of public schools impacted the ( Store, Time ), it is indicated by SchoolHoliday.

StoreType distinguishes the four distinct store models ( a, b, c, and d ).

Assortment - specifies three levels of assortment: basic, extra, and extended.

CompetitionDistance is measuring the distance in meters to the closest rival store.

CompetitionOpenSince - provides an approximation of the year and month the closest competitor was first made available.

Shows whether a shop is offering a promotion that day.

Promo2 is a continuous campaign that some stores are running: 0 indicates that a store is not taking part, while one indicates it is.

Promo2Since[Year / Week] - specifies the calendar year and week when the store first joined Promo2.

PromoInterval defines the months in which Promo2 is launched at regular intervals.

Exploratory Data Analysis(EDA)

Let us use EDA to obtain insights into the provided data.

The following is information about train.csv:


As we can see, we have around 1 million data points. Also, because this is a time-series prediction issue, we must sort the data by date.

Our goal variable, in this case, is Sales.

The following is information about store.csv:


We have 1115 distinct shops. Many of the columns in this table have null values. We'll take charge of them in a moment.

Let's look at the details of the columns in the data now.

Promo:

Rossmann Store Sales Prediction

Promotion column next to Sales and Customers.

We can see that revenue and consumer base increase significantly during promotions. This demonstrates that promotion has a favorable impact on a store.

Sales:

Rossmann Store Sales Prediction

Sales over week Average

It is also worth noting that Christmas and New Year's ( see graph at week 52 ) led to a spike in sales. Because Rossmann Stores offers health and beauty items, it is reasonable to assume that around the holidays and New Year's, individuals buy beauty products when they go out to celebrate, which might explain the rapid surge in sales.

DayOfWeek:

Rossmann Store Sales Prediction

Sales and customers are compared to the DayOfWeek column.

Since most stores are closed, we can see that business and consumer numbers are down on Sundays.

In addition, sales on Monday are the greatest of the week. This might be because most stores are closed on Sundays.

Another crucial point is that retailers operating during school vacations had more sales than usual.

Customers and sales are based on a store's store type.

We can see how stores of type A have a more significant number of customers and sales. StoreType D ranks second in both Sales as well as Customers.

Conclusions of EDA:

  • A is the most popular and congested retail type.
  • The quantity of customers is closely connected with sales.
  • Promotion increases both sales and customers in all stores.
  • Stores open during the school holidays had more sales than typical days.
  • School holidays have more stores open than state holidays.
  • Sales increase over Christmas week, maybe because individuals buy more beauty items throughout the holiday season.
  • The lack of data in attributes Competition Open Since Year / Month does not imply the absence of competition. Where the other two values are null, distance values are not null.
  • After evaluating sales using Wave decomposition, I discovered some seasonality in the Sales data.

Feature Engineering

Outliers Column:

In this column, we will determine whether a Sales number is an outlier or not based on the Median Absolute Deviation ( MAD ).

The MAD formula.

We built outlier columns store per shop, which means we did it for each unique store independently and then combined the data.

Date Features:

First, we'll use pandas' to__datetime method to transform the Date column. Following that, we can extract further properties from Date.

This week's, last week's, and next week's holidays:

We've designed three features that show the total number of holidays in the current week, the previous week, and the following week.

State Holiday Calculator:

The function shown above was used to generate two new features. One indicates how many days remain before a state holiday, while the other shows how many days have elapsed since the last state holiday.

School holiday promotion and counter:

In addition to the characteristics above, I built four more that indicate the number of days before or after a promotion or a school holiday.

Close dummy variable: This feature has two values: +1 or -1. +1 if the shop was closed yesterday or tomorrow; otherwise, -1.

Removing data points with zeros Sales: In this case, data points with zeros are eliminated since they indicate that the store was closed for whatever reason. And if we receive a store that isn't open, we can forecast zero sales.

Customers__per__day, Sales__per__customers__per__day, and Sales__per__customers__per__day:

The names of the features merely indicate their meaning. There is no need for additional clarification.

Open Competition and Open Promo: We are changing these two characteristics from 'year' to month as a unit.

Promo interval feature generation: Promointerval is offered in the following format: May, August, November. We will split them as follows: May is one characteristic, August is another, and November is the third.

Sales Variation and Acceleration: Variation = y - ( y-1 ), y = sales

Acceleration equals [( y-1 ) - ( y-2 )]., where y = sales

Fourier Characteristics:

I use the fft function from numpy to compute Fourier frequencies and amplitudes. Then employ them as features.

Other features include:

These contain vital patterns about DayOfWeek, Promotions, Holidays, etc.

External Information:

There are just two of them for extra information. One type of data is state data, which identifies which state a shop belongs to, and the other is weather data for a particular state on a specific day.

VIF Analysis: After adding all of the characteristics, we ran a VIF analysis to see whether there was any collinearity between them. High collinearity features were eliminated.

Let's do some modeling right now.

Modeling

Base model:

We used sklearn Pipe and Column Transformer to preprocess the data.

The median is used to impute numerical values, whereas the most frequent is used to assume qualitative values. Numeric values are scaled as well.

Now divide the data into training and validation sets.

Spliting Data in test and train for Outliers

Output:

Will train until test error hasn't decreased in 250 rounds.
[0]	train-rmspe:0.99963	test-rmspe:0.999863
[250]	train-rmspe:0.41216	test-rmspe:0.487971
[500]	train-rmspe:0.19972	test-rmspe:0.188309
[71]	train-rmspe:1.166821	test-rmspe:1.156818
[111]	train-rmspe:1.137129	test-rmspe:1.132996
151]	train-rmspe:1.122311	test-rmspe:1.121135
[111]	train-rmspe:1.119952	test-rmspe:1.112465
[175]	train-rmspe:1.111481	test-rmspe:1.116788
[211]	train-rmspe:1.193883	test-rmspe:1.112796
[251]	train-rmspe:1.188571	test-rmspe:1.111461
[511]	train-rmspe:1.183871	test-rmspe:1.198587
[275]	train-rmspe:1.181263	test-rmspe:1.197114
[311]	train-rmspe:1.177273	test-rmspe:1.195896
[351]	train-rmspe:1.174326	test-rmspe:1.194886
[511]	train-rmspe:1.171833	test-rmspe:1.194189
[375]	train-rmspe:1.169711	test-rmspe:1.193385
[411]	train-rmspe:1.167834	test-rmspe:1.192811
[451]	train-rmspe:1.166113	test-rmspe:1.192347
[511]	train-rmspe:1.164519	test-rmspe:1.191941
[4751]	train-rmspe:1.162961	test-rmspe:1.191569
[5111]	train-rmspe:1.161581	test-rmspe:1.191271
[5251]	train-rmspe:1.161224	test-rmspe:1.191116
[5511]	train-rmspe:1.158971	test-rmspe:1.191792
[5751]	train-rmspe:1.157782	test-rmspe:1.191615
[6111]	train-rmspe:1.156656	test-rmspe:1.191459
[6251]	train-rmspe:1.155568	test-rmspe:1.191351
[6511]	train-rmspe:1.154547	test-rmspe:1.191252
[6751]	train-rmspe:1.153527	test-rmspe:1.191143
[7111]	train-rmspe:1.152577	test-rmspe:1.191167
[7251]	train-rmspe:1.151698	test-rmspe:1.191118
[7511]	train-rmspe:1.151825	test-rmspe:1.189956
[7751]	train-rmspe:1.151112	test-rmspe:1.189897
[8111]	train-rmspe:1.149217	test-rmspe:1.189844
[8251]	train-rmspe:1.148413	test-rmspe:1.189811
[8511]	train-rmspe:1.147679	test-rmspe:1.189768
[8751]	train-rmspe:1.146973	test-rmspe:1.189742
[9111]	train-rmspe:1.146274	test-rmspe:1.189741
Stopping. Best iteration:
[8751]	train-rmspe:1.146971	test-rmspe:1.189741

Feature Selection:

I did forward deciding on all of the additional features created in the feature engineering stage after generating the aforementioned basic model.

The following are the new features in the pipeline:

Reading store data

Meta Learning:

This method is outlined as follows:

  1. Divide the data into an 80-20 split.
  2. Split the train section into two pieces, D1 and D2.
  3. Take 9 D1 samples and train a forest-based Regressor model on all of them.
  4. Predict D2 based on these 9 models. To train a new model, use these 9 forecasts as features and D2 y__original as output.
  5. For the experimental set, forecast it using 9 models and send it to the meta model using 9 forecasts as features. As the final prediction, use the prediction from the meta model.

Reading sales data

Output:

Rossmann Store Sales Prediction

Conclusion:

TABLE 1
Modelrmspe on test setKaggle private score
SGD Regressor0.2500.234
Decision Tree Regressor0.17670.16442
Random Forest Regressor0.1640.139
Light GBM Regressor0.1570.121
Stacking with 9 models0.19880.17375

Table for all scores.

The table above suggests that the Light GBM Model is the best model.

Future Work:

With developments in Deep Learning or ML, LSTM may be an excellent place to start to enhance performance on a given dataset.

Other ensemble strategies can be applied to see whether they enhance the result.