Electricity Consumption Prediction Using Machine Learning

In today's fast-paced world, electricity consumption holds a vital position in fulfilling the energy needs of modern societies. As the demand for electricity keeps growing, optimizing energy usage becomes extremely important. Thankfully, advancements in technology have led to the emergence of machine learning, a potent tool that can predict electricity consumption with remarkable accuracy.

Forecasting electricity consumption is a complex undertaking. It requires analyzing large volumes of historical data, such as past electricity usage, weather patterns, time of day, and seasonal changes. Although traditional methods offer some insights, they often struggle to capture intricate connections between these variables.

Enter machine learning, a branch of artificial intelligence that equips computers with the ability to learn from data and make predictions without being explicitly programmed. Machine learning algorithms excel at discovering hidden patterns and correlations in vast datasets, making them a perfect fit for electricity consumption prediction.

Advantages of Electricity Consumption Prediction Using Machine Learning

Electricity consumption prediction using machine learning offers numerous advantages that can revolutionize the way we manage and optimize our energy resources. Some of the key advantages include:

Accurate Forecasts: Machine learning algorithms can analyze historical data with precision and identify intricate patterns, leading to more accurate electricity consumption forecasts. This accuracy helps utilities and grid operators plan more effectively for future demands, ensuring a stable and reliable energy supply.
Demand Response Management: Machine learning models can analyze historical consumption patterns to predict peak demand periods. This insight allows utilities to implement demand response strategies, encouraging consumers to shift their electricity usage to off-peak hours, thereby reducing the strain on the grid during peak times.
Renewable Energy Integration: Machine learning can predict electricity consumption based on weather patterns, enabling better integration of renewable energy sources like solar and wind. By aligning renewable energy generation with peak demand periods, we can further reduce our reliance on fossil fuels and promote sustainability.
Customer Empowerment: Machine learning enables personalized electricity consumption forecasts for individual consumers. This empowers customers to make informed decisions about their energy usage, leading to potential cost savings and a more sustainable lifestyle.
Grid Stability and Reliability: Accurate electricity consumption predictions help grid operators maintain grid stability and reliability. By anticipating changes in demand, they can balance energy generation and distribution more effectively, reducing the risk of power outages and ensuring a smooth functioning grid.
Cost Optimization: Machine learning models can optimize energy production and distribution, leading to cost savings for both energy providers and consumers. These cost optimizations can result in more competitive electricity prices and improved financial outcomes for all stakeholders.

Challenges of Electricity Consumption Prediction Using Machine Learning

Predicting electricity consumption using machine learning comes with its fair share of challenges. While machine learning offers promising solutions, it is essential to be aware of the hurdles that may arise during the process. Some of the key challenges include:

Data Quality and Quantity: One of the fundamental requirements for accurate predictions is high-quality data. Inadequate or inconsistent data can lead to unreliable results. Moreover, obtaining a sufficient amount of historical data can be challenging, especially for new or evolving areas.
Complexity of Data Patterns: Electricity consumption data is often influenced by various factors, such as weather conditions, holidays, and industrial activities. Identifying and modeling these complex patterns can be demanding and requires sophisticated algorithms.
Seasonal and Weather Variations: Electricity consumption exhibits strong seasonal and weather-related variations. Incorporating these fluctuations into the predictive models can be intricate, as it involves handling non-linear relationships.
Non-Stationarity: Electricity consumption patterns might change over time due to evolving technologies, population growth, or economic changes. Adapting the predictive models to handle non-stationary data is crucial for accurate long-term forecasts.
Model Selection and Tuning: There are numerous machine learning algorithms available, and selecting the most suitable one for a specific dataset can be challenging. Fine-tuning the model's hyperparameters is also crucial to achieving optimal performance.

About the Dataset

This dataset is a daily time series of electricity demand, generation, and prices in Spain from 2014 to 2018. It is gathered from ESIOS, a website managed by REE (Red Electrica Española), which is the Spanish TSO (Transmission System Operator)

A TSO's main function is to operate the electrical system and to invest in new transmission (high voltage) infrastructure.

(https://www.ree.es/en/about-us/business-activities/electricity-business-in-Spain)

As a system operator, REE forecasts electricity demand and offers and runs daily actions. As a result of daily actions, a PBF Plan Básico de funcionamiento) is yielded. This is a basic scheduling of energy production (upon it, several mechanisms are triggered to ensure supply)

Energy and prices data can be downloaded from: https://www.esios.ree.es/en

OMIE (Operador del Mercado Iberico de Electricidad) is responsible for running those daily actions and also offers interesting data.

http://www.omie.es/en/inicio

Content

Original values are kept, so some names in Spanish are shown. The column name describes each time series, so I provide a description of each name:

Demanda programada PBF total (MWh): Schedulled Total Demand
Demanda real (MW): Actual demanded power
Energía asignada en Mercado SPOT Diario España (MWh): Energy traded in daily spot Spanish market (OMIE)
Energía asignada en Mercado SPOT Diario Francia (MWh): Energy traded in daily spot French market
Generación programada PBF Carbón (MWh): Schedulled Coal electricity generation
Generación programada PBF Ciclo combinado (MWh): Schedulled Combined Cycle electricity generation
Generación programada PBF Eólica: (MWh): Schedulled Wind electricity generation
Generación programada PBF Gas Natural Cogeneración (MWh): Schedulled Natural Gas electricity Co-generation
Generación programada PBF Nuclear (MWh): Schedulled Nuclear electricity generation
Generación programada PBF Solar fotovoltaica (MWh): Schedulled Photovoltaic electricity generation
Generación programada PBF Turbinación bombeo (MWh): Schedulled Reversible-Hydro electricity generation
Generación programada PBF UGH + no UGH (MWh): Schedulled Total Hydroelectricity generation
Generación programada PBF total (MWh): Schedulled Total electricity generation
Precio mercado SPOT Diario ESP (€/MWh): Daily spot Spain market price
Precio mercado SPOT Diario FRA (€/MWh): Daily spot France market price
Precio mercado SPOT Diario POR (€/MWh): Daily spot Portugal market price
Rentas de congestión mecanismos implícitos diario Francia exportación (€/MWh): Daily spot export from France price
Rentas de congestión mecanismos implícitos diario Francia importación (€/MWh): Daily spot import to France price
Rentas de congestión mecanismos implícitos diario Portugal exportación (€/MWh): Daily spot export from Portugal price
Rentas de congestión mecanismos implícitos diario Portugal importación (€/MWh): Daily spot import to Portugal price

Note: Original data format is maintained, just in case it is necessary to append new data downloaded from Esios. As a result, geo columns are null.

Code:

Importing Libraries

import pandas as pd
import datetime as dt
import numpy as np
import datetime as dt
import matplotlib.pyplot as plt
import seaborn as sns
from  scipy.stats import skew, kurtosis, shapiro

Reading the Dataset

path = "/kaggle/input/spain_energy_market.csv"
data = pd.read_csv(path, sep=",", parse_dates=["datetime"])
data = data[data["name"]=="Demanda programada PBF total"]#.set_index("datetime")
data["date"] = data["datetime"].dt.date
data.set_index("date", inplace=True)
data = data[["value"]]
data = data.asfreq("D")
data = data.rename(columns={"value": "energy"})
data.info()     

Output:

data.plot(title="Energy Demand")
plt.ylabel("MWh")
plt.show()

Output:

We're in luck! There are no missing values in the dataset, and we have a four-year span of data to work with. Now, let's dive into the exciting part and calculate some date-related features to make our analysis go on.

data["year"] = data.index.year
data["qtr"] = data.index.quarter
data["mon"] = data.index.month
data["week"] = data.index.week
data["day"] = data.index.weekday
data["ix"] = range(0,len(data))
data[["movave_7", "movstd_7"]] = data.energy.rolling(7).agg([np.mean, np.std])
data[["movave_30", "movstd_30"]] = data.energy.rolling(30).agg([np.mean, np.std])
data[["movave_90", "movstd_90"]] = data.energy.rolling(90).agg([np.mean, np.std])
data[["movave_365", "movstd_365"]] = data.energy.rolling(365).agg([np.mean, np.std])

plt.figure(figsize=(20,16))
data[["energy", "movave_7"]].plot(title="Daily Energy Demand in Spain (MWh)")
plt.ylabel("(MWh)")
plt.show()

Output:

EDA(Exploratory Data Analysis)

Analyzing the target variable involves studying its seasonality and trend. Our aim is to visually understand the patterns and fluctuations in the time series data without heavily relying on statistical techniques such as decomposition. By graphically examining the data, we can gain insights into the underlying patterns and trends that may exist.

Target Analysis(Normality)

mean = np.mean(data.energy.values)
std = np.std(data.energy.values)
skew = skew(data.energy.values)
ex_kurt = kurtosis(data.energy)
print("Skewness: {} \nKurtosis: {}".format(skew, ex_kurt+3))

Output:

In terms of data distribution, negative skewness indicates that the data is not perfectly symmetrical and has a longer left tail. Additionally, the kurtosis value below 3 suggests that the tails of the distribution are slightly thinner compared to a normal distribution. This characteristic is known as platykurtic, indicating that the likelihood of encountering extreme values is lower than in a normal distribution.

def shapiro_test(data, alpha=0.05):
    stat, pval = shapiro(data)
    print("H0: Data was drawn from a Normal Ditribution")
    if (pval<alpha):
        print("pval {} is lower than significance level: {}, therefore null hypothesis is rejected".format(pval, alpha))
    else:
        print("pval {} is higher than significance level: {}, therefore null hypothesis cannot be rejected".format(pval, alpha))
        
shapiro_test(data.energy, alpha=0.05)

Output:

sns.distplot(data.energy)
plt.title("Target Analysis")
plt.xticks(rotation=45)
plt.xlabel("(MWh)")
plt.axvline(x=mean, color='r', linestyle='-', label="\mu: {0:.2f}%".format(mean))
plt.axvline(x=mean+2*std, color='orange', linestyle='-')
plt.axvline(x=mean-2*std, color='orange', linestyle='-')
plt.show()

Output:

In general, the data does not exhibit a normal distribution as it displays a smaller left tail and a reduced likelihood of observing extreme values compared to normally distributed data.

# Insert the rolling quantiles to the monthly returns
data_rolling = data.energy.rolling(window=90)
data['q10'] = data_rolling.quantile(0.1).to_frame("q10")
data['q50'] = data_rolling.quantile(0.5).to_frame("q50")
data['q90'] = data_rolling.quantile(0.9).to_frame("q90")

data[["q10", "q50", "q90"]].plot(title="Volatility Analysis: 90-rolling percentiles")
plt.ylabel("(MWh)")
plt.show()

Output:

data.groupby("qtr")["energy"].std().divide(data.groupby("qtr")["energy"].mean()).plot(kind="bar")
plt.title("Coefficient of Variation (CV) by qtr")
plt.show()

Output:

data.groupby("mon")["energy"].std().divide(data.groupby("mon")["energy"].mean()).plot(kind="bar")
plt.title("Coefficient of Variation (CV) by month")
plt.show()

Output:

data[["movstd_30", "movstd_365"]].plot(title="Heteroscedasticity analysis")
plt.ylabel("(MWh)")
plt.show()

Output:

When considering shorter time periods such as quarters and months, volatility tends to vary, but over the long term (in a yearly window), it remains relatively stable. As a result, potential predictors need to account for the seasonal pattern in variance.

data[["movave_30", "movave_90"]].plot(title="Seasonal Analysis: Moving Averages")
plt.ylabel("(MWh)")
plt.show()

Output:

sns.boxplot(data=data, x="qtr", y="energy")
plt.title("Seasonality analysis: Distribution over quaters")
plt.ylabel("(MWh)")
plt.show()

Output:

sns.boxplot(data=data, x="day", y="energy")
plt.title("Seasonality analysis: Distribution over weekdays")
plt.ylabel("(MWh)")
plt.show()

Output:

As anticipated, there are distinct seasonal patterns observed in the data when considering quarters and weekdays (with Monday represented as 0).

data_mon = data. energy.resample("M").agg(sum).to_frame("energy")
data_mon["ix"] = range(0, len(data_mon))
data_mon[:5]

Output:

sns.regplot(data=data_mon,x="ix", y="energy")
plt.title("Trend analysis: Regression")
plt.ylabel("(MWh)")
plt.xlabel("")
plt.show()

Output:

sns.boxplot(data=data["2014":"2017"], x="year", y="energy")
plt.title("Trend Analysis: Annual Box-plot Distribution")
plt.ylabel("(MWh)")
plt.show()

Output:

The energy demand shows a positive linear trend, or a slightly damped trend, which can be attributed to the steady economic growth resulting from the recovery from a previous recession.

Feature Engineering

The current challenge lies in developing automated features that can effectively handle seasonality, trend, and changes in volatility. These features should be able to adapt to the varying patterns and fluctuations observed in the data.

Standardizing the data is a necessary step to enable the application of models that are sensitive to scale, such as neural networks or support vector machines (SVM). By standardizing the data, we ensure that the distribution shape remains unchanged while only altering the first and second moments, namely the mean and standard deviation. This process allows for more accurate and effective modeling of the data using these particular machine learning algorithms.

data["target"] = data.energy.add(-mean).div(std)
sns.distplot(data["target"])
plt.show()

Output:

features = []
corr_features=[]
targets = []
tau = 30 #forecasting periods

for t in range(1, tau+1):
    data["target_t" + str(t)] = data.target.shift(-t)
    targets.append("target_t" + str(t))
    
for t in range(1,31):
    data["feat_ar" + str(t)] = data.target.shift(t)
    #data["feat_ar" + str(t) + "_lag1y"] = data.target.shift(350)
    features.append("feat_ar" + str(t))
    #corr_features.append("feat_ar" + str(t))
    #features.append("feat_ar" + str(t) + "_lag1y")
        
    
for t in [7, 14, 30]:
    data[["feat_movave" + str(t), "feat_movstd" + str(t), "feat_movmin" + str(t) ,"feat_movmax" + str(t)]] = data.energy.rolling(t).agg([np.mean, np.std, np.max, np.min])
    features.append("feat_movave" + str(t))
    #corr_features.append("feat_movave" + str(t))
    features.append("feat_movstd" + str(t))
    features.append("feat_movmin" + str(t))
    features.append("feat_movmax" + str(t))
    
months = pd.get_dummies(data.mon,
                              prefix="mon",
                              drop_first=True)
months.index = data.index
data = pd.concat([data, months], axis=1)

days = pd.get_dummies(data.day,
                              prefix="day",
                              drop_first=True)
days.index = data.index
data = pd.concat([data, days], axis=1)


features = features + months.columns.values.tolist() + days.columns.values.tolist()

corr_features = ["feat_ar1", "feat_ar2", "feat_ar3", "feat_ar4", "feat_ar5", "feat_ar6", "feat_ar7", "feat_movave7", "feat_movave14", "feat_movave30"]

# Calculate correlation matrix
corr = data[["target_t1"] + corr_features].corr()

top5_mostCorrFeats = corr["target_t1"].apply(abs).sort_values(ascending=False).index.values[:6]


# Plot heatmap of correlation matrix
sns.heatmap(corr, annot=True)
plt.title("Pearson Correlation with 1 period target")
plt.yticks(rotation=0); plt.xticks(rotation=90)  # fix ticklabel directions
plt.tight_layout()  # fits plot area to the plot, "tightly"
plt.show()  # show the plot

Output:

sns.pairplot(data=data[top5_mostCorrFeats].dropna(), kind="reg")
plt.title("Most important features Matrix Scatter Plot")
plt.show()

Output:

Some features, such as AR_6 (AutoRegressive lag 6) and MOVAVE_7 (7-day moving average), exhibit a relatively strong linear correlation with the target variable. To validate this assumption and further investigate their predictive power, we will build various models and evaluate their performance using these features. By assessing the models' accuracy and predictive capabilities, we can determine the extent to which these features contribute to the overall predictive power of the models.

Model Building

In this step, we have built two candidate models using a convenient feature in Scikit-Learn called MultiOutput Regression. This feature allows us to efficiently and automatically fit models that can predict multiple target variables simultaneously. By leveraging this framework, we can train our models to predict several target variables in a streamlined manner. This not only simplifies the modeling process but also enables us to evaluate the models' performance across multiple targets effectively.

First, we will fit a baseline model using linear regression and compare it to a more advanced model, such as Random Forest. The linear regression model does not require extensive hyperparameter tuning and provides a solid foundation for our analysis. However, there are several considerations to keep in mind:

Non-Normal Distribution and Varied Variance: The target variable does not follow a perfect normal distribution and exhibits varying levels of variance. This can affect the assumptions of linear regression, which assumes normality and constant variance. We need to be cautious of potential deviations from these assumptions.
Multicollinearity Among Predictors: There is a high degree of multicollinearity among the predictor variables, meaning that some predictors are highly correlated with each other. This can introduce challenges in interpreting the individual effects of these predictors on the target variable and may impact the model's performance.
Non-Independence of Observations: The observations in our dataset may not be independent, which violates one of the key assumptions of linear regression. Non-independence can arise from various factors, such as temporal dependencies or clustering within the data. We need to consider this when interpreting the model results and evaluating its accuracy.

On the other hand, an advanced model such as Random Forest requires careful hyperparameter tuning to achieve optimal performance. Typically, this is done using techniques like GridSearch and Cross Validation (CV). However, using traditional CV methods with time series data poses challenges. This is because the data should not be shuffled as it follows a specific time structure.

Fortunately, Scikit-Learn provides a helpful solution called TimeSeries Split. This technique allows us to perform GridSearch in a time-aware manner by preserving the temporal order of the data. It splits the data into sequential time-based folds, ensuring that each fold respects the chronological order of the observations.

By using TimeSeries Split, we can iteratively train and evaluate our Random Forest model with different combinations of hyperparameters. This approach enables us to find the best set of hyperparameters that maximizes the model's performance on unseen future data points.

Applying hyperparameter tuning in a time-aware manner is essential for time series data, as it ensures that our model's performance is more realistic and reliable. By leveraging the TimeSeries Split functionality in Scikit-Learn, we can effectively optimize our Random Forest model without violating the temporal structure of the data.

data_feateng = data[features + targets].dropna()
nobs= len(data_feateng)
print("Number of observations: ", nobs)

Output:

Splitting Data

To ensure an unbiased evaluation of our model's performance and conduct thorough residual analysis, we reserve the data points from the year 2018 as a separate holdout dataset. This means that we keep this data untouched during the model development process.

X_train = data_feateng.loc["2014":"2017"][features]
y_train = data_feateng.loc["2014":"2017"][targets]

X_test = data_feateng.loc["2018"][features]
y_test = data_feateng.loc["2018"][targets]

n, k = X_train.shape
print("Total number of observations: ", nobs)
print("Train: {}{}, \nTest: {}{}".format(X_train.shape, y_train.shape,
                                              X_test.shape, y_test.shape))

plt.plot(y_train.index, y_train.target_t1.values, label="train")
plt.plot(y_test.index, y_test.target_t1.values, label="test")
plt.title("Train/Test split")
plt.xticks(rotation=45)
plt.show()

Output:

Baseline Model: Linear Regression

from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

reg = LinearRegression().fit(X_train, y_train["target_t1"])
p_train = reg.predict(X_train)
p_test = reg.predict(X_test)

RMSE_train = np.sqrt(mean_squared_error(y_train["target_t1"], p_train))
RMSE_test = np.sqrt(mean_squared_error(y_test["target_t1"], p_test))

print("Train RMSE: {}\nTest RMSE: {}".format(RMSE_train, RMSE_test) )

Output:

Train a Random Forest with Time Series Split to tune Hyperparameters

In this particular example, we illustrate the use of the TimeSeriesSplit framework. With this approach, each fold of the data is constructed in such a way that the training data is closer to the beginning of the forecasting period.

from sklearn.model_selection import TimeSeriesSplit, ParameterGrid

splits = TimeSeriesSplit(n_splits=3, max_train_size=365*2)
for train_index, val_index in splits.split(X_train):
    print("TRAIN:", len(train_index), "TEST:", len(val_index))
    y_train["target_t1"][train_index].plot()
    y_train["target_t1"][val_index].plot()
    plt.show()

Output:

from sklearn.ensemble import RandomForestRegressor

splits = TimeSeriesSplit(n_splits=3, max_train_size=365*2)
rfr = RandomForestRegressor()
# Create a dictionary of hyperparameters to search
rfr_grid = {"n_estimators": [500], 
        'max_depth': [3, 5, 10, 20, 30], 
        'max_features': [4, 8, 16, 32, 59], 
        'random_state': [123]}
rfr_paramGrid = ParameterGrid(rfr_grid)

def TimeSplit_ModBuild(model, paramGrid, splits, X, y):
    from sklearn.model_selection import TimeSeriesSplit
    from sklearn.metrics import mean_squared_error

    #Loop over each time split and for each
    for train_index, val_index in splits.split(X_train):
        _X_train_ = X.iloc[train_index]
        _y_train_ = y.iloc[train_index]
        _X_val_ = X.iloc[val_index]
        _y_val_ = y.iloc[val_index]

        train_scores = []
        val_scores = []
        #models = []
        
        # Loop through the parameter grid, set the hyperparameters, and save the scores
        for g in paramGrid:
            model.set_params(**g)
            model.fit(_X_train_, _y_train_)
            p_train = model.predict(_X_train_)
            p_val = model.predict(_X_val_)
            score_train = np.mean(mean_squared_error(_y_train_, p_train))
            score_val = np.mean(mean_squared_error(_y_val_, p_val))
            train_scores.append(score_train)
            val_scores.append(score_val)
            #models.append(model)
            best_idx = np.argmin(val_scores)
            
        print("Best-Fold HyperParams:: ", paramGrid[best_idx])
        print("Best-Fold Train RMSE: ", train_scores[best_idx])
        print("Best-Fold Val RMSE: ",val_scores[best_idx])
        print("\n")
        
    #Return the most recent model
    return train_scores, val_scores, best_idx


CV_rfr_tup = TimeSplit_ModBuild(rfr, rfr_paramGrid, splits, X_train, y_train["target_t1"])

Output:

best_rfr_idx = CV_rfr_tup[2]
best_rfr_grid = rfr_paramGrid[best_rfr_idx]
best_rfr = RandomForestRegressor().set_params(**best_rfr_grid).\
    fit(X_train.loc["2016":"2017"], y_train.loc["2016":"2017", "target_t1"])

Utilizing Random Forest yields a significant enhancement in comparison to Linear Regression. However, it is essential to exercise caution as Random Forest models are constructed by bootstrapping the data, which may result in the loss of some time structure within the dataset.

Feature Importance

# Get feature importances from our random forest model
importances = best_rfr.feature_importances_

# Get the index of importance from greatest importance to least
sorted_index = np.argsort(importances)[::-1]
sorted_index_top = sorted_index[:10]
x = range(len(sorted_index_top))

# Create tick labels 
labels = np.array(features)[sorted_index_top]
plt.bar(x, importances[sorted_index_top], tick_label=labels)
plt.title("Feature importance analyisis")
# Rotate tick labels to vertical
plt.xticks(rotation=45)
plt.show()

Output:

The results obtained from the model do not align with the findings of the correlation analysis, highlighting the influence of complex relationships and interactions on model performance. This aspect is crucial to consider, particularly when dealing with models such as ARIMA.

Model Assessment

When evaluating the performance of the model, the Mean Absolute Percent Error (MAPE) is chosen as the performance metric instead of the commonly used Root Mean Square Error (RMSE). MAPE is considered more appropriate for this analysis as it is easier to interpret and communicate. The MAPE will be calculated using a one-period ahead model for the test period.

p_train = best_rfr.predict(X_train)
train_resid_1step = y_train["target_t1"]- p_train

p_test = best_rfr.predict(X_test)
test_resid_1step = y_test["target_t1"]- p_test

test_df = y_test[["target_t1"]]*std+mean
test_df["pred_t1"] = p_test*std+mean
test_df["resid_t1"] = test_df["target_t1"].add(-test_df["pred_t1"])
test_df["abs_resid_t1"] = abs(test_df["resid_t1"])
test_df["ape_t1"] = test_df["resid_t1"].div(test_df["target_t1"])

test_MAPE = test_df["ape_t1"].mean()*100
print("1-period ahead forecasting MAPE: ", test_MAPE)

Output:

test_df[["target_t1", "pred_t1"]].plot()

plt.title("1-period ahead Forecasting")
plt.ylabel("(MWh)")
plt.legend()
plt.show()

Output:

The MAPE value is slightly above 10%, which is quite remarkable considering the strong dependence of electricity demand on weather conditions. Moreover, it is important to note that February experienced exceptionally cold temperatures, making the result even more astonishing.

plt.scatter(y=y_train["target_t1"],x=p_train, label="train")
plt.scatter(y=y_test["target_t1"],x=p_test, label="test")
plt.title("1-period ahead Actual vs forecasting ")
plt.ylabel("Actual")
plt.xlabel("Forecast")
plt.legend()
plt.show()

Output:

By plotting the actual values against the forecasted values, we can visually assess the model's ability to fit the training data and generalize it to the test data.

Residual Analysis

test_resid_1step.plot.hist(bins=10, title="Test 1-step ahead residuals distribution")
plt.xlabel("Residuals")
plt.show()

Output:

test_resid_1step.plot(title="Test 1-step ahead residuals time series")
plt.ylabel("Residuals")
plt.show()

Output:

plt.scatter(x=y_test["target_t1"].values, y=test_resid_1step.values)
plt.title("Test 1-step ahead residuals vs Actual values")
plt.ylabel("Residuals")
plt.xlabel("Actual values")
plt.show()

Output:

Forecasting

Multi-period ahead model building

Once we determine the optimal set of hyperparameters, we can train a new instance of the Random Forest model using the most recent and relevant data. Typically, it is recommended to have at least two years of data to generate a long-term daily forecast. Let's proceed with retraining a collection of Random Forest models using the MultiOutput Regression feature.

multi_rfr = RandomForestRegressor().set_params(**best_rfr_grid).\
    fit(X_train.loc["2016":"2017"], y_train.loc["2016":"2017"])

p_train = multi_rfr.predict(X_train)
train_resid_1step = y_train- p_train

p_test = multi_rfr.predict(X_test)
test_resid_1step = y_test- p_test

Lastly, it is important to evaluate the forecasting accuracy using the MAPE (Mean Absolute Percent Error) metric across multiple periods and determine if it remains consistent and stable.

periods = [1, 7, 14, 30]

ytest_df = y_test*std+mean
ptest_df = pd.DataFrame(data=p_test*std+mean, index=test_df.index, columns=["pred_t" + str(i) for i in range(1, 31)])
test_df = pd.concat([ytest_df, ptest_df], axis=1)

test_MAPE = []

for t in periods:
    test_df["resid_t" + str(t)] = test_df["target_t" + str(t)].add(-test_df["pred_t" + str(t)])
    test_df["abs_resid_t" + str(t)] = abs(test_df["resid_t" + str(t)])
    test_df["ape_t" + str(t)] = test_df["abs_resid_t" + str(t)].div(test_df["target_t" + str(t)])
    test_MAPE.append(round(test_df["ape_t" + str(t)].mean(), 4)*100)

print("MAPE test: ", test_MAPE)

Output:

mape_df = pd.DataFrame(index=periods, data={"test_MAPE": test_MAPE})
mape_df.plot(kind="bar", legend=False)
plt.title("Mean Absolute Percent Error in Test")
plt.xlabel("Forecasting Period")
plt.ylabel("%")
plt.xticks(rotation=0)
plt.show()

Output:

As expected, the forecasting accuracy improves when considering a shorter period. It is worth noting that having more data does not always guarantee better results. Additionally, the MAPE tends to increase as the forecasting horizon extends, but overall, it demonstrates a relatively stable pattern.

Actual VS Forecasted

As mentioned before, a convenient method to evaluate the model's fit is by plotting the actual values against the forecasted values and examining the distribution of data points.

#f, ax = plt.subplots(nrows=3,ncols=2)
for t in periods:
    test_df[["target_t" + str(t), "pred_t" + str(t)]].plot(x="pred_t" + str(t), y="target_t" + str(t) ,kind="scatter")
    plt.title("{}-period(s) ahead forecasting".format(t))
    plt.xlabel("Forecasted (MWh)")
    plt.ylabel("Actual values (MWh)")
    plt.xticks(rotation=45)
    plt.show()

Output:

It is evident that as the forecasting period becomes longer, there is an increase in the scattering of data points, particularly for extreme values.

Forecasting 30 days ahead

forecast_range = pd.date_range(start=np.max(test_df.index.values), periods=tau, freq="D")
len(forecast_range)

Output:

forecast = []
for t in range(0, tau):
    #print(-(t+1), (t))
    forecast = p_test[-(t+1):,(t)]*std+mean

test_df["target_t1"].plot()
plt.scatter(x=test_df.index, y=test_df["pred_t1"], c="r", alpha=0.2, label="test preds")
plt.plot(forecast_range, forecast, c="r", alpha=0.5, label="forecasting")
plt.ylabel("(MWh)")
plt.xticks(rotation=45)
plt.title("Forecasting Daily Electricity Consumption (MWh) in Spanish Market (2018)")
plt.show()

Output:

Future Aspects of Machine Learning

Machine learning holds immense potential for the energy industry. By analyzing vast historical data, including electricity usage, weather patterns, and seasonal variations, machine learning algorithms can provide accurate predictions. Challenges such as complex relationships between variables are being addressed with advanced techniques. The future of this field looks promising, with improved accuracy, integration of IoT and smart grid data, and real-time predictive analytics. This will enable efficient energy distribution, demand-side response, and seamless integration of renewable energy sources. Moreover, machine learning will support predictive maintenance for energy infrastructure and foster energy conservation and sustainability. The collaboration between AI and human expertise will be essential, and transparent AI models will build trust and accountability. Overall, machine learning is set to transform the energy sector and pave the way for a more sustainable and efficient energy ecosystem.

Conclusion

Electricity consumption prediction using machine learning is a game-changer in the energy industry. By harnessing the power of data and advanced algorithms, we are unlocking new possibilities for efficient energy management and a greener tomorrow. As machine learning continues to evolve, we can look forward to a future where electricity consumption becomes more sustainable, economical, and environmentally friendly. Embracing this cutting-edge approach will pave the way for a brighter and more sustainable energy future.

Next TopicData Analytics vs. Machine Learning

← prev next →