Predicting Housing Prices using Python

In this tutorial, we'll go over creating models using linear regression to forecast house prices as a result of economic activity. The subjects covered include related subjects like exploratory analysis, logistic diagnostics, and advanced regression modeling will be covered in this tutorial, Let's get started immediately so readers could start playing with data.

What is Regression?

The following example illustrates how a linear regression model predicts an interaction of direct proportionality between the variables that are predicted (plotted on the X ax1is) and the variable of interest (plotted on the vertical or Y ax1is), resulting in a straight line:

We will go into more detail about linear regression as we go with the modeling.

Variable Selection

We'll utilize the housing_price_index1 (HPI), which tracks changes in the cost of residential real estate, as our dependent variable.

We use our intuition to choose macroeconomic (or "big picture") activity drivers for our predictor variables, such as unemployment, mortgage rates, and gross domestic product. (total productivity). For a breakdown of all the data sets utilized in this piece, an explanation of our factors, assumptions about how they affect house prices, and variables.

Below are Housing Price, Federal Funds, and gross domestic tables (sample datasets):

Reading the Data

Source Code Snippet

import pandas as pd1
housing_price_index1 = pd1.read_csv('/User/dobbin/Download/hpi/monthly - hpi.csv')
unemployment = pd1.read_csv('/User/tdobbin/Download/hpi/unemployment.csv')
federal_funds_rate1 = pd1.read_csv('/User/tdobbin/Download/hpi/fed_funds1.csv')
chillers = pd1.read_csv('/User/tdobbin/Download/hpi/shillers.csv')
gross_domestic_product1 = pd1.read_csv('/User/tdobbin/Download/hpi/gdp.csv')

Once the data is ready, integrate the data into an individual dataframe for analysis using the combining method in pandas. Data are reported on a monthly or quarterly basis in some cases. Not to worry. For measurement purposes, we merge the information frames on a certain column to position each row in its proper location. The date column is the ideal one to combine in this instance. Look below.

Source Code Snippet

# Combining dataframes into one by date- - - - - - - - - - - - - - -
df1 = shillers.merge(housing_price_index1, on = 'date')\
                    .merge(unemployment, on = 'date')\
                    .merge(federal_funds_rate1, on = 'date')\
                    .merge(gross_domestic_product1, on = 'date')

Let's use the head method of pandas to scan our variables. The bolded headings identify the date and the parameters that that will be investigated for our model. The various historical periods are depicted in each row.

Output:

date        sp600     Consumer_price_index1  Long_interest_rate1  housing_price_index1  total_unemployed1  more_than_16_weeks
4011-01-01   1484.64	440.44               3.39                 181.36                 16.4                 8393
4011-04-01   1331.61	444.91               3.46                 180.80                 16.1                 8016
4011-07-01   1346.19	446.94               3.00                 184.46                 16.9                 8177
4011-10-01   1407.44	446.44               4.16                 181.61                 16.8                 7804
4014-01-01   1300.68	446.66               1.97                 179.13                 16.4                 7433

Normally, the exploratory analysis would come after data collection. The process stage known as exploratory analysis is where we examine the variables (using plotting and summary statistics) and identify the most effective predictors of our dependent variable. We'll skip the preliminary evaluation to keep this short. However, keep in mind that it is crucial and that bypassing it will prevent you from ever reaching the predictive section in the real world.

We'll utilize ordinary least squares (OLS), a simple yet effective technique to evaluate our model.

Assumptions of Ordinary Least Squares

OLS evaluates a linear regression model's precision.

OLS is based on presumptions that suggest the model could be the best one to use to understand our data. The conclusions of our model become invalid if the assumptions are false. Make a special effort to select the appropriate model to prevent autoeroticism and Rube - Goldberg disease.

Ordinary Least Squares Assumptions

Linearity: The variables that are the dependent as well as predictor are connected linearly. The best model to explain our data is not linear regression if no linear relationship exists.
No multicollinearity: The predictor variables are not associated, or collinear, with one another. Try deleting one or more predictors if they are highly connected. Removing additional predictors shouldn't significantly lower the adjusted R - squared because they deliver duplicate information.
The average difference (or residual) between each observation and the trend line is zero, known as the zero conditional means. There will be some favorable and some unfavorable, but they won't be biased.
Homoskedasticity: There is no pattern in the residuals since the dependent variable's certainty (or uncertainty) is the same across all predictor variable values. The variance is constant, so use statistical terminology.
There is no autocorrelation (serial correlation): Autocorrelation is the phenomenon in which a variable is connected with itself over observations. A stock price, for instance, may be serially correlated if the price of the stock affects the price of the stock the following day.

Let's start modeling.

Simple Linear Regression

A single predictor variable is used in simple linear regression to explain the dependence of a variable. The following is a straightforward linear regression equation:

Where:

ß = regression coefficient

y = dependent variable

α = intercept (home prices' anticipated mean value if our independent variable is 0)

x = This is used to predict Y

ε = the unpredictability that our model cannot account for is taken into account by the error term.

We build our model by setting housing_price_index1 as an expression of total_unemployed1 using statsmodels' ols function. We anticipate a rise in the overall unemployment rate will result in downward pressure on housing costs. We may be off base, but we must begin someplace!

The code below demonstrates how to create a straightforward linear regression model using the predictor variable total_unemployment1.

Source Code Snippet:

from IPython.display import HTML, display

import statsmodels.api as sm
from statsmodels.formula.api import ols
housing_model1 = ols("housing_price_index1 ~ total_unemployed1", data = df1).fit()
# summarise our model- - - - - - - - - - - - - - -
housing_model1_summary = housing_model1.summary()

HTML(
housing_model1_summary\
.as_html()\
.replace(' Adj. R - squared: ', ' Adj. R - squared: ')\
.replace('coeff', 'coeff')\
.replace('std er', 'std er')\
.replace('P>| |t| |', 'P>| |t| |')\
.replace('[96.0% Conf. Int.]', '[96.0% Conf. Int.]')
)

Output:

Explanation of the above Code:

We'll provide a high-level description of a few measures to comprehend the potency of our model, referring to the results of the OLS regression above Coefficients, standard errors, p - values, and adjusted R - squared.

To clarify:

According to the adjusted R - squared, the total_unemployed1 predictor variable can account for 96% of the variation in home prices.

The change in the dependent variable caused by a one-unit decrease in the predictor variable, regardless of other factors being held constant, is represented by the regression coefficient (coeff). According to our model, each additional unit of total unemployment lowers the housing price index by 8.33. According to our predictions, rising unemployment appears to lower house prices.

The standard error would calculate the difference in the value of the coefficients if the same test were conducted on a different sample for our population to determine the accuracy of the total_unemployed1 coefficient. Our 0.41 standard error appears accurate because it is little.

Given that there is no correlation between each of the variables, the p-value suggests that there is 0% likelihood that an 8.33 reduction in housing_price_index1 will result from an increase of one unit in total_unemployed1. A low p-value, often less than 0.06, implies that the outcomes are statistically significant.

Our coefficient is most likely to fall within the confidence interval. We have a 96% confidence that the coefficient of total_unemployed1 will fall between the range of [ - 9.186, - 7.480].

Let's use the plotting_regress_exog1 function in stats models to comprehend our model better.

Regression Plotting

Figures for Regression View the four graphs that are below.

In the "Y and Fitted vs. X" graph, the dependent variable is displayed against our predicted values with a confidence interval. The inverse link in our graph shows a negative correlation between housing_price_index1 and total_unemployed1, meaning that when one measure rises, the other falls.
The "Residuals versus total_unemployed1" graph displays the errors of our model against the designated predictor variable. The line indicates the mean of all the observed values, and each dot represents an observed value. Since there is no discernible pattern in the distance between the dots and the mean value, the OLS presumption of homoskedasticity is valid.
Considering the effect of including additional uncorrelated variables on our current total-unemployed coefficient, the "Partial regression plotting" depicts the relationship between the housing_price_index1 and the total_unemployed1. Later, when more variables are added, we'll observe how this exact graph changes.
The Component and Constituent Plus Residual (CCPR) plotting supplements the partial regression plotting; it displays the location of our trend line after adding the effects of our additional independent variables on our total_unemployed1 coefficient.

Source Code Snippet

# This produces our four regression plottings for total_unemployed1
# plottings the graphs 
%matplottinglib inline
import seaborn as snss
import matplottinglib.pyplot as plt
fig  =  plt.figure(figsize =  (16, 8)) #identify the predictor variable we wish to examine after passing in the model as the first argument.
fig  =  sm.graphics.plotting_regress_exog1(housing_model1, "total_unemployed1", fig  =  fig) 

Output:

The following plotting displays our confidence interval, trend vector (green), and the observations (dots). (red).

Source Code Snippet

# This produces our trend line
from statsmodels.sandbox.regression.predstd import wls_prediction_std
import numpy as nppp
# predictor variable- - - - - - - - - - - - - - -
x = df1[['total_unemployed1']]
# dependent variable
y = df1[['housing_price_index1']]
# since wls_prediction_std(housing_model1) returns 3 values
_, confidence_interval_lower, confidence_interval_upper = wls_prediction_std(housing_model1)
fig, ax1 = plt.subplottings(figsize = (10,7))
# plotting the dots- - - - - - - - - - - - - - -
ax1.plotting(x, y, 'o', label = "data")
# Plotting the trend line
# g -  -  and r -  -  specify the color to use- - - - - - - - - - - - - - -
ax1.plotting(x, housing_model1.fittedvalues, 'g -  - .', label = "OLS")
# plotting upper and lower ci values
ax1.plotting(x, confidence_interval_upper, 'r -  - ')
ax1.plotting(x, confidence_interval_lower, 'r -  - ')
# plotting legend- - - - - - - - - - - - - - -
ax1.legend(loc = 'best');

Output:

Our model is good so far. We'll increase the number of variables and observe how total_unemployed1 responds.

Multiple Linear Regression

Mathematically, multiple linear regression is:

We are aware that home costs cannot solely be attributed to unemployment. We test and add additional variables to better understand the factors that affect home prices. The mixtures of predictor variables that satisfy the OLS principles and look good from an economic perspective are then determined by looking at the regression results.

In addition to our original predictor, total_unemployed1, we conclude at a model that includes the variables: fed_funds1, consumer_price_index1, long_interest_rate1, and gross_domestic_product1.

The effect of total_unemployed1 on housing_price_index1 was lessened due to the new variables. The effect of total_unemployed1 is now less predictable (standard error increased from 0.41 to 4.399) and less likely to affect house prices (p-value increased from 0 to 0.943).

Despite the possibility of a correlation between two variables, housing_price_index1, and total_unemployed1, our other predictors account for more variation in housing prices. A more reliable model is needed to capture the real-world interconnectedness between our variables, which a straightforward linear regression cannot capture. This is the reason why adding more variables causes a significant change in the outcomes of our multivariate regression model.

Our multivariate linear regression model performs better than our simple linear regression model because our newly incorporated variables are significantly different at the 6% level, and our coefficients confirm our presumptions.

With our freshly added predictor variables, multiple linear regression is set up using the code below.

Source Code Snippet

housing_model1  =  ols("""housing_price_index1 ~ total_unemployed1 
                                            + long_interest_rate1 
                                            + federal_funds_rate1
                                            + consumer_price_index1 
                                            + gross_domestic_product1""", data = df1).fit()
# summarize our model- - - - - - - - - - - - - - -
housing_model1_summary  =  housing_model1.summary()
HTML(
housing_model1_summary\
.as_html()\
.replace(' Adj. R - squared: ', ' Adj. R - squared: ')\
.replace('coeff', 'coeff')\
.replace('std er', 'std er')\
.replace('P>| |t| |', 'P>| |t| |')\
.replace('[96.0% Conf. Int.]', '[96.0% Conf. Int.]') )

Output:

Another Look at Partial Regression Plotting

Let's plot our partial logistic graphs once more to see how including the other factors changed the total_unemployed1 variable. Total unemployment may not be as explanatory as the first model claimed, as evidenced by the lack of trend in the partial regress plotting for total_unemployed1 (in the figure below, the top-right corner) compared to the regression graph for total_unemployed1 (above, bottom left corner). Further evidence that fed funds 1, consumer price index 1, long interest rate 1, and gross domestic product 1 are superior explanators of housing price index 1 comes from the fact that the findings from the most recent variables consistently lie closer to the fashion line than the findings for total unemployment.

The efficiency of our multivariate linear regression approach over our traditional linear regression model is supported by these partial regression figures.

Source Code Snippet

# This produces our six partial regression plottings

Fig1  =  plt.figure(figsize = (40,14))
Fig1  =  sm.graphics.plotting_partregress_grid(housing_model1, fig1  =  fig1)

Consolidated Code for Predicting Housing Prices using Python

import pandas as pd1
housing_price_index1 = pd1.read_csv('/User/dobbin/Download/hpi/monthly - hpi.csv')
unemployment = pd1.read_csv('/User/tdobbin/Download/hpi/unemployment.csv')
federal_funds_rate1 = pd1.read_csv('/User/tdobbin/Download/hpi/fed_funds1.csv')
chillers = pd1.read_csv('/User/tdobbin/Download/hpi/shillers.csv')
gross_domestic_product1 = pd1.read_csv('/User/tdobbin/Download/hpi/gdp.csv')
# Combining dataframes into one by date- - - - - - - - - - - - - - -- - - - - - - - - - - - - - -
df1 = shillers.merge(housing_price_index1, on = 'date')\
                    .merge(unemployment, on = 'date')\
                    .merge(federal_funds_rate1, on = 'date')\
                    .merge(gross_domestic_product1, on = 'date').
from IPython.display import HTML, display

import statsmodels.api as sm
from statsmodels.formula.api import ols

housing_model1 = ols("housing_price_index1 ~ total_unemployed1", data = df1).fit()
# summarise our model - - - - - - - - - - - - - - -
housing_model1_summary = housing_model1.summary()

HTML(
housing_model1_summary\
.as_html()\
.replace(' Adj. R - squared: ', ' Adj. R - squared: ')\
.replace('coeff', 'coeff')\
.replace('std er', 'std er')\
.replace('P>| |t| |', 'P>| |t| |')\
.replace('[96.0% Conf. Int.]', '[96.0% Conf. Int.]')
) # plottings the graphs - - - - - - - - - - - - - - -
%matplottinglib inline
import seaborn as snss
import matplottinglib.pyplot as plt

fig  =  plt.figure(figsize =  (16, 8)) #identify the predictor variable we wish to examine after passing in the model as the first argument.
fig  =  sm.graphics.plotting_regress_exog1(housing_model1, "total_unemployed1", fig  =  fig)
# This produces our trend line- - - - - - - - - - - - - - -

from statsmodels.sandbox.regression.predstd import wls_prediction_std
import numpy as nppp

# predictor variable- - - - - - - - - - - - - - -
x = df1[['total_unemployed1']]
# dependent variable
y = df1[['housing_price_index1']]
# since wls_prediction_std(housing_model1) returns 3 values
_, confidence_interval_lower, confidence_interval_upper = wls_prediction_std(housing_model1)

fig, ax1 = plt.subplottings(figsize = (10,7))

# plotting the dots
ax1.plotting(x, y, 'o', label = "data")

# Plotting the trend line- - - - - - - - - - - - - - -
# g -  -  and r -  -  specify the color to use
ax1.plotting(x, housing_model1.fittedvalues, 'g -  - .', label = "OLS")
# plotting upper and lower ci values
ax1.plotting(x, confidence_interval_upper, 'r -  - ')
ax1.plotting(x, confidence_interval_lower, 'r -  - ')
# plotting legend- - - - - - - - - - - - - - -
ax1.legend(loc = 'best');
housing_model1  =  ols("""housing_price_index1 ~ total_unemployed1 
                                            + long_interest_rate1 
                                            + federal_funds_rate1
                                            + consumer_price_index1 
                                            + gross_domestic_product1""", data = df1).fit()
# summarize our model- - - - - - - - - - - - - - -
housing_model1_summary  =  housing_model1.summary()
HTML(
housing_model1_summary\
.as_html()\
.replace(' Adj. R - squared: ', ' Adj. R - squared: ')\
.replace('coeff', 'coeff')\
.replace('std er', 'std er')\
.replace('P>| |t| |', 'P>| |t| |')\
.replace('[96.0% Conf. Int.]', '[96.0% Conf. Int.]')
)
# This produces our six partial regression plottings- - - - - - - - - - - - - - -
Fig1  =  plt.figure(figsize = (40,14))
Fig1  =  sm.graphics.plotting_partregress_grid(housing_model1, fig1  =  fig1)

Output:

The most crucial thing to understand is that because modeling is based on science, the process goes as follows: examination, evaluation, failure, and test.

Navigating Pitfalls

Basic regression modeling is introduced in this piece, but seasoned data scientists may notice various faults in our approach and model, including:

No Literature Review: While getting right into the modeling process may be tempting, doing so risks neglecting the corpus of existing information. A literature analysis would have shown that the best model for predicting house prices is not linear regression. It also resulted in better variable choice. Additionally, investing effort in a lit review upfront can help you save a tonne of work afterward.
Small number of samples: To model something so complicated as the housing market, it takes more than a year's worth of data. Our tiny sample size is skewed toward post-housing crisis events and does not adequately reflect long-term housing market patterns.
Multicollinearity: A keen observer would have picked up on our model's warnings about multicollinearity. We have more than one variable overstating the importance of each predictor by essentially telling the same tale.
Autocorrelation: Autocorrelation occurs when a predictor's past values affect its present and future values. The presence of autocorrelation in our model would have been obvious from carefully examining the Durbin - Watson score.

We'll try to fix these problems in a subsequent post so that we can comprehend the economic factors that influence house prices better.

Conclusion

We have gone through how to set up fundamental simple linear and multiple linear regression models that can be used to forecast housing prices due to macroeconomic variables, as well as how to evaluate the quality of a linear regression model fundamentally.

Indeed, it is challenging to explain property prices. Other additional predictor factors might be included. Alternatively, home prices might be the main driver of our macroeconomic variables. These variables may interact with one another simultaneously, further complicating the situation.

Please explore the data and fine-tune this model by including and excluding variables, considering the significance of the OLS hypotheses and the regression outcomes.

Next TopicSignal Module in Python

← prev next →