Python Tutorial

Linear regression models in Machine Learning are used to predict the future values of an attribute. In this model, we have specific independent attributes, also known as predictors. The model takes in these predictors, fits a straight line to the data, and gives us a model to predict the value of the dependent attribute using specific values of these independent attributes. By fitting, we mean optimizing the parameters to get an optimal solution.

We can try different combinations of the independent attributes to find which predicts the value more accurately. However, this method takes a lot of work. But the question arises of how to find which attribute is crucial for the model, which is too quickly. There are many ways, such as the adjusted R-squared error and Mean Squared Error, in which we find the values of a dependent attribute using the model and find the difference in actual and predicted values to judge the model's accuracy.

Another statistical approach to solve this problem is Hypothesis Testing. We will create a hypothesis, calculate the value of the statistic, and according to the level of significance and p-value, judge the quality of the model fit.

What do we do in Manual Feature Elimination?

Following are the steps that need to be followed:-

Build an ML model with all the desired features.
Delete the features you think will not add any value to the result of the model. These are the features with high p-values.
Test the correlation between the features and drop the features which have a strong correlation.
Rebuild the model with a new set of features and repeat the process.

Usually, researchers advise one to maintain a balance between automated and manual selection to get an optimal number of features. We will discuss how to use Hypothesis Testing in the selection of the features.

Before going to hypothesis testing, let us understand the Linear Regression model and its parameters.

In linear regression, we fit a straight line to the data. A straight line has the following equation:-

Hypothesis Testing of Linear Regression in Python

Where y is the independent feature, is the intercept of the straight line, and is the slope of the straight line. For simplicity, we are using only one independent feature.

Since we are considering more features than the model's overall fit, we will ignore them. We will focus on the slope of the line, i.e., the feature's coefficient. We will use the built-in diabetes dataset and its two features, one independent and one dependent feature.

Code

# Python program to plot a scatter plot

# Importing the required classes
from sklearn.datasets import load_diabetes
import matplotlib.pyplot as plt

# Loading the dataset
X, Y = load_diabetes(return_X_y = True)

# Taking a single independent feature
X = X[ : , 0]

# Plotting the scatter plot
plt.scatter(X, Y)
plt.title("Scatter Plot")
plt.xlabel("Independent Feature (x)")
plt.ylabel("Dependent Feature (y)")
plt.show()

Output:

Scatter Plot

We will fit a regression model to the dataset and plot the regression line.

Code

# Python program to plot the regression line

# Importing the required classes
from sklearn.datasets import load_diabetes
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt

# Loading the dataset
X, Y = load_diabetes(return_X_y = True)

# Taking a single independent feature
# Reshaping the arrays to make them consistent for the model
X = X[ : , 0].reshape(-1, 1)

# Fitting the linear regression model to the dataset
lr = LinearRegression()
lr.fit(X, Y)

# Getting the values of intercept and slope calculated using the linear regression model
print(f"The intercept of the linear equations is {lr.intercept_} and the slope of the line is {lr.coef_[0]}")

# Finding the predicting values using the trained model
# We will find the dependent feature of the same set of independent features to compare the results
Y_pred = lr.predict(X)

# Plotting the scatter plot and the regression line using the predicted values
plt.scatter(X, Y, label = "Original Data")
plt.plot(X, Y_pred, color = 'r', label = "Regression Line", linewidth = 3)
plt.title("Scatter Plot and Regression Line")
plt.xlabel("Independent Feature (x)")
plt.ylabel("Dependent Feature (y)")
plt.legend()
plt.show()

Output:

The intercept of the linear equations is 152.13348416289594 and the slope of the line is 304.18307452830607

It is clear from the graph that the plot is randomly scattered, and there is no trend in the two features. Even if the plot has not followed a linear trend, Python would fit a linear model to the data. However, in this case, the error term would be huge and accuracy very low. Thus fitting a linear line does not imply that the data can be explained through a regression line. Therefore, we need other measures to determine if the feature is right for the ML model we are working on.

In our example, to test if x has importance, we will run a hypothesis test on.

Steps to perform a Hypothesis Test

Claim or declare the Hypothesis
Set the criterion for making a decision, which is called the Level of Significance.
Calculate the test statistics
On comparing with the Level of Significance, make the decision.

Step 1

We will start by stating the hypothesis. The hypothesis will be based on the value of β₁. Since this is the null hypothesis, we have to declare an equality statement related to β₁.

We will assume that this β₁ is not significant. This means that x and y have no relationship between them. This will happen when the slope of the line is zero.

Hence, β₁=0

Null Hypothesis (H₀): β₁=0

Alternative Hypothesis (H_A): β₁≠0

Step 2

Now we have to set a boundary to tell if we should accept or reject the null hypothesis. Usually, the values of the level of significance are 1%, 5%, and 10%. We will take the level of significance as 5%.

Step 3

Now comes the main part of hypothesis testing. We have to calculate the test statistic, which will measure the significance of x in the regression model on y. We will compare the test statistic value with the level of significance to decide on the significance of x. However, the test statistic is not directly compared with the level of significance. We compare the p-value corresponding to the calculated value of the test statistic. Let us see what this means.

We will calculate the t-score value for the mean of the independent feature x.

Where μ is the population's mean and s is the standard deviation of the selected sample. N is the number of samples. Together s/√n is known as standard error.

Now, we have to find the p-value. We will use the cumulative probability table for the t-distribution, also known as the t-table, to find the p-value for the t-score.

Decide on the basis of the p-value with respect to the given significance level value.

Step 4

Now, we will see the rule to accept or reject the null hypothesis. In the below rule, 0.05 is the level of significance. For 5%, the rejection region of the null hypothesis is less than 0.05

If the p-value < 0.05, we will reject the null hypothesis and β₁ is significant.
If the p-value > 0.05, we have to accept the null hypothesis and is β₁ not significant.

If we fail to reject the null hypothesis, that would mean β₁ is zero (in other words β₁ is insignificant) and of no use in the model. Similarly, if we reject the null hypothesis, it would mean that β₁ is not zero, and the line fitted is significant.

We have been using only one independent feature for all this time. Let us see now how the above notations will change for the multiple linear regression models.

The linear equation for a multiple regression model is as follows:

Where k is the total number of independent features in the model.

Here are the null and alternative hypotheses for the multiple linear regression model.

Null Hypothesis (H₀): β₁= β₂= β₃=...= β_k=0 Alternative Hypothesis (H_A): β₁≠0 for at least one I, where i ranges from 1 to k.

Example in Python

Let us now see the implementation of the hypothesis in Python. We will use the same dataset, but this time we will consider all the independent features and one dependent feature. We must fit a multiple linear regression model to this data to predict the diabetes level. Let us take a look at the various columns of the dataset.

Here we have the attribute names and the top 5 rows of independent and dependent features.

Code

# Python program to display the dataset to be used

# Importing the required classes
from sklearn.datasets import load_diabetes
import matplotlib.pyplot as plt

# Loading the dataset
X, Y = load_diabetes(return_X_y = True)

# Getting the names of the features
ld = load_diabetes()
attributes = ld.feature_names

# We will take all the features this time
# Looking at the attributes name and the top 5 rows of the dataset
print("Features Names: \n", attributes)
print("Independent Features: \n", X[ : 5, :])
print("Dependent Features: \n", Y[ : 5])

Output:

Features Names:
 ['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6']
Independent Features:
 [[ 0.03807591  0.05068012 
0.06169621  0.02187239 -0.0442235  -0.03482076
  -0.04340085 -0.00259226  0.01990749 -0.01764613]
 [-0.00188202 -0.04464164 -0.05147406 -0.02632753 -0.00844872 -0.01916334
   0.07441156 -0.03949338 -0.06833155 -0.09220405]
 [ 0.08529891  0.05068012 
0.04445121 -0.00567042 -0.04559945 -0.03419447
  -0.03235593 -0.00259226  0.00286131 -0.02593034]
 [-0.08906294 -0.04464164 -0.01159501 -0.03665608 
0.01219057  0.02499059
  -0.03603757  0.03430886 
0.02268774 -0.00936191]
 [ 0.00538306 -0.04464164 -0.03638469  0.02187239  0.00393485 
0.01559614
   0.00814208 -0.00259226 -0.03198764 -0.04664087]]
Dependent Features:
 [151.  75. 141. 206. 135.]

This time we will use the statsmodel to fit the linear regression model. We are using this library because it has a method to display the summary statistic of the linear fit. The summary statistics include the coefficients' p-vales and the statistic value for 3 different confidence levels or significance levels.

Code

# Python program to display the summary statistic of the linear regression fit

# Importing the required classes
from sklearn.datasets import load_diabetes
import statsmodels.api as sm
import matplotlib.pyplot as plt

# Loading the dataset
X, Y = load_diabetes(return_X_y = True)

# Fitting the linear regression model to the dataset
X_lm = sm.add_constant(X)
lm = sm.OLS(Y, X_lm).fit()

# Pringting the summary statistic
print(lm.summary())

Output:

Now, look at the p-values and the t-statistic of each constant and coefficient. All those attributes whose p-value is greater than the modulus of the t-statistic are not significant to the model. The statsmodel library makes hypothesis testing simple with just one method call.

From the above table, we can conclude that x1, x7, x8, and x10 are insignificant for the regression model.

Next TopicAdvanced Usage of Python

← prev next →