# Implementation of Linear Regression using Python

Linear regression is a statistical technique to describe relationships between dependent variables with a number of independent variables. This tutorial will discuss the basic concepts of linear regression as well as its application within Python.

In order to give an understanding of the basics of the concept of linear regression, we begin with the most basic form of linear regression, i.e., "Simple linear regression".

## Simple Linear Regression

Simple linear regression (SLR) is a method to predict a response using one feature. It is believed that both variables are linearly linked. Thus, we strive to find a linear equation that can predict an answer value(y) as precisely as possible in relation to features or the independently derived variable(x).

Let's consider a dataset in which we have a number of responses y per feature x: For simplification, we define:

x as feature vector, i.e., x = [x1, x2, x3, …., xn],

y as response vector, i.e., y = [y1, y2, y3 …., yn]

for n observations (in above example, n = 10).

A scatter plot of the above dataset looks like: - The next step is to identify the line that is most suitable for this scatter graph so that we can anticipate the response of any new value for a feature. (i.e., the value of x is that is not in the dataset)

This line is referred to as the regression line.

The equation of the regression line can be shown as follows: Here,

• h(xi ) signifies the predicted response value for ith
• ?0 and ?1xi ) are regression coefficients and represent y-intercept and slope of regression line respectively.

In order to build our model, we need to "learn" or estimate the value of the regression coefficients and . After we've determined those coefficients, then we are able to make use of this model in order to forecast the response!

In this tutorial, we're going to employ the concept of Least Squares.

Let's consider:

yi = ?0+ ?1xi + ?i=h(xi )+ ?i ? ?i= yi- h(xi )

Here, ?i is a residual error in ith observation.

So, our goal is to minimize the total residual error.

We have defined the cost function or squared error, J as: and our mission is to find the value of ?0 and ?1 for which J(?0,?1) is minimum.

Without going into the mathematical details, we are presenting the result below: Where, ssxy would be the sum of the cross deviations of "y" and "x": And ssxx would be the sum of squared deviations of "x" Code:

Output:

```Estimated coefficients are :
b_0 = -0.4606060606060609
b_1 = 1.1696969696969697 ```

## Multiple linear regression

Multiple linear regression attempts explain how the relationships are among several elements and then respond by applying a linear equation with the data. Clearly, it's not anything more than an extension of linear regression.

Imagine a set of data that has one or more features (or independent variables) as well as one response (or dependent variable).

The dataset also is comprised of additional n rows/observations.

We define:

X(Feature Matrix) = it is a matrix of size "n * p" where "xij" represents the values of jth attribute for the ith observation.

Therefore, And,

y (Response Vector) = It is a vector of size n where represents the value of response for the ith observation. The regression line for "p" features are represented as: Where h(xi) is the predicted response value for the ith observation point and ?0,?1,?2,....,?p are the regression coefficients.

We can also write: Where, ?i is representing the residual error in the ith observation point.

We can also generalize our linear model a little more by representing the attribute matrix of "X" as: Therefore, the linear model can be expressed in the terms of matrices as shown below:

y=X?+?

Where, We now determine an estimation of the b, i.e., b' by with an algorithm called the Least Squares method. As previously mentioned, this Least Squares method is used to find b' for the case where the residual error of total is minimized.

We will present the results as shown below: Where ' is the matrix's transpose and -1 is the matrix's that is the reverse.

With the help of the lowest square estimates b', the multi-linear regression model is now calculated by: where y' is the estimated response vector.

Code:

Output:

```Regression Coefficients are:  [-8.95714048e-02  6.73132853e-02  5.04649248e-02  2.18579583e+00
-1.72053975e+01  3.63606995e+00  2.05579939e-03 -1.36602886e+00
2.89576718e-01 -1.22700072e-02 -8.34881849e-01  9.40360790e-03
-5.04008320e-01]
Variance score is: 0.7209056672661751 ```

In the above example, we calculate the accuracy score using the Explained Variance Score.

We define:

explained_variance_score = 1 - Var{y - y'}/Var{y}

where y' is the estimated output target, where y is the equivalent (correct) to the target's output, where Var is Variance, which is the square of the standard deviation.

The most ideal score is 1.0. Lower scores are worse.

## Assumptions:

Here are the main assumptions that the linear regression model is based on with regard to the dataset on the basis on which it is utilized:

• Linear relationships: Relationship between feature and response variables must be linear. The assumption of linearity is tested with the help of scatter plots. As we can see, the 1st figure is a representation of linearly related variables, whereas the variables in the 3rd and 2nd figures are probably non-linear. Thus, 1st figure can make more accurate predictions by using linear regression. • Little or no multi-collinearity: is the assumption that there is minimal or no multicollinearity present in the data. Multi-collinearity happens when the features (or independent variables) aren't independent of each other.
• Little or no auto-correlation: Another theory could be that there's not much or no autocorrelation within the data. Autocorrelation is when residual errors aren't independent of one another.
• Homoscedasticity: is refer to an instance where errors are a factor (that is, the "noise" or random disturbance in the relationship between independent variables and dependent variables) that remains the same for all independent variables. Figure 1 is homoscedastic, while figure 2 displays heteroscedasticity. At the end of the tutorial, we'll discuss some of the applications of linear regression.

## Applications:

Following are the fields of applications based on Linear Regression:

• Trend lines: illustrate the changes in the quantity of data as the passing in time (like GDP or oil prices, for instance.). They typically have a linear connection. Therefore, linear regression could be used to predict future value. However, the method is not able to meet the requirements of scientific credibility in situations when other possible changes may alter the data.
• Economics: Linear Regression is the primary tool used in economics. It can be employed to predict the amount of spending by consumers and fixed investment spending investments in inventory, purchases of exports of a country, spending on imports, demand to store liquid assets, demand for labour, and supply.
• Financial: A model of capital value assets makes use of linear regression to study and quantify the risk factors of investing.
• Biology: Linear regression is a method to explain causal relationships between variables in biological systems.

## Help Others, Please Share   