Python Data Analytics
Data Analysis can help us to obtain useful information from data and can provide a solution to our queries. Further, based on the observed patterns we can predict the outcomes of different business policies.
Understanding the basic of Data Analytics
The kind of data on which we work during the analysis is mostly of the csv (comma separated values) format. Usually, the first row in the csv files represents as headers.
There's a diversity of libraries available in Python packages that can facilitate easy implementation without writing a long code.
Examples of some of the packages are-
Importing & Exporting Datasets
The two essential things that we must take care of while importing the datasets are-
It can be done in the following way-
If the dataset doesn't contain a header, we can specify it in the following way-
To look at the first five and last five rows of the dataset, we can make use of df.head() and df.tail() respectively.
Let's have a look at how we can export the data, if we have a file present in the .csv format then,
Data wrangling is a process of converting the data from a raw format to the one in which it can be used for analysis
Let us see what this part encompasses-
How to deal with missing values?
Missing values - Some entries are left blank because of the unavailability of information. It is usually represented with NaN, ? or 0.
Let us discuss how we can deal with them-
The best option is to replace the numerical variable with their average and the categorical variable with the mode.
Sometimes a situation might occur, when we have to drop the missing value, it can be done using-
If we want to drop a row, we have to specify the axis as 0. If we want to drop a column, we have to specify the axis as 1.
Moreover, if we want these changes to directly occur in the dataset, we will specify one more parameter inplace = True.
Now let's see how the values can be replaced-
The syntax is -
Here we will make a variable and store the mean of the attribute (whose value we want to replace) in it
How to proceed with data formatting?
It refers to the process of bringing the data in a comprehensible format. For example - Changing a variable name to make it understandable.
Normalization of Data
The features present in the dataset have values that can result in a biased prediction. Therefore, we must bring them to a range where they are comparable.
To do the same, we can use the following techniques on an attribute-
How to convert categorical variables into numeric variables?
Under this, we proceed with a process called "One-Hot Encoding", let's say there's an attribute that holds categorical values. We will make dummy variables from the possibilities and assign them 0 or 1 based on their occurrence in the attribute.
To convert categorical variables to dummy variables 0 or 1, we will use
Binning in Python
It refers to the process of converting numeric variables into categorical variables.
Let's say we have taken an attribute 'price' from a dataset. We can divide its data into three categories based on the range and then denote them with names such as low-price, mid-price, and high price.
We can obtain the range using linspace() method
Exploratory Data Analysis
We can find out the statistical summary of our dataset using describe() method. It can be used as df.describe(). The categorical variables can be summarized using value_counts() method.
The groupby() method of pandas can be applied to categorical variables. It groups the subsets based on different categories. It can involve single or multiple variables.
Let us have a look at an example that would help us to understand how it can be used in Python.
Correlation measures the scope to which two variables are interdependent.
A visual idea of checking what kind of a correlation exists between the two variables. We can plot a graph and interpret how does a rise in the value of one attribute affects the other attribute.
Concerning statistics, we can obtain the correlation using Pearson Correlation. It gives us the correlation coefficient and the P-value.
Let us have a look at the criteria-
We can use it in our piece of code using scipy stat package.
Let's say we want to calculate the correlation between two attributes, attribute1 and attribute2-
Further to check the correlation between all the variables, we can create a heatmap.
Relationship between two categorical variables
The relationship between two categorical variables can be calculated using the chi-square method.
How to Develop a Model?
First, let us understand what is a model?
A model can refer to an equation that helps us to predict the outcomes.
Linear Regression - As the name suggests, it involves only a single independent variable to make a prediction.
Multiple Regression - It involves multiple independent variables to make a prediction.
The equation for a simple linear regression can be represented as-
x - independent variable
To implement Linear Regression in Python-
Using Visualization to evaluate our model
Creating plots is a good practice since they show the strength of correlation and whether the direction of the relationship is positive or negative.
Let us have a look at the different plots that can help us to evaluate our model-
1. Using Regression Plot
2. Using Residual Plot
In Sample Evaluation
Here, we will discuss how we can evaluate our model numerically, the two ways of doing the same are-
1. Mean Square Error(MSE)
This method takes the difference between the actual and predicted value, squares it, and then finally calculates their average.
We can implement the same in Python using-
R-squared is also known as the coefficient of determination. It shows the closeness of data with the fitted regression line. It can be used in Python using the score() method.
In a nutshell, we have to take care of the following things when we are evaluating a model-
How to Evaluate a Model?
Evaluating our model is an integral element since it tells how perfectly our data fits the model. Now, we will discuss how we can use the training data to predict the results.
The key idea is to split our dataset into training and testing. The training dataset is used to build our model and the testing dataset is used to assess the performance of our model.
It can be implemented in Python using-
Overfitting and Underfitting
Overfitting- It is the condition when the model is quite simple to fit the data.
Underfitting - It is the condition when the model easily adjusts the noise factor rather than the function.
This is used when we are dealing with the variables of tenth degree. Here we introduced a factor called alpha. Let us see how we can implement this in Python.