Analysing Data in Python

What is Data Analysis?

Data Analysis is a process of extracting useful information from the data and predicting trends on the basis of the past data. Data analysis consists of variety of methods including, collecting, modifying, and organizing data. Data analytics is used to convert unstructured data into useful information, which can be used to find solutions of many business problems. We can analyse data to get insights and statistics in the form of charts, images, tables, and graphs, which make it easier to understand and analyze the information using the visualizations.

Methodology for Data Analysis

  1. Data Collection
  2. Preparing Data
  3. Data Exploration
  4. Modelling Data
  5. Data evaluation

Let's understand these steps in detail.

1. Data Collection

Data Collection is the first step in Data Analytics to collect the data from various sources, databases, social media and more.

2. Data Preparation

The next step is to prepare the data, in which the data is cleaned and checked for any null values. It removes duplicate and null values and converts the data to the appropriate format. It makes the data ready for analysis in further process.

3. Data Exploration

Data exploration is a process of exploring and visualizing the data using different charts and graphs to explore and analyze the unseen trends in the data. Visualizing the data makes the data more understandable.

4. Data Modelling

Data modelling is a process of building the model and training it with the data using various machine learning algorithms that can be used to make future predictions and extract trends from the data.

5. Data evaluation

Data evaluation is a process of deriving results after analyzing, evaluating the accuracy, and comparing them with the expected results.

Data Analytics with Python

Data analysis can be done using different programming languages including Python, R, etc. Python is more preferable language which can be used for data analytics.

  1. Python has a simple syntax, it is simple and easy-to-understand language, which makes it a suitable language for data analytics.
  2. Python is a flexible language and provides different packages for data analytics.
  3. Python offers libraries for data visualizations that can be useful for analyzing and extracting insights from the data.
  4. Data manipulation and statistics are much easier with Python.

Packages and Libraries for Data Analytics

Python offers a range of libraries for data analytics. These are:

  • Pandas: Pandas is a Python library used for data analysis. It handles the missing data, performs mathematical computation, and reads data from various file sources like CSV, JSON, text, etc.
  • NumPy: NumPy is a library providing multi-dimensional arrays that offer computation for linear algebra.
  • Matplotlib: This library is used to make interactive charts, plots, and graphs, which helps to visualize the data and analyze it more easily.
  • SciPy: This library offers various algorithms for statistics, algebraic equations, and many other problems.
  • Scikit-Learn: The Scikit-Learn library helps to make regression, classification, and clustering models. It provides different modules to implement these models.

Let's implement these libraries for data analytics in Python.

Analysing Data using NumPy

NumPy is a library for data analysis in Python used for array processing. It provides computation for multidimensional arrays and various other tools for the arrays.

What are NumPy Arrays?

Arrays are sets of elements of the same type. A tuple of positive integers indexes it. The integers can give the size of the array called the shape of the array. We can create arrays using different ranks. The rank is the dimension of the array (1-D, 2-D, 3-D, etc.). The arrays can also be created using different data types like lists, tuples, dictionaries, etc. The index of the array starts from 0. The index of the array element is defined by the range 0 to n-1, where n is the number of the elements in the array. For example, array a has 10 elements, and we want to search for the 5th element of the array. The index of the 5th element will be a[4], as the array starts from 0 index.

NumPy provides different functions and methods to create arrays and transform them. There are different ways by which we can analyze the data using the arrays.

Let's implement the NumPy Arrays and analyze data with it.

Firstly, we will install the numpy library using the pip command:

After installing the library, we will import it:

Code:

Output:

The array is : [ 78 889  12  45 566  90]
The type of arr is :  <class 'numpy.ndarray'>
=

We have made a simple numpy array using the np.array( ) function. We added integer elements in the array. Then, we printed the array elements and their type.

Now, we will create arrays with different dimensions.

Code:

Output:

Array 1: [0]
Array 2: [[0 0]
 [0 0]]
Array 3: [['' '' '']
 ['' '' '']
 ['' '' '']] <class 'numpy.ndarray'>

We have created multiple arrays of different dimensions using the np.empty( ) function. We can make multiple-dimensional arrays using the np.empty( ) function.

We can create the multi-dimensional array and add values using np.array( ) directly.

Code:

Output:

Array 4: [[1 2]
 [2 2]
 [3 4]]

We have created a 3 x 2 array using the np.array( ) function and added values to it.

We can do mathematical calculations on the arrays.

Code:

Output:

Addition of array 1 and array 2: [[3 4]
 [7 9]]
Subtraction of array 1 and array 2: [[-1  0]
 [-1 -1]]
Multiplication of array 1 and array 2: [[ 2  4]
 [12 20]]
Division of array 1 and array 2: [[0.5  1.  ]
 [0.75 0.8 ]]

We have created two different arrays of 2 x 2 dimensions. Then, we performed different mathematical functions, including adding, subtraction, division, and multiplication.

We can transform the arrays using different functions like slicing, indexing, etc.

  • Indexing in arrays is accessing the array elements using its index. The index starts from 0.

Code:

Output:

arr[5]: 34
arr[10]: 12
arr[2]: 45
arr[0]: 1

Code:

Output:

arr2[3][2]; 45
arr2[1][0]: 10
arr2[2][2]: 56
arr2[3][0]: 1
arr2[0][2]: 3

We have created a multi-dimensional array of 4 x 3 dimensions and printed elements of different indexes.

  • Slicing is a method to slice the elements of the arrays. We can return a range of elements using slicing. Let's implement the slicing of the array in Python.

Code:

Output:

arr2[2:5]: [ 45  67 100]
arr2[2:7]: [ 45  67 100  34 566]

We have spliced a 1-D array with different ranges.

Code:

Output:

arr2[1:3]: [[19 64 82]
 [90 35 46]]

We have spliced a multi-dimensional array with different ranges.

Numpy gives other functions like concatenating the arrays, deleting multiple elements, adding elements to the array, sorting, searching, calculating mean, median, and mode, etc.

Analysing Data with Pandas

Pandas is a library in Python used for data analysis. Generally, it works with huge data sets. It can read files like CSV, JSON, text, etc. It has different functions, like transforming the data, including checking and handing null and duplicate values. It cleans, explores, transforms, and analyzes the data.

Pandas use labeled data, which provides different data structures. It gives two data structures: Series and Dataframe.

What is the Pandas Series?

Pandas Series is a type of 1-D labeled array that can store any data. The series in pandas can be referred to as a column in an Excel sheet. The labels are known as the indexes. In the pandas series, the index numbers can be used to label it. It starts from index 0.

Let's implement the Pandas Series in Python.

Firstly, we will install the library using the pip command:

After installation, we have to import the library.

Now, we will create a series in pandas and add data.

Code:

Output:

SERIES:
0    1
1    2
2    3
3    4
4    5
5    6
6    7
dtype: Int64

We first imported the libraries and then used the pd.Series( ) method, we have made a series and added data into it.

What is Pandas DataFrame?

Pandas Data frame is a two-dimensional data structure with rows and columns. The data frame. The data frame consists of rows, columns, and data. It can be created using the dataframe( ) method.

Let's implement the Pandas data frame in Python.

Code:

Output:

	0
0	1
1	2
2	3
3	4
4	5
5	6
6	7

We have imported the library and are using the pd.DataFame( ), we have made a data frame and added data.

We can use pandas for reading CSV data and create data frames from it.

Code:

Output:

	Index	Organization Id	Name	Website	Country	Description	Founded	Industry	Number of employees
0	1	FAB0d41d5b5d22c	Ferrell LLC	https://price.net/	Papua New Guinea	Horizontal empowering knowledgebase	1990	Plastics	3498
1	2	6A7EdDEA9FaDC52	Mckinney, Riley, and Day	http://www.hall-buchanan.info/	Finland	User-centric system-worthy leverage	2015	Glass / Ceramics / Concrete	4952
2	3	0bFED1ADAE4bcC1	Hester Ltd	http://sullivan-reed.com/	China	Switchable scalable moratorium	1971	Public Safety	5287
3	4	2bFC1Be8a4ce42f	Holder-Sellers	https://becker.com/	Turkmenistan	De-engineered systemic artificial intelligence	2004	Automotive	921
4	5	9eE8A6a4Eb96C24	Mayer Group	http://www.brewer.com/	Mauritius	Synchronized needs-based challenge	1991	Transportation	7870

We imported the library, and using the read_csv( ) function, we read the customer data set, including data of different organizations and their industry type; then, we created the data frame. Using the data.head( ) function, we have read the first 5 records of the data set.

Now, we will use different functions and methods to analyze the data set. We will explore and analyze the data by checking and handling duplicate and null values and many other functions.

Checking null values

Code:

Output:

Index                  0
Organization Id        0
Name                   0
Website                0
Country                0
Description            0
Founded                0
Industry               0
Number of employees    0
dtype: int64

Using the data.isnull( ).sum( ) function, we have checked null values in the data set. It will summarize null values in the data set, if any. In this data set, there are no null values.

Getting brief information about the data frame.

Code:

Output:

<class 'pandas.core.frame.DataFrame'>
Int64Index: 100 entries, 63 to 65
Data columns (total 9 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   Index                100 non-null    int64 
 1   Organization Id      100 non-null    object
 2   Name                 100 non-null    object
 3   Website              100 non-null    object
 4   Country              100 non-null    object
 5   Description          100 non-null    object
 6   Founded              100 non-null    int64 
 7   Industry             100 non-null    object
 8   Number of employees  100 non-null    int64 
dtypes: int64(3), object(6)
memory usage: 7.8+ KB

Using the data.info( ) function, we get the information of the data set. It includes the columns of the data set, its data type, the number of entries in the data set, and the non-null value count.

Getting a description of the data

Code:

Output:

	Index	Founded	Number of employees
count	100.000000	100.000000	100.000000
mean	50.500000	1995.410000	4964.860000
std	29.011492	15.744228	2850.859799
min	1.000000	1970.000000	236.000000
25%	25.750000	1983.500000	2741.250000
50%	50.500000	1995.000000	4941.500000
75%	75.250000	2010.250000	7558.000000
max	100.000000	2021.000000	9995.000000

We have used data.describe( ) function to get a description of the data in the data frame. It gives a correlation matrix of the data.

Exploratory Data Analysis using Pandas is an important concept in data analysis using Python. It involves checking and handling the imperfections and errors in the data, like removing duplicates, changing the data format, manipulating the columns of the data set, and many more.

Analysing Data with Matplotlib

Matplotlib is a library that creates interactive charts, graphs, and tables, including bar graphs, scatter plots, line charts, etc., for exploring and analyzing data. It helps to understand and analyze the data more efficiently. This is a simple and easy language used for visualizing the data in graphical forms.

To implement the matplotlib, we need to install the library using the pip command:

Matplotlib has a module named Pyplot, which offers different functions for creating charts and graphs.

Firstly, we will import the matplotlib library and create charts and graphs to analyze the data.

Now, we will create charts and graphs, including bar plots, histograms, scatter plots, etc.

To make charts and graphs, we will use the iris data set. We read the csv file using the read_csv( ) function of the pandas library. Then, we made the data frame from it.

Code:

Output:

	Id	SepalLengthCm	SepalWidthCm	PetalLengthCm	PetalWidthCm	Species
0	 1	     5.1	     3.5	         1.4	         0.2	      Iris-setosa
1	 2	     4.9	     3.0	         1.4	         0.2	      Iris-setosa
2	 3	     4.7	     3.2	         1.3	         0.2	      Iris-setosa
3	 4	     4.6	     3.1	         1.5	         0.2	      Iris-setosa
4	 5	     5.0	     3.6	         1.4	         0.2	      Iris-setosa

We imported the pandas library, read the iris data using the read_csv( ) method, and then printed the first 5 records using the head( ) function.

Creating Bar Plot

Code:

Output:

<function matplotlib.pyplot.show(close=None, block=None)>

Analysing Data in Python

We imported the matplotlib.pyplot library, and using plt.bar( ), we made a bar plot between SepalLengthCm and PetalLengthCm. Using the title( ) function, we added the title to the graph.

Creating Histogram

Code:

Output:

<function matplotlib.pyplot.show(close=None, block=None)>

Analysing Data in Python

We imported the matplotlib.pyplot library, and using plt.hist( ), we made a histogram of PetalWidthCm. Using the title( ) function, we added the title to the graph.

Creating Scatter Plot

Code:

Output:

<function matplotlib.pyplot.show(close=None, block=None)>

Analysing Data in Python

We imported the matplotlib.pyplot library, and using the plt.scatter( ) function, we made a scatter plot between SepalLengthCm and PetalLengthCm. Using the title( ) function, we added the title to the graph.

Python provides another library for visualizing the data and creating interactive charts and graphs. Seaborn is an interactive library used to make charts and graphs. It is similar to matplotlib but helps create more interactive and colorful graphs than matplotlib.

For implementing the seaborn, we must install it using the pip command

Then, we will import the library

We can create different maps like heatmap, box plots, etc.

Creating Heatmap

Code:

Output:

Analysing Data in Python

We imported the libraries, and using the sns.heatmap( ) function, we created the heatmap.