Analysing Data in PythonWhat is Data Analysis?Data Analysis is a process of extracting useful information from the data and predicting trends on the basis of the past data. Data analysis consists of variety of methods including, collecting, modifying, and organizing data. Data analytics is used to convert unstructured data into useful information, which can be used to find solutions of many business problems. We can analyse data to get insights and statistics in the form of charts, images, tables, and graphs, which make it easier to understand and analyze the information using the visualizations. Methodology for Data Analysis
Let's understand these steps in detail. 1. Data Collection Data Collection is the first step in Data Analytics to collect the data from various sources, databases, social media and more. 2. Data Preparation The next step is to prepare the data, in which the data is cleaned and checked for any null values. It removes duplicate and null values and converts the data to the appropriate format. It makes the data ready for analysis in further process. 3. Data Exploration Data exploration is a process of exploring and visualizing the data using different charts and graphs to explore and analyze the unseen trends in the data. Visualizing the data makes the data more understandable. 4. Data Modelling Data modelling is a process of building the model and training it with the data using various machine learning algorithms that can be used to make future predictions and extract trends from the data. 5. Data evaluation Data evaluation is a process of deriving results after analyzing, evaluating the accuracy, and comparing them with the expected results. Data Analytics with PythonData analysis can be done using different programming languages including Python, R, etc. Python is more preferable language which can be used for data analytics.
Packages and Libraries for Data AnalyticsPython offers a range of libraries for data analytics. These are:
Let's implement these libraries for data analytics in Python. Analysing Data using NumPyNumPy is a library for data analysis in Python used for array processing. It provides computation for multidimensional arrays and various other tools for the arrays. What are NumPy Arrays?Arrays are sets of elements of the same type. A tuple of positive integers indexes it. The integers can give the size of the array called the shape of the array. We can create arrays using different ranks. The rank is the dimension of the array (1-D, 2-D, 3-D, etc.). The arrays can also be created using different data types like lists, tuples, dictionaries, etc. The index of the array starts from 0. The index of the array element is defined by the range 0 to n-1, where n is the number of the elements in the array. For example, array a has 10 elements, and we want to search for the 5th element of the array. The index of the 5th element will be a[4], as the array starts from 0 index. NumPy provides different functions and methods to create arrays and transform them. There are different ways by which we can analyze the data using the arrays. Let's implement the NumPy Arrays and analyze data with it. Firstly, we will install the numpy library using the pip command: After installing the library, we will import it: Code: Output: The array is : [ 78 889 12 45 566 90] The type of arr is : <class 'numpy.ndarray'> = We have made a simple numpy array using the np.array( ) function. We added integer elements in the array. Then, we printed the array elements and their type. Now, we will create arrays with different dimensions. Code: Output: Array 1: [0] Array 2: [[0 0] [0 0]] Array 3: [['' '' ''] ['' '' ''] ['' '' '']] <class 'numpy.ndarray'> We have created multiple arrays of different dimensions using the np.empty( ) function. We can make multiple-dimensional arrays using the np.empty( ) function. We can create the multi-dimensional array and add values using np.array( ) directly. Code: Output: Array 4: [[1 2] [2 2] [3 4]] We have created a 3 x 2 array using the np.array( ) function and added values to it. We can do mathematical calculations on the arrays. Code: Output: Addition of array 1 and array 2: [[3 4] [7 9]] Subtraction of array 1 and array 2: [[-1 0] [-1 -1]] Multiplication of array 1 and array 2: [[ 2 4] [12 20]] Division of array 1 and array 2: [[0.5 1. ] [0.75 0.8 ]] We have created two different arrays of 2 x 2 dimensions. Then, we performed different mathematical functions, including adding, subtraction, division, and multiplication. We can transform the arrays using different functions like slicing, indexing, etc.
Code: Output: arr[5]: 34 arr[10]: 12 arr[2]: 45 arr[0]: 1 Code: Output: arr2[3][2]; 45 arr2[1][0]: 10 arr2[2][2]: 56 arr2[3][0]: 1 arr2[0][2]: 3 We have created a multi-dimensional array of 4 x 3 dimensions and printed elements of different indexes.
Code: Output: arr2[2:5]: [ 45 67 100] arr2[2:7]: [ 45 67 100 34 566] We have spliced a 1-D array with different ranges. Code: Output: arr2[1:3]: [[19 64 82] [90 35 46]] We have spliced a multi-dimensional array with different ranges. Numpy gives other functions like concatenating the arrays, deleting multiple elements, adding elements to the array, sorting, searching, calculating mean, median, and mode, etc. Analysing Data with PandasPandas is a library in Python used for data analysis. Generally, it works with huge data sets. It can read files like CSV, JSON, text, etc. It has different functions, like transforming the data, including checking and handing null and duplicate values. It cleans, explores, transforms, and analyzes the data. Pandas use labeled data, which provides different data structures. It gives two data structures: Series and Dataframe. What is the Pandas Series?Pandas Series is a type of 1-D labeled array that can store any data. The series in pandas can be referred to as a column in an Excel sheet. The labels are known as the indexes. In the pandas series, the index numbers can be used to label it. It starts from index 0. Let's implement the Pandas Series in Python. Firstly, we will install the library using the pip command: After installation, we have to import the library. Now, we will create a series in pandas and add data. Code: Output: SERIES: 0 1 1 2 2 3 3 4 4 5 5 6 6 7 dtype: Int64 We first imported the libraries and then used the pd.Series( ) method, we have made a series and added data into it. What is Pandas DataFrame?Pandas Data frame is a two-dimensional data structure with rows and columns. The data frame. The data frame consists of rows, columns, and data. It can be created using the dataframe( ) method. Let's implement the Pandas data frame in Python. Code: Output: 0 0 1 1 2 2 3 3 4 4 5 5 6 6 7 We have imported the library and are using the pd.DataFame( ), we have made a data frame and added data. We can use pandas for reading CSV data and create data frames from it. Code: Output: Index Organization Id Name Website Country Description Founded Industry Number of employees 0 1 FAB0d41d5b5d22c Ferrell LLC https://price.net/ Papua New Guinea Horizontal empowering knowledgebase 1990 Plastics 3498 1 2 6A7EdDEA9FaDC52 Mckinney, Riley, and Day http://www.hall-buchanan.info/ Finland User-centric system-worthy leverage 2015 Glass / Ceramics / Concrete 4952 2 3 0bFED1ADAE4bcC1 Hester Ltd http://sullivan-reed.com/ China Switchable scalable moratorium 1971 Public Safety 5287 3 4 2bFC1Be8a4ce42f Holder-Sellers https://becker.com/ Turkmenistan De-engineered systemic artificial intelligence 2004 Automotive 921 4 5 9eE8A6a4Eb96C24 Mayer Group http://www.brewer.com/ Mauritius Synchronized needs-based challenge 1991 Transportation 7870 We imported the library, and using the read_csv( ) function, we read the customer data set, including data of different organizations and their industry type; then, we created the data frame. Using the data.head( ) function, we have read the first 5 records of the data set. Now, we will use different functions and methods to analyze the data set. We will explore and analyze the data by checking and handling duplicate and null values and many other functions. Checking null values Code: Output: Index 0 Organization Id 0 Name 0 Website 0 Country 0 Description 0 Founded 0 Industry 0 Number of employees 0 dtype: int64 Using the data.isnull( ).sum( ) function, we have checked null values in the data set. It will summarize null values in the data set, if any. In this data set, there are no null values. Getting brief information about the data frame. Code: Output: <class 'pandas.core.frame.DataFrame'> Int64Index: 100 entries, 63 to 65 Data columns (total 9 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Index 100 non-null int64 1 Organization Id 100 non-null object 2 Name 100 non-null object 3 Website 100 non-null object 4 Country 100 non-null object 5 Description 100 non-null object 6 Founded 100 non-null int64 7 Industry 100 non-null object 8 Number of employees 100 non-null int64 dtypes: int64(3), object(6) memory usage: 7.8+ KB Using the data.info( ) function, we get the information of the data set. It includes the columns of the data set, its data type, the number of entries in the data set, and the non-null value count. Getting a description of the data Code: Output: Index Founded Number of employees count 100.000000 100.000000 100.000000 mean 50.500000 1995.410000 4964.860000 std 29.011492 15.744228 2850.859799 min 1.000000 1970.000000 236.000000 25% 25.750000 1983.500000 2741.250000 50% 50.500000 1995.000000 4941.500000 75% 75.250000 2010.250000 7558.000000 max 100.000000 2021.000000 9995.000000 We have used data.describe( ) function to get a description of the data in the data frame. It gives a correlation matrix of the data. Exploratory Data Analysis using Pandas is an important concept in data analysis using Python. It involves checking and handling the imperfections and errors in the data, like removing duplicates, changing the data format, manipulating the columns of the data set, and many more. Analysing Data with MatplotlibMatplotlib is a library that creates interactive charts, graphs, and tables, including bar graphs, scatter plots, line charts, etc., for exploring and analyzing data. It helps to understand and analyze the data more efficiently. This is a simple and easy language used for visualizing the data in graphical forms. To implement the matplotlib, we need to install the library using the pip command: Matplotlib has a module named Pyplot, which offers different functions for creating charts and graphs. Firstly, we will import the matplotlib library and create charts and graphs to analyze the data. Now, we will create charts and graphs, including bar plots, histograms, scatter plots, etc. To make charts and graphs, we will use the iris data set. We read the csv file using the read_csv( ) function of the pandas library. Then, we made the data frame from it. Code: Output: Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm Species 0 1 5.1 3.5 1.4 0.2 Iris-setosa 1 2 4.9 3.0 1.4 0.2 Iris-setosa 2 3 4.7 3.2 1.3 0.2 Iris-setosa 3 4 4.6 3.1 1.5 0.2 Iris-setosa 4 5 5.0 3.6 1.4 0.2 Iris-setosa We imported the pandas library, read the iris data using the read_csv( ) method, and then printed the first 5 records using the head( ) function. Creating Bar Plot Code: Output: <function matplotlib.pyplot.show(close=None, block=None)> We imported the matplotlib.pyplot library, and using plt.bar( ), we made a bar plot between SepalLengthCm and PetalLengthCm. Using the title( ) function, we added the title to the graph. Creating Histogram Code: Output: <function matplotlib.pyplot.show(close=None, block=None)> We imported the matplotlib.pyplot library, and using plt.hist( ), we made a histogram of PetalWidthCm. Using the title( ) function, we added the title to the graph. Creating Scatter Plot Code: Output: <function matplotlib.pyplot.show(close=None, block=None)> We imported the matplotlib.pyplot library, and using the plt.scatter( ) function, we made a scatter plot between SepalLengthCm and PetalLengthCm. Using the title( ) function, we added the title to the graph. Python provides another library for visualizing the data and creating interactive charts and graphs. Seaborn is an interactive library used to make charts and graphs. It is similar to matplotlib but helps create more interactive and colorful graphs than matplotlib. For implementing the seaborn, we must install it using the pip command Then, we will import the library We can create different maps like heatmap, box plots, etc. Creating Heatmap Code: Output: We imported the libraries, and using the sns.heatmap( ) function, we created the heatmap. Next TopicApi-authentication-in-python |