Label Encoding in Python
Before we begin learning the categorical variable encoding, let us understand the basics of data types and their scales at first. It becomes essential for the learners to understand these topics in order to proceed to work with categorical variable encoding. As we all know, Data is a distinct type of information usually formatted in a specific manner. We can categorize Data into three types, called, structured data, semi-structured data, and unstructured data.
The data embodied in the form of matrix with rows and columns are denoted as structured data. These data can be stored as a table in SQL database, rows and columns in excel sheet, or delimiter separated in CSV.
The Data which is not embodied in the form of matrix is said to be semi-structured data and unstructured data. We can generally store the Semi-Structured data in XML files, JSON format and a lot more, whereas the unstructured data in form of images, e-mails, videos, log data, and textual data.
Let us consider a provided business problem based on machine learning or data science. If we deal with structured data only and the data gathered is a combination of continuous variables and categorical variables, most of the algorithms in Machine Learning will not understand or not be able to work with categorical variables.
This statement means that algorithms in machine learning perform considerably better in accuracy and other performance metrics when data is represented in numeric form instead of categorical to a model for training and testing.
Hence, the categorical data must be encoded into numbers before using it to fit and evaluate a model.
Some Algorithms of Machine Learning, like Tree-based (such as Decision Tree, Random Forest) algorithms, perform better in handling categorical variables. The best training in any project associated with data science is to convert categorical data to a numerical data.
Since the objective is now clear, let us understand several types of categorical data before we start building any statistical models, deep learning models, or machine learning models and start encoding or transforming categorical data to numerical forms.
Understanding Nominal Scale
Nominal Scale is defined as the variables that are names only. They are used to label variables. Nominal scales never overlap with each other, and do not have any numerical significance.
Note: Nominal Scale refers only to those variables that just name.
Here are some examples that represent data for the nominal Scale. Once we collect the data, we have to assign a numerical code representing a nominal variable generally.
For instance, we can assign a numerical code 1 to address Delhi, 2 to address Mumbai, 3 to address Chennai, and 4 to address Bangalore for a categorical variable - In which city does the person reside.
Important Note: The assigned numerical value does not have any attached mathematical value.
Above statement implies that basic mathematical operations like division, multiplication, subtraction, or addition are pointless. As a result, operations like Delhi/Mumbai or Chennai + Bangalore make no sense at all.
Understanding Ordinal Scale
An Ordinal Scale refers to a variable in which the data value is stored from an ordered set. For instance, data utilizes Likert scale for representing customer feedback survey that is finite, as shown in the table below:
Customer Feedback: 5-Points Likert Scale
In the above case, we have collected the feedback data with the help of a five-point Likert scale. We assigned the numerical code 1 to Very Poor, 2 for Poor, 3 for Satisfactory, 4 for Good, and 5 for Very Good. We can also observe that 5 is better than 4 and much better than 3. However, it will not make any sense if we subtract very good from satisfactory.
As we already know, most machine learning algorithms function exclusively with numerical values or data. This is the reason behind encoding the categorical features into a representation compatible using the models.
Thus, there are several well-known approaches available for encoding, including:
However, we will be covering Label Encoding only throughout this tutorial:
Understanding Label Encoding
In Python Label Encoding, we need to replace the categorical value using a numerical value ranging between zero and the total number of classes minus one. For instance, if the value of the categorical variable has six different classes, we will use 0, 1, 2, 3, 4, and 5.
Now, let us understand label encoding with the data of COVID-19 cases in India across states as an example. While observing the following data frame, we will find out that the State column consists of a categorical value that is not pretty machine-friendly. The other columns consist of numerical values. Now, let us try label encoding for State Column.
Covid-19 cases in India across states
As we can observe, after performing label encoding, we have assigned the numerical value to each and every categorical value. Some of us might wonder why the numbering is unordered (Top-Down). The reason is that we have assigned the numbers in alphabetical order which implies that Delhi is assigned to 0, Gujarat to 1, Karnataka to 2 and so on.
Label Encoding using Python
The sklearn library of Python offers users pre-defined functions in order to work with Label Encoding on the dataset.
As we can observe, we have created an object of the LabelEncoder class and then use the object to apply the label encoding to the data.
There are primarily two ways available for Label Encoding:
Label Encoding using the scikit-learn library
Let us begin with the process of Label Encoding. The primary step for dataset encoding is to have a dataset.
So, we have created a simple dataset here.
Example: Creating the dataset
As we can observe, we have created a dictionary 'data' and transformed it into a Data Frame with the help of the DataFrame() function of pandas.
Geniune Data Frame: Gender Name 0 F Cindy 1 M Carl 2 M Johnny 3 F Stacey 4 M Andy 5 F Sara 6 M Victor 7 F Martha 8 F Mindy 9 M Max
As we can observe from the above dataset, we have the variable called 'Gender' that has labels as 'F' and 'M', respectively.
Moving ahead, let us try importing the LabelEncoder class. We will then apply the class on the 'Gender' variable of the dataset.
Geniune Data Frame: Gender Name 0 F Cindy 1 M Carl 2 M Johnny 3 F Stacey 4 M Andy 5 F Sara 6 M Victor 7 F Martha 8 F Mindy 9 M Max [0 1] Data Frame after Label Encoding: Gender Name 0 0 Cindy 1 1 Carl 2 1 Johnny 3 0 Stacey 4 1 Andy 5 0 Sara 6 1 Victor 7 0 Martha 8 0 Mindy 9 1 Max
In the above example, we have imported pandas and preprocessing modules of the scikit-learn library. We have then defined the data as a dictionary and printed a data frame for reference.
Later on, we have used the fit_transform() method in order to add label encoder functionality pointed by the object to the data variable. We have printed the unique code with respect to the Gender and the final Data Frame after performing label encoding.
Moving ahead, let us discuss another method of Label Encoding that is with the help of Category codes.
Label Encoding using Category codes
Before we get into the process of label encoding using Category codes, let us check the data types of the variables of the dataset.
We can check the data type using the dtypes function as shown below:
Gender object Name object dtype: object
Once we have checked the data type of the variable 'Gender', we will transform and convert it to category type.
This can be seen in the following snippet of code:
Gender category Name object dtype: object
Now, let us try transforming the labels into integer types with the pandas.DataFrame.cat.codes function. Here is a complete example based on label encoding using the category codes:
Genuine Data Frame: Gender Name 0 F Cindy 1 M Carl 2 M Johnny 3 F Stacey 4 M Andy 5 F Sara 6 M Victor 7 F Martha 8 F Mindy 9 M Max Data Frame after Label Encoding using Category codes: Gender Name 0 0 Cindy 1 1 Carl 2 1 Johnny 3 0 Stacey 4 1 Andy 5 0 Sara 6 1 Victor 7 0 Martha 8 0 Mindy 9 1 Max
In the above example, we have imported the pandas library and defined data as a dictionary. We have then printed the original data frame for reference. After that, we have converted the data type of the variable 'Gender' into a category and used the pandas.DataFrame.cat.codes function to transform it into category codes. At last, we have printed the result after label encoding using Category codes.