Label Encoding in Python

An Introduction

Before we begin learning the categorical variable encoding, let us understand the basics of data types and their scales at first. It becomes essential for the learners to understand these topics in order to proceed to work with categorical variable encoding. As we all know, Data is a distinct type of information usually formatted in a specific manner. We can categorize Data into three types, called, structured data, semi-structured data, and unstructured data.

The data embodied in the form of matrix with rows and columns are denoted as structured data. These data can be stored as a table in SQL database, rows and columns in excel sheet, or delimiter separated in CSV.

The Data which is not embodied in the form of matrix is said to be semi-structured data and unstructured data. We can generally store the Semi-Structured data in XML files, JSON format and a lot more, whereas the unstructured data in form of images, e-mails, videos, log data, and textual data.

Let us consider a provided business problem based on machine learning or data science. If we deal with structured data only and the data gathered is a combination of continuous variables and categorical variables, most of the algorithms in Machine Learning will not understand or not be able to work with categorical variables.

This statement means that algorithms in machine learning perform considerably better in accuracy and other performance metrics when data is represented in numeric form instead of categorical to a model for training and testing.

Hence, the categorical data must be encoded into numbers before using it to fit and evaluate a model.

Some Algorithms of Machine Learning, like Tree-based (such as Decision Tree, Random Forest) algorithms, perform better in handling categorical variables. The best training in any project associated with data science is to convert categorical data to a numerical data.

Since the objective is now clear, let us understand several types of categorical data before we start building any statistical models, deep learning models, or machine learning models and start encoding or transforming categorical data to numerical forms.

Understanding Nominal Scale

Nominal Scale is defined as the variables that are names only. They are used to label variables. Nominal scales never overlap with each other, and do not have any numerical significance.

Note: Nominal Scale refers only to those variables that just name.

Here are some examples that represent data for the nominal Scale. Once we collect the data, we have to assign a numerical code representing a nominal variable generally.

What is person's Gender?What is person's Marital Status?In which city does the person reside?
MaleSingleDelhi
FemaleMarriedMumbai
DivorcedChennai
WidowedBangalore
City where Person liveAssigned Numerical Code
Delhi1
Mumbai2
Chennai3
Bangalore4

For instance, we can assign a numerical code 1 to address Delhi, 2 to address Mumbai, 3 to address Chennai, and 4 to address Bangalore for a categorical variable - In which city does the person reside.

Important Note: The assigned numerical value does not have any attached mathematical value.

Above statement implies that basic mathematical operations like division, multiplication, subtraction, or addition are pointless. As a result, operations like Delhi/Mumbai or Chennai + Bangalore make no sense at all.

Understanding Ordinal Scale

An Ordinal Scale refers to a variable in which the data value is stored from an ordered set. For instance, data utilizes Likert scale for representing customer feedback survey that is finite, as shown in the table below:

Customer Feedback: 5-Points Likert Scale

FeedbackAssigned Numerical Code
Very Poor1
Poor2
Satisfactory3
Good4
Very Good5

In the above case, we have collected the feedback data with the help of a five-point Likert scale. We assigned the numerical code 1 to Very Poor, 2 for Poor, 3 for Satisfactory, 4 for Good, and 5 for Very Good. We can also observe that 5 is better than 4 and much better than 3. However, it will not make any sense if we subtract very good from satisfactory.

As we already know, most machine learning algorithms function exclusively with numerical values or data. This is the reason behind encoding the categorical features into a representation compatible using the models.

Thus, there are several well-known approaches available for encoding, including:

  1. Label Encoding
  2. One-hot Encoding
  3. Ordinal Encoding

However, we will be covering Label Encoding only throughout this tutorial:

Understanding Label Encoding

In Python Label Encoding, we need to replace the categorical value using a numerical value ranging between zero and the total number of classes minus one. For instance, if the value of the categorical variable has six different classes, we will use 0, 1, 2, 3, 4, and 5.

Now, let us understand label encoding with the data of COVID-19 cases in India across states as an example. While observing the following data frame, we will find out that the State column consists of a categorical value that is not pretty machine-friendly. The other columns consist of numerical values. Now, let us try label encoding for State Column.

Covid-19 cases in India across states

StateConfirmedDeathsRecovered
Maharashtra28428111194158140
Tamil Nadu1563692236107416
Delhi118645354597693
Karnataka51422208919729
Gujarat45481208932103
Uttar Pradesh43441104626675

As we can observe, after performing label encoding, we have assigned the numerical value to each and every categorical value. Some of us might wonder why the numbering is unordered (Top-Down). The reason is that we have assigned the numbers in alphabetical order which implies that Delhi is assigned to 0, Gujarat to 1, Karnataka to 2 and so on.

State (Nominal Scale)State (Label Encoding)
Maharashtra3
Tamil Nadu4
Delhi0
Karnataka2
Gujarat1
Uttar Pradesh5

Label Encoding using Python

The sklearn library of Python offers users pre-defined functions in order to work with Label Encoding on the dataset.

Syntax:

As we can observe, we have created an object of the LabelEncoder class and then use the object to apply the label encoding to the data.

There are primarily two ways available for Label Encoding:

  1. LabelEncoder class with the help of scikit-learn library
  2. Category Codes

Label Encoding using the scikit-learn library

Let us begin with the process of Label Encoding. The primary step for dataset encoding is to have a dataset.

So, we have created a simple dataset here.

Example: Creating the dataset

As we can observe, we have created a dictionary 'data' and transformed it into a Data Frame with the help of the DataFrame() function of pandas.

Output:

Geniune Data Frame:

  Gender    Name
0      F   Cindy
1      M    Carl
2      M  Johnny
3      F  Stacey
4      M    Andy
5      F    Sara
6      M  Victor
7      F  Martha
8      F   Mindy
9      M     Max

As we can observe from the above dataset, we have the variable called 'Gender' that has labels as 'F' and 'M', respectively.

Moving ahead, let us try importing the LabelEncoder class. We will then apply the class on the 'Gender' variable of the dataset.

Example:

Output:

Geniune Data Frame:


  Gender    Name
0      F   Cindy
1      M    Carl
2      M  Johnny
3      F  Stacey
4      M    Andy
5      F    Sara
6      M  Victor
7      F  Martha
8      F   Mindy
9      M     Max
[0 1]
Data Frame after Label Encoding:

   Gender    Name
0       0   Cindy
1       1    Carl
2       1  Johnny
3       0  Stacey
4       1    Andy
5       0    Sara
6       1  Victor
7       0  Martha
8       0   Mindy
9       1     Max

Explanation:

In the above example, we have imported pandas and preprocessing modules of the scikit-learn library. We have then defined the data as a dictionary and printed a data frame for reference.

Later on, we have used the fit_transform() method in order to add label encoder functionality pointed by the object to the data variable. We have printed the unique code with respect to the Gender and the final Data Frame after performing label encoding.

Moving ahead, let us discuss another method of Label Encoding that is with the help of Category codes.

Label Encoding using Category codes

Before we get into the process of label encoding using Category codes, let us check the data types of the variables of the dataset.

We can check the data type using the dtypes function as shown below:

Output:

Gender    object
Name      object
dtype: object

Once we have checked the data type of the variable 'Gender', we will transform and convert it to category type.

This can be seen in the following snippet of code:

Output:

Gender    category
Name        object
dtype: object

Now, let us try transforming the labels into integer types with the pandas.DataFrame.cat.codes function. Here is a complete example based on label encoding using the category codes:

Example:

Output:

Genuine Data Frame:

  Gender    Name
0      F   Cindy
1      M    Carl
2      M  Johnny
3      F  Stacey
4      M    Andy
5      F    Sara
6      M  Victor
7      F  Martha
8      F   Mindy
9      M     Max

Data Frame after Label Encoding using Category codes:

   Gender    Name
0       0   Cindy
1       1    Carl
2       1  Johnny
3       0  Stacey
4       1    Andy
5       0    Sara
6       1  Victor
7       0  Martha
8       0   Mindy
9       1     Max

Explanation:

In the above example, we have imported the pandas library and defined data as a dictionary. We have then printed the original data frame for reference. After that, we have converted the data type of the variable 'Gender' into a category and used the pandas.DataFrame.cat.codes function to transform it into category codes. At last, we have printed the result after label encoding using Category codes.