5 Categorical Encoding Tricks You Need to Know Today as a Data Scientist

Introduction:

In this tutorial, we will learn about five categorical encoding techniques. Handling categorical variables is very important. It is important for building an accurate model in data science. Categorical encoding involves transforming these categorical variables into a numerical format. The machine learning algorithms can understand this. Choose a right method can impact on the performance of your models. Because when there are many techniques available then you need to choose correct one. In this tutorial, we will explore five essential categorical encoding tricks that every data scientist should know. These techniques will help you handle various types of categorical data efficiently. It also improves your model's accuracy. Also improve streamlines your data preprocessing workflow. These techniques are essential for efficiently handling various types of categorical data in your projects.

What is the Meaning of Categorial Data?

This data is also referred to as the nominal data or ordinal data. The results are divided into several groups or categories. The numerical data measures objects quantitatively and categorical data represents qualitative or descriptive characteristics. Understanding categorical data is essential when working with machine learning models, as these models typically require numerical inputs. Categorical variables are often represented as a strings or labels. Its values are limited. The examples of categorial data are given below -

  1. The residing city of a person. The example of it is Kolkata, Mumbai, Delhi, etc.
  2. The department in which a person works. Examples include IT, Finance, Human Resources, Production, etc.
  3. The highest educational degree of a person like High School, Diploma, etc.
  4. The grades a student receives. The example of it is A+, A, E, O+, O, etc.

What Kind of Categorial Data is Available?

The categorial data are two types, which are given below -

1. Ordinal Data:

Ordinal data has categories. It has a sequence or order. It is crucial to preserve the order of the categories in the time of ordinal data encoding.

For example, class level indicates a person's qualifications when determining the highest degree that he or she can receive. This can be a significant factor in assessing their suitability for a job.

2. Nominal Data:

Nominal data has categories. There is no specific order or sequence. When encoding nominal data, it is crucial to account for whether a feature is present or absent. However, the order of the categories is irrelevant. For example, it is important to keep information about the city a person lives in when considering that city. However, there is no order in the city. For example, living in Kolkata is considered the same as living in Delhi in terms of order. Understanding the nature of categorical data helps data scientists and technologists choose appropriate coding techniques. The category_encoders Python package offers various methods for encoding categorical data. You can install it using the code below:

Types of the Encoding Techniques:

1. Identify the Categorial Features:

First, examine your data to identify features that contain non-numerical values such as text labels or categories like color, size, and customer type. This job requires encoding.

2. Select an Encoding Technique:

There are many coding methods. Each method has its advantages and disadvantages. Here we learn some popular options:

  1. One-Hot Encoding: This is ideal for properties with multiple categories. It creates new binary features where 1 indicates the category's presence and 0 indicates its absence.
  2. Label Encoding: This method is used to assigns a numerical value for each category. However, it assumes an ordinal relationship among categories. This is might not always be appropriate.
  3. Ordinal Encoding: It is similar to the label encoding. However, it should only be used when the group order is low, medium, high, etc.

3. Apply the Encoding Technique:

Once you choose the encoding method, you can apply it to your data. Many machine learning libraries provide functions for this purpose. For example, you can create a new binary feature with one-hot encoding, and you will have an original feature for each category.

4. Test and Refine the Technique:

It is optional. Sometimes, it is beneficial to experiment with different encoding techniques. This experiment helps you see how this process affects the performance of machine learning models. This can help you determine the best method for your specific dataset.

Ordinal Encoding or Label Encoding:

We use categorical data encoding techniques when categorical features are ordered. This means that the group order is important and must be preserved. In label encoding, each character is converted to an integer value representing the sequence. For example, we have data on people's academic achievement. These levels have a natural order that we want to preserve. The code is given below -

Output:

Now, we run the above code and find the result. The result is given below -

 
     Education  Education_Encoded
0  High School                0.0
1    Bachelor's                1.0
2      Master's                2.0
3          PhD                3.0
4  High School                0.0
5      Master's                2.0   

In this example, the ordinal encoding assigns numerical values to the education levels in the specified order, preserving their natural sequence.

One Hot Encoding:

We used a categorical data coding system, while the coding features were nominal. Each group corresponds to a binary variable containing either 0 or 1. These new binary features are called dummy variables. The number of dummy variables corresponds to the number of levels in the categorical variables. The process may seem complicated.

For example, suppose we have a dataset whose attributes represent different colors like "red", "green", and "blue". The code is given below -

Output:

Now, we run the above code and find the result. The result is given below -

 
   Color_Blue  Color_Green  Color_Red
0           0            0          1
1           0            1          0
2           1            0          0
3           0            1          0
4           0            0          1   

In this example, one-hot encoding creates three new binary columns. The columns are "Color_Blue," "Color_Green," and "Color_Red." Each column indicates the presence (1) or absence (0) of the corresponding color in the original feature.

Dummy Encoding:

Dummy coding is similar to one-hot encoding in that both methods convert categorical data into binary or dummy variables. However, while one-hot encoding uses N binary variables for N categories, dummy coding uses N-1 binary variables. This slight difference makes dummy coding a more efficient encoding method compared to one-hot encoding.

Here is a short example of dummy coding. Let you have a categorical variable representing three colors: red, green, and blue.

Color
Red
Green
Blue

Using dummy coding, you would create two binary variables (N-1 = 3-1 = 2).

ColorDummy1 (Green)Dummy2 (Blue)
Red00
Green10
Blue01

Here, "Red" is the reference category and is represented by 0 in both dummy1 and dummy2 variables. "Green" is represented by 1 in the first dummy variable and 0 in the second. "Blue" is represented by 0 in the first dummy variable and 1 in the second.

So, here we show this example in coding. The code is given below -

Output:

Now we run the above code. Then we find the result. In this example, "Red" is the reference category and is represented by 0 in both dummy variables "Color_Green" and "Color_Blue". "Green" is represented by 1 in "Color_Green" and 0 in "Color_Blue", and "Blue" is represented by 0 in "Color_Green" and 1 in "Color_Blue". The result is given below -

 
   Color_Green  Color_Blue
0            0           0
1            1           0
2            0           1   

Disadvantages of the One-Hot Encoding and Dummy Encoding:

One hot encoding and dummy encoding are two powerful and effective methods for categorical data encoding. They are also very popular among data scientists but may not be as effective when -

  1. When there are a large number of levels in the data, both one hot encoding and dummy encoding become less effective. If a feature variable has multiple categories, then a corresponding number of dummy variables will be required to encode the data. For instance, a column with 30 distinct values would need 30 new variables for encoding.
  2. When a dataset contains multiple categorical features, a similar issue arises. Each categorical feature, with its numerous categories, will lead to the creation of many binary features. For example, a dataset with 10 or more categorical columns will result in a large number of binary features, each representing the different categories within those columns.

In both cases, these encoding methods result in differences in the dataset. This resulted in many columns filled with zeros and only a few with ones. This means they create numerous dummy features without adding significant information. Additionally, they can lead to the dummy variable trap. It is a situation where features are highly correlated, and the value of one variable can be predicted using others. This increased dataset size can slow down the model's learning process and degrade overall performance. It also makes the model computationally expensive. Moreover, these encodings are not optimal for tree-based models.

Effect Encoding:

The coding process is also called Deviation Encoding or Sum Encoding. Effect encoding is similar to dummy coding, but there are a few differences. Dummy coding uses 0 and 1 to represent data, while effect coding uses 1, 0, and -1. Specifically, rows with only 0 in dummy coding are represented as -1 in effect encoding. Here we give an example of effect encoding, which is given below -

Output:

Now we run the above code. Then we find the result. In this output, E1 and E2 represent the effect encoding for the Colorvariable. Each level's encoding reflects its deviation from the overall mean, with Green being the reference level implicitly encoded by the negative sum of the other levels. The result is given below -

 
   Color  E1  E2
0    Red   1   0
1   Blue   0   1
2  Green  -1  -1
3    Red   1   0
4  Green  -1  -1
5   Blue   0   1   

Hash Encoder:

The hash encoding understanding is essential to first grasp the concept of hashing. It is the process of converting an input of arbitrary size into a fixed-size value. This is done using a hashing algorithm, which generates the hash value based on the input. Hashing is a single way process. This is meaning that it is not possible to reverse-engineer the original input from its hash value. Hashing has various applications. The examples of hashing are data retrieval, data integrity verification, and encryption. There are many hash functions. Message Digest is a hash function. Some examples of message digest are MD2, MD5. There are also Secure Hash Algorithms.

The hash encoder is similar to the one-hot encoding. It represents categorical features in new dimensions. The n_component parameter allows you to set variable sizes. For example, a feature with 5 categories can be represented by N new features, and features with 100 groups can be modified using N new features. By default, the hash encoder uses the MD5 hash algorithm. But users can specify a different algorithm if they prefer. An example of a Hash encoder is given below -

In this example, the HashingVectorizer transforms categorical data into a fixed number of hashed features (5 in this case). The alternate_sign parameter is set to False to ensure non-negative values. The output is a two-dimensional array representing the mixed features of each category.

Binary Encoding:

Binary encoding combines elements of hash encoding and one-bit encoding. In this method, categorical features are first converted into numerical form using the ordinal encoder. These numerical values are then transformed into their binary equivalents. The binary values are subsequently split into separate columns. Binary encoding is particularly effective in machine learning when dealing with features that have a large number of categories. An example of binary encoding is given below -

Output:

Now, we run the above code and find the result. Here, the BinaryEncoder has transformed the categorical values 'A', 'B', and 'C' into the binary-encoded format. The categories are first ordinally encoded to 1, 2, and 3, respectively, and then each number is represented in binary form and split into separate columns. The result is given below -

 
   Category_0  Category_1
0           0           1
1           0           2
2           0           3
3           0           1
4           0           2
5           0           3   

Target Encoding:

Target encoding is a method used in data preprocessing to convert categorical variables into numerical values. The one-hot encoding produces a binary string for each category, and target encoding assigns a numerical value to each category based on its relationship to the target variable. This approach is often used to classify activities, using commonalities (or other measurements) of different targets for each category instead of categorical values.

Target encoding captures important information from categorical data while reducing the remaining feature space, making it suitable for models such as decision trees and gradient boosting. It is a Bayesian coding technique. Bayesian coders use information about target variables to encode categorical data. In target encoding, the mean of the target variable is calculated for each category, and the mean replaces the category variable. For categorical target variables, each category is replaced by the posterior probability of the target. An example of target encoding is given below -

Output:

Now, we run the above code and find the result. In this example, TargetEncoder converts the categorical data "Category" to a numeric value based on the mean of the variable "Target". Each category is replaced with the mean of the values corresponding to that category. The result is given below -

 
  Category  Target  Encoded_Category
0        A       1               1.0
1        B       2               2.0
2        C       3               3.0
3        A       1               1.0
4        B       2               2.0
5        C       3               3.0
6        A       1               1.0   

Conclusion:

In this tutorial, we will learn about five categorical encoding techniques. These are the basic techniques that every data scientist should know. Encoding the categorical data is an important aspect of feature engineering. It is essential to choose the appropriate encoding scheme based on the dataset and the model being used. This tutorial explores various encoding techniques, highlighting their challenges and suitable use cases. Additionally, we have reviewed different types of encoding in machine learning.






Latest Courses