Best Python Popular Library for Data Engineer | NLP

In the tutorial, we will learn about the Python's popular library for the data engineer. These libraries make data engineer life easy. As we know, Python is the most popular language for the machine learning. We will list down those popular libraries with the code.

Top Six Libraries of Python

1. Pendulum

Pendulum is a useful and convenient library for working with date and time in Python applications. However, there are many other libraries to work with the time and date but pendulum is easy to use as compare to other libraries. It provides a cleaner and more intuitive syntax than the standard Python datetime module and offers additional features, such as time zone support, intuitive and robust parsing, easier interval creation, and localization. Pendulum can handle a wide range of use cases, from simple date/time operations to more complex scenarios, such as scheduling and recurring events.

It offers a simple, human-friendly API for formatting dates and times. Now, let's install this library.

Now, let's understand the following example.

Example -

Output:

datetime object: 2023-02-17T16:37:49.251059+00:00
Converted datetime object: 2023-02-17T17:37:49.251059+01:00
Specific datetime object: 2022-03-15T12:30:00-04:00
Difference: 42

Explanation -

In the above code, we first create a new datetime object for the current date and time in the UTC time zone. We then convert this object to the 'Europe/Paris' time zone and print the result. Next, we create a new datetime object for a specific date and time, and print the result. Finally, we calculate the difference in days between two datetime objects and print the result. This code demonstrates just a few of the many features of the Pendulum library.

2. Humanize

Humanize libraries provides the easy to read number, string formatting and dates in human readable. It provides various functions that help in converting complex data into more easily understandable forms, particularly for display purposes. For instance, it can convert dates into relative time representations, such as "3 hours ago" or "next Monday", or transform large numbers into more readable formats, such as "1.2M" instead of "1200000". The Humanize library can also format file sizes, durations, and other types of data in a more human-friendly way. We can install it using the following command.

Let's understand the following example.

Example -

Output:

Current Date and Time: now
500,000,000
Duration  a day
File size 1.1 GB

Explanation -

In the above code, we first import the Humanize library and the Python datetime module. We then use the naturaltime() function to convert the current date and time to a human-readable string that represents the time relative to the current time, such as "just now" or "2 days ago". We also use the intcomma() function to convert a large number to a human-readable string with commas separating the thousands, millions, billions, etc. We use the naturaldelta function to convert duration into a more readable form such as "1 day" or "2 hours". Finally, we use the naturalsize() function to convert a file size (in bytes) to a more readable format such as "1 GB" or "2 MB".

3. pgeocode

The pgeocode library is useful for applications that require geographical information or mapping, such as logistics, supply chain management, and online retail. It is a Python library for high performance off-line querying of GPS coordinates, postcode/ZIP codes, and countries. It provides an interface for Python users to query and retrieve data from the Geonames geographical database, which is a comprehensive dataset of geographic locations and their corresponding data, such as country, region, postal codes, and more.

With the pgeocode library, you can obtain information about a specific location or a list of locations based on their GPS coordinates, postal codes, or country names. The library allows you to access a large number of location-related features, including latitude and longitude, city, state or region, country name, postal code, and time zone.

We can install it using the following command.

Example -

Output:

postal_code          10001
country_code            US
place_name        New York
state_name        New York
state_code              NY
county_name       New York
county_code           61.0
community_name         NaN
community_code         NaN
latitude           40.7484
longitude         -73.9967
accuracy               4.0
Name: 0, dtype: object

Explanation

In the above code, we first import the pgeocode library and create an instance of the Nominatim class for the United States, with the us parameter. We then use the query_postal_code() method to query for location information based on the postal code 10001. The method returns a pandas Series object with the location information, including the city, state, latitude and longitude, and more. Finally, we display the location information using the print function.

Note that you may need to install the pandas library as a prerequisite for using pgeocode. Additionally, you can use other query methods of the Nominatim class to obtain location information based on country, state, and city names, as well as GPS coordinates.

4. ftfy

Sometimes the foreign language present in the data doesn't appear correctly. This term is known as the Mojibake which is used to describe distorted or jumbled text that occurs as a result of encoding or decoding problem. The ftfy (file text for you) library can resolve such problems. It is especially useful when working with text data that has been scraped from the web or that has been encoded using non-standard methods. It can be used in a variety of applications, including data cleaning, text analysis, and natural language processing. It also includes tools for cleaning up text that has been mangled by automatic text correction or formatting, such as text that has been accidentally capitalized or lowercased. We can install it using the following library.

Let's understand the following example.

Example -

Output:

Héllo, Wörld!
大中华民国
à perturber la réflexion

Explanation -

In the above code, the ftfy.fix_text() function is used to fix a specific problem with the text. The first example fixes mixed character encodings, the second example fixes mojibake, and the third example cleans up text formatting.

5. Sketch

A sketch is an AI based Python written code assistant specially used for the pandas library. It helps to utilize the machine learning algorithm to rectify the user data and provide appropriate code suggestion to make data manipulation and analysis tasks quicker and easier.

It allows users to store and manipulate large sets of data using compact representations, which are efficient in terms of space complexity and provide fast performance. The library provides various types of data structures such as Count-Min Sketch, HyperLogLog, Bloom Filter, and MinHash, which can be used for different purposes such as estimating the frequency of items in a dataset, identifying near-duplicates, or measuring similarity between data sets.

This library can be useful in a wide range of applications such as data mining, machine learning, natural language processing, network analysis, and web analytics. It is an open-source library and can be installed using Python's package manager, pip.

Let's understand the following example.

Example -

Output:

3
1
1
03
1
1
0

Explanation -

In the above code, we create a CountMinSketch object with 10 hash functions and 1000 rows. We then update the sketch with some data (the strings "apple", "banana", and "cherry") using the update() method. Finally, we estimate the frequency of some items using the estimate() method.\

The output shows that "apple" appears three times in the dataset, "banana" and "cherry" each appear once, and "orange" does not appear at all. Note that the frequency estimates are approximate and may be slightly inaccurate due to the probabilistic nature of the Count-Min Sketch data structure.

6. rembg

Rembg is an open-source library written in Python that uses deep learning to remove the background of images. It is designed to work with images that contain a single foreground object with a clear boundary between the foreground and the background, such as photos of people, animals, or objects on a plain background.

The library uses a pre-trained convolutional neural network (CNN) model to predict the probability of each pixel in the image belonging to the foreground or the background. The model is trained on a large dataset of annotated images and is able to accurately detect the boundary between the foreground and the background.

It can be used either as a standalone command-line tool or as a Python library that can be integrated into other applications.

Rembg is useful in a wide range of applications such as image processing, computer vision, and photography. It can be used to remove the background from images for use in other projects or to create cutout images that can be placed on different backgrounds. It can be installed using Python's package manager, pip.

It can be installed using Python's package manager, pip.

Let's understand the following example.

Example -

Explanation -

In the above code, we import the rembg's remove() method and cv2 library. We read from the given path and remove the image background using the remove() method and save in the output path.

Conclusion

This tutorial included some amazing library which is quite handy and useful for the Python developer. You might have known most of the libraries however you can them according to your requirement.