FuzzyWuzzy Python Library
In this tutorial, we will learn how we can match the string using the Python built-in fuzzyWuzzy library and determine how they are similar using various examples.
Python provides a few methods to compare two strings. A few main methods are given below.
But there is another method that can be effectively used for comparison, known as fuzzywuzzy. This method is quite effective in differentiating the two strings referring to the same thing, but they are written slightly differently. Sometimes we need a program that can automatically identify wrong spelling.
It is a process of finding strings that match a given pattern. It uses Levenshtein Distance to calculate the difference between sequences.
This library can help map databases that lack a common key, such as joining two tables by company name, and these appear differently in both tables.
Let's see the following example.
The above code returns true because strings are matched an exactly (100 %), what if we make the change in str2.
Here the above code returns the false, and strings are pretty identical to the human eyes, but not for the interpreter. However, we can solve this problem by converting both strings to lower case.
But if we make changes in charset, we will get another problem.
To resolve such types of problems, we need more effective tools to compare the strings. And fuzzywuzzy is the best tool to calculate the strings.
The Levenshtein Distance
The levenshtein distance is used to calculate the distance between two sequences of words. It calculates the minimum number edits that we need to change in the given string. These edits can be insertion, deletions or substitution.
We will use the above function in the earlier example where we were trying to compare "Welcome to javatpoint." to "Welcome to javatpoint". We can see both strings are likely to same because Levensthtein's length is small.
The FuzzyWuzzy Package
The name of this library something weird and funny, but it is advantageous. It has a unique way to compare both strings and returns the score out of 100 of how much string is matched. To work with this library, we need to install it in our Python environment.
We can install this library using the pip command.
Collecting fuzzywuzzy Downloading fuzzywuzzy-0.18.0-py2.py3-none-any.whl (18 kB) Installing collected packages: fuzzywuzzy Successfully installed fuzzywuzzy-0.18.0
Now type the following command and press enter.
Let's understand the following methods of fuzzuwuzzy library
The fuzz module is used to compare the two given string at a time. It returns a score out of 100 after comparison using the different methods.
It is one of the important methods of fuzz module. It compares the string and score on the basis of how much the given string are matched. Let's understand the following example.
As we can see in the above code, the fuzz.ratio() method returned the score which means there is very slight difference between the strings.
The fuzzywuzzy library provides another powerful method - partial_ratio(). It is used to handle the complex string comparison such as substring matching. Let's see the following example.
The partial_ratio() method can detect the substring. Thus, it yields a 100% similarity. It follows the optimal partial logic where the short length string k and longer string m, the algorithm finds the best matching length k-substring.
This method does not guarantee to get an accurate result because if we make the changes in the order of string. It may not give an accurate result.
But fuzzywuzzy module provides the solution. Let's understand the following example.
59 74 100
In the above code, we have used token_sort_ratio() method which provides an advantage over partial_ratio. In this method, string token sorted alphabetically and joined together. But there is another situation such as what if the strings are widely different in the length.
Let's understand the following example.
40 64 61 95
In the above code, we have used another method called fuzz.token_set_ratio() that performs a set operation and takes out the common token and then makes ratio() pairwise comparison.
The intersection of the sorted token is always the same because the substring or smaller string consists of larger chunks of the original string or remaining token is closer to each other.
The fuzzywuzzy package provides the process module that allows us to calculate the string with the highest similarity. Let's understand the following example.
[('hello', 90), ('Hello Good', 90), ('Morning', 90), ('Good Evenining', 59)] ('hello', 90)
The above code will return the highest matching percentage of given string list.
The process module also provides the WRatio, which gives a better result than the simple ratio. It handles lower and upper cases and some other parameters too. Let's understand the following example.
In this tutorial, we have discussed how to match the string and determine how closely they are. We have illustrated the simple example but they are enough to clear that how computer treats the mismatched strings. Many real-life applications such as spell checking, bioinformatics to match, DNA sequence etc. are based on the fuzzy logic.