FP Growth Algorithm in Python

In the era of big data, uncovering significant experiences from vast datasets is a critical task for organizations, scientists, and data analysts. One key challenge is finding patterns and relationships inside this data, which can give actionable information for decision-making, and marketing strategies, from there, the limit is more. The FP-Growth (Frequent Pattern Growth) algorithm, presented by Han et al. in 2000, is a strong and effective tool intended to address this challenge.

The FP-Growth (Frequent Pattern Growth) algorithm is especially appropriate for mining frequent item sets, which are sets of items or attributes that frequently occur together in a dataset. These item sets act as the structure blocks for producing association rules, which desrcibe relationships and conditions between various items or properties. FP-Growth has found applications in a wide range of domains, from retail and web-based business to bioinformatics and web usage mining.

The Principles of FP-Growth Algorithm

The FP- Growth (Frequent Pattern Growth) algorithm is established on a set of key principles and information structures that enable it to productively discover regular item sets in large datasets. To completely grasp a handle on how the algorithm functions, it is essential to understand these principles is fundamental:

1. Frequent Item sets:

At the core of the FP-Growth algorithm are frequent item sets. These are sets of items (or attributes) that show up together in a dataset with a frequency exceeding a predefined threshold known as the minimum support. The minimum support is a client-characterized boundary that decides the minimum number of times an item should occur to be thought of as "frequent." Item sets that meet this support threshold are of interest since they represent patterns or relationships inside the data.

2. FP-Tree (Frequent Pattern Tree)

The focal data structure utilized by FP-Growth is the FP-Tree (Frequent Pattern Tree). This tree-like design productively encodes the transactional dataset, making it conceivable to perform frequent itemset mining in a minimal and highly improved manner.

FP-Tree Components:

  • Nodes: Every node in the FP-Tree addresses a single item from the dataset.
  • Branches: Branches between nodes mean the event of an exchange that incorporates the items addressed by the connected nodes.
  • Header Table: This table goes with the FP-Tree and contains data about the items, their frequencies, and pointers to the nodes where they are tracked down in the tree. The Header Table is significant for efficient navigation inside the tree during the mining process.

3. FP-Tree Construction:

The most common way of building the FP-Tree includes a few fundamental stages:

i. Scanning the Dataset

At first, the dataset is filtered to count the frequency of every item. This step recognizes which items meet or exceed the minimum support threshold.

ii. Sorting Items

The identified frequent items are then arranged in descending order of their frequency. Arranging the items as such guarantees that the most frequent items show up first or appear first.

iii. Building the FP-Tree

For every exchange in the dataset, items are added to the FP-Tree individually, following up in the order by their frequency. If any item exists already in the tree, another branch is made from the current node addressing that item.

By building the FP-Tree along these lines, the algorithm efficiently captures the design of regular item sets in the dataset, working with resulting mining.

4. Mining Frequent Item sets

When the FP-Tree is built or constructed, the FP-Growth algorithm continues to mine successive item sets effectively. The mining system includes the following fundamental stages:

i. Mining Conditional FP-Trees

Beginning with the most un-successive or least frequent item, the algorithm mines contingent FP trees by recursively eliminating the item and its related nodes from the FP tree. This process creates frequent item sets for every item.

ii. Combining Results:

The continuous item sets got from the contingent FP-Trees are combined to create the final set of frequent item sets for the whole dataset.

Along these lines, the FP-Growth algorithm discovers frequent examples and relationships inside the data, making it a significant instrument for a wide range of applications.

Steps included in the FP-Growth Algorithm

The FP-Growth algorithm follows a recursive approach to mine frequent item sets. There are three steps are present as follows:

Data Preprocessing:

Gather transactional data: Gather information as transactions, where every transaction comprises of a set of items. These items can address items purchased by clients, items in a basket, or any comparative context.

Determine Minimum Support:

Set a minimum support edge, which is the minimum number of times an item should show up in the transactions to be thought of as "frequent." This limit is commonly characterized by the client because of the dataset and explicit necessities.

Check the Dataset and Count Item Frequencies:

Go through the dataset once and count the frequency of every item (item set) in the transactions. This step distinguishes which items meet or exceed the minimum support edge.

Sort Items by Frequency:

Sort the frequent items in the diving request of their recurrence. Arranging the items in this manner guarantees that the most regular items show up first, which is vital for the effectiveness of the algorithm.

Build the FP-Tree:

Create a void FP-Tree. For every transaction in the dataset, embed items into the FP-Tree individually, following not entirely settled by their recurrence.

Assuming an item as of now exists in the tree, increase the count of that item's hub. In the event that is not, make another hub for the item.

This cycle brings about the Growth of the FP-Tree, which proficiently catches the recurrence and design of item sets in the dataset.

Create the Header Table:

Alongside the FP-Tree, make a Header Table. This table contains a rundown of items alongside pointers to their comparing hubs in the FP-Tree. The Header Table is fundamental for a proficient route inside the tree during the mining system.

Mine Conditional FP-Trees:

Beginning with the most un-regular item, recursively mine restrictive FP-Trees. For every item, eliminate it and its related hubs from the FP-Tree to make a contingent FP-Tree.

Apply the FP-Growth algorithm recursively to the contingent FP-Tree to find successive item sets for that item.

Proceed with this cycle for each regular item, creating contingent FP trees and mining incessant item sets.

Combine Results:

Join the successive item sets acquired from the contingent FP-Trees to create the last arrangement of frequent item sets for the whole dataset.

Generate Association Rules (Optional):

Whenever wanted, create affiliation rules from the regular item sets. Affiliation leads ordinarily comprise of two sections: a forerunner (left-hand side) and a subsequent (right-hand side).

Compute the certainty of each standard, which is a proportion of how frequently the standard is right. Affiliation rules can assist with uncovering significant connections between items.

Output Frequent Item sets and Association Rules:

Finally, present the frequent item sets and affiliation rules to the client. These outcomes give bits of knowledge into examples and relationships inside the dataset, which can be important for different applications like market basket analysis, recommendation systems, and more.

Example:

For this example, we'll utilize an improved transaction dataset where each rundown addresses a transaction containing things:

Step 1: Importing the Required Library

We need to import the pdf growth library to utilize the FP-Growth calculation:

Syntax:

Step 2: Setting the Minimum Support Threshold

Characterize the minimum support threshold, which decides the minimum number of times an item set should show up in the transactions to be thought of as "frequent." In this example, how about we set the minimum support to 2:

Step 3: Finding Frequent Item sets

Use pyfpgrowth.find_frequent_patterns to find frequent item sets in the transactions:

we have declared a variable pattern and this will return a dictionary (patterns) where the keys are frequent item sets, and the values are their support counts. We will pass the transactions and min_support as the parameters.

Step 4: Generating Association Rules

Use pyfpgrowth.generate_association_rules to produce or generate association rules from the frequent item sets:

we have declared a variable rules and we will pass the parameters as patterns and min_confidence. In this step, we can change the min_confidence boundary to set a minimum certainty threshold for your association rules.

Step 5: Printing the Results

At last, you can print the successive item sets and association rules, alongside their support and confidence values:

Code:

The code above will display the frequent item sets and association rules found in your dataset, along with their respective support and confidence values.

Example:

Output:

Here are the Frequent Item sets:
('item2',): 4
('item2', 'item3'): 3
('item2', 'item4'): 2
('item1', 'item2'): 2
('item1', 'item2', 'item3'): 2
('item1', 'item2', 'item4'): 2
Here are the Association Rules:
('item1',) => ('item2', 'item3') (Confidence: 1.0)
('item1', 'item2') => ('item3',) (Confidence: 1.0)
('item1', 'item2') => ('item4',) (Confidence: 1.0)
('item1', 'item2', 'item3') => ('item4',) (Confidence: 1.0)
('item1', 'item2', 'item4') => ('item3',) (Confidence: 1.0)
('item2', 'item3') => ('item4',) (Confidence: 0.6666666666666666)
('item2', 'item4') => ('item3',) (Confidence: 1.0)
('item3',) => ('item2', 'item4') (Confidence: 0.6666666666666666)
('item4',) => ('item2', 'item3') (Confidence: 0.6666666666666666)

Explanation:

The result of the FP-Growth algorithm reveals significant experiences into the given dataset. To start with, the frequent item sets show which blends of items happen much of the time together. Prominently, 'item2' shows up in all exchanges, making it the most regular item. The itemset ('item2', 'item3') is likewise eminent, proposing that 'item3' is frequently purchased along with 'item2.' Furthermore, the algorithm distinguished ('item2', 'item4') as a continuous itemset, demonstrating a typical relationship somewhere in the range of 'item2' and 'item4.'

Additionally, the association rules shed light on the connections between these items. For instance, the standard ('item1',) => ('item2', 'item3') suggests that at whatever point 'item1' is bought, 'item2' and 'item3' are reliably remembered for the exchange. The certainty of 1.0 signifies serious areas of strength for a. Additionally, different standards like ('item1', 'item2') => ('item3',) and ('item1', 'item2') => ('item4',) uncover reliable relationship among these items.

Applications of FP-Growth Algorithm:

The FP-Growth algorithm finds down applications in different domains because of its capacity to productively mine frequents item sets and find relationship inside huge datasets. Here are a few useful applications of the algorithm:

  • Market Basket Analysis:

Retailers use FP-Growth for market crate examination to comprehend client buying conduct. By recognizing habitually co-happening items, retailers can streamline store formats, item situations, and plan successful strategically pitching and upselling methodologies.

  • Recommender Systems:

Recommender frameworks in online business and content proposal stages use continuous item sets to make customized item or content suggestions. FP-Growth distinguishes thing affiliations, empowering frameworks to recommend related things or content to clients in light of their inclinations.

  • Anomaly Detection:

In online protection and extortion discovery, FP-Growth can be applied to recognize surprising examples or arrangements of occasions that might demonstrate dubious exercises. By recognizing inconsistent item sets or arrangements, irregularities can be hailed for additional examination.

  • Bioinformatics:

In bioinformatics, FP growth is mainly utilized to find the relationship among qualities, proteins, or natural cycles in enormous datasets. This data helps with grasping hereditary associations, sickness pathways, and medication disclosure.

  • Web Usage Mining:

Breaking down client conduct on sites is essential for further developing client experience and enhancing content suggestions. FP-Growth recognizes designs in the client route, prompting upgraded web architecture and customized content ideas.

  • Text Mining:

In text examination, FP-Growth can be applied to find co-happening terms or expressions in reports. This is important for point demonstrating, report bunching, and recognizing every now and again related word in normal language handling errands.

Advantages of the FP-Growth algorithm

The FP-Growth algorithm offers a few benefits that make it an important instrument for frequent pattern mining and association rule generation:

  • Efficiency:

FP-Growth is highly productive, particularly for large datasets, as it requires only two passes over the data first for counting item frequencies and afterward for building the FP-Tree. This effectiveness makes it quicker than many other frequent pattern mining algorithms, like the Apriori algorithm.

  • Compact Data Representation:

The FP-Tree structure proficiently encodes the dataset, decreasing memory necessities compared with traditional database scans. This conservative representation speeds up pattern mining.

  • Reduced Candidate Generation:

Unlike Apriori, FP-Growth doesn't create a candidate itemset explicitly, prompting reduced computational overhead. It directly mines frequent item sets from the FP-Tree, wiping out or eliminating the requirement for multiple iterations.

  • Parallelization:

FP-Growth can be parallelized effectively because the construction of the FP-Tree for various subsets of the information can be performed simultaneously, making it a distributed and parallel computing environment.

  • Scalability:

The algorithm's efficiency and reduced memory use make it adaptable to deal with huge datasets, which is critical in the present period of large data.

  • Mining Rare Patterns:

FP-Growth is powerful at mining continuous examples as well as also rare or inconsistent patterns, giving a more comprehensive view of information associations.

  • Highly Customizable:

Clients can alter or customize the minimum support threshold and least certainty threshold as indicated by their specific requirements, considering adaptable pattern mining.

  • Association Rule Generation:

FP-Growth finds frequent item sets as well as creates association rules with associated certainty values, giving deeper insights of knowledge into information relationships.

  • Wide Range of Applications:

The algorithm has applications across different domains, including retail, medical services, web mining, and more, making it adaptable for various use cases.

  • Robustness to Data Noise:

FP-Growth is generally powerful to noise and exceptions in the information, making it appropriate for real world datasets that might contain inconsistencies.

  • Real-time Analytics:

FP-Growth can be utilized for real-time investigation, permitting organizations to go with convenient choices based on the data that is up to date including all the transaction data.

Disadvantages of the FP-Growth algorithm

While the FP-Growth algorithm offers many benefits, it also has specific restrictions and disadvantages:

  • Memory Utilization:

At times, developing the FP-Tree can require significant memory, particularly while managing incredibly huge datasets. This can prompt memory-related issues and breaking point its applicability on machines with restricted RAM.

  • Initial Scan:

Although the fact that FP-Growth reduces the number of passes over the data compared with Apriori, it requires an initial scan to count item frequencies. For very huge datasets, this initial scan can be time-consuming.

  • Sparse Datasets:

FP-Growth may not proceed as proficiently on extremely spare datasets, where most items have low support. In such cases, the FP-Tree may not give huge benefits over different algorithms.

  • Difficulty in Handling Streaming Data:

FP-Growth is commonly designed for batch handling of static datasets. Adapting it to handle streaming information with dynamic updates can be testing and may require additional techniques.

  • Lack of Parameter Tuning Guidance:

Determining suitable qualities for the minimum support and minimum confidence thresholds can be a challenge. Choosing these thresholds erroneously may lead to uninteresting or overpowering outcomes.

  • Complex Implementation:

implementing the FP-Growth algorithm without any preparation can be complicated, especially the Growth and control of the FP-Tree and the header table. This complexity can be a barrier for those not familiar with the algorithm.

  • Limited to Categorical Data:

FP-Growth is basically designed for absolute data, where items are disrcete and not continuous. Handling continuous or mathematical information might require preprocessing or binning.

  • Limited to Single-Level Item sets:

The fundamental FP-Growth algorithm focuses on mining frequent item sets of any size, however, it may not directly support mining frequent item sets with various levels or progressive systems. Additional adaptations might be required for such cases.

  • Interpretability of Rules:

While FP-Growth creates association rules, interpreting these principles can be challenging, particularly while managing a large number of rules or complex item sets.

  • Not Suitable for All Data Types:

FP-Growth is most effective for transactional data, where items co-occur in transactions. It may not be appropriate for different kinds of information, like sequences or time-series data.