Huffman encoding

What is encoding?

Encoding involves converting data or information from one form, structure, or symbol to another. Such flexibility is usually required for several purposes, including data storage, transmission, and information processing. Encoding comes in various formats, tailored to specific contexts and needs, and covers various data types, including text, numerical data, images, audio, and more.

Introduction

Efficiency is a valuable commodity in data storage, transport, and processing. Numerous data compression techniques have been developed because of the necessity to make the most of limited resources. Among these, Huffman encoding stands out as an effective way of lowering data size while maintaining data integrity. We shall look at the concept, history, and applications of Huffman encoding in this post.

The Huffman Encoding Concept

Huffman encoding, often known as Huffman coding, is a lossless data compression method established in 1952 by David A. Huffman. Huffman's encoding is based on a simple yet straightforward principle: symbols that appear more frequently are allocated shorter binary codes, whereas symbols that occur less frequently are assigned longer codes. Because common symbols use fewer bits, the overall data size is reduced due to this procedure.

How Does Huffman Encoding Work?

Frequency Analysis: The first stage in Huffman encoding is to analyze the incoming data and generate a frequency table that records the occurrence of each symbol. Symbols can represent text letters, pixels in a picture, or any other data unit.
Building the Huffman Tree: A Huffman tree uses the frequency table. The tree is built by combining the two least frequent symbols into a new node and repeating this procedure until only one node remains, which becomes the tree's root. Higher frequency symbols are closer to the tree's root.
Code Assignment: Binary codes are assigned to each symbol in the Huffman tree as it is formed. A '0' is added to the code by traversing to the left branch of the tree, and a '1' is added by traversing to the right branch. The code for that particular symbol is represented by the route from the root to a leaf node.
Data Compression: The input data may be compressed using the Huffman codes, with each symbol replaced with its matching code. As a consequence, the data has been compressed.

Example

Assume you have a text file containing the following characters and their frequencies:

A: 5
B: 9
C: 12
D: 13
E: 16

Step 1: Analyse Frequency

To begin, make a frequency table for the characters in your input data:

Character  |  Frequency
-----------------------
    A     |      5
    B     |      9
    C     |     12
    D     |     13
    E     |     16

Step 2: Constructing the Huffman Tree

You now construct a Huffman tree using these frequencies. Create a leaf node for each character and their frequency.

Then, continually combine the two nodes with the lowest frequencies to produce a new internal node with the total of the two nodes' frequencies. Continue this procedure until only one node is left, which will be the Huffman tree's root.

The completed Huffman tree may look something like this:

Step 3: Assigning a Code

Now, based on their location in the tree, you give binary codes to each character. Beginning at the root, go left and add '0' to the code, then right and add '1'. The Huffman code represents the path from the root to each character.

A: 00
B: 010
C: 011
D: 100
E: 101

Step 4: Data Compression

You may now encrypt your input data after assigning the Huffman codesFor example, if your original text was "BEAD," the encoded version would be "010101000."

To decode the data, begin at the root of the Huffman tree and work your way through the code bits until you reach a leaf node representing a character.

In this case, "010101000" is encoded as "BEAD."

Implementation

import heapq
class HuffmanNode:
    def __init__(self, symbol, frequency):
        self.symbol = symbol
        self.frequency = frequency
        self.left = None
        self.right = None
    def __lt__(self, other):
        return self.frequency < other.frequency
def build_huffman_tree(data):
    frequency_table = {}
    for symbol in data:
        if symbol in frequency_table:
            frequency_table[symbol] += 1
        else:
            frequency_table[symbol] = 1
    priority_queue = [HuffmanNode(symbol, frequency) for symbol, frequency in frequency_table.items()]
    heapq.heapify(priority_queue)
    while len(priority_queue) > 1:
        left = heapq.heappop(priority_queue)
        right = heapq.heappop(priority_queue)
        parent = HuffmanNode(None, left.frequency + right.frequency)
        parent.left = left
        parent.right = right
        heapq.heappush(priority_queue, parent)
    return priority_queue[0]
def build_huffman_codes(node, code, huffman_codes):
    if node.symbol is not None:
        huffman_codes[node.symbol] = code
    if node.left is not None:
        build_huffman_codes(node.left, code + '0', huffman_codes)
    if node.right is not None:
        build_huffman_codes(node.right, code + '1', huffman_codes)
def huffman_encode(data):
    root = build_huffman_tree(data)
    huffman_codes = {}
    build_huffman_codes(root, '', huffman_codes)
    encoded_data = ''.join([huffman_codes[symbol] for symbol in data])
    return encoded_data, root
def huffman_decode(encoded_data, root):
    decoded_data = []
    node = root
    for bit in encoded_data:
        if bit == '0':
            node = node.left
        else:
            node = node.right
        if node.symbol is not None:
            decoded_data.append(node.symbol)
            node = root
    return ''.join(decoded_data)
data = "this is an example for huffman encoding"
encoded_data, huffman_tree = huffman_encode(data)
decoded_data = huffman_decode(encoded_data, huffman_tree)
print("Original data:", data)
print("Encoded data:", encoded_data)
print("Decoded data:", decoded_data)

Output:

Huffman Encoding Applications

Huffman encoding is used in a variety of disciplines, including:

File Compression: Huffman coding is used in file compression applications such as ZIP and GZIP to minimize file size for storage or transmission. It is particularly handy for preserving and transmitting huge amounts of data across the internet.
Picture Compression: Huffman coding is used in picture formats such as JPEG to represent image data efficiently. It enables speedier transmission and more economical storage by lowering picture data size.
Text Compression: Huffman encoding is used for text compression, frequently in conjunction with other approaches. It is useful for reducing the amount of space taken up by text documents.
Network Data Transmission: Huffman encoding in data communication can assist in minimizing the amount of data carried across a network, conserving bandwidth, and speeding up data transfers.

Conclusion

Huffman encoding is a key idea in information theory and data compression. Its ability to reduce data size while preserving information has made it a foundational component of many compression methods and applications. Understanding how Huffman encoding works and its different real-world applications will help you better use data in a data-driven environment. Huffman encoding is still a great technique for data optimization, whether you're working with text, graphics, or other data types.

Next TopicHuffman Tree in Data Structure

← prev next →