Suppose we have 105 characters in a data file. Normal Storage: 8 bits per character (ASCII) - 8 x 105 bits in a file. But we want to compress the file and save it compactly. Suppose only six characters appear in the file:
How can we represent the data in a Compact way?
(i) Fixed length Code: Each letter represented by an equal number of bits. With a fixed length code, at least 3 bits per character:
a 000 b 001 c 010 d 011 e 100 f 101
For a file with 105 characters, we need 3 x 105 bits.
(ii) A variable-length code: It can do considerably better than a fixed-length code, by giving many characters short code words and infrequent character long codewords.
a 0 b 101 c 100 d 111 e 1101 f 1100 Number of bits = (45 x 1 + 13 x 3 + 12 x 3 + 16 x 3 + 9 x 4 + 5 x 4) x 1000 = 2.24 x 105bits
Thus, 224,000 bits to represent the file, a saving of approximately 25%.This is an optimal character code for this file.
The prefixes of an encoding of one character must not be equal to complete encoding of another character, e.g., 1100 and 11001 are not valid codes because 1100 is a prefix of some other code word is called prefix codes.
Prefix codes are desirable because they clarify encoding and decoding. Encoding is always simple for any binary character code; we concatenate the code words describing each character of the file. Decoding is also quite comfortable with a prefix code. Since no codeword is a prefix of any other, the codeword that starts with an encoded data is unambiguous.
Greedy Algorithm for constructing a Huffman Code:
Huffman invented a greedy algorithm that creates an optimal prefix code called a Huffman Code.
The algorithm builds the tree T analogous to the optimal code in a bottom-up manner. It starts with a set of |C| leaves (C is the number of characters) and performs |C| - 1 'merging' operations to create the final tree. In the Huffman algorithm 'n' denotes the quantity of a set of characters, z indicates the parent node, and x & y are the left & right child of z respectively.