Huffman Coding
Article Contents
From Wikipedia:
“In computer science and information theory, Huffman coding is an entropy encoding algorithm used for lossless data compression. The term refers to the use of a variable-length code table for encoding a source symbol (such as a character in a file) where the variable-length code table has been derived in a particular way based on the estimated probability of occurrence for each possible value of the source symbol.”
Since Huffman encoding uses a variable-length codes, it needs a mechanism to be able to distinguish each symbol code, this mechanism is called prefix condition. It means that for any 2 codes, one can’t be the prefix of another. Another way to present this notion is for each code Ci and Cj, Ci isn’t the prefix of Cj.
Huffman Preprocessing
There are three types of Huffman coding:
- Adaptive: The Huffman coding tree evolves (a single symbol can have different codes over time) while reading the file for compression by calculating the probability on-the-fly.
- Semi-Adaptive: The file to be compressed is read first to determine the probability of each symbol.
- Static: All symbols have a pre-determined probability.
Prefix condition
Example of a variable-length code that doesn’t have the prefix condition:
Symbol | Code |
---|---|
A | 0 |
B | 01 |
C | 00 |
As you can see, when you read 00 from a file, it can be AA or C because the code 0, is a prefix of 00.
Example of a variable-code with prefix condition:
Symbol | Code |
---|---|
A | 0 |
B | 10 |
C | 110 |
D | 1110 |
E | 1111 |
As you can see, when you read a file, it is impossible to mistake a symbol for another, this can also be represented by a coding tree.
Huffman coding tree
To create a Huffman coding tree you need an alphabet and the associated probability to each symbol, for example:
Symbol | Probability |
---|---|
A | 0.5 |
B | 0.2 |
C | 0.2 |
D | 0.1 |
The following tree is created (since B and C have the same probability is can differ from the one presented):
Algorithm
The algorithm to create this tree is the following:
- Create a node for each symbol and add them to a pool;
- Sort the pool by ascendending probability;
- Choose the two nodes with less probability;
- Create a new node with the sum the probability;
- Connect the two nodes to a new node with the sum of the probability;
- Remove the two chosen nodes from the pool and add the new node;
- Back to step 2 until all nodes are connected.
Step 1 .. 2
Step 3 .. 6
Repeat
Which translates to:
Symbol | Probability |
---|---|
A | 0 |
B | 11 |
C | 101 |
D | 100 |
As a side note, there is currently no algorithm that creates better codesfor alphabets that have independent probabilities (e.g. symbols don’t have dependencies between them as some latin languages have with “qu”).
The encoding is done with a dictionary (the table with the relation Symbol – Code) by directly replacing a symbol with its code.
Coding
To compress the file it’s needed to save the alphabet and the respective probabilities to be able to decode the text(to rebuild the tree). Normally this is done by saving the total number of characters and the number of occurrences of each symbol which are all integers. That saves storing size on the compressed file and avoids parsing problems on the decoder side.
Decoding
To decode a file all that is needed is to read bit-by-bit and follow the coding tree until a leaf(node with no childs) is reached and a symbol is sent to the output.
Thank you for reading.