Almighty Bus Error

Loading search...

Huffman Coding

From Wikipedia:

In computer science and information theory, Huffman coding is an entropy encoding algorithm used for lossless data compression. The term refers to the use of a variable-length code table for encoding a source symbol (such as a character in a file) where the variable-length code table has been derived in a particular way based on the estimated probability of occurrence for each possible value of the source symbol.

Since Huffman encoding uses a variable-length codes, it needs a mechanism to be able to distinguish each symbol code, this mechanism is called prefix condition. It means that for any 2 codes, one can’t be the prefix of another. Another way to present this notion is for each code Ci and Cj, Ci isn’t the prefix of Cj.

Huffman Preprocessing

There are three types of Huffman coding:

  • Adaptive: The Huffman coding tree evolves (a single symbol can have different codes over time) while reading the file for compression by calculating the probability on-the-fly.
  • Semi-Adaptive: The file to be compressed is read first to determine the probability of each symbol.
  • Static: All symbols have a pre-determined probability.

Prefix condition

Example of a variable-length code that doesn’t have the prefix condition:

SymbolCode
A0
B01
C00

As you can see, when you read 00 from a file, it can be AA or C because the code 0, is a prefix of 00.

Example of a variable-code with prefix condition:

SymbolCode
A0
B10
C110
D1110
E1111

As you can see, when you read a file, it is impossible to mistake a symbol for another, this can also be represented by a coding tree.

Huffman coding tree

To create a Huffman coding tree you need an alphabet and the associated probability to each symbol, for example:

SymbolProbability
A0.5
B0.2
C0.2
D0.1

The following tree is created (since B and C have the same probability is can differ from the one presented):

Algorithm

The algorithm to create this tree is the following:

  1. Create a node for each symbol and add them to a pool;
  2. Sort the pool by ascendending probability;
  3. Choose the two nodes with less probability;
  4. Create a new node with the sum the probability;
  5. Connect the two nodes to a new node with the sum of the probability;
  6. Remove the two chosen nodes from the pool and add the new node;
  7. Back to step 2 until all nodes are connected.

Step 1 .. 2

Step 3 .. 6

Repeat

Which translates to:

SymbolProbability
A0
B11
C101
D100

As a side note, there is currently no algorithm that creates better codesfor alphabets that have independent probabilities (e.g. symbols don’t have dependencies between them as some latin languages have with “qu”).

The encoding is done with a dictionary (the table with the relation Symbol – Code) by directly replacing a symbol with its code.

Coding

To compress the file it’s needed to save the alphabet and the respective probabilities to be able to decode the text(to rebuild the tree). Normally this is done by saving the total number of characters and the number of occurrences of each symbol which are all integers. That saves storing size on the compressed file and avoids parsing problems on the decoder side.

Decoding

To decode a file all that is needed is to read bit-by-bit and follow the coding tree until a leaf(node with no childs) is reached and a symbol is sent to the output.

Thank you for reading.