CSE 382 Algorithms and Data Structures

Barbara Nostrand, Ph.D.

Electrical Engineering and Computer Science


Laboratory 11 - Huffman Compression

Introduction

Huffman Coding is an entropy encoding algorithm used for lossless data compression.  Huffman Coding automatically builds a dictionary of variable-length codes for common strings of symbols in a source file.  Next,  this dictionary is used to replace common lengthy code groups with short code symbols.  The combined length of the dictionary and the encoded file is often dramatically shorter than the original file.  In 1951,  David A. Huffman and his MIT information theory classmates were given the choice of a term paper or a final exam.  The professor,  Robert M. Fano,  assigned a term paper on the problem of finding the most efficient binary code.  Huffman,  unable to prove any codes were the most efficient,  was about to give up and start studying for the final when he hit upon the idea of using a frequency-sorted binary tree and quickly proved this method the most efficient.  Huffman outdid his professor,  who had developed a sub-optimal algorithm with Claude Shannon.  Huffman avoided the major flaw of the suboptimal Shannon-Fano coding by building the tree from the bottom up instead of from the top down. 

Huffman compression replaces variable length source code groups with variable length compression code groups.  Common letters like "T" and words like "AND" use fewer bits than unusual letters like "X".  Huffman Compression and Huffman Trees are discussed on pages 398-402 and 447-454.  A partial implementation of the HuffmanTree class begins on page 451. 

Building and Representing Trees

There are fundamentally two ways to represent trees: 

  1. As a linked structure of an internal Node class. 
  2. As a linked structure of trees. 
Aside from being able to:  adddelete, and retrieve data from our tree,  we also want to be able to:  retrieve a subtreedetach a subtree,  and attach a subtree.  For Huffman trees,  this suggests that we might want to have a HuffmanTree class and an internal HuffmanLeaf class instead of having a Tree class and a Node class.  In short,  using a recursive definition for Tree may simplify our design and implementation tasks.  Essentially,  a Tree would consist of two different kinds of node:  an internal node and a leaf node.  Now notice that an internal node has to keep all of the same information as a Tree object!  Consequently,  we can replace or specification for our Tree class with our definition for InternalNode!  Thus,  we use the design we worked out for InternalNode for Tree and dispense with having a separate InternalNose class.  Finally,  recall that our left and right links must be able to attach to either leaves or trees.  One approach to handling this is to define a Leaf as a subclass of Tree which contains an additional data attribute.  This works when you want to place all of the data at the leaves of the tree.  If,  however,  you want to locate data internally,  then you simply include a data attribute in your Tree class and dispense with having a separate Leaf class. 

Trees as Recursive Data Structures

Any of the pointer based tree representations involves a recursive data structure

  1. You can implement a BinarySearchTree class by having a BinarySearchTree class and a separate internal Node class.  In this case,  the Node class is a recursive data structure as each Node contains a pair of left and right attribute variables each of which is of class Node
  2. You can implement a BinarySearchTree class by having a BinarySearchTree class and a separate internal Leaf class which has a data attribute variable,  but lacks left and right attribute variables.  In this case,  BinarySearchTree is a recursive data structure as each left and right attribute variable will belong to class BinarySearchTree
  3. Finally,  you can simply define a BinarySearchTree class where each BinarySearchTree object will have an Object valued data attribute as well as left and right branches to subtrees.  These BinarySearchTree attribute variables make this realization of BinarySearchTree into a recursive data structure. 

The Java Collection Framework

We have been exploring a number of ADTs (Abstract Data Types) which are commonly used in designing software systems.  You might wonder why Tree is not part of the Java Collections Framework.  The reason for this is that we use trees as components to construct more intuitive ADTs such as dictionaries and sets which are in the Java Collections Framework.  The Map interface effectively replaces the old Dictionary class.  Both of these data structures allow you to look up a data element by its key value.  The Set interface defines set membership operations.  Both of these interfaces have sorted extensions which add order properties to each of them.  A SortedSet allows you to sequentially access the members of the set.  A SortedMap allows you to sequentially access the keys of the map.  Here is a simplified inheritance diagram for the Java Collections Framework

Dictionaries and Maps

In our investigation of vectors and arrays,  we've seen structures that collect objects in which the objects can be indexed by non-negative integers.  Are there other ways to index collections?  Yes,  we can index arrays and vectors using integer compatible values such as characters.  However,  we would like to index more abstract collections with other objects such as:  Strings,  Complex Humbers,  or Points.  It turns out that we can use just about any class of objects as keys to retrieve values from unordered collections.  Please understand that while we can place integers and characters in a definite order,  other useful indices can not be ordered.  Consequently,  we need a more general data structure.  Currently,  Java provides the Map interface for this purpose.  The Map interface replaces the earlier Dictionary class.  In Java,  a map is a well defined relation between two distinct Sets.  Thus,  we have a data structure based on a mathematical concept much as Relational Databases are themselves based on a mathematical concept.  Maps are similar to discrete functions,  except that they are usually sparsely defined over the domain of possible keys.  Essentially,  a Map is a Set of key/value pairs such that there can be at most one pair corresponding to each possible key

Dictionary Class

HashTable Class

The HashTable is an old idea.  In Java,  the HashTable class extends the Dictionary class.  Many software engineers believe that these two classes are obsolete,  but you are likely to encounter them if you maintain legacy systems or work with legacy designs.  Hashtables were invented when sparse tables with a large range of index values had to be stored in very small computer memories.  The idea is to convert the natural index which you might use if you had enough space to store records for every possible index value.  A hash function converts the unrestriced natural index value to a new restriced index value.  The number of possible these hash function values is made to be equal to the number of records that can be accomodated in the hash table

For example,  you might decide to use the last four digits of the Social Security Number for your hash function and use this to store personel records in a table of only 10,000 cells.  Unfortunately,  this can result in collisions when people share the same four last digits.  This can be overcome by various probing techniques which start at the calculated cell index and search for the cell belonging to the target record.  Another technique called separate chaining employs a restricted pointer system to solve the problem. 

Map Interface

An object that maps keys to values.  A map cannot contain duplicate keys;  each key can map to at most one value.  This interface takes the place of the Dictionary class,  which was a totally abstract class rather than an interface. 

Nested Class Summary
static interface Map.Entry
          A map entry (key-value pair).
 
Method Summary
 void clear()
          Removes all mappings from this map (optional operation).
 boolean containsKey(Object key)
          Returns true if this map contains a mapping for the specified key.
 boolean containsValue(Object value)
          Returns true if this map maps one or more keys to the specified value.
 Set entrySet()
          Returns a set view of the mappings contained in this map.
 boolean equals(Object o)
          Compares the specified object with this map for equality.
 Object get(Object key)
          Returns the value to which this map maps the specified key.
 int hashCode()
          Returns the hash code value for this map.
 boolean isEmpty()
          Returns true if this map contains no key-value mappings.
 Set keySet()
          Returns a set view of the keys contained in this map.
 Object put(Object key, Object value)
          Associates the specified value with the specified key in this map (optional operation).
 void putAll(Map t)
          Copies all of the mappings from the specified map to this map (optional operation).
 Object remove(Object key)
          Removes the mapping for this key from this map if it is present (optional operation).
 int size()
          Returns the number of key-value mappings in this map.
 Collection values()
          Returns a collection view of the values contained in this map.
 

Set Interface

A Collection that contains no duplicate elements.  More formally,  sets contain no pair of elements e1 and e2 such that e1.equals(e2),  and at most one null element.  As implied by its name,  this interface models mathematical sets. 

Method Summary
 boolean add(Object o)
          Adds the specified element to this set if it is not already present (optional operation).
 boolean addAll(Collection c)
          Adds all of the elements in the specified collection to this set if they're not already present (optional operation).
 void clear()
          Removes all of the elements from this set (optional operation).
 boolean contains(Object o)
          Returns true if this set contains the specified element.
 boolean containsAll(Collection c)
          Returns true if this set contains all of the elements of the specified collection.
 boolean equals(Object o)
          Compares the specified object with this set for equality.
 int hashCode()
          Returns the hash code value for this set.
 boolean isEmpty()
          Returns true if this set contains no elements.
 Iterator iterator()
          Returns an iterator over the elements in this set.
 boolean remove(Object o)
          Removes the specified element from this set if it is present (optional operation).
 boolean removeAll(Collection c)
          Removes from this set all of its elements that are contained in the specified collection (optional operation).
 boolean retainAll(Collection c)
          Retains only the elements in this set that are contained in the specified collection (optional operation).
 int size()
          Returns the number of elements in this set (its cardinality).
 Object[] toArray()
          Returns an array containing all of the elements in this set.
 Object[] toArray(Object[] a)
          Returns an array containing all of the elements in this set; the runtime type of the returned array is that of the specified array.
 

Getting Started

Create a new directory for the Software Library,  and download the file archive into it.  After you expand the archive,  you will need to use the HuffmanTree files in the CH08 directory and the BitString files from the CH09 directory.  You should also refer to the discussion on how to build a custom HuffmanTree class begining on page 448 of our textbook. 

Software Engineering

Complete all methods of class HuffmanTree and test them out using a document file and a Java source file on your computer.  For testing purposes,  you can use your source file for this project and any computer-readable multi-page text document.  To complete this assignment you need to both compress a file and decompress the compressed version of the file.  The resulting decompressed version should be identical to the original version of the file.  In addition,  your program should report the lengths of both files and the compression ratio. 

For full credit,  be sure to implement a generic HuffmanTree class.  This will allow your Huffman tree to store strings or groups of pixels in an image file.  A generic HuffmanTree<T> requires a generic inner class HuffData<T> where the T is the data type of symbol.  Each parameter type <HUffData> in our class HuffmanTree would be replaced by <HuffData<T>> which indicates that T is a type parameter for class HuffData.  For extra credit,  produce a String based version of your Huffman compression program and compare compression efficiency between the character oriented version and the string oriented version.  Be sure to turn in both versions of your program. 

Project Summary

You will be using a variety of data structures to implement this project.  Effectively,  your Huffman tree will implement a Dictionary.  You should expect your Huffman tree to be fairly flat like the one in figure 8.35 of our textbook. 

  1. Your transmitter will scan the input file to derive character frequency statistics.  Collect these statistics in an array of integers where the index value corresponds to the character code and the value counts the number of occurances of that character. 
  2. Use the character frequency data which you just collected to construct a Huffman tree.  Since this step builds a Huffman tree,  you should think of doing this inside your Huffman tree constructor.  In this case,  you can work directly with nodes,  and dispense with worrying about trees until you have finished constructing your Huffman tree.  Your Huffman tree constructor should accept your character frequency table as an argument.  We will talk about adding and removing nodes from the priority queue instead of trees as in the textbook.  Each node should have the following structure: 
  3. Itteratively construct a leaf node for each character with non-zero frequency,  and insert each of these nodes into your priority queue
  4. Itteratively remove pairs of nodes from the priority queue until you are left with a single node which becomes the root of your Huffman tree. 
  5. Each time you remove a pair of nodes from the priority queue,  create a new node which is linked to the two nodes you removed from your priority queue.  Set the weight of this new node equal to the sum of the weights of the two nodes that it is linked to.  Then,  put this new node into the priority queue
  6. Continue iterating until your priority queue contains only one node.  Remove this last node from the priority queue and make it the root of your Huffman tree. 
  7. Remember to correctly assign a value to the size attribute for your Huffman tree. 
  8. The bit values for a Huffman code are determined by whether the corresponding cell was reached by a left branch (0) or a right branch (1).  There is no code termination mark as each code automatically terminates when you reach a leaf. 
  9. Perform a pre-order traversal of your Huffman tree,  and perform the following operations: 
    1. Send a left or right signal each time you follow a left or right branch down. 
    2. When you reach a leaf: 
      1. Send the original character code for the character stored there. 
      2. Enter the bit string Huffman encoding for the character into a dictionary indexed by the charcter.  This will allow you to look up the Huffman encoding for each character that you will be sending to the receiver. 
  10. Your transmitter will have to transmit your Huffman tree to the decoder over the same channel that you send the file itself.  Note that you will have to encode your tree transmission in such a way that your decoder will be able to distinguish between branches and leaves.  The shortest way to do this is as follows:  For the purposes of this assignment,  you can restrict yourself to 7-bit ASCII,  but you will have to correctly deal with the leading 0 bits as Java uses Unicode to represent characters.  This is quite easy as, except for adding nine leading zero bits,  the first 128 Unicode characters are identical to 7-bit ASCII.  However,  you should make your system more general purpose by transmitting the integer number of bits in each symbol prior to transmitting your Huffman tree.  This allows the receiver to correcdtly decode the Huffman tree that it receives from you. 
  11. Your transmitter will read the file twice and send it only once. 
  12. Your receiver will only receive the file once. 
  13. Your receiver will need to receive and reconstruct the custom Huffman Tree.  Please note that your receiver will have to use a different constructor than the one used by the transmitter.  In particular,  while the transmitter constructs the Huffman tree bottom-up,  the receiver will reconstruct the Huffman tree top-down beginning with the root node. 
  14. Your receiver will need to use the Huffman tree to reconstruct the file.  This is accomplished by using the incoming bit stream to construct a path from the root to a leaf.  When a leaf is reached,  the decoder emits the character stored at the leaf,  and immediately returns to the root of the Huffman Tree. 
As you will be automatically generating your Huffman tree,  carefully study the Case Study begining on page 448 of our textbook.  Your Huffman tree constructor will use a priority queue based on statistics gathered during the first pass over the file to be transmitted.  The tree-constructor algorithm in our textbook builds the Huffman tree from the bottom up.  This means that the leaves are added first,  and the least frequent character should emerge from the priority queue first. 

Sending the Huffman Tree

You must transmit the Huffman tree before you transmit the encrypted file itself.  Consequently,  you must visit each node in your Huffman tree.  Ideally,  your traversial should make reconstructing the tree as easy as possible.  Please recall that Huffman trees are not necessarily balanced.  Consequently,  we will probably want to execute a preorder traversal in the following manner: 


      private static void traverse(Node node) {
          send(node);
          if (node == null) return;
          traverse(node.left());
          traverse(node.right());
          }
Note that this passes off the responsibility for actually encoding the segment for each node to the send(Node node) method.  Also,  you will have to send an indication that a null has been encountered. 

Phrases you should now understand:

Map (replaces Dictionary),  Set,  hashing. 

CollaborationYou will complete this project with your lab partner. 
ImplementationYou must implement your system as a Java application.  You may use Jgrasp.  Do not use netbeans or produce a Java applet. 
EnvironmentYour program must execute on the podium computer in our classroom.
Turn inYou must turn in both a report containing complete program sources and an executable electronic version of your program.

Last modified: 2008 FEB 11
bnostran@syr.edu