Chapter 48

Home

Huffman Coding Basics

Huffman coding compresses data by giving frequent symbols short binary codes and rare symbols long codes. The greedy algorithm builds an optimal prefix-free tree by repeatedly merging the two least frequent symbols. It's used in JPEG, PNG, MP3, and gzip.

Fixed-Length vs Variable-Length

ASCII uses 8 bits per character — every symbol gets the same length. Huffman assigns shorter codes to frequent symbols. "e" might get 01 while "z" gets 11010. On average, the message is shorter.

Fixed: e→00 t→01 a→10 z→11 each symbol = 2 bits Huffman: e→0 t→10 a→110 z→111 frequent = short, rare = long

The key constraint: no code is a prefix of another code (prefix-free). Otherwise, when decoding, you wouldn't know where one code ends and the next begins.

The Greedy Algorithm

Start with each symbol as a leaf node with its frequency. Repeatedly:

1. Pick the two nodes with the smallest frequencies

2. Merge them into a new internal node with frequency = sum

3. Attach left (0) and right (1) children

4. Repeat until one node remains (the root)

A:5 B:9 C:12 D:13 E:45 ↓ merge A(5)+B(9) → 14 ↓ merge 14+C(12) → 26 ↓ merge 26+D(13) → 39 ↓ merge 39+E(45) → 84 (root)

C++ — Huffman tree node

struct Node {
    char ch;
    int freq;
    Node *left, *right;

    Node(char c, int f) : ch(c), freq(f), left(nullptr), right(nullptr) {}
};

// Custom comparator for min-heap
struct Cmp {
    bool operator()(Node* a, Node* b) { return a->freq > b->freq; }
};

Node* build_huffman(map<char, int>& freq) {
    priority_queue<Node*, vector<Node*>, Cmp> pq;

    for (auto& [ch, f] : freq)
        pq.push(new Node(ch, f));

    while (pq.size() > 1) {
        Node* left = pq.top(); pq.pop();
        Node* right = pq.top(); pq.pop();

        Node* parent = new Node('\0', left->freq + right->freq);
        parent->left = left;
        parent->right = right;

        pq.push(parent);
    }
    return pq.top();  // root
}

Rust — Huffman tree node

use std::collections::BinaryHeap;
use std::cmp::Reverse;

#[derive(Eq, PartialEq)]
struct Node {
    ch: Option<char>,
    freq: usize,
    left: Option<Box<Node>>,
    right: Option<Box<Node>>,
}

impl Ord for Node {
    fn cmp(&self, other: &Self) -> std::cmp::Ordering {
        self.freq.cmp(&other.freq)
    }
}
impl PartialOrd for Node { /* delegate to cmp */ }

fn build_huffman(freq: &HashMap<char, usize>) -> Node {
    let mut pq: BinaryHeap<Reverse<Box<Node>>> = BinaryHeap::new();
    for (&ch, &f) in freq {
        pq.push(Reverse(Box::new(Node { ch: Some(ch), freq: f,
                                        left: None, right: None })));
    }
    while pq.len() > 1 {
        let left = pq.pop().unwrap().0;
        let right = pq.pop().unwrap().0;
        let parent = Box::new(Node {
            ch: None,
            freq: left.freq + right.freq,
            left: Some(left),
            right: Some(right),
        });
        pq.push(Reverse(parent));
    }
    *pq.pop().unwrap().0
}

💡 Greedy choice: always merge the smallest

The two smallest frequencies are merged at each step. This is the optimal choice because it gives the rarest symbols the longest codes (deepest in the tree), which has the least impact on total message length. The proof is an exchange argument: if an optimal tree doesn't merge the two smallest, you can swap and get a tree that's at least as good.

Key Takeaways

Practice