Single-Pass Huffman Coding
While working on something else, I figured out a nice Haskell implementation of Huffman coding, and I thought I’d share it here. I’ll go through a few techniques for transforming a multi-pass algorithm into a single-pass one first, and then I’ll show how to use them for Huffman. If you just want to skip to the code, it’s provided at the end 1.
The algorithm isn’t single-pass in the sense of Adaptive Huffman Coding: it still uses the normal Huffman algorithm, but the input is transformed in the same traversal that builds the tree to transform it.
Circular Programming
There are several techniques for turning multi-pass algorithms into single-pass ones in functional languages. Perhaps the most famous is circular programming: using laziness to eliminate a pass. R. S. Bird (1984) used this to great effect in solving the repmin problem:
Given a tree of integers, replace every integer with the minimum integer in the tree, in one pass.
For an imperative programmer, the problem is relatively easy: first, write the code to find the minimum value in the tree in the standard way, using a loop and a “smallest so far” accumulator. Then, inside the loop, after updating the accumulator, set the value of the leaf to be a reference to the accumulator.
At first, that solution may seem necessarily impure: we’re using global, mutable state to update many things at once. However, as the paper shows, we can claw back purity using laziness:
data Tree a = Leaf a | Tree a :*: Tree a
repMin :: Tree Integer -> Tree Integer
= ys where
repMin xs = go xs
(m, ys) Leaf x) = (x, Leaf m)
go (:*: ys) = (min x y, xs' :*: ys')
go (xs where
= go xs
(x,xs') = go ys (y,ys')
There and Back Again
Let’s say we don’t have laziness at our disposal: are we hosed? No 2! Danvy and Goldberg (2005) explore this very issue, by posing the question:
Given two lists, xs and ys, can you zip xs with the reverse of ys in one pass?
The technique used to solve the problem is named “There and Back Again”; it should be clear why from one of the solutions:
= walk xs const where
convolve xs ys = k [] ys
walk [] k :xs) k = walk xs (\r (y:ys) -> k ((x,y) : r) ys) walk (x
The traversal of one list builds up the function to consume the other. We could write repmin in the same way:
= uncurry ($) . go where
repMin Leaf x) = (Leaf, x)
go (:*: ys) = (\m -> xs' m :*: ys' m, min xm ym) where
go (xs = go xs
(xs',xm) = go ys (ys',ym)
Cayley Representations
If you’re doing a lot of appending to some list-like structure, you probably don’t want to use actual lists: you’ll end up traversing the left-hand-side of the append many more times than necessary. A type you can drop in to use instead is difference lists (Hughes 1986):
type DList a = [a] -> [a]
rep :: [a] -> DList a
= (++)
rep
abs :: DList a -> [a]
abs xs = xs []
append :: DList a -> DList a -> DList a
= (.) append
append
is
in this representation. In fact, for any monoid with a slow mappend
, you
can use the same trick: it’s called the Cayley representation, and
available as Endo
in Data.Monoid.
rep :: Monoid a => a -> Endo a
= Endo (mappend x)
rep x
abs :: Monoid a => Endo a -> a
abs (Endo f) = f mempty
instance Monoid (Endo a) where
mempty = Endo id
mappend (Endo f) (Endo g) = Enfo (f . g)
You can actually do the same transformation for “monoids” in the categorical sense: applying it to monads, for instance, will give you codensity (Rivas and Jaskelioff 2014).
Traversable
Looking back—just for a second—to the repmin example, we should be
able to spot a pattern we can generalize. There’s really nothing
tree-specific about it, so why can’t we apply it to lists? Or other
structures, for that matter? It turns out we can: the mapAccumL
function is tailor-made to
this need:
repMin :: Traversable t => t Integer -> t Integer
= ys where
repMin xs ~(Just m), ys) = mapAccumL f Nothing xs
(Nothing x = (Just x, m)
f Just y) x = (Just (min x y), m) f (
The tilde before the Just
ensures
this won’t fail on empty input.
Huffman Coding
Finally, it’s time for the main event. Huffman coding is a very multi-pass algorithm, usually. The steps look like this:
- Build a frequency table for each character in the input.
- Build a priority queue from that frequency table.
- Iteratively pop elements and combine them (into Huffman trees) from the queue until there’s only one left.
- That Huffman tree can be used to construct the mapping from items back to their Huffman codes.
- Traverse the input again, using the constructed mapping to replace elements with their codes.
We can’t skip any of these steps: we can try perform them all at once, though.
Let’s write the multi-pass version first. We’ll need the frequency table:
frequencies :: Ord a => [a] -> Map a Int
= Map.fromListWith (+) . map (flip (,) 1) frequencies
And a heap, ordered on the frequencies of its elements (I’m using a skew heap here):
data Heap a
= Nil
| Node {-# UNPACK #-} !Int a (Heap a) (Heap a)
instance Monoid (Heap a) where
mappend Nil ys = ys
mappend xs Nil = xs
mappend h1@(Node i x lx rx) h2@(Node j y ly ry)
| i <= j = Node i x (mappend h2 rx) lx
| otherwise = Node j y (mappend h1 ry) ly
mempty = Nil
Next, we need to build the tree3. We can use the tree type from above.
buildTree :: Map a Int -> Maybe (Tree a)
= prune . toHeap where
buildTree = Map.foldMapWithKey (\k v -> Node v (Leaf k) Nil Nil)
toHeap Nil = Nothing
prune Node i x l r) = case mappend l r of
prune (Nil -> Just x
Node j y l' r' ->
mappend (Node (i+j) (x :*: y) Nil Nil) (mappend l' r')) prune (
Then, a way to convert between the tree and a map:
toMapping :: Ord a => Tree a -> Map a [Bool]
Leaf x) = Map.singleton x []
toMapping (:*: ys) =
toMapping (xs fmap (True:) (toMapping xs)) (fmap (False:) (toMapping ys)) Map.union (
And finally, putting the whole thing together:
huffman :: Ord a => [a] -> (Maybe (Tree a), [[Bool]])
= (tree, map (mapb Map.!) xs) where
huffman xs = frequencies xs
freq = buildTree freq
tree = maybe Map.empty toMapping tree mapb
Removing the passes
The first thing to fix is the toMapping
function: at every level, it
calls union
, a complex and
expensive operation. However, union
and empty
form a monoid, so we can use the
Cayley representation to reduce the calls to a minimum. Next, we want to
get rid of the fmap
s: we can
do that by assembling a function to perform the fmap
as we go,
as in convolve
4.
toMapping :: Ord a => Tree a -> Map a [Bool]
= go tree id Map.empty where
toMapping tree Leaf x) k = Map.insert x (k [])
go (:*: ys) k =
go (xs . (:) True) . go ys (k . (:) False) go xs (k
Secondly, we can integrate the toMapping
function with the buildTree
function, removing another
pass:
buildTree :: Ord a => Map a Int -> Maybe (Tree a, Map a [Bool])
= prune . toHeap where
buildTree = Map.foldMapWithKey (\k v -> Node v (Leaf k, leaf k) Nil Nil)
toHeap Nil = Nothing
prune Node i x l r) = case mappend l r of
prune (Nil -> Just (fmap (\k -> k id Map.empty) x)
Node j y l' r' ->
mappend (Node (i+j) (cmb x y) Nil Nil) (mappend l' r'))
prune (= Map.insert x (k [])
leaf x k = xs (k . (:) True) . ys (k . (:) False)
node xs ys k = (xt :*: yt, node xm ym) cmb (xt,xm) (yt,ym)
Finally, to remove the second pass over the list, we can copy repmin,
using mapAccumL
to both
construct the mapping and apply it to the structure in one go.
huffman :: (Ord a, Traversable t) => t a -> (Maybe (Tree a), t [Bool])
= (fmap fst tree, ys) where
huffman xs = mapAccumL f Map.empty xs
(freq,ys) = (Map.insertWith (+) x 1 fm, mapb Map.! x)
f fm x = buildTree freq
tree = maybe Map.empty snd tree mapb
And that’s it!
Generalization
The similarity between the repmin function and the solution above is suggestive: is there a way to encode this idea of making a multi-pass algorithm single-pass? Of course! We can use an applicative:
data Circular a b c =
Circular !a
-> c)
(b
instance Functor (Circular a b) where
fmap f (Circular tally run) = Circular tally (f . run)
instance Monoid a =>
Applicative (Circular a b) where
pure x = Circular mempty (const x)
Circular fl fr <*> Circular xl xr =
Circular
mappend fl xl)
(-> fr r (xr r))
(\r
liftHuffman :: Ord a
=> a -> Circular (Map a Int) (Map a [Bool]) [Bool]
= Circular (Map.singleton x 1) (Map.! x)
liftHuffman x
runHuffman :: Ord a
=> Circular (Map a Int) (Map a [Bool]) r -> (Maybe (Tree a), r)
Circular smry run) =
runHuffman (maybe (Nothing, run Map.empty) (Just *** run) (buildTree smry)
huffman :: (Ord a, Traversable t)
=> t a -> (Maybe (Tree a), t [Bool])
= runHuffman . traverse liftHuffman huffman
Thanks to it being an applicative, you can do all the fun lensy things with it:
showBin :: [Bool] -> String
= map (bool '0' '1')
showBin
>>> let liftBin = fmap showBin . liftHuffman
>>> (snd . runHuffman . (each.traverse) liftBin) ("abb", "cad", "c")
"01","11","11"],["00","01","10"],["00"]) ([
Bringing us back to the start, it can also let us solve repmin!
liftRepMin :: a -> Circular (Option (Min a)) a a
= Circular (pure (pure x)) id
liftRepMin x
runRepMin :: Circular (Option (Min a)) a b -> b
Circular m r) = r (case m of
runRepMin (Option (Just (Min x)) -> x)
repMin :: (Ord a, Traversable t) => t a -> t a
= runRepMin . traverse liftRepMin repMin
Related
So the Circular
type
is actually just the product of reader and writer, and is closely
related to the sort type.
It’s also related to the Prescient
type, which I noticed after I’d written the above.
References
Huffman coding single-pass implementation:
↩︎import Data.Map.Strict (Map) import qualified Data.Map.Strict as Map import Data.Traversable (mapAccumL) data Heap a = Nil | Node {-# UNPACK #-} !Int a (Heap a) (Heap a) instance Monoid (Heap a) where mappend Nil ys = ys mappend xs Nil = xs mappend h1@(Node i x lx rx) h2@(Node j y ly ry) | i <= j = Node i x (mappend h2 rx) lx | otherwise = Node j y (mappend h1 ry) ly mempty = Nil data Tree a = Leaf a | Tree a :*: Tree a buildTree :: Ord a => Map a Int -> Maybe (Tree a, Map a [Bool]) = prune . toHeap where buildTree = Map.foldMapWithKey (\k v -> Node v (Leaf k, leaf k) Nil Nil) toHeap Nil = Nothing prune Node i x l r) = case mappend l r of prune (Nil -> Just (fmap (\k -> k id Map.empty) x) Node j y l' r' -> mappend (Node (i+j) (cmb x y) Nil Nil) (mappend l' r')) prune (= Map.insert x (k []) leaf x k = xs (k . (:) True) . ys (k . (:) False) node xs ys k = (xt :*: yt, node xm ym) cmb (xt,xm) (yt,ym) huffman :: (Ord a, Traversable t) => t a -> (Maybe (Tree a), t [Bool]) = (fmap fst tree, ys) where huffman xs = mapAccumL f Map.empty xs (freq,ys) = (Map.insertWith (+) x 1 fm, mapb Map.! x) f fm x = buildTree freq tree = maybe Map.empty snd tree mapb
Well, that’s a little bit of a lie. In terms of asympostics, Pippenger (1997) stated a problem that could be solved in linear time in impure Lisp, but in pure Lisp. R. Bird, Jones, and Moor (1997) then produced an algorithm that could solve the problem in linear time, by using laziness. So, in some cases, laziness will give you asymptotics you can’t get without it (if you want to stay pure).↩︎
There’s actually a nicer version of the
buildTree
function which usesStateT (Heap a) Maybe
, but it’s equivalent to this one under the hood, and I though might be a little distracting.↩︎Something to notice about this function is that it’s going top-down and bottom-up at the same time. Combining the maps (with
(.)
) is done bottom-up, but building the codes is top-down. This means the codes are built in reverse order! That’s why the accumulating parameter (k
) is a difference list, rather than a normal list. As it happens, if normal lists were used, the function would be slightly more efficient through sharing, but the codes would all be reversed.↩︎