Lazily Grouping in Haskell
Here’s a cool trick:
minimum :: Ord a => [a] -> a
minimum = head . sortThis is in Haskell, not as you might expect. And this isn’t because Haskell is using some weird linear-time sorting algorithm; indeed, the following is :
maximum :: Ord a => [a] -> a
maximum = last . sortNo: since the implementation of minimum above only
demands the first element of the list, and since sort has
been carefully implemented, only a linear amount of work will be done to
retrieve it.
It’s not easy to structure programs to have the same property as
sort does above: to be maximally lazy, such that
unnecessary work is not performed. Today I was working on a maximally
lazy implementation of the following program:
groupOn :: Eq k => (a -> k) -> [a] -> [(k,[a])]
groupOn = ...
>>> groupOn (`rem` 2) [1..5]
[(1,[1,3,5]),(0,[2,4])]
>>> groupOn (`rem` 3) [5,8,3,6,2]
[(2,[5,8,2]),(0,[3,6])]This function groups the elements of a list according to some key function. The desired behaviour here is a little subtle: we don’t want to just group adjacent elements, for instance.
groupOn (`rem` 3) [5,8,3,6,2] ≢ [(2,[5,8]),(0,[3,6]),(2,[2])]And we don’t want to reorder the elements of the list by the keys:
groupOn (`rem` 3) [5,8,3,6,2] ≢ [(0,[3,6]),(2,[5,8,2])]These constraints make it especially tricky to make this function
lazy. In fact, at first glance, it seems impossible. What should, for
instance, groupOn id [1..] return? It can’t even fill out
the first group, since it will never find another 1.
However, it can fill out the first key. And, in fact, the
second. And it can fill out the first element of the first group.
Precisely:
groupOn id [1..] ≡ [(1,1:⊥), (2,2:⊥), (3,3:⊥), ...Another example is groupOn id (repeat 1), or
groupOn id (cycle [1,2,3]). These each have
partially-defined answers:
groupOn id (repeat 1) ≡ (1,repeat 1):⊥
groupOn id (cycle [1,2,3]) ≡ (1,repeat 1):(2,repeat 2):(3,repeat 3):⊥So there is some kind of well-defined lazy semantics for this function. The puzzle I was interested in was defining an efficient implementation for these semantics.
The Slow Case
The first approximation to a solution I could think of is the following:
groupOn :: Ord k => (a -> k) -> [a] -> [(k, [a])]
groupOn k = Map.toList . Map.fromListWith (++) . map (\x -> (k x, [x]))In fact, if you don’t care about laziness, this is probably the best solution: it’s , it performs well (practically as well as asymptotically), and it has the expected results.
However, there are problems. Primarily this solution cares about ordering, which we don’t want. We want to emit the results in the same order that they were in the original list, and we don’t necessarily want to require an ordering on the elements (for the efficient solution we will relax this last constraint).
Instead, let’s implement our own “map” type that is inefficient, but more general.
type Map a b = [(a,b)]
insertWith :: Eq a => (b -> b -> b) -> a -> b -> Map a b -> Map a b
insertWith f k v [] = [(k,v)]
insertWith f k v ((k',v'):xs)
| k == k' = (k',f v v') : xs
| otherwise = (k',v') : insertWith f k v xs
groupOn :: Eq k => (a -> k) -> [a] -> [(k, [a])]
groupOn k = foldr (uncurry (insertWith (++))) [] . map (\x -> (k x, [x]))The problem here is that it’s not lazy enough.
insertWith is strict in its last argument, which means that
using foldr doesn’t gain us anything laziness-wise.
There is some extra information we can use to drive the result: we know that the result will have keys that are in the same order as they appear in the list, with duplicates removed:
groupOn :: Eq k => (a -> k) -> [a] -> [(k, [a])]
groupOn k xs = map _ ks
where
ks = map k xsFrom here, we can get what the values should be from each key by filtering the original list:
groupOn :: Eq k => (a -> k) -> [a] -> [(k,[a])]
groupOn key xs = map (\k -> (k, filter ((k==) . key) xs)) (nub (map key xs))Using a kind of Schwartzian transform yields the following slight improvement:
groupOn :: Eq k => (a -> k) -> [a] -> [(k,[a])]
groupOn key xs = map (\k -> (k , map snd (filter ((k==) . fst) ks))) (nub (map fst ks))
where
ks = map (\x -> (key x, x)) xsBut this traverses the same list multiple times unnecessarily. The
problem is that we’re repeating a lot of work between nub
and the rest of the algorithm.
The following is much better:
groupOn :: Eq k => (a -> k) -> [a] -> [(k,[a])]
groupOn key = go . map (\x -> (key x, x))
where
go [] = []
go ((k,x):xs) = (k,x:map snd y) : go ys
where
(y,ys) = partition ((k==).fst) xsFirst, we perform the Schwartzian transform optimisation. The work of
the algorithm is done in the go helper. The idea is to
filter out duplicates as we encounter them: when we encounter
(k,x) we can keep it immediately, but then we split the
rest of the list into the components that have the same key as this
element, and the ones that differ. The ones that have the same key can
form the collection for this key, and those that differ are what we
recurse on.
This partitioning also avoids re-traversing elements we know to be
already accounted for in a previous group. I think that this is the most
efficient (modulo some inlining and strictness improvements) algorithm
that can do groupOn with just an Eq
constraint.
A Faster Version
The reason that the groupOn above is slow is that every
element returned has to traverse the entire rest of the list to remove
duplicates. This is a classic pattern of quadratic behaviour: we can
improve it by using the same trick as quick sort, by partitioning the
list into lesser and greater elements on every call.
groupOnOrd :: Ord k => (a -> k) -> [a] -> [(k,[a])]
groupOnOrd key = go . map (\x -> (key x, x))
where
go [] = []
go ((k,x):xs) = (k,x:e) : go lt ++ go gt
where
(e,lt,gt) = foldr split ([],[],[]) xs
split ky@(k',y) ~(e,lt,gt) = case compare k' k of
LT -> (e, ky:lt, gt)
EQ -> (y:e, lt, gt)
GT -> (e, lt, ky:gt)While this is , and it does group elements, it also reorders the underlying list. Let’s fix that by tagging the incoming elements with their positions, and then using those positions to order them back into their original configuration:
groupOnOrd :: Ord k => (a -> k) -> [a] -> [(k,[a])]
groupOnOrd k = map (\(_,k,xs) -> (k,xs)) . go . zipWith (\i x -> (i, k x, x)) [0..]
where
go [] = []
go ((i, k, x):xs) = (i, k, x : e) : merge (go l) (go g)
where
(e, l, g) = foldr split ([],[],[]) xs
split ky@(_,k',y) ~(e, l, g) = case compare k' k of
LT -> (e , ky : l, g)
EQ -> (y:e, l, g)
GT -> (e , l, ky : g)
merge [] gt = gt
merge lt [] = lt
merge (l@(i,_,_):lt) (g@(j,_,_):gt)
| i <= j = l : merge lt (g:gt)
| otherwise = g : merge (l:lt) gtThis is close, but still not right. This isn’t yet lazy. The
merge function is strict in both arguments.
However, we have all the information we need to unshuffle the lists
without having to inspect them. In split, we know which
direction we put each element: we can store that info without using
indices.
groupOnOrd :: Ord k => (a -> k) -> [a] -> [(k,[a])]
groupOnOrd k = catMaybes . go . map (\x -> (k x, x))
where
go [] = []
go ((k,x):xs) = Just (k, x : e) : merge m (go l) (go g)
where
(e, m, l, g) = foldr split ([],[],[],[]) xs
split ky@(k',y) ~(e, m, l, g) = case compare k' k of
LT -> ( e, LT : m, ky : l, g)
EQ -> (y:e, EQ : m, l, g)
GT -> ( e, GT : m, l, ky : g)
merge [] lt gt = []
merge (EQ : xs) lt gt = Nothing : merge xs lt gt
merge (LT : xs) (l:lt) gt = l : merge xs lt gt
merge (GT : xs) lt (g:gt) = g : merge xs lt gtWhat we generate here is a [Ordering]: this list tells
us the result of all the compare operations on the input list. Then, in
merge, we invert the action of split,
rebuilding the original list without inspecting either lt
or gt.
And this solution works! It’s , and fully lazy.
>>> map fst . groupOnOrd id $ [1..]
[1..]
>>> groupOnOrd id $ cycle [1,2,3]
(1,repeat 1):(2,repeat 2):(3,repeat 3):⊥
>>> groupOnOrd (`rem` 3) [1..]
(1,[1,4..]):(2,[2,5..]):(0,[3,6..]):⊥The finished version of these two functions, along with some benchmarks, is available here.