groupBy
Here’s a useful function from Data.List:
groupBy :: (a -> a -> Bool) -> [a] -> [[a]]
==) "aabcdda"
groupBy (-- ["aa","b","c","dd","a"]
However, as has been pointed out before1,
groupBy
expects an equivalence
relation, and can exhibit surprising behavior when it doesn’t get one.
Let’s say, for instance, that we wanted to group numbers that were close
together:
groupClose :: [Integer] -> [[Integer]]
= groupBy (\x y -> abs (x - y) < 3) groupClose
What would you expect on the list [1, 2, 3, 4, 5]
? All
in the same group? Well, what you actually get is:
1,2,3],[4,5]] [[
This is because the implementation of groupBy
only compares to the first
element in each group:
= []
groupBy _ [] :xs) = (x:ys) : groupBy eq zs
groupBy eq (xwhere (ys,zs) = span (eq x) xs
Brandon Simmons gave a definition
of groupBy
that is perhaps more
useful, but it used explicit recursion, rather than a fold.
A definition with foldr
turned
out to be trickier than I expected. I found some of the laziness
properties especially difficult:
>>> head (groupBy (==) (1:2:undefined))
1]
[>>> (head . head) (groupBy (==) (1:undefined))
1
>>> (head . head . tail) (groupBy (==) (1:2:undefined))
2
Here’s the definition I came up with, after some deliberation:
groupBy :: (a -> a -> Bool) -> [a] -> [[a]]
= build (\c n ->
groupBy p xs let f x a q
| q x = (x : ys, zs)
| otherwise = ([], c (x : ys) zs)
where (ys,zs) = a (p x)
in snd (foldr f (const ([], n)) xs (const False)))
{-# INLINE groupBy #-}
Seemingly benign changes to the function will break one or more of the above tests. In particular, the laziness of a “where” binding needs to be taken into account. Here’s an early attempt which failed:
groupBy :: (a -> a -> Bool) -> [a] -> [[a]]
= build (\c n ->
groupBy p xs let f x a q d
| q x = a (p x) (d . (:) x)
| otherwise = d [] (a (p x) (c . (:) x))
in foldr f (\_ d -> d [] n) xs (const False) (\ _ y -> y))
Once done, though, it works as expected:
>>> groupBy (==) "aaabcccdda"
"aaa","b","ccc","dd","a"]
[>>> groupBy (==) []
[]>>> groupBy (<=) [1,2,2,3,1,2,0,4,5,2]
1,2,2,3],[1,2],[0,4,5],[2]] [[
It’s the fastest version I could find that obeyed the above laziness properties.
The GHC page on the issue unfortunately seems to indicate the implementation won’t be changed. Ah, well. Regardless, I have a repository with the implementation above (with extra fusion machinery added) and comparisons to other implementations.
There are several threads on the libraries mailing list on this topic:
- 2006
- mapAccumL - find max in-sequence subsequence
- 2007
- Data.List.groupBy with non-transitive equality predicate (this is the longest discussion on the topic)
- 2008
- Generalize groupBy in a useful way?
- 2009
- nubBy seems broken in recent GHCs