groupBy
Here’s a useful function from Data.List:
groupBy :: (a -> a -> Bool) -> [a] -> [[a]]
==) "aabcdda"
groupBy (-- ["aa","b","c","dd","a"]
However, as has been pointed out before1, groupBy
expects an equivalence relation, and can exhibit surprising behavior when it doesn’t get one. Let’s say, for instance, that we wanted to group numbers that were close together:
groupClose :: [Integer] -> [[Integer]]
= groupBy (\x y -> abs (x - y) < 3) groupClose
What would you expect on the list [1, 2, 3, 4, 5]
? All in the same group? Well, what you actually get is:
1,2,3],[4,5]] [[
This is because the implementation of groupBy
only compares to the first element in each group:
= []
groupBy _ [] :xs) = (x:ys) : groupBy eq zs
groupBy eq (xwhere (ys,zs) = span (eq x) xs
Brandon Simmons gave a definition of groupBy
that is perhaps more useful, but it used explicit recursion, rather than a fold.
A definition with foldr
turned out to be trickier than I expected. I found some of the laziness properties especially difficult:
>>> head (groupBy (==) (1:2:undefined))
1]
[>>> (head . head) (groupBy (==) (1:undefined))
1
>>> (head . head . tail) (groupBy (==) (1:2:undefined))
2
Here’s the definition I came up with, after some deliberation:
groupBy :: (a -> a -> Bool) -> [a] -> [[a]]
= build (\c n ->
groupBy p xs let f x a q
| q x = (x : ys, zs)
| otherwise = ([], c (x : ys) zs)
where (ys,zs) = a (p x)
in snd (foldr f (const ([], n)) xs (const False)))
{-# INLINE groupBy #-}
Seemingly benign changes to the function will break one or more of the above tests. In particular, the laziness of a “where” binding needs to be taken into account. Here’s an early attempt which failed:
groupBy :: (a -> a -> Bool) -> [a] -> [[a]]
= build (\c n ->
groupBy p xs let f x a q d
| q x = a (p x) (d . (:) x)
| otherwise = d [] (a (p x) (c . (:) x))
in foldr f (\_ d -> d [] n) xs (const False) (\ _ y -> y))
Once done, though, it works as expected:
>>> groupBy (==) "aaabcccdda"
"aaa","b","ccc","dd","a"]
[>>> groupBy (==) []
[]>>> groupBy (<=) [1,2,2,3,1,2,0,4,5,2]
1,2,2,3],[1,2],[0,4,5],[2]] [[
It’s the fastest version I could find that obeyed the above laziness properties.
The GHC page on the issue unfortunately seems to indicate the implementation won’t be changed. Ah, well. Regardless, I have a repository with the implementation above (with extra fusion machinery added) and comparisons to other implementations.
There are several threads on the libraries mailing list on this topic:
- 2006
- mapAccumL - find max in-sequence subsequence
- 2007
- Data.List.groupBy with non-transitive equality predicate (this is the longest discussion on the topic)
- 2008
- Generalize groupBy in a useful way?
- 2009
- nubBy seems broken in recent GHCs