## groupBy

Posted on January 7, 2018
Tags: ,

Here’s a useful function from Data.List:

``````groupBy :: (a -> a -> Bool) -> [a] -> [[a]]

groupBy (==) "aabcdda"
-- ["aa","b","c","dd","a"]``````

However, as has been pointed out before1, `groupBy` expects an equivalence relation, and can exhibit surprising behavior when it doesn’t get one. Let’s say, for instance, that we wanted to group numbers that were close together:

``````groupClose :: [Integer] -> [[Integer]]
groupClose = groupBy (\x y -> abs (x - y) < 3)``````

What would you expect on the list `[1, 2, 3, 4, 5]`? All in the same group? Well, what you actually get is:

``[[1,2,3],[4,5]]``

This is because the implementation of `groupBy` only compares to the first element in each group:

``````groupBy _  []           =  []
groupBy eq (x:xs)       =  (x:ys) : groupBy eq zs
where (ys,zs) = span (eq x) xs``````

Brandon Simmons gave a definition of `groupBy` that is perhaps more useful, but it used explicit recursion, rather than a fold.

A definition with `foldr` turned out to be trickier than I expected. I found some of the laziness properties especially difficult:

``````>>> head (groupBy (==) (1:2:undefined))
[1]
1
2``````

Here’s the definition I came up with, after some deliberation:

``````groupBy :: (a -> a -> Bool) -> [a] -> [[a]]
groupBy p xs = build (\c n ->
let f x a q
| q x = (x : ys, zs)
| otherwise = ([], c (x : ys) zs)
where (ys,zs) = a (p x)
in snd (foldr f (const ([], n)) xs (const False)))
{-# INLINE groupBy #-}``````

Seemingly benign changes to the function will break one or more of the above tests. In particular, the laziness of a “where” binding needs to be taken into account. Here’s an early attempt which failed:

``````groupBy :: (a -> a -> Bool) -> [a] -> [[a]]
groupBy p xs = build (\c n ->
let f x a q d
| q x = a (p x) (d . (:) x)
| otherwise = d [] (a (p x) (c . (:) x))
in foldr f (\_ d -> d [] n) xs (const False) (\ _ y -> y))``````

Once done, though, it works as expected:

``````>>> groupBy (==) "aaabcccdda"
["aaa","b","ccc","dd","a"]
>>> groupBy (==) []
[]
>>> groupBy (<=) [1,2,2,3,1,2,0,4,5,2]
[[1,2,2,3],[1,2],[0,4,5],[2]]``````

It’s the fastest version I could find that obeyed the above laziness properties.

The GHC page on the issue unfortunately seems to indicate the implementation won’t be changed. Ah, well. Regardless, I have a repository with the implementation above (with extra fusion machinery added) and comparisons to other implementations.

1. There are several threads on the libraries mailing list on this topic:

2006
mapAccumL - find max in-sequence subsequence
2007
Data.List.groupBy with non-transitive equality predicate (this is the longest discussion on the topic)
2008
Generalize groupBy in a useful way?
2009
nubBy seems broken in recent GHCs
↩︎