groupBy

Posted on January 7, 2018

Tags: Haskell, Folds

Here’s a useful function from Data.List:

groupBy :: (a -> a -> Bool) -> [a] -> [[a]]

groupBy (==) "aabcdda"
-- ["aa","b","c","dd","a"]

However, as has been pointed out before¹, groupBy expects an equivalence relation, and can exhibit surprising behavior when it doesn’t get one. Let’s say, for instance, that we wanted to group numbers that were close together:

groupClose :: [Integer] -> [[Integer]]
groupClose = groupBy (\x y -> abs (x - y) < 3)

What would you expect on the list [1, 2, 3, 4, 5]? All in the same group? Well, what you actually get is:

[[1,2,3],[4,5]]

This is because the implementation of groupBy only compares to the first element in each group:

groupBy _  []           =  []
groupBy eq (x:xs)       =  (x:ys) : groupBy eq zs
                           where (ys,zs) = span (eq x) xs

Brandon Simmons gave a definition of groupBy that is perhaps more useful, but it used explicit recursion, rather than a fold.

A definition with foldr turned out to be trickier than I expected. I found some of the laziness properties especially difficult:

>>> head (groupBy (==) (1:2:undefined))
[1]
>>> (head . head) (groupBy (==) (1:undefined))
1
>>> (head . head . tail) (groupBy (==) (1:2:undefined))
2

Here’s the definition I came up with, after some deliberation:

groupBy :: (a -> a -> Bool) -> [a] -> [[a]]
groupBy p xs = build (\c n ->
  let f x a q
        | q x = (x : ys, zs)
        | otherwise = ([], c (x : ys) zs)
        where (ys,zs) = a (p x)
  in snd (foldr f (const ([], n)) xs (const False)))
{-# INLINE groupBy #-}

Seemingly benign changes to the function will break one or more of the above tests. In particular, the laziness of a “where” binding needs to be taken into account. Here’s an early attempt which failed:

groupBy :: (a -> a -> Bool) -> [a] -> [[a]]
groupBy p xs = build (\c n -> 
  let f x a q d
        | q x = a (p x) (d . (:) x)
        | otherwise = d [] (a (p x) (c . (:) x))
  in foldr f (\_ d -> d [] n) xs (const False) (\ _ y -> y))

Once done, though, it works as expected:

>>> groupBy (==) "aaabcccdda"
["aaa","b","ccc","dd","a"]
>>> groupBy (==) []
[]
>>> groupBy (<=) [1,2,2,3,1,2,0,4,5,2]
[[1,2,2,3],[1,2],[0,4,5],[2]]

It’s the fastest version I could find that obeyed the above laziness properties.

The GHC page on the issue unfortunately seems to indicate the implementation won’t be changed. Ah, well. Regardless, I have a repository with the implementation above (with extra fusion machinery added) and comparisons to other implementations.

There are several threads on the libraries mailing list on this topic:

2006

mapAccumL - find max in-sequence subsequence

2007

Data.List.groupBy with non-transitive equality predicate (this is the longest discussion on the topic)

2008

Generalize groupBy in a useful way?

2009

nubBy seems broken in recent GHCs

↩︎