groupBy

Posted on January 7, 2018
Tags: Haskell, Folds

Here’s a useful function from Data.List:

groupBy :: (a -> a -> Bool) -> [a] -> [[a]]

groupBy (==) "aabcdda"
-- ["aa","b","c","dd","a"]

However, as has been pointed out before1, groupBy expects an equivalence relation, and can exhibit surprising behavior when it doesn’t get one. Let’s say, for instance, that we wanted to group numbers that were close together:

groupClose :: [Integer] -> [[Integer]]
groupClose = groupBy (\x y -> abs (x - y) < 3)

What would you expect on the list [1, 2, 3, 4, 5]? All in the same group? Well, what you actually get is:

[[1,2,3],[4,5]]

This is because the implementation of groupBy only compares to the first element in each group:

groupBy _  []           =  []
groupBy eq (x:xs)       =  (x:ys) : groupBy eq zs
                           where (ys,zs) = span (eq x) xs

Brandon Simmons gave a definition of groupBy that is perhaps more useful, but it used explicit recursion, rather than a fold.

A definition with foldr turned out to be trickier than I expected. I found some of the laziness properties especially difficult:

>>> head (groupBy (==) (1:2:undefined))
[1]
>>> (head . head) (groupBy (==) (1:undefined))
1
>>> (head . head . tail) (groupBy (==) (1:2:undefined))
2

Here’s the definition I came up with, after some deliberation:

groupBy :: (a -> a -> Bool) -> [a] -> [[a]]
groupBy p xs = build (\c n ->
  let f x a q
        | q x = (x : ys, zs)
        | otherwise = ([], c (x : ys) zs)
        where (ys,zs) = a (p x)
  in snd (foldr f (const ([], n)) xs (const False)))
{-# INLINE groupBy #-}

Seemingly benign changes to the function will break one or more of the above tests. In particular, the laziness of a “where” binding needs to be taken into account. Here’s an early attempt which failed:

groupBy :: (a -> a -> Bool) -> [a] -> [[a]]
groupBy p xs = build (\c n -> 
  let f x a q d
        | q x = a (p x) (d . (:) x)
        | otherwise = d [] (a (p x) (c . (:) x))
  in foldr f (\_ d -> d [] n) xs (const False) (\ _ y -> y))

Once done, though, it works as expected:

>>> groupBy (==) "aaabcccdda"
["aaa","b","ccc","dd","a"]
>>> groupBy (==) []
[]
>>> groupBy (<=) [1,2,2,3,1,2,0,4,5,2]
[[1,2,2,3],[1,2],[0,4,5],[2]]

It’s the fastest version I could find that obeyed the above laziness properties.

The GHC page on the issue unfortunately seems to indicate the implementation won’t be changed. Ah, well. Regardless, I have a repository with the implementation above (with extra fusion machinery added) and comparisons to other implementations.


  1. There are several threads on the libraries mailing list on this topic:

    2006
    mapAccumL - find max in-sequence subsequence
    2007
    Data.List.groupBy with non-transitive equality predicate (this is the longest discussion on the topic)
    2008
    Generalize groupBy in a useful way?
    2009
    nubBy seems broken in recent GHCs