## groupBy

Here’s a useful function from Data.List:

```
groupBy :: (a -> a -> Bool) -> [a] -> [[a]]
groupBy (==) "aabcdda"
-- ["aa","b","c","dd","a"]
```

However, as has been pointed out before^{1}, `groupBy`

expects an equivalence relation, and can exhibit surprising behavior when it doesn’t get one. Let’s say, for instance, that we wanted to group numbers that were close together:

```
groupClose :: [Integer] -> [[Integer]]
groupClose = groupBy (\x y -> abs (x - y) < 3)
```

What would you expect on the list `[1, 2, 3, 4, 5]`

? All in the same group? Well, what you actually get is:

`[[1,2,3],[4,5]]`

This is because the implementation of `groupBy`

only compares to the first element in each group:

```
groupBy _ [] = []
groupBy eq (x:xs) = (x:ys) : groupBy eq zs
where (ys,zs) = span (eq x) xs
```

Brandon Simmons gave a definition of `groupBy`

that is perhaps more useful, but it used explicit recursion, rather than a fold.

A definition with `foldr`

turned out to be trickier than I expected. I found some of the laziness properties especially difficult:

```
>>> head (groupBy (==) (1:2:undefined))
[1]
>>> (head . head) (groupBy (==) (1:undefined))
1
>>> (head . head . tail) (groupBy (==) (1:2:undefined))
2
```

Here’s the definition I came up with, after some deliberation:

```
groupBy :: (a -> a -> Bool) -> [a] -> [[a]]
groupBy p xs = build (\c n ->
let f x a q
| q x = (x : ys, zs)
| otherwise = ([], c (x : ys) zs)
where (ys,zs) = a (p x)
in snd (foldr f (const ([], n)) xs (const False)))
{-# INLINE groupBy #-}
```

Seemingly benign changes to the function will break one or more of the above tests. In particular, the laziness of a “where” binding needs to be taken into account. Here’s an early attempt which failed:

```
groupBy :: (a -> a -> Bool) -> [a] -> [[a]]
groupBy p xs = build (\c n ->
let f x a q d
| q x = a (p x) (d . (:) x)
| otherwise = d [] (a (p x) (c . (:) x))
in foldr f (\_ d -> d [] n) xs (const False) (\ _ y -> y))
```

Once done, though, it works as expected:

```
>>> groupBy (==) "aaabcccdda"
["aaa","b","ccc","dd","a"]
>>> groupBy (==) []
[]
>>> groupBy (<=) [1,2,2,3,1,2,0,4,5,2]
[[1,2,2,3],[1,2],[0,4,5],[2]]
```

It’s the fastest version I could find that obeyed the above laziness properties.

The GHC page on the issue unfortunately seems to indicate the implementation won’t be changed. Ah, well. Regardless, I have a repository with the implementation above (with extra fusion machinery added) and comparisons to other implementations.

There are several threads on the libraries mailing list on this topic:

- 2006
- mapAccumL - find max in-sequence subsequence
- 2007
- Data.List.groupBy with non-transitive equality predicate (this is the longest discussion on the topic)
- 2008
- Generalize groupBy in a useful way?
- 2009
- nubBy seems broken in recent GHCs