Byte-sized Algorithms: Removing Duplicates

Contrary to popular belief among software engineers, algorithms really do come up in the real world sometimes! This post is just some basic Leetcode-esque stuff I needed to do at work recently. It shouldn’t surprise or impress anyone who has thought about this before or taken an algorithms course, but it was mildly interesting to think and write about.

# Problem Statement

Suppose you have a large (but in-memory) array of strings and you want to remove all of its duplicates as efficiently as possible, i.e. a function signature like this:

// Dedupe returns an array containing all non-unique elements in the input.
func Dedupe(array []string) []string


We’re going to ignore the obvious solution that probably exists in many languages: returning a set initialized with the array, e.g. in Python:

from typing import List

def dedupe(array: List[str]) -> List[str]:
return list(set(array))


This isn’t ideal since you construct two new objects: the set and then the final list. Plus, Golang (and other systems-y languages) don’t have native set types, so let’s explore a few variations to this problem.

## Naive Solution

The simplest approach would be to use a map (or dictionary or hashmap) to track the unique elements, then turn that back into an array. This is basically the Python approach, except we are emulating its built-in set type with a map:1

// Dedupe removes all non-unique elements from the given array.
func Dedupe(array []string) []string {
set := make(map[string]bool, len(array))

for _, item := range array {
if _, ok := set[item]; !ok {
set[item] = true
}
}

i := 0
unique := make([]string, len(set))
for key := range set {
unique[i] = key
i++
}

return unique
}


For an array of size n, this takes O(2n) time2 and uses O(2n) space (yes, the constant doesn’t really matter for Big-O, but it does when comparing approaches).

This approach is painful because (a) we over-estimate the necessary capacity of the new map and (b) we need two loops through our data structures.

We could solve (b) by pre-allocating unique sooner, then do the array insertion as we do existence checks:

// Dedupe removes all non-unique elements from the given array.
func Dedupe(array []string) []string {
set := make(map[string]bool, len(array))
unique := make([]string, 0, len(array))

for _, item := range array {
if _, ok := set[item]; !ok {
set[item] = true
unique = append(unique, item)
}
}

return unique
}


However, this makes (a) worse: the capacity of unique is much higher than it needs to be.

## Unsorted Case

If the input list is unsorted, we can do a lot better than this by using the input list itself as the final storage location:

// Dedupe removes all non-unique elements from the given array *in place*.
func Dedupe(array []string) []string {
set := make(map[string]bool, len(array))

count := 0
for _, item := range array {
if _, ok := set[item]; !ok {
set[item] = true
array[count] = item
count++
}
}

return array[:count]
}


Obviously, if the input list must stay unmodified, this is an irrelevant optimization.

For an array of size n, this version takes O(n) time and uses O(n) space, our best approach yet.

## Sorted Case

If the input list is sorted, we can eliminate the need for a map entirely! Any time a new value is encountered, we adjust the storage index. Note that we shouldn’t sort the list in the unsorted case, since sorting takes O(n log n) time which makes matters worse, not better.

// Dedupe removes all non-unique elements from the given *sorted* array.
func Dedupe(array []string) []string {
prevItem := array // assumption: array has at least one element
count := 1

for i := 1; i < len(array); i++ {
if array[i] == prevItem {
continue
}

array[count] = array[i]
count++
}

return array[:count]
}


This is the best possible scenario, as we need no extra storage space and do the bare minimum amount of work (albeit still O(n)) to de-duplicate the array.

1. I know that map[string]struct{} is a more idiomatic and efficient way to represent a set in Golang, but it’s less readable to someone unfamiliar with the language, so I prefer bool here. ↩︎

2. This assumes constant-time map lookups and insertions, both of which are true in the amortized (“amortized” = did it enough times) case in most languages. There is the subtlety of string comparisons here, though: an equality check is O(m) where m is the length of the shortest string. Technically, then, the complexity is more like O(2nm). However, this factor is present in every solution, so we can “factor it out” in a sense, since we’re comparing them. ↩︎