< >

Finding Bieber: On removing duplicates from a set of documents

Posted ten years ago

So I have two million song lyrics in a big file. Don't ask me how I got it. The point is that I want to find the most poetic phrase of all time.

Problem is, the origins of this file are so sketchy it would make a Pearls Before Swine cartoon look like a Da Vinci. There could well be thousands of copies of Justin Bieber's Eenie Meenie, all frenetically typed in by a horde of snapchatting teenagers using mom's Windows Vista laptop with the missing shift key.

I don't want my analysis to include the copies and covers of the same song. So I have two problems to solve:

How can we know whether two songs are actually the same?
And how can we do this quickly, over the whole collection?

But first

When dealing with text, or images, or sound files, or whatever kind of media tickles your pickle, we want to transform them into numbers that computers can use. We turn them into feature vectors, or what most Ph.D toting natural language processing experts call, when they really want to get technical for a stodgy old formal publication -- one that will put them on the tenure track -- when they want to choose the most technically precise phrase, they call them: "bags of words". I am not making this up.

Lets say we had this paragon of poesy:

Eenie, meenie, miney, mo
Catch a bad chick by her toe
If she holla
If, if, if she holla, let her go

A bag of words is set of words, and their counts. The ordering of the words is lost to simplify things. Order is rarely important anyway.

{a: 1, bad: 1, by: 1, catch: 1, chick: 1, eenie: 1, go: 1,her: 2, holla: 2, if: 4, let: 1, meenie: 1, miney: 1 mo:1, she: 2, toe: 1 }

We could go even simpler and remove the counts, if we feel they aren't important.

{a, bad, by, catch, chick, eenie, go, her, holla, if, let, meenie, miney, mo, she, toe}

As we process each document from the database, the first thing we do is turn it into the bag of words. In python it's a one-liner.

def makeBag(song):
    return set(song.replace(",", " ").split())

Comparing two bags

Let's say we had three sets. One is the song above. In the other, the teenaged transcriber thought "miney" should be spelled "miny". The third is Frank Sinatra's Fly me to the moon. We would like a distance function, so that if two songs are differ by only one word, then the distance would be small, and if they are completely different, the distance is large.

To find the answer, we have to travel to 1907, and accompany Swiss professor Paul Jaccard on his trip to the Alps to do some serious botany. He noticed that there were different clusters of plants in different regions, and wondered if these clusters could be used to determine the ecological evolution of the area. He writes:

In order to reply to this question I have considered, in an alpine district of fair size, various natural sub-divisions, presenting, besides numerous resemblances in their ecological conditions (i.e. conditions dependent on soil and climate), a few characteristic differences, and I have sought to determine, by comparison, the influence of these resemblances and differences on the composition of flora.

He counted all of the different plants in different regions, and came up with this formula to compare how similar two different regions are:

Number of species common to the two districts / total number of species in the two districts.

This gives 0 if the sets share no common elements, and 1 if they are the same. But that's the opposite of what we need, so we subtract it from one to obtain a distance function.

def Jaccard(A, B):  
    intersection = len(A & B)
    union = len(A | B)
    return 1.0 - float(intersection)/union

Now we have a distance function. To find all the duplicate songs, we just run it on every pair of songs that we have. With two million songs, that's only, umm, four trillion comparisons. If we can do 10000/second we could be done in about three years. Maybe we could split it up, use some cloud instances, pay a few thousand dollars for compute time and be done in a day.

Or we could use algorithms.

Time to get LSH'd

I have two little girls and coincidentally, they have approximately two million individual socks, with no two pairs alike. Yet it doesn't take me three years to sort them, because I use a locality sensitive hash.

I take a sock, and if it's pinkish, I put it in the top left. Purple goes in the top right, and colours in the middle go in between. Blues go on the bottom, greens have their own spot. By the time I run out of socks to sort, the pairs are of near each-other on the carpet. Then it's a simple matter to join them together.

Actually, over the years, I have further refined the system, because "pinkish" is ambiguous. Children's socks are a mix of shapes, eyes, cats & dinosaurs of all colours. If the sock as any blue at all, no matter how small, it goes top left. Otherwise, if it has any red, other colours notwithstanding, it goes bottom left. Otherwise, if it has any green whatsoever, top right. Otherwise, bottom left.

This is known as:

MinHash

Now let's travel to 1997. Titanic and The Full Monty are in theaters. Some people pay to see the film 9 or 10 times. (Titanic, I mean) This is unsurprising because the only thing on TV is the OJ Simpson trial. On the WWW, then known as the World Wide Wait, AltaVista is one of the top search engines for finding the status of the Trojan Room Coffee Pot.

Computer Scientist Andrei Broder, who has been with AltaVista from near the beginning, is working on the duplicates problem. As the web was expanding, a lot of search results that come up are duplicates of other pages. For search, this is Very Annoying. Broder devises a way of quickly searching these millions of pages for duplicates.

MinHash is a function that reduces a text document to a single number. Documents that share many of the same words have numbers that are near each-other.

How is this done?

Suppose you build a dictionary of all the words that could possibly occur in your documents, and you number them.

0 aardvark
1 abacus
2 abacuses
3 abaft
4 abalone
5 abandon
...

The minhash would take this dictionary, and take your document, and assign it the number of the minimum word that occurs. That's it.

So if your document is "The aardvark abandoned his abacus" then the number assigned would be 0 (because aardvark is the zero'th word in the dictionary). In fact, every document that talks about an aardvark would hash to 0.

But what if, by chance, there is a document that is similar to our aardvark text but mispells it? Then they would hash to some other number entirely.

To guard against this, we actually take several random permutations of the dictionary and average the minhash against each of them.

0 abacus
1 abalone
2 abacuses
3 aardvark
4 abaft
5 abandoned

0 abalone
1 abacus
2 abacuses
3 abandoned
4 aardvark
5 abaft

0 abacus
1 abaft
2 abacuses
3 abalone
4 abandoned
5 aardvark

Document: "The aardvark abandoned his abacus"
Minhash under first dictionary: 0
Minhash under dictionary 2: 1
Minhash under dictionary 3: 0
Combined minhash: (0 + 1 + 0) / 3 = 0.333333333

As you use more and more dictionaries to compute the hash, then documents that share similar sets of words begin to hash to similar values. If you like code, here's some python.

import random

def MinHash(corpus, k = 5):
    # Map from words to array of the five values
    words = {}
    for word in corpus:
        words[word] = []

    for i in range(k):
        shuffled = list(corpus)
        random.shuffle(shuffled)
        for j in range(len(shuffled)):
            words[shuffled[j]].append(j)

    def hash(document):
        total = 0.

        # for each hash function, find the lowest value word in the
        # document.

        #sum(min(h_k(w) over words in doc)

        vals = [-1] * k
        for word in document:
            if word in words:
                m = words[word]
                for i in range(k):
                    if vals[i] == -1 or m[i] < vals[i]:
                        vals[i] = m[i]

        return sum(vals) / k

    return hash

Finally

Using MinHash, you can mark duplicates in three passes through the data, and a sort.

In the first pass, build a dictionary of all the words that occur, and use it to create the hash function.
In the second pass, compute the minhash of each document.
Sort the documents by their minhash (if you can afford to do so) or place them into buckets. In either case, documents that are similar will theoretically be close together.
Finally, go through the list (if sorted) or nearby buckets, and compare documents within a certain window using a more refined comparison function, such as Jaccard distance. Anything that is close enough to being the same is a duplicate.

Oh yeah

I will assume that the most poetic words of all time, in English, are the ones most likely to end a line. After analysis of 2 million song lyrics, with near duplicates removed, they are:

at all
no more
don't know
all alone
right now
all night
at night
far away
like that
let go
too late

And the number one most poetic phrase in the history of music:

oh oh

Steve Hanov makes a living working on Rhymebrain.com, rapt.ink, www.websequencediagrams.com, and Zwibbler.com. He lives in Waterloo, Canada.

Post comment

edit

Ketan Bhatt

nine years ago

I recently discovered your blog because your article regarding AB testing in 20 lines was posted to HN (I guess that's where I saw it).

What a happy discovery! Your articles are fun to read and so informative. Thank you

Pitching to VCs #2 (comic)

Five essential steps to prepare for your next programming interview

They put you in a room, give you a problem, and stare at you while you fumble around with markers on a whiteboard for 45 minutes. With a little preparation, you'll look like a pro.

Let's read a Truetype font file from scratch

Walkthough of reading and interpretting a TrueType font file in a few lines of Javascript.

Succinct Data Structures: Cramming 80,000 words into a Javascript file.

jQuery creator John Resig needs a little help storing lists of words in his side project. Let's go overkill and explore a little known branch of computer science called Succinct Data Structures.

How a programmer reads your resume (comic)

People thought it was a comic, so I never corrected them.

Using the Acer Aspire One as a web server

A netbook can be ideal for a home web server. They are cheap, and use less power than a CFL light bulb.

Experiment: Deleting a post from the Internet

Once you post something on the Internet, it is hard to get rid of it. As an experiment, I deleted one of my past posts, and I tried to remove all traces of it.

UMA and free long distance

What's to stop me from travelling to another continent, and then making free long distance calls to local numbers back home? Technically, nothing.

Tool for Creating UML Sequence Diagrams

If you have to draw something called "UML Sequence Diagrams" for work or school, you already know that it can take hours to get a diagram to look right. Here's a web site that will save you some time.

Bending over: How to sell your software to large companies

For a micro-ISV, selling to businesses can be more lucrative than selling to consumers. Instead of making a few dollars per sale and hoping for thousands of sales, you sell to only a few customers, and charge much higher rates. But the rates are high for a reason. It takes more time and money to sell to businesses.