Finding Bieber: On removing duplicates from a set of documents
So I have two million song lyrics in a big file. Don't ask me how I got it. The point is that I want to find the most poetic phrase of all time.
Problem is, the origins of this file are so sketchy it would make a Pearls Before Swine cartoon look like a Da Vinci. There could well be thousands of copies of Justin Bieber's Eenie Meenie, all frenetically typed in by a horde of snapchatting teenagers using mom's Windows Vista laptop with the missing shift key.
I don't want my analysis to include the copies and covers of the same song. So I have two problems to solve:
- How can we know whether two songs are actually the same?
- And how can we do this quickly, over the whole collection?
But first
When dealing with text, or images, or sound files, or whatever kind of media tickles your pickle, we want to transform them into numbers that computers can use. We turn them into feature vectors, or what most Ph.D toting natural language processing experts call, when they really want to get technical for a stodgy old formal publication -- one that will put them on the tenure track -- when they want to choose the most technically precise phrase, they call them: "bags of words". I am not making this up.Lets say we had this paragon of poesy:
Eenie, meenie, miney, mo Catch a bad chick by her toe If she holla If, if, if she holla, let her go
A bag of words is set of words, and their counts. The ordering of the words is lost to simplify things. Order is rarely important anyway.
{a: 1, bad: 1, by: 1, catch: 1, chick: 1, eenie: 1, go: 1,her: 2, holla: 2, if: 4, let: 1, meenie: 1, miney: 1 mo:1, she: 2, toe: 1 }
We could go even simpler and remove the counts, if we feel they aren't important.
{a, bad, by, catch, chick, eenie, go, her, holla, if, let, meenie, miney, mo, she, toe}
As we process each document from the database, the first thing we do is turn it into the bag of words. In python it's a one-liner.
def makeBag(song): return set(song.replace(",", " ").split())
Comparing two bags
Let's say we had three sets. One is the song above. In the other, the teenaged transcriber thought "miney" should be spelled "miny". The third is Frank Sinatra's Fly me to the moon. We would like a distance function, so that if two songs are differ by only one word, then the distance would be small, and if they are completely different, the distance is large.To find the answer, we have to travel to 1907, and accompany Swiss professor Paul Jaccard on his trip to the Alps to do some serious botany. He noticed that there were different clusters of plants in different regions, and wondered if these clusters could be used to determine the ecological evolution of the area. He writes:
In order to reply to this question I have considered, in an alpine district of fair size, various natural sub-divisions, presenting, besides numerous resemblances in their ecological conditions (i.e. conditions dependent on soil and climate), a few characteristic differences, and I have sought to determine, by comparison, the influence of these resemblances and differences on the composition of flora.He counted all of the different plants in different regions, and came up with this formula to compare how similar two different regions are:
Number of species common to the two districts / total number of species in the two districts.
This gives 0 if the sets share no common elements, and 1 if they are the same. But that's the opposite of what we need, so we subtract it from one to obtain a distance function.
def Jaccard(A, B): intersection = len(A & B) union = len(A | B) return 1.0 - float(intersection)/union
Now we have a distance function. To find all the duplicate songs, we just run it on every pair of songs that we have. With two million songs, that's only, umm, four trillion comparisons. If we can do 10000/second we could be done in about three years. Maybe we could split it up, use some cloud instances, pay a few thousand dollars for compute time and be done in a day.
Or we could use algorithms.
Time to get LSH'd
I have two little girls and coincidentally, they have approximately two million individual socks, with no two pairs alike. Yet it doesn't take me three years to sort them, because I use a locality sensitive hash.
I take a sock, and if it's pinkish, I put it in the top left. Purple goes in the top right, and colours in the middle go in between. Blues go on the bottom, greens have their own spot. By the time I run out of socks to sort, the pairs are of near each-other on the carpet. Then it's a simple matter to join them together.
Actually, over the years, I have further refined the system, because "pinkish" is ambiguous. Children's socks are a mix of shapes, eyes, cats & dinosaurs of all colours. If the sock as any blue at all, no matter how small, it goes top left. Otherwise, if it has any red, other colours notwithstanding, it goes bottom left. Otherwise, if it has any green whatsoever, top right. Otherwise, bottom left.
This is known as:
MinHash
Now let's travel to 1997. Titanic and The Full Monty are in theaters. Some people pay to see the film 9 or 10 times. (Titanic, I mean) This is unsurprising because the only thing on TV is the OJ Simpson trial. On the WWW, then known as the World Wide Wait, AltaVista is one of the top search engines for finding the status of the Trojan Room Coffee Pot.
Computer Scientist Andrei Broder, who has been with AltaVista from near the beginning, is working on the duplicates problem. As the web was expanding, a lot of search results that come up are duplicates of other pages. For search, this is Very Annoying. Broder devises a way of quickly searching these millions of pages for duplicates.
MinHash is a function that reduces a text document to a single number. Documents that share many of the same words have numbers that are near each-other.
How is this done?
Suppose you build a dictionary of all the words that could possibly occur in your documents, and you number them.
0 aardvark 1 abacus 2 abacuses 3 abaft 4 abalone 5 abandon ...The minhash would take this dictionary, and take your document, and assign it the number of the minimum word that occurs. That's it.
So if your document is "The aardvark abandoned his abacus" then the number assigned would be 0 (because aardvark is the zero'th word in the dictionary). In fact, every document that talks about an aardvark would hash to 0.
But what if, by chance, there is a document that is similar to our aardvark text but mispells it? Then they would hash to some other number entirely.
To guard against this, we actually take several random permutations of the dictionary and average the minhash against each of them.
0 abacus 1 abalone 2 abacuses 3 aardvark 4 abaft 5 abandoned |
0 abalone 1 abacus 2 abacuses 3 abandoned 4 aardvark 5 abaft |
0 abacus 1 abaft 2 abacuses 3 abalone 4 abandoned 5 aardvark |
- Document: "The aardvark abandoned his abacus"
- Minhash under first dictionary: 0
- Minhash under dictionary 2: 1
- Minhash under dictionary 3: 0
- Combined minhash: (0 + 1 + 0) / 3 = 0.333333333
import random def MinHash(corpus, k = 5): # Map from words to array of the five values words = {} for word in corpus: words[word] = [] for i in range(k): shuffled = list(corpus) random.shuffle(shuffled) for j in range(len(shuffled)): words[shuffled[j]].append(j) def hash(document): total = 0. # for each hash function, find the lowest value word in the # document. #sum(min(h_k(w) over words in doc) vals = [-1] * k for word in document: if word in words: m = words[word] for i in range(k): if vals[i] == -1 or m[i] < vals[i]: vals[i] = m[i] return sum(vals) / k return hash
Finally
Using MinHash, you can mark duplicates in three passes through the data, and a sort.- In the first pass, build a dictionary of all the words that occur, and use it to create the hash function.
- In the second pass, compute the minhash of each document.
- Sort the documents by their minhash (if you can afford to do so) or place them into buckets. In either case, documents that are similar will theoretically be close together.
- Finally, go through the list (if sorted) or nearby buckets, and compare documents within a certain window using a more refined comparison function, such as Jaccard distance. Anything that is close enough to being the same is a duplicate.
Oh yeah
I will assume that the most poetic words of all time, in English, are the ones most likely to end a line. After analysis of 2 million song lyrics, with near duplicates removed, they are:at all no more don't know all alone right now all night at night far away like that let go too lateAnd the number one most poetic phrase in the history of music:
oh oh
What a happy discovery! Your articles are fun to read and so informative. Thank you