Finding Bieber: On removing duplicates from a set of documents
Using a locality sensitive hash, you can mark duplicates in millions of items in no time.

"I tried to book the meeting room but it's booked all week for the layoffs," said James.
Times were bleak at BlackBerry. Last Friday 11% of the workforce, over two thousand people, were laid off. On that day I had passed a red faced man being escorted out, flanked by a security goon and a serious looking woman from Organizational Development.
On the bright side, our team was -- mostly -- still there, and there were plenty of spare monitors lying around. I grabbed a couple and now I had three. There was a lot of room on my desk -- I had grown tired of bringing all my photos and desk toys back and forth every Thursday just in case I was fired. Now I just kept them at home, still packed in bag hanging off a bicycle hook in the garage.
"But there aren't any layoffs happening now, are there?" I mused. "I mean, maybe if you were on vacation or something last Friday, but they wouldn't need the meeting room all week."
Chris said, "I checked the room and there's just this guy there. He comes in every morning and just sits there reading a novel."
When our meeting was finished I stopped by the Marconi room. Inside, sat a bearded man along, reading a novel. His badge had the prominant red escort-required visitors stripe.I hurried off before he could see me.
Sure enough, each day that week as I walked by this room, he was there, just reading. And each day he would leave by 5.
A couple of weeks later we were finally able to book the room. As we flung our BlackBerry's and notebooks on the table, Chris was finally able to resolve the mystery.
"I found out who that guy was."
The mystery had been solved!
Chris continued. "He's a counsellor the company brought in to help us deal with the layoffs."
"How'd you find out?" I asked. I had been reading all of the email announcements and hadn't seen anything about this.
"I asked him."
Using a locality sensitive hash, you can mark duplicates in millions of items in no time.
For a micro-ISV, selling to businesses can be more lucrative than selling to consumers. Instead of making a few dollars per sale and hoping for thousands of sales, you sell to only a few customers, and charge much higher rates. But the rates are high for a reason. It takes more time and money to sell to businesses.
In software, the simplest things can turn into a nightmare, especially at a large company.
Let's say you have millions of pictures of faces tagged with names. Given a new photo, how do you find the name of person that the photo most resembles?
In the cases I mentioned, each record has hundreds or thousands of elements: the pixels in a photo, or patterns in a sound snippet, or web usage data. These records can be regarded as points in high dimensional space. When you look at a points in space, they tend to form clusters, and you can infer a lot by looking at ones nearby.