Finding Bieber: On removing duplicates from a set of documents

"I tried to book the meeting room but it's booked all week for the layoffs," said James.
Times were bleak at BlackBerry. Last Friday 11% of the workforce, over two thousand people, were laid off. On that day I had passed a red faced man being escorted out, flanked by a security goon and a serious looking woman from Organizational Development.
On the bright side, our team was -- mostly -- still there, and there were plenty of spare monitors lying around. I grabbed a couple and now I had three. There was a lot of room on my desk -- I had grown tired of bringing all my photos and desk toys back and forth every Thursday just in case I was fired. Now I just kept them at home, still packed in bag hanging off a bicycle hook in the garage.
"But there aren't any layoffs happening now, are there?" I mused. "I mean, maybe if you were on vacation or something last Friday, but they wouldn't need the meeting room all week."
Chris said, "I checked the room and there's just this guy there. He comes in every morning and just sits there reading a novel."
When our meeting was finished I stopped by the Marconi room. Inside, sat a bearded man along, reading a novel. His badge had the prominant red escort-required visitors stripe.I hurried off before he could see me.
Sure enough, each day that week as I walked by this room, he was there, just reading. And each day he would leave by 5.
A couple of weeks later we were finally able to book the room. As we flung our BlackBerry's and notebooks on the table, Chris was finally able to resolve the mystery.
"I found out who that guy was."
The mystery had been solved!
Chris continued. "He's a counsellor the company brought in to help us deal with the layoffs."
"How'd you find out?" I asked. I had been reading all of the email announcements and hadn't seen anything about this.
"I asked him."
In the cases I mentioned, each record has hundreds or thousands of elements: the pixels in a photo, or patterns in a sound snippet, or web usage data. These records can be regarded as points in high dimensional space. When you look at a points in space, they tend to form clusters, and you can infer a lot by looking at ones nearby.