barcamp (comic)

You might want to visit DialAbc.com, which has more results. Stay here if you are interested in the theory behind it.
This article was actually written in 2002. Here, I explain a technique for figuring out which words are in which phone numbers. Full C source code is included.
Computers are wonderful things. We take for granted that they can do stuff like the above in microseconds, but to non-computer scientists it seems like magic. If you've ever wondered what computer scientists do all day, this will help.
I developed the algorithm for spellophone over Christmas break 2001. It wasn't easy -- I went through three different drafts until I found the right one. Would you believe that the first one took over ten minutes to run on a ten-digit number on my old Pentium 166?
The first algorithm I tried is known as the "brute force" approach. I had the computer go through every possible combination of letters and check it for words. I thought I had a clever way of doing that because I used a trie to store all of the words in the dictionary. A trie is like a very smart parrot. You can shout letters at it, and it will squawk if anything you say forms a word. For example, if you say "D-O-G" at it, it will squawk because DOG is a word. If you then say "M-A" it will squawk again because DOGMA is also a word.
Despite the clever use of the trie, the program still ran too slowly. Can you guess why? Think about how many possible letters are in a telephone number.
If you look at a phone keypad, you'll see that each digit has three letters on it. Some of them have four on newer phones. So you can only make three one-letter words with one digit. But if you add a second digit, say "22", you can spell "AA, AB, AC, BA, BB, BC, CA, CB, CC", or 9 words. That's quite a bit more than one! It turns out that if you add 10 digits, there are 1049760 possible combinations of letters. And for each possible combination that spells actual words, there are lots of different ways those words can be placed together between dashes. So it turns out that the computer might have to go through millions of combinations, trying to pick out the ones that make sense. Unfortunately, even today's computers can't process all of those words fast enough. So I had to find a better way.
Then something struck me. Last year, I had learned of something that was meant to deal with just this type of problem in one of my computer science courses. It's called "Dynamic Programming", or DP for short. Dynamic Programming is a way of programming that solves problems a little bit at a time. It's best for problems where each piece of the the problem is built on the previous one, so once you solve all of these little pieces you can put them together and solve the entire puzzle. If I could figure out a way to make DP work, then the computer would only need to check each word once, and the program might work in seconds instead of minutes!
I racked my brain, trying to remember what Professor Chan had said in his thick accent. With DP, you have to make a grid of squares, and each square represents a small part of the solution. After a lot of thinking, I drew a grid on paper. Across the top, I wrote Starting Position, and along the side, I wrote Length of word. Each square would contain all of the words that you could spell if you started on a certain digit and used a certain number of letters.
My hands trembled as I stepped through the algorithm on paper. I used the phone number "78225" because I knew it spelled "QUACK." (I used to work for a company called Quack.com and that was part of their number). Here's what I came up with. The partial words are in normal printing, and the finished words in each square are in bold:
Starting digit | |||||
---|---|---|---|---|---|
Length | 7 (PQRS) | 8 (TUV) | 2 (ABC) | 2 (ABC) | 5 (JKL) |
P, Q, R, S | T, U, V | A, B, C | A, B, C | J, K, L | |
PU, QU, RU, ST, SU | TA, UB, VA | AA, AB, AC, BA, CA | AL, BL, CL | . | |
PUB, PUC, QUA, RUB, STA, SUC | TAB, TAC, VAC | ACK, BAL, CAJ, CAK, CAL | . | . | |
PUBA, QUAC, RUBA, RUBB, STAB, STAC, SUCC | TACK | . | . | . | |
QUACK, STABL, STACK | . | . | . | . |
What is good about this is the table could be calculated very quickly. Each square builds on the data that was already processed in the square above. I had done it in five minutes on paper -- the computer could do it in the blink of an eye. Also, it is pretty easy to string all of the words together so that the longest ones appear first.
Dynamic Programming always involves two steps -- first creating the solution table and then analyzing it to piece the solution together. The way I chose to piece the solution together is pretty simple:
It's harder to explain than to do. Using the above data, you get the following words in this order: QUACK, STACK, STAB-5, PUBA-5, PUB-25, 7-TACK, 78-AB-5
dialabc.com is the nicest web site that I have seen to search for words in phone numbers.
In 2003 I used to have my phone number on my web page. "Only women are afraid to have their phone number on their web page!" I thought. How naive I was.
Then, out of the blue, I got a phone call frome someone in, I don't know, Alamaba? This guy woke me up at like, 10 in the morning (at this time I was a jobless bum and slept all day). He wanted to know how to adapt it to run on Windows.
I would have been happy to tell him, except that I only had a cell phone, and in Canada I was being charged for long distance incoming calls. So I just gave him some vague answers about makefiles and /usr/dict/words and quickly got this guy off the line.
So the motto is, don't post your phone number on your web page: otherwise, weird people call you and they might wake you up.
In the cases I mentioned, each record has hundreds or thousands of elements: the pixels in a photo, or patterns in a sound snippet, or web usage data. These records can be regarded as points in high dimensional space. When you look at a points in space, they tend to form clusters, and you can infer a lot by looking at ones nearby.