What does your phone number spell?
You might want to visit DialAbc.com, which has more results. Stay here if you are interested in the theory behind it.
This article was actually written in 2002. Here, I explain a technique for figuring out which words are in which phone numbers. Full C source code is included.
How did the computer do that?
Quick answer:
- TrieStore.h
- TrieStore.c
- main.c
- Compile using gcc -o spellophone *.c on unix/linux. A competent C programmaer can adapt it to run on Windows in about 5 minutes.
Long answer:
Computers are wonderful things. We take for granted that they can do stuff like the above in microseconds, but to non-computer scientists it seems like magic. If you've ever wondered what computer scientists do all day, this will help.
I developed the algorithm for spellophone over Christmas break 2001. It wasn't easy -- I went through three different drafts until I found the right one. Would you believe that the first one took over ten minutes to run on a ten-digit number on my old Pentium 166?
The first algorithm I tried is known as the "brute force" approach. I had the computer go through every possible combination of letters and check it for words. I thought I had a clever way of doing that because I used a trie to store all of the words in the dictionary. A trie is like a very smart parrot. You can shout letters at it, and it will squawk if anything you say forms a word. For example, if you say "D-O-G" at it, it will squawk because DOG is a word. If you then say "M-A" it will squawk again because DOGMA is also a word.
Despite the clever use of the trie, the program still ran too slowly. Can you guess why? Think about how many possible letters are in a telephone number.
If you look at a phone keypad, you'll see that each digit has three letters on it. Some of them have four on newer phones. So you can only make three one-letter words with one digit. But if you add a second digit, say "22", you can spell "AA, AB, AC, BA, BB, BC, CA, CB, CC", or 9 words. That's quite a bit more than one! It turns out that if you add 10 digits, there are 1049760 possible combinations of letters. And for each possible combination that spells actual words, there are lots of different ways those words can be placed together between dashes. So it turns out that the computer might have to go through millions of combinations, trying to pick out the ones that make sense. Unfortunately, even today's computers can't process all of those words fast enough. So I had to find a better way.
Dynamic Programming
I looked at what the computer was doing, and it seemed to me that it was wasting a lot of its time asking the trie about the same words over and over again. For example, if the last three digits in phone number didn't actually spell anything, it would still check them thousands of times in combination with the first part of the phone number.Then something struck me. Last year, I had learned of something that was meant to deal with just this type of problem in one of my computer science courses. It's called "Dynamic Programming", or DP for short. Dynamic Programming is a way of programming that solves problems a little bit at a time. It's best for problems where each piece of the the problem is built on the previous one, so once you solve all of these little pieces you can put them together and solve the entire puzzle. If I could figure out a way to make DP work, then the computer would only need to check each word once, and the program might work in seconds instead of minutes!
I racked my brain, trying to remember what Professor Chan had said in his thick accent. With DP, you have to make a grid of squares, and each square represents a small part of the solution. After a lot of thinking, I drew a grid on paper. Across the top, I wrote Starting Position, and along the side, I wrote Length of word. Each square would contain all of the words that you could spell if you started on a certain digit and used a certain number of letters.
My hands trembled as I stepped through the algorithm on paper. I used the phone number "78225" because I knew it spelled "QUACK." (I used to work for a company called Quack.com and that was part of their number). Here's what I came up with. The partial words are in normal printing, and the finished words in each square are in bold:
Starting digit | |||||
---|---|---|---|---|---|
Length | 7 (PQRS) | 8 (TUV) | 2 (ABC) | 2 (ABC) | 5 (JKL) |
P, Q, R, S | T, U, V | A, B, C | A, B, C | J, K, L | |
PU, QU, RU, ST, SU | TA, UB, VA | AA, AB, AC, BA, CA | AL, BL, CL | . | |
PUB, PUC, QUA, RUB, STA, SUC | TAB, TAC, VAC | ACK, BAL, CAJ, CAK, CAL | . | . | |
PUBA, QUAC, RUBA, RUBB, STAB, STAC, SUCC | TACK | . | . | . | |
QUACK, STABL, STACK | . | . | . | . |
What is good about this is the table could be calculated very quickly. Each square builds on the data that was already processed in the square above. I had done it in five minutes on paper -- the computer could do it in the blink of an eye. Also, it is pretty easy to string all of the words together so that the longest ones appear first.
Dynamic Programming always involves two steps -- first creating the solution table and then analyzing it to piece the solution together. The way I chose to piece the solution together is pretty simple:
- Begin at the leftmost column that you haven't worked on yet.
- Get a word from the bottom-most square.
- Now put that word in your phone number, and go to the digits that are now left-over at the end. Start at step 1.
- Once you run out of words, try the next one from the square you chose in step 2 before you went off to the left-over digits. If there are no more words left in the square, continue up-wards. Repeat step 3.
- Once you have exhausted the left-most column, get rid of it, add the number instead of a letter, and start over until there are no more columns left in the table.
It's harder to explain than to do. Using the above data, you get the following words in this order: QUACK, STACK, STAB-5, PUBA-5, PUB-25, 7-TACK, 78-AB-5
Other work
There are quite a few web sites that do this kind of thing. Phonespell.org has a good description of how their algorithm works, in the F.A.Q. section.dialabc.com is the nicest web site that I have seen to search for words in phone numbers.
In 2003 I used to have my phone number on my web page. "Only women are afraid to have their phone number on their web page!" I thought. How naive I was.
Then, out of the blue, I got a phone call frome someone in, I don't know, Alamaba? This guy woke me up at like, 10 in the morning (at this time I was a jobless bum and slept all day). He wanted to know how to adapt it to run on Windows.
I would have been happy to tell him, except that I only had a cell phone, and in Canada I was being charged for long distance incoming calls. So I just gave him some vague answers about makefiles and /usr/dict/words and quickly got this guy off the line.
So the motto is, don't post your phone number on your web page: otherwise, weird people call you and they might wake you up.
VP trees: A data structure for finding stuff fast
Let's say you have millions of pictures of faces tagged with names. Given a new photo, how do you find the name of person that the photo most resembles?In the cases I mentioned, each record has hundreds or thousands of elements: the pixels in a photo, or patterns in a sound snippet, or web usage data. These records can be regarded as points in high dimensional space. When you look at a points in space, they tend to form clusters, and you can infer a lot by looking at ones nearby.