Steve Hanov's Programming Blog

I found Security Vulnerability in your web application

2022-01-28T10:54:57-05:00

I found Security Vulnerability in your web application. For security purpose can we report vulnerability here,then will i get bounty reward in PayPal or Bitcoin for Security bug ?

Is it just me, or are security consultants swarming web sites, looking for bugs unasked, and emailing you demanding thousands of dollars in bounties for security flaws? If you search the web for the exact email above, you will get hundreds of hits. Whether the cost is to an unsolicited consultant, or due to a data breach, flaws in your product can be pricey. In this page, I'm going to write down some basic things to check in your website to make attacks harder, so at least you're not being shaken down for stupid mistakes.

I have ceased any bounty payments, because the things they generally find are not very likely to happen. I have to make a business decision: Do I want a perfect web application, or can I live with a few possible attacks? I am not Apple. I am not a bank. This is going to sound foolish, but at the end of the day I simply cannot afford to pay out half of my revenue each and every month for these things, because I still have to provide for my family.

Still, here is a list of the things I have learned. I am sure there is a larger list somewhere, but I have not found it.

Don't put sensitive information in URLs

When a user is logged in, what is in the URL at the top of the screen? Is there an identifier in there that should not be shared?

If the user pasted any URL on your site, while logged in, to Twitter, could anybody get access to something they should not? Could other users edit this user's files?
If a script on your page that you have no control over reads window.location and sends it back to its creator, is this a problem?

When possible, store sensitive identifiers in cookies that cannot be accessed in Javascript. More importantly, always verify on the server that the user has access to the data that he or she is requesting.

Verify any information you include in generated emails

One consultant found a way to send password reset emails from me to any user, and replace the URL in the email with his own. How did he do this? Because I was lazy.

In my code, I had abstracted the password reset emails into a reusable library. The code needed to know which web site it was resetting the password for when it created the email. So I just included it in the POST request. And this resulted in a costly bounty I had to pay.

Rate limit generated emails

On that note, can anyone make you send an unlimited number of emails? Any code that automatically sends an email needs rate limiting.

Prevent password guessing

It's a fact that most people's passwords can be guessed in less than 100 tries. That's why they should use a password manager and create longer and more random passwords.

So one of the first things a consultant will do is figure out your API call to login and run it through their script file of passwords. If they don't get stopped right away, then this will result in a significant bounty.

You need to rate login attempts, in at least two ways. You should of course limit the attempts for a particular login name. But then the consultant will just cycle through different login names. You will need to also limit it based on IP address or other information as well.

Put limits on file uploads

Can you accept an image on your website? Do you resize it on the server? Great, let me dig out my 18 megapixel .png file for you. It's only 1KB so your code will accept it. Without proper checks, it will crash your server.

Protect "secret" web urls

Do you have any secret web urls in your app for checking its status, or performing administration? Maybe you use go, and left in the code for /debug/pprof? All secret urls can be easily guessed using automated tools that just try every word or combination of characters and see if they get a 404 error.

Assuming the consultant can list all the secret urls on your web site, can they do anything bad?

UI Redressing

If you allow your web site to be placed into an IFRAME it opens up a lot of attacks. If the attacker can trick a user into clicking on their url, then they will open up your web site inside their own page, and super-impose their login button over yours, and steal ther user's passwords.

Of course, anyone can do this without the iframe permissions by creating an exact copy of your site. But the point is to make it a little harder for them. Here is some information on IFRAME breaking.

How to detect if an object has been garbage collected in Javascript

2020-05-25T10:35:28-05:00

If you are writing an application in Javascript, soon you will have to worry about memory leaks. But it is difficult to even know if a memory leak exists. This handy method can help.

WeakMap

At first, you may think that WeakMap will do it. WeakMap/WeakSet will hold onto things for you, but don't prevent an object from being garbage collected. The instant an object is GC'd, it is removed from the WeakMap or WeakSet.

So, the obvious solution is to check if an object is still inside the WeakMap. If it is missing, it has been GC'd. This won't work.

The problem is that WeakMap and WeakSet are designed so that you cannot get at what's inside without already knowing it is there. In order to lookup an item, you need to already have that item. These collections don't even have a length method.

To check if an object is inside of a WeakMap, you must already have a reference to it, and therefore you are preventing it from being garbage collected.

So, what good are they? WeakMap is best used to link objects together. For example, if you have a bunch of <img> elements and you want to associate some data with them, you could simply do img.myextraproperty="blah". But your IDE may complain because the HTMLImageElement does not have this property. Instead, you can use WeakMap. If the extra property is a single value true then use a WeakSet.

The Real Solution

Some browsers, including Chrome but not Firefox, have the ability to check the amount of Javascript memory used. So the solution to test if an object is there, is to make it sufficiently large so that it has a noticeable impact on memory.

In the code below, I use a WeakMap to associate a 1 gigabyte object with whatever you pass in. When the object is freed, and the garbage collector is run, you would expect at least 1 GB of memory to be freed as well. That is what the code checks for. The process takes at least 10 seconds, because it seems that Chrome only runs the garbage collector every 10 seconds.

View on CodePen

/** Determines if an object is freed
@param obj is the object of interest
@param freeFn is a function that frees the object.
@returns a promise that resolves to {freed: boolean, memoryDiff:number}

@author Steve Hanov <steve.hanov@gmail.com>
*/
function isObjectFreed(obj, freeFn) {
  return new Promise( (resolve) => {
    if (!performance.memory) {
      throw new Error("Browser not supported.");
    }

    // When obj is GC'd, the large array will also be GCd and the impact will
    // be noticeable.
    const allocSize = 1024*1024*1024;
    const wm = new WeakMap([[obj, new Uint8Array(allocSize)]]);

    // wait for memory counter to update
    setTimeout( () => {
      const before = performance.memory.usedJSHeapSize;

      // Free the memory
      freeFn();

      // wait for GC to run, at least 10 seconds
      setTimeout( () => {
        const diff = before - performance.memory.usedJSHeapSize;
        resolve({ 
          freed: diff >= allocSize, 
          memoryDiff: diff - allocSize
        });
      }, 10000);
    }, 100);
  });
}

let foo = {bar:1};

isObjectFreed(foo, () => foo = null).then( (result) => {
  document.write(`Object GCd:${result.freed}, ${result.memoryDiff} bytes freed`)
}, (error) => {
  document.write(`Error: ${error.message}`)
})

How I use it

I use this method as part of my test suite for Zwibbler, my Javascript drawing app. It has a destroy() method that is supposed to remove all resources. But occasionally, I would have some event listener that was not removed, that would keep a reference to the entire application. So when using it inside something like React or Angular, where it can be repeatedly shown and hidden by the view framework, it is vitally important that resources be completely freed.

My favourite Google Cardboard Apps

2017-02-15T10:46:40-05:00

I have never been a gamer. The most I've played was Super Mario Bros (the original). I then took a break for a decade or so and spent a few weeks with Simcity 4.

Something happened last week. Overnight, I've become addicted to games. The cause was this:

It arrived the next day after I ordered it from Google. This is very surprising as I am in Canada.

I now have nearly a hundred games on my old Samsung S5. Here are a few of my favourites.

First, Some Cardboard Basics

It took me a few hours to figure out how to navigate in various games. I'll save you the trouble. There are three methods of control.

Stare-to-click

The most frequently used is the stare. Stare at a button on the screen, and a circular indicator will count down for a second or so. Keep looking until it runs out to click.

Stare and click

An alternative method of control is a button on the cardboard unit itself. The original Cardboard has a slider on the left side that moves a magnet. However, the newly redesigned units have a button on the top right that you can press. It moves a lever that taps the top of the screen. (As a side effect, if somebody sends you a Facebook message while you are playing, it will send them a thumbs up.)

Walk in place

A minority of games require you to walk in place to move. Simply bounce your head up and down slightly and you will travel in the direction that you are looking.

Other control methods

You don't need any other controller. Only a couple of games require a Bluetooth or USB gamepad to move. One game studio uses the device's camera and a large QR code that you point in front of you to move around. However, it falls back to the point-and-stare method.

How to sit

It is best to play in a darkened room on a swivel chair. This lets you look around without getting too tired or bumping into things. Also, my phone tends to drift slowly to the right, so I can keep spinning to keep up.

Don't forget to wear headphones. The sound is 3D too, so your ears will tell you where things are happening.

Terrifying experiences

Quite a few games consist of wandering around a haunted house. However, the act of walking makes gameplay difficult. That's why I like these two games. You sit in one place and stuff happens around you. Remember to keep looking around, because the ghouls will patiently wait for you to look away before terrorizing you from behind.

In Chair in a room you can select from two stories. They each take about 15 minutes to play. You are fixed in one spot, sitting on a chair in the dark. You must gradually piece together the story from what is happening around you.

Sisters includes two experiences as well. In the Blair Witch trailer, you are standing outside an abandoned house in the forest and stuff happens. When you have completed it, play Sisters. You are sitting on a couch during a storm, and two creepy dolls sit on the mantlepiece. The power goes out. You're on your own.

Wandering around

Alien Apartment sets itself apart by its remarkable attention to detail. Using a unique control method -- tilt your head to walk, and tilt again to stop -- wander around a neat, modern apartment. The massive living room windows overlook an alien world. The textures, subtle lighting, and spacy soundtrack make for a visual feast.

Alien Apartment only one scene of a larger work, Whispering Eons, but I've not tried it yet.

Sit and watch

In A Time in Space 2, a cute robot leads you on a short space adventure. For some reason, the stereoscopic image is full-screen rather than tailored for cardboard, so it makes me a little nauseous. But the experience is worth it.

VR Cave stands out because of its incredible detail and lends itself well to 3D. You float through a cave on a predefined track. It is stunning to see the crystals whiz by inches from your face while looking down vast bottomless caverns of rock. Sit on a swivel chair and slowly spin while you are doing it, so that you can trick yourself into feeling your feet drag on the rock as you go.

I tried a few roller coasters. VR Roller Coaster is reasonable.

Need to chill out? Install Android Dreams and sit back while your driver serenely lands your craft in a futuristic city at night. Gaze out the window as you cruise between giant billboards and skyscrapers.

Within has a few dozen VR movies to choose from. My favourites are the two science documentaries on gravity waves and the robots of Boston Dynamics. You can experience several VR music videos, or sit in the audience during a taping of Saturday Night Live.

The Youtube Virtual Reality Channel includes many short videos in VR. You can try them without cardboard by dragging the screen around, but the headset makes them shine.

The Veer app has a curated selection of videos that were much better than the Youtube ones.

Puzzle

I feel like I'm in the movie Tron when I'm playing Gravity Pull. Solve puzzles by putting boxes onto weight sensitive pads, unlocking each door to the next room.

Action

Sorry, you're going to have to part with some cash, because the best action games cost a couple of bucks.

Proton Pulse is brickbreaker. Move your head to bounce the ball off the glass prism and destroy all of the floating bricks.

Install Minos Starfighter. In your swivel chair, you'll feel like you're in an X-Wing. Make sure you look up and down as you spin because the attacking ships are coming at you from all directions. They explode with a satisfying fireball under the wrath of your cannons.

WTF

You have to to see this to believe it. For someone completely uninitiated to Japanese anime, Nagomi's Earcleaning VR is a constant stream of WTF moments. In the game, you are visiting your young, attractive cousin, and in dialog laced with innuendo, she beckons you to lean on her lap. Then you hear audible scratching as she proceeds to clean your ears. She looks hurt and mortified if you try to escape. Better lay down and ponder the quirkiness of Japanese culture.

What my kids like

Do you have small kids? At ages five and seven, Mine like these apps:

What are your favourite VR apps?

O(n) Delta Compression With a Suffix Array

2017-01-24T10:16:32-05:00

ABSTRACT

The difference between two sequences A and B can be compactly stored using COPY/INSERT operations. The greedy algorithm for finding these operations relies on an efficient way of finding the longest matching part of A of any given position in B. This article describes how to use a suffix array to find the optimal sequence of operations in time proportional to the length of the input sequences. As a preprocessing step, we find and store the longest match in A for every position in B in two passes over the suffix array that has been enhanced with longest common prefix information (LCP).

INTRODUCTION

The days of losing work due to a power outage are over. In modern applications, users expect their work to be saved immediately. They also make mistakes, such as accidentally deleting large sections, and they expect to be able to restore their work to a prior state. To facilitate this, we need a way to retrieve or reconstruct every version of a document. We can take advantage of the similarities between the versions to minimize the space they consume.

INSERT/DELETE ALGORITHMS

Given two sequences of items A and B, we can compute a set of operations that will transform A into B. That way, only these operations need be stored. Two main classes of differencing algorithms are commonly used. Version control tools that deal with source code are often based on the longest common subsequence. For example, given the text:

A: The quick brown fox jumped over the lazy dog.

B: The lazy dog jumped over the quick brown fox.

The longest common subsequence is:

(The )(jumped over the )(.)

Although the text is written out here for clarity, only the position and length of each matching block is used. Using this information, one can directly compute the ranges where the text changed and thus derive INSERT / DELETE operations.

The example above can be encoded:

AT 5 DELETE "quick brown fox "
AT 5 INSERT "lazy dog"
AT 30 DELETE "lazy dog"
AT 30 INSERT "quick brown fox "

While INSERT/DELETE changes are easy to see visually in side-by-side comparison tools, they are suboptimal for file storage because they cannot exploit out-of-sequence similarities.

We can fix this by adding a MOVE operations. It is possible to change pairs of INSERT/DELETE commands to MOVE as a post-processing step. However, a more pressing issue is that any algorithm based on the longest common subsequence has a O(N^2) runtime in its worst case. Modern differencing tools often use Meyer's O(ND) algorithm, which is proportional to the length of the strings and the number of differences between them. Of course, when the two texts share little similarity, that will take a very long time.

THE COPY/INSERT ALTERNATIVE

Tichy describes system based only on COPY/INSERT. Starting with an empty string, it is possible to recreate a B by copying from various positions of A. Portions that do not exist in A can be created using the INSERT command. It can also be used when the encoding of the insert command would be smaller than the equivalent copy.

A: The quick brown fox jumped over the lazy dog.

COPY (The )
COPY (lazy dog)
COPY (jumped over the )
COPY (quick brown fox)
INSERT (.)

Finding COPY/INSERT operations is much simpler than LCS based algorithms.

# the position of the string.
q = 0
while q < len(B):
    find p and l such that (p.q.J) is a maximal block move 
    position, length = findLongestMatch(q)
    if length > 0:
        If insertFrom >= 0:
             outputInsertCommand(insertFrom, q - insertFrom)
             insertFrom = -1
        outputCopyCommand(position, length)
        q += length
    elif insertFrom == -1:
        insertFrom = q

if insertFrom >= 0:
     outputInsertCommand(insertFrom, q - insertFrom)

The runtime of the algorithm is dependent on the findLongestMatch function. Given a position in B, it finds the position in A with the longest matching sequence.

The brute force solution, which simply compares each possible position of A each time, performs surprisingly well when the sequences are mostly similar. It is not called very often, because the matches it finds are long. For sequences that are different, it again devolves into an O(N^2) running time.

The algorithm presented by MacDonald (2000) for use in the XDFS file system uses a preprocessing step to achieve O(N) operation. At each offset in A, the next 16 characters are placed into a lookup table and mapped to that position. FindLongestMatch is then:

code = next 16 characters in A
If code exists in the table,
    Return the length of the longest match at A[table[code]:] and B[index:]
Else
   No match at this position.

This algorithm is very fast and reasonably thorough. However, it is often not optimal. To guarantee O(N) time, the algorithm makes a tradeoff. When the table of positions is built, existing entries are "clobbered" by later ones. Only one position for each code is stored. If all of the positions were stored, then the algorithm would have to check each one, which would make the overall runtime O(N^2) with certain combinations of inputs.

THE SUFFIX ARRAY

A suffix array is an ordered list of positions in the string. Modern suffix array construction algorithms will bucket sort certain positions in the string, and then use the information to "induce" the positions of the other characters. With this clever trick, a suffix array can be created from a sequence in O(N) time where N is the length of the sequence.

It is often useful to build an enhanced suffix array. In addition to the suffix array of length N, the longest common prefix (LCP) array of length N-1 is built. IT contains the length of the longest common prefix between each entry and the next. By exploiting the commonalities in the sequence, these prefixes can also be computed in O(N) time, either during the suffix array creation, or as a separate step.

To find commonalities between two different sequences, they are appended together, separated by a character not found in either string, and a suffix array is constructed. An example is here:

A: "mississippi" + "u0001"

B: "sips and misses" + "u0000"

String   Index   Lcp  15 characters
   B       15     0 |"u0000"
A          11     0 |"u0001sips and missesu0000"
   B        4     1 |" and missesu0000"
   B        8     0 |" missesu0000"
   B        5     0 |"and missesu0000"
   B        7     0 |"d missesu0000"
   B       13     0 |"esu0000"
A          10     1 |"iu0001sips and missesu0000"
A           7     2 |"ippu0001sips and misses"
   B        1     1 |"ips and missesu0000"
   B       10     3 |"issesu0000"
A           4     4 |"issippiu0001sips and mis"
A           1     0 |"ississippiu0001sips and "
   B        9     4 |"missesu0000"
A           0     0 |"mississippiu0001sips and"
   B        6     0 |"nd missesu0000"
A           9     1 |"piu0001sips and missesu0000"
A           8     1 |"ppiu0001sips and missesu0000"
   B        2     0 |"ps and missesu0000"
   B       14     1 |"su0000"
   B        3     1 |"s and missesu0000"
   B       12     1 |"sesu0000"
A           6     3 |"sippiu0001sips and misse"
   B        0     2 |"sips and missesu0000"
A           3     1 |"sissippiu0001sips and mi"
   B       11     2 |"ssesu0000"
A           5     3 |"ssippiu0001sips and miss"
A           2     0 |"ssissippiu0001sips and m"

Note 1. The position of sequence 1 and 2 can be easily determined by its offset into the concatenated string.

Note 2. The LCP at index i efers to the commonality with position i+1 in the suffix array.

Note 3. The LCP between any two entries in the suffix array is the minimum of the LCP of all adjacent entries between them.

The common parts of the combined string AB are near each other in the the suffix array. However the common parts of A and B are not necessarily adjacent. If either string has commonalities with itself, then there will be two or more adjacent entries in the suffix array belonging to the same string.

Examining the suffix array above, we see that the location "sips and misses" in B has below it a match in string A of length 2. However, above it is a match of length 3. Our algorithm must consider both possibilities.

We do this in two passes. In the forward pass, we find the longest common prefixes between B and any part of A previous to it in the suffix array. In the reverse pass, we find the LCPs between B and any part of a that is after it in the suffix array.

def longestMatches(self):
        # returns, for every position in B, a tuple with the longest matching 
        # position in A and the length of that match.
        result = [None] * self.length2

        # forward pass
        lcp = 0
        aIndex = 0
        for i in range(len(self.sa)):
            if self.sa[i] < self.length1:
                # string in A
                lcp = self.lcp[i]
                aIndex = self.sa[i]
            else:
                # string in B.
                result[self.sa[i] - self.length1] = (aIndex, lcp)
                lcp = min(lcp, self.lcp[i])

        # reverse pass
        lcp = 0
        aIndex = 0
        for i in range(len(self.sa)-1, -1, -1):
            if self.sa[i] < self.length1:
                # string in A
                aIndex = self.sa[i]
                if i > 0:
                    lcp = self.lcp[i-1]
            else:
                # string in B.
                lcp = min(lcp, self.lcp[i])
                bIndex = self.sa[i] - self.length1
                oldAIndex, oldLcp = result[bIndex]
                if lcp > oldLcp:
                    result[bIndex] = (aIndex, lcp)

        return result

ANALYSIS

Even though building the suffix array is an O(N) algorithm, it will always be slower compared to the XDFS method because it requires several passes through the data. If optimal encoding is required, and the algorithm must be O(N) for all inputs, then the suffix array method may be considered. Preprocessing operations, such as removing the common prefix and suffix from the input, are important to reduce the problem size.

RESOURCES

Here is the Python source code that I used to test this technique. It contains a transcription of the SAIS suffix array construction algorithm from C into Python.

Finding Bieber: On removing duplicates from a set of documents

2014-11-10T08:00:00-05:00

So I have two million song lyrics in a big file. Don't ask me how I got it. The point is that I want to find the most poetic phrase of all time.

Problem is, the origins of this file are so sketchy it would make a Pearls Before Swine cartoon look like a Da Vinci. There could well be thousands of copies of Justin Bieber's Eenie Meenie, all frenetically typed in by a horde of snapchatting teenagers using mom's Windows Vista laptop with the missing shift key.

I don't want my analysis to include the copies and covers of the same song. So I have two problems to solve:

How can we know whether two songs are actually the same?
And how can we do this quickly, over the whole collection?

But first

When dealing with text, or images, or sound files, or whatever kind of media tickles your pickle, we want to transform them into numbers that computers can use. We turn them into feature vectors, or what most Ph.D toting natural language processing experts call, when they really want to get technical for a stodgy old formal publication -- one that will put them on the tenure track -- when they want to choose the most technically precise phrase, they call them: "bags of words". I am not making this up.

Lets say we had this paragon of poesy:

Eenie, meenie, miney, mo
Catch a bad chick by her toe
If she holla
If, if, if she holla, let her go

A bag of words is set of words, and their counts. The ordering of the words is lost to simplify things. Order is rarely important anyway.

{a: 1, bad: 1, by: 1, catch: 1, chick: 1, eenie: 1, go: 1,her: 2, holla: 2, if: 4, let: 1, meenie: 1, miney: 1 mo:1, she: 2, toe: 1 }

We could go even simpler and remove the counts, if we feel they aren't important.

{a, bad, by, catch, chick, eenie, go, her, holla, if, let, meenie, miney, mo, she, toe}

As we process each document from the database, the first thing we do is turn it into the bag of words. In python it's a one-liner.

def makeBag(song):
    return set(song.replace(",", " ").split())

Comparing two bags

Let's say we had three sets. One is the song above. In the other, the teenaged transcriber thought "miney" should be spelled "miny". The third is Frank Sinatra's Fly me to the moon. We would like a distance function, so that if two songs are differ by only one word, then the distance would be small, and if they are completely different, the distance is large.

To find the answer, we have to travel to 1907, and accompany Swiss professor Paul Jaccard on his trip to the Alps to do some serious botany. He noticed that there were different clusters of plants in different regions, and wondered if these clusters could be used to determine the ecological evolution of the area. He writes:

In order to reply to this question I have considered, in an alpine district of fair size, various natural sub-divisions, presenting, besides numerous resemblances in their ecological conditions (i.e. conditions dependent on soil and climate), a few characteristic differences, and I have sought to determine, by comparison, the influence of these resemblances and differences on the composition of flora.

He counted all of the different plants in different regions, and came up with this formula to compare how similar two different regions are:

Number of species common to the two districts / total number of species in the two districts.

This gives 0 if the sets share no common elements, and 1 if they are the same. But that's the opposite of what we need, so we subtract it from one to obtain a distance function.

def Jaccard(A, B):  
    intersection = len(A & B)
    union = len(A | B)
    return 1.0 - float(intersection)/union

Now we have a distance function. To find all the duplicate songs, we just run it on every pair of songs that we have. With two million songs, that's only, umm, four trillion comparisons. If we can do 10000/second we could be done in about three years. Maybe we could split it up, use some cloud instances, pay a few thousand dollars for compute time and be done in a day.

Or we could use algorithms.

Time to get LSH'd

I have two little girls and coincidentally, they have approximately two million individual socks, with no two pairs alike. Yet it doesn't take me three years to sort them, because I use a locality sensitive hash.

I take a sock, and if it's pinkish, I put it in the top left. Purple goes in the top right, and colours in the middle go in between. Blues go on the bottom, greens have their own spot. By the time I run out of socks to sort, the pairs are of near each-other on the carpet. Then it's a simple matter to join them together.

Actually, over the years, I have further refined the system, because "pinkish" is ambiguous. Children's socks are a mix of shapes, eyes, cats & dinosaurs of all colours. If the sock as any blue at all, no matter how small, it goes top left. Otherwise, if it has any red, other colours notwithstanding, it goes bottom left. Otherwise, if it has any green whatsoever, top right. Otherwise, bottom left.

This is known as:

MinHash

Now let's travel to 1997. Titanic and The Full Monty are in theaters. Some people pay to see the film 9 or 10 times. (Titanic, I mean) This is unsurprising because the only thing on TV is the OJ Simpson trial. On the WWW, then known as the World Wide Wait, AltaVista is one of the top search engines for finding the status of the Trojan Room Coffee Pot.

Computer Scientist Andrei Broder, who has been with AltaVista from near the beginning, is working on the duplicates problem. As the web was expanding, a lot of search results that come up are duplicates of other pages. For search, this is Very Annoying. Broder devises a way of quickly searching these millions of pages for duplicates.

MinHash is a function that reduces a text document to a single number. Documents that share many of the same words have numbers that are near each-other.

How is this done?

Suppose you build a dictionary of all the words that could possibly occur in your documents, and you number them.

0 aardvark
1 abacus
2 abacuses
3 abaft
4 abalone
5 abandon
...

The minhash would take this dictionary, and take your document, and assign it the number of the minimum word that occurs. That's it.

So if your document is "The aardvark abandoned his abacus" then the number assigned would be 0 (because aardvark is the zero'th word in the dictionary). In fact, every document that talks about an aardvark would hash to 0.

But what if, by chance, there is a document that is similar to our aardvark text but mispells it? Then they would hash to some other number entirely.

To guard against this, we actually take several random permutations of the dictionary and average the minhash against each of them.

0 abacus
1 abalone
2 abacuses
3 aardvark
4 abaft
5 abandoned

0 abalone
1 abacus
2 abacuses
3 abandoned
4 aardvark
5 abaft

0 abacus
1 abaft
2 abacuses
3 abalone
4 abandoned
5 aardvark

Document: "The aardvark abandoned his abacus"
Minhash under first dictionary: 0
Minhash under dictionary 2: 1
Minhash under dictionary 3: 0
Combined minhash: (0 + 1 + 0) / 3 = 0.333333333

As you use more and more dictionaries to compute the hash, then documents that share similar sets of words begin to hash to similar values. If you like code, here's some python.

import random

def MinHash(corpus, k = 5):
    # Map from words to array of the five values
    words = {}
    for word in corpus:
        words[word] = []

    for i in range(k):
        shuffled = list(corpus)
        random.shuffle(shuffled)
        for j in range(len(shuffled)):
            words[shuffled[j]].append(j)

    def hash(document):
        total = 0.

        # for each hash function, find the lowest value word in the
        # document.

        #sum(min(h_k(w) over words in doc)

        vals = [-1] * k
        for word in document:
            if word in words:
                m = words[word]
                for i in range(k):
                    if vals[i] == -1 or m[i] < vals[i]:
                        vals[i] = m[i]

        return sum(vals) / k

    return hash

Finally

Using MinHash, you can mark duplicates in three passes through the data, and a sort.

In the first pass, build a dictionary of all the words that occur, and use it to create the hash function.
In the second pass, compute the minhash of each document.
Sort the documents by their minhash (if you can afford to do so) or place them into buckets. In either case, documents that are similar will theoretically be close together.
Finally, go through the list (if sorted) or nearby buckets, and compare documents within a certain window using a more refined comparison function, such as Jaccard distance. Anything that is close enough to being the same is a duplicate.

Oh yeah

I will assume that the most poetic words of all time, in English, are the ones most likely to end a line. After analysis of 2 million song lyrics, with near duplicates removed, they are:

at all
no more
don't know
all alone
right now
all night
at night
far away
like that
let go
too late

And the number one most poetic phrase in the history of music:

oh oh

Let's read a Truetype font file from scratch

2014-11-05T15:45:16-05:00

Drag a font file here to reveal its innermost secrets! Here's one in case you don't have one handy.

Drag TTF file here

Source code

Here are the steps we will follow:

When the file is dragged onto the web page, we want to read it.
We need to be able to interpret the numbers in the file, even though they were made for C programs to read.
We have to find the number of characters in the file and the positions of the glyph outlines in the file
We have to interpret the format of the glyph outlines
Finally, we have to render them to the web page.

Reading Files from Javascript

Whoah that sounds dangerous. But javascript can't read any file on your computer; just the ones you happen to drag over the web page, intentionally or accidentally. We do that by handling the dragover and drop events. When the drop event is received, it contains a reference to the file and then our code is allowed to read it. This is done without any interactions with the server.

We also have to handle the ondragover event and cancel it, because otherwise it won't work.

var dropTarget = document.getElementById("dropTarget");
dropTarget.ondragover = function(e) {
    e.preventDefault();
};
dropTarget.ondrop = function(e) {
    e.preventDefault();

    if (!e.dataTransfer || !e.dataTransfer.files) {
        alert("Your browser didn't include any files in the drop event");
        return;
    }

    var reader = new FileReader();
    reader.readAsArrayBuffer(e.dataTransfer.files[0]);
    reader.onload = function(e) {
        ShowTtfFile(reader.result);
    };

};

You can't do much with the HTML5 File object. To get its data, you have to use the FileReader to read it asynchronously. You can choose to read it as a base64 encoded string or an array buffer. We choose an ArrayBuffer.

Interpreting the C structures

TrueType files were designed when computers had very little memory. They were designed to be mapped into RAM and read in place. C structures were even placed directly in the file. Opening a true type file was just a matter of loading it in. There was no need to do anything else. We will do a similar thing, but we will need a way to easily seek around the file and read numbers in various formats.

Here's a class that lets you do that.

function BinaryReader(arrayBuffer)
{
    assert(arrayBuffer instanceof ArrayBuffer);
    this.pos = 0;
    this.data = new Uint8Array(arrayBuffer);
}

BinaryReader.prototype = {
    seek: function(pos) {
        assert(pos >=0 && pos <= this.data.length);
        var oldPos = this.pos;
        this.pos = pos;
        return oldPos;
    },

    tell: function() {
        return this.pos;
    },

    getUint8: function() {
        assert(this.pos < this.data.length);
        return this.data[this.pos++];
    },

    getUint16: function() {
        return ((this.getUint8() << 8) | this.getUint8()) >>> 0;
    },

    getUint32: function() {
       return this.getInt32() >>> 0;
    },

    getInt16: function() {
        var result = this.getUint16();
        if (result & 0x8000) {
            result -= (1 << 16);
        }
        return result;
    }, 

    getInt32: function() {
        return ((this.getUint8() << 24) | 
                (this.getUint8() << 16) |
                (this.getUint8() <<  8) |
                (this.getUint8()      ));
    }, 

    getFword: function() {
        return this.getInt16();
    },

    get2Dot14: function() {
        return this.getInt16() / (1 << 14);
    },

    getFixed: function() {
        return this.getInt32() / (1 << 16);
    },

    getString: function(length) {
        var result = "";
        for(var i = 0; i < length; i++) {
            result += String.fromCharCode(this.getUint8());
        }
        return result;
    },

    getDate: function() {
        var macTime = this.getUint32() * 0x100000000 + this.getUint32();
        var utcTime = macTime * 1000 + Date.UTC(1904, 1, 1);
        return new Date(utcTime);
    }
};

Fixed point numbers

Besides unsigned and signed 8, 16, and 32 bit numbers, there are some other types of things that appear in font files. The Fixed type is a way of representing decimals in a certain number of bits. Like fixed-point arithmetic, only we use binary instead of 10s. Suppose we wanted to write (in base 10) the number 1.53 but our decimal point key is broken. We would instead write 153. To convert it back, we divide by 100. Likewise, in binary, it works the same way, except that we divide by a power of two.

A note on Javascript numbers

Javascript has a wishy-washy "number" type. It is usually a 32-bit integer. It switches from signed to unsigned whenever it feels like it, and when you least expect it, it will switch to a 64-bit double precision number.

But you can force it to be signed using the "unsigned shift right" operator (>>>). By shifting it by 0, it converts the internal type to unsigned.

Finding the treasures

The TrueType font format is described by Apple here. The truetype file is prefixed with something called the "offset" table that tells you where everything else is in the file. We will have to go diving into various tables to find the actual outlines of the fonts.

The tables also have a checksum to ensure they are right. This is obtained by adding up all the 4-byte integers in them, modulo 2³². Here's the code to read the offsets.

function TrueTypeFont(arrayBuffer)
{
    this.file = new BinaryReader(arrayBuffer);
    this.tables = this.readOffsetTables(this.file);
    this.readHeadTable(this.file);
    this.length = this.glyphCount();
}

TrueTypeFont.prototype = {
    readOffsetTables: function(file) {
        var tables = {};
        this.scalarType = file.getUint32();
        var numTables = file.getUint16();
        this.searchRange = file.getUint16();
        this.entrySelector = file.getUint16();
        this.rangeShift = file.getUint16();

        for( var i = 0 ; i < numTables; i++ ) {
            var tag = file.getString(4);
            tables[tag] = {
                checksum: file.getUint32(),
                offset: file.getUint32(),
                length: file.getUint32()
            };

            if (tag !== 'head') {
                assert(this.calculateTableChecksum(file, tables[tag].offset,
                            tables[tag].length) === tables[tag].checksum);
            }
        }

        return tables;
    },

    calculateTableChecksum: function(file, offset, length)
    {
        var old = file.seek(offset);
        var sum = 0;
        var nlongs = ((length + 3) / 4) | 0;
        while( nlongs-- ) {
            sum = (sum + file.getUint32() & 0xffffffff) >>> 0;
        }

        file.seek(old);
        return sum;
    },

Okay now we know where all the various tables are in the file. But one that we will need later is the "head" table, which contains the dimenions of the font, and importantly, the format of the glyph index.

    readHeadTable: function(file) {
        assert("head" in this.tables);
        file.seek(this.tables["head"].offset);

        this.version = file.getFixed();
        this.fontRevision = file.getFixed();
        this.checksumAdjustment = file.getUint32();
        this.magicNumber = file.getUint32();
        assert(this.magicNumber === 0x5f0f3cf5);
        this.flags = file.getUint16();
        this.unitsPerEm = file.getUint16();
        this.created = file.getDate();
        this.modified = file.getDate();
        this.xMin = file.getFword();
        this.yMin = file.getFword();
        this.xMax = file.getFword();
        this.yMax = file.getFword();
        this.macStyle = file.getUint16();
        this.lowestRecPPEM = file.getUint16();
        this.fontDirectionHint = file.getInt16();
        this.indexToLocFormat = file.getInt16();
        this.glyphDataFormat = file.getInt16();
    },

There are many tables to obtain the characteristics of the font, or the horizontal distance between glyphs, or the minimum recommended height, creation date, etc. But I want to stay focused on the buried treasure -- the glyph outlines.

The glyph outlines are contained in the "glyf" section. The glyphs are highly compressed and each one is a different length. To find a particular one quickly, we have to first go to the "loca" table.

It is simply an array of 2 byte or four byte values, depending on the "indexToLocFormat" in the header. When this is set to one, the values are four bytes long and give the position of a glyph in the glyf table. Otherwise, they are two bytes long, and give the position of the glyph divided by two in the glyf table. File formats make confusing tradeoffs to be small.

    getGlyphOffset: function(index) {
        assert("loca" in this.tables);
        var table = this.tables["loca"];
        var file = this.file;
        var offset, old;

        if (this.indexToLocFormat === 1) {
            old = file.seek(table.offset + index * 4);
            offset = file.getUint32();
        } else {
            old = file.seek(table.offset + index * 2);
            offset = file.getUint16() * 2;
        }

        file.seek(old);

        return offset + this.tables["glyf"].offset;
    },

Given any glyph index, we can now locate is exact position from the start of the file. Now things get a little complicated.

Conceptually the glyph can be one of two structures, which share a common header. (diagram)

When two shapes are drawn on top of each-other, it is convention that the second will cut out the first one if it has a different winding order. That is, if the points are specified going clockwise instead of counter-clockwise and vice-versa. Fonts use this convention to build up shapes from contours. For example, the letter O will have two contours -- one for the outer circle, and one for the inner one.

But there are two kinds of glyphs. The simple type is made of contours, as above. The compound type is made up of other glyphs. To draw the glyph, we have to draw each of the component glyphs and shift them around. This is made to handle characters with accents. Accented versions of the letters can therefore take very little space.

Let's keep focused on getting the treasure. We will ignore the compound glyphs. We just want to extract those sweet outlines.

Interpreting the outlines

This function will read the glyph header, and then call the right function to read it.

    readGlyph: function(index) {
        var offset = this.getGlyphOffset(index);
        var file = this.file;

        if (offset >= this.tables["glyf"].offset + this.tables["glyf"].length)
        {
            return null;
        }

        assert(offset >= this.tables["glyf"].offset);
        assert(offset < this.tables["glyf"].offset + this.tables["glyf"].length);

        file.seek(offset);

        var glyph = {
            numberOfContours: file.getInt16(),
            xMin: file.getFword(),
            yMin: file.getFword(),
            xMax: file.getFword(),
            yMax: file.getFword()
        };

        assert(glyph.numberOfContours >= -1);

        if (glyph.numberOfContours === -1) {
            this.readCompoundGlyph(file, glyph);
        } else {
            this.readSimpleGlyph(file, glyph);
        }

        return glyph;
    },

The simple glyphs are stored in a compressed format. They can deal with repeated points, and small movements from one point to the next very well. This is done using a series of one-byte flags. Each flag-byte indicates whether the corresponding point is stored in one byte or two bytes, for each of the X and Y coordinates. After the flags come the X coordinates, and finally the Y coordinates. The great thing about this is that if either the X or the Y coordinate doesn't change, only one byte is used to indicate this in the flags.

When we read a glyph, we will assemble the points together into one array of (x, y) coordinates, plus one of the flags which is very important for rendering.

    readSimpleGlyph: function(file, glyph) {

        var ON_CURVE        =  1,
            X_IS_BYTE       =  2,
            Y_IS_BYTE       =  4,
            REPEAT          =  8,
            X_DELTA         = 16,
            Y_DELTA         = 32;

        glyph.type = "simple";
        glyph.contourEnds = [];
        var points = glyph.points = [];

        for( var i = 0; i < glyph.numberOfContours; i++ ) {
            glyph.contourEnds.push(file.getUint16());
        }

        // skip over intructions
        file.seek(file.getUint16() + file.tell());

        if (glyph.numberOfContours === 0) {
            return;
        }

        var numPoints = Math.max.apply(null, glyph.contourEnds) + 1;

        var flags = [];

        for( i = 0; i < numPoints; i++ ) {
            var flag = file.getUint8();
            flags.push(flag);
            points.push({
                onCurve: (flag & ON_CURVE) > 0
            });

            if ( flag & REPEAT ) {
                var repeatCount = file.getUint8();
                assert(repeatCount > 0);
                i += repeatCount;
                while( repeatCount-- ) {
                    flags.push(flag);
                    points.push({
                        onCurve: (flag & ON_CURVE) > 0
                    });
                }
            }
        }

        function readCoords(name, byteFlag, deltaFlag, min, max) {
            var value = 0;

            for( var i = 0; i < numPoints; i++ ) {
                var flag = flags[i];
                if ( flag & byteFlag ) {
                    if ( flag & deltaFlag ) {
                        value += file.getUint8();
                    } else {
                        value -= file.getUint8();
                    }
                } else if ( ~flag & deltaFlag ) {
                    value += file.getInt16();
                } else {
                    // value is unchanged.
                }

                points[i][name] = value;
            }
        }

        readCoords("x", X_IS_BYTE, X_DELTA, glyph.xMin, glyph.xMax);
        readCoords("y", Y_IS_BYTE, Y_DELTA, glyph.yMin, glyph.yMax);
    }

Drawing the glyphs in the web page

Finally we have something to show for all the effort. We want to draw the glyphs. HTML5 has its handy canvas API that will let us draw shapes.

Here's the function that controls the whole thing. It takes and array buffer from the drag & drop event, and creates our TrueType object from it. Then it removes any previous glyphs from the screen. For each character, it creates an <canvas> element and scales the font so that it's EM height (literally, the height of the letter 'M') is about 64 pixels high. The font also has to be flipped vertically, because its coordinates assume zero is in the lower left of the screen, but our coordinates are in the top left.

function ShowTtfFile(arrayBuffer)
{
    var font = new TrueTypeFont(arrayBuffer);

    var width = font.xMax - font.xMin;
    var height = font.yMax - font.yMin;
    var scale = 64 / font.unitsPerEm;

    var container = document.getElementById("font-container");

    while(container.firstChild) {
        container.removeChild(container.firstChild);
    }

    for( var i = 0; i < font.length; i++ ) {
        var canvas = document.createElement("canvas");
        canvas.style.border = "1px solid gray";
        canvas.width = width * scale;
        canvas.height = height * scale;
        var ctx = canvas.getContext("2d");
        ctx.scale(scale, -scale);
        ctx.translate(-font.xMin, -font.yMin - height);
        ctx.fillStyle = "#000000";
        ctx.beginPath();
        if (font.drawGlyph(i, ctx)) {
            ctx.fill();
            container.appendChild(canvas);
        }
    }

}

All that's left to show you is how they are drawn. In this function we ignore the curves and simply connect each point in the outline. However, in reality, some points are actually control points in a quadratic bezier curve.


    drawGlyph: function(index, ctx) {

        var glyph = this.readGlyph(index);

        if ( glyph === null || glyph.type !== "simple" ) {
            return false;
        }

        var p = 0,
            c = 0,
            first = 1;

        while (p < glyph.points.length) {
            var point = glyph.points[p];
            if ( first === 1 ) {
                ctx.moveTo(point.x, point.y);
                first = 0;
            } else {
                ctx.lineTo(point.x, point.y);
            }

            if ( p === glyph.contourEnds[c] ) {
                c += 1;
                first = 1;
            }

            p += 1;
        }

        return true;
    }

The code

The code that I am using these days is on GitHub here. It has been enhanced to handle character kerning, and CMaps which map codepoints to glyph indicies.

A Quick Measure of Sortedness

2014-09-11T17:45:08-05:00

How do you measure the "sortedness" of a list? There are several ways. In the literature this measure is called the "distance to monotonicity" or the "measure of disorder" depending on who you read. It is still an active area of research when items are presented to the algorithm one at a time. In this article, I consider the simpler case where you can look at all of the items at once.

The Kendall distance between two lists is the number of swaps it would take to turn one list into another. So, for [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] and [10, 1, 2, 3, 4, 5, 6, 7, 8, 9], it would take nine swaps.

Edit distance is another method. We could take the 10, and move it after the 9, in one operation. The edit distance is inversely related to the longest increasing subsequence. In the list [1, 2, 3, 5, 4, 6, 7, 9, 8], the longest increasing subsequence is [1, 2, 3, 5, 6, 7, 9], of length seven, and it is three away from being a sorted list. The longest increasing subsequence can be calculated in O(nlogn) time. A drawback of this method is its large granularity. For a list of ten elements, the measure can only take the distinct values 0 through 9.

Here, I propose another measure for sortedness. The procedure is to sum the difference between the position of each element in the sorted list, x, and where it ends up in the unsorted list, f(x). We divide by the square of the length of the list and multiply by two, because this gives us a nice number between 0 and 1. Subtracting from 1 makes it range from 0, for completely unsorted, to 1, for completely sorted.

A simple genetic algorithm in python for sorting a list using the above fitness function is presented below.

import random

def procreate(A):
    A = A[:]
    first = random.randint(0, len(A) - 1)
    second = random.randint(0, len(A) - 1)
    A[first], A[second] = A[second], A[first]
    return A

def score(A):
    diff = 0.
    for index, element in enumerate(A):
        diff += abs(index - element)

    return 1.0 - diff / len(A) ** 2 * 2

def genetic(root, procreateFn, scoreFn, generations = 1000, children=6):
    maxScore = 0.
    for i in range(generations):
        print("Generation {0}: {1} {2}".format(i, maxScore, root))
        maxChild = None
        for j in range(children):
            child = procreate(root)
            score = scoreFn(child)
            print("    child score {0:.2f}: {1}".format(score, child))
            if maxScore < score:
                maxChild = child
                maxScore = score
        if maxChild:
            root = maxChild
    return root

A = [a for a in range(10)]
random.shuffle(A)
genetic(A, procreate, score)

Note that under this metric, the completely reversed list does not have a score of 0.

The Spearman's coefficient, mentioned in the comments, might be what you are looking for.

My thoughts on various programming languages

2014-07-06T20:36:53-05:00

I hate all the languages. Once, I tried to make my own language, but I couldn't figure out what language to do it in, so I never started.

Most of the time, you don't have any choice of what language to work in. Whatever language I'm using, I've learned to appreciate both its strengths and weaknesses.

Java

People who like Java like typing. I mean: actually hitting keys on the keyboard. You have to keep repeating yourself over and over.

The whole java system was designed by an insane person whose answer to everything is to use a Design Pattern. If you see design patterns as a way of working around problems in the language, you will see that Java has many.

On the other hand, the folks at Sun really put the work in to make Java a specification that works on embedded platforms, so we're stuck with it there. I wouldn't really trust Python or C to run my desktop on my phone.

Also, what's with all those folders? I have to use Eclipse, against my will, because it knows how to jump around all those 1000 character path names. Would it really hurt anybody if I kept the 10 objects in my application in the same folder?

C

C is precise. When I write something in C, and it is done, I know that it will work. It's like painting a masterpiece with a single-hair brush. Having to code in that level of detail is a different mindset. When you sit down to write something in C, you have to plan it out before you start. Otherwise it's a lot of work to change it later.

If you have enough experience, memory leaks are rare. It's second nature -- malloc/free come in pairs. You can't forget one. It would be like forgetting to flush or turn off the lights. You just do it.

That being said, if you're going to paint a house, you don't want to be using a fine brush. You want huge rollers. If I'm writing a whole application, or a system, I would avoid C if I can.

It is difficult to make large changes to a C program. When I'm working on an algorithm, and I know that the first cut won't be right, often I will code in python first and then translate it into C by hand when it's done.

C++

It's C with a string class. And arrays and lists and heaps of queues to implement whatever you desire. A word for the wise: don't try to make your own templates. It's too hard. Aside from that, C++ makes C better, and you can write some very nice software in C++. The extra features make it scale up to larger systems with only moderate difficulty, as long as everyone follows the same conventions.

Javascript

This is the language that nobody loves. But javascript loves you. When you were first learning it, you might have written some very bad code that used an array as a dictionary, with other objects as keys, but it's totally okay, because that code is still running flawlessly and will continue to run as along as browsers run javascript.

Javascript lacks a linker, so all the code shares the same namespace, but everyone knows that, so everything still works together.

coffeescript

Coffeescript is a translator that takes a strange ruby-like language and turns it into Javascript, line by line. It is javascript with all extraneous syntax -- braces, brackets, extra keyswords removed. Only the essential meaning of the code remains.

Coffeescript is nice. When you have to write tonnes of code, coffeescript will make you at least 25% faster. You can see that many more lines on the screen at once.

When you code in coffeescript you have to be very aware of what Javascript is going to be generated. That's the problem. You have to know Javascript first. Anyone new coming to your project has to first learn Javascript, and only then learn coffeescript, and then learn your codebase.

node.js

I wanted to like it. I think I gave it a good go. It's the callbacks that got me. I just know that someday, for whatever reason, one of those callbacks isn't going to happen and then my app is going to be stalled waiting forever. That's no way to live.

Also, nearly nothing is built in. But if you have to do X, there are always a dozen modules to choose from that do the same thing. Which do you choose? Which will get support if you have problems?

Scala

Scala is a functional, typed language that compiles to JVM code.

I learned Scala on the job. Yup, a startup was actually using it for their production system, and I joined them fairly late.

This allowed me to see the ugly side of Scala: Type inference. Types are inferred to the extreme. Everything has a type, but figuring out what that type is means checking different files several levels back. And Scala inherits Java's folder insanity, so it means delving into several levels of folders to find the right file to lookup the type.

In short, Scala was great -- for the original developers. Newcomers had a long learning curve to learn the existing code.

Erlang

Erlang is also one that I wanted to love. I really tried. It is a beautiful functional language that lets you make wonderful little modules that communicate in precise ways, and your system can run for 10 years because it can deal with unexpected problems, restart what's needed, and keep going.

Unfortunately it's baroque. Development seems to have stopped at about the time that Berkeley invented sockets. Almost nothing needed in the modern era is included. Why is it so much work to make a simple web service?

Go

Go is easy to learn, even for newcomers. It uses language concepts from 40 years ago to build a robust, asynchronous system, and lets you code it as if it were synchronous. You can write 1000 threads that work together safely in Go without your brain hurting.

It still needs some work in library support. When I want to do X, which library should I use -- the one on github from 2011 or the one from 2013 that is half finished? One is linked to from the official pages, but it the official pages don't seem all that up to date. Sigh, I guess I'll have to write my own...

Python

There's a library for everything in python, and if you use Linux it's usually clear which is the top one, because it's installable with one command.

If you have to do some number crunching or scientific computing you will be well-served by choosing Python.

Strings can be both text and data in python, so you have to learn about text encodings early on.

Python 3

Python 3 shares many characteristics with python, though it is a different language. Since it's newer its not supported as much. I want to use it, but there's always that one library I need that only has python 2 support.

A little VIM hacking

2014-07-06T03:26:18-05:00

The strange man reading a novel in the meeting room

2014-07-06T03:07:33-05:00

In late 2011, we were crammed in the table in the break room.

"I tried to book the meeting room but it's booked all week for the layoffs," said James.

Times were bleak at BlackBerry. Last Friday 11% of the workforce, over two thousand people, were laid off. On that day I had passed a red faced man being escorted out, flanked by a security goon and a serious looking woman from Organizational Development.

On the bright side, our team was -- mostly -- still there, and there were plenty of spare monitors lying around. I grabbed a couple and now I had three. There was a lot of room on my desk -- I had grown tired of bringing all my photos and desk toys back and forth every Thursday just in case I was fired. Now I just kept them at home, still packed in bag hanging off a bicycle hook in the garage.

"But there aren't any layoffs happening now, are there?" I mused. "I mean, maybe if you were on vacation or something last Friday, but they wouldn't need the meeting room all week."

Chris said, "I checked the room and there's just this guy there. He comes in every morning and just sits there reading a novel."

When our meeting was finished I stopped by the Marconi room. Inside, sat a bearded man along, reading a novel. His badge had the prominant red escort-required visitors stripe.I hurried off before he could see me.

Sure enough, each day that week as I walked by this room, he was there, just reading. And each day he would leave by 5.

A couple of weeks later we were finally able to book the room. As we flung our BlackBerry's and notebooks on the table, Chris was finally able to resolve the mystery.

"I found out who that guy was."

The mystery had been solved!

Chris continued. "He's a counsellor the company brought in to help us deal with the layoffs."

"How'd you find out?" I asked. I had been reading all of the email announcements and hadn't seen anything about this.

"I asked him."

You can cheat so your web site seems faster than it is

2013-12-13T05:47:55-05:00

I couldn't sleep. Took some neo citron for my cold and rewrote the rhymebrain instant algorithm.

When you start typing "ox" into rhymebrain, there is almost 100% chance that you are going to type "oxygen". Likewise for "or" it is "orange". There is a perceived time savings if rhymebrain prefetches the results for what it thinks you are going to type while you are still typing. On the other hand, if you go ahead and enter "oxen" then it will actually take longer to get the results, since the requests are serialized by the browser. Prefetching can waste time.

Most people only type in one of 1000 words, which easily fit into some <script defer> javascript.

Previously, I was precalculating the completion which maximized the probability of the word, using some basterdized half-remembered version of bayes law. But I think there is a better approach, by running simulations on existing data.

I have lots of data of what people typed into the search box.

Google-mad-scientist-and-textbook-writer Peter Norvig once wrote a spellchecker in python that works efficiently by brute force. (http://norvig.com/spell-correct.html) I took some inspiration from that and wrote a python program that does, by brute force:

for each possible prefix of all the words,
   for each word with that prefix,
      assume the word is completed.
      calculate time saved if the word is completed * count
      for each other word with that prefix,
         subtract time wasted * count

...where count is the number of times the word was entered into Rhymebrain's search box by people. This time of year, there is a large savings for completing "Christ" to "christmas" because people enter it so often.

Now we have a big list of the time savings (or waste) of prefetching each word from each prefix. Sort them by savings and put the top ones into a javascript file: http://rhymebrain.com/prefix.js

Now when you use rhymebrain and enter one of the 1800 words in the list, and type slowly enough, then the results will instantly pop onto the screen when you press enter.

Finally getting tired. Almost time to wake up and feed breakfast to kids

Yes, You Absolutely Might Possibly Need an EIN to Sell Software to the US

2013-11-13T04:45:08-05:00

Warning

For qualified advice, ask an organization such as the chamber of commerce. Here is some information from BDO for Canadians on selling to the USA. If you can pay for advice, you should set up an appointment with an adviser.

After many months, your software sale is complete! You've got a purchase order, sent the invoice, delivered the software. You're already handling some support issues from users at BigCorp. Then BANG! Martha from Procurement emails back, as a favour, just to let you know that BigCorp has not received your W8 form with a valid tax id, and therefore will be withholding 30% of the purchase price of your multi-thousand dollar product for taxes, so that the crazy government in the crazy USA that NEITHER YOU NOR ANYBODY FROM YOUR COMPANY HAS EVER SET FOOT IN can buy more guns.

OK, stop and take a deep breath. Refrain from calling your lawyer yet. A quick Googling will tell you that:

You don't need a W8-form, Martha is wrong.
Wait that first post was wrong. You absolutely need a W8-form but Martha is incorrect -- you can leave US Tax-ID blank.
Actually the first two posters are wrong you need to get a tax ID and start paying US taxes, and also file paperwork each quarter to collect sales taxes separately in each state that you are selling in.
What, really? That can't be right!

Confident people on Internet forums are a wealth of misinformation about the W8-BEN form. The rules are spelled out for you on the form's instructions. The part that is confusing everybody is: what you can get away with. If your customer's procurement is being ultra strict, it is difficult to argue because they must follow the law and you are just a foreigner.

The W8-BEN form must be kept on file by US companies that make payments of any type to foreigners. The government never sees this form unless the company is audited, so the rules are loosely enforced. Most companies will accept these explanations when they ask for one:

My company has no US presence so I cannot provide this form.
I do not have a US taxpayer ID so I cannot provide this form.

But the rules that Martha is following are quite clear, no matter how inconvenient it makes your life. You need to provide this form upon request, and it must have a valid US-taxpayer ID for you to claim benefits under a tax treaty, unless the payment is for dividends.

Because Martha's company is making payments to foreigners, they are a withholding agent, and their finance and legal department are being extra careful to follow the rules in this document for withholding agents.

The good news it should only take you about 2 hours time to obtain the documentation to appease Martha. If Martha receives the W8-BEN form, and your country has a tax treaty with the US, then she will probably pay your invoice. Here are the steps.

Obtain a valid US Taxpayer ID

If you are a foreign corporation, the taxpayer identification number you need is called an Employer Identification Number (EIN). It doesn't matter if you have employees, or if you never pay any US taxes due to a treaty.

Download the SS-4 form, and print it out. Fill it in. Most of it is your company name and address, and date of incorporation, so it will be simple.
Now you can mail or fax it in. If you need the number right away, you will have to call.
Phone the US government using the toll number from the instructions of the SS-4 form. As a foreigner you cannot do it online.
You will be told the wait time is 15 minutes, but you will then wait one or more hours on hold for your opportunity to speak with a tax official. Make sure you have your phone charger.
The tax official will be tired and rude because she has to deal with indignant foreigners all day. Smile as you talk and be polite. Explain that you need an EIN to fill out the W8-BEN form. Do not offer any other information. The rest of the call will be you spelling out your company name and address from the SS-4 form letter by letter. (See the NATO Phonetic Alphabet)
She will verbally give you your very own EIN over the phone.
If you are told you need an ITIN then thank her, hang up and try again with a different agent.

Find your applicable tax treaty

If you are lucky, then decades ago, your country negotiated a tax treaty with the United States. The instructions to witholding agents contains a table of tax treaties and section numbers that they will refer to. They are listed in Publication 515 The actual text of the treaties are helpfully available here. You will need the section number from the table of contents. For my Canadian Company's software sales, I used "Independent Personal Services" because for some reason the 1980 tax treaty failed to forsee the need for a section on "Cloud Platform for Creating UML Diagrams".

Fill out the W8-BEN form

You'd better have a stack of them, because larger companies are asking for original forms with "wet signature". But in the past it has been convenient to put a scanned copy on your web site, in the same place where you accept purchase orders.

Follow the instructions for the form, filling in your name, address, foreign tax id with your corporation's business number, and enter your new EIN on line 6. On line 9-10, fill in your country and put in the tax treaty section number (eg for Canada: XVII Independent Personal Services)

Good luck!

I hope that you are able to get your invoice paid. Always remember that the people you are dealing with are doing their job, and they have to follow the rules. If you are rude and do not follow instructions, then they will compain about you at the watercooler and possibly misplace your invoice. If you are polite and give them their paperwork, then you will get your money, and they will remember you and future sales with the same company will be a breeze.

Get tips on improving your software business, right in your inbox

Enter your email and I will send you tips on selling software right in your inbox. I have been selling software since 1998, and whether it's consulting products, adsense, or software as a service, I have done it all, and I want to tell you what I wish I knew when I started.

Asana's shocking pricing practices, and how you can get away with it too

2013-11-06T02:21:20-05:00

If one apple costs $1, how much would five apples cost? How about 500?

In everyday life, when you buy more of something, you get more bananas for your buck. The fixed costs decrease. If you sell a lot of apples to one person, you don't have to wrap each one, you don't have to pay fixed transaction fees on each sale, and you don't have to worry about finding someone to buy the other 499 apples. The savings are passed on to the consumer. Often, software is priced this way too. That's why I love Asana's pricing page. It breaks the rules.

Asana prices their product based on its value. It lets teams coordinate about projects and tasks they are working on. Asana is very clear about the value they give. In fact, the pricing page tells you that the only difference between the paid and free versions is that "premium plans allow you to coordinate with more team members, as well as the features listed in the table above. All other user features are exactly the same."

It's a mathematical law that as the number of people in a team grows, the number of communication paths grows quadratically. A company with 100 people using it is therefore getting much more value out of it than a company of 15 people, so they pay higher per-seat costs.

Homework

What is the one thing that gives your software value? Are you directly charging for that thing, or something else? How can you take advantage of team effects to provide more value when more people use it?

Get tips on improving your software business, right in your inbox

5 Ways PowToon Made Me Want to Buy Their Software

2013-10-31T21:28:19-05:00

Powtoon is online software that lets you create animated powerpoint presentations, without the steep learning curve of Adobe Flash. The selling techniques they use are simple and powerful. Even though I saw through their tricks at every step along the way, I am now a customer and proud of it. It is worthwhile to look at what they did, because these are simple things that you can do to improve your software business.

Help users remember your app with email

As soon as I signed up with powtoon, the emails started. By the fifth day of the constant barrage I started marking them as spam. But for those five days, my inbox was a constant reminder that Powtoon was there and waiting for me. Since it was on my mind, I mentioned it to three other people in conversations I had. If the emails weren't there, I might be still struggling to remember the name of that cartoon site I signed up with, instead of being a paying user. What was that name again? Powertoon.io? Powrtoon.com?

Homework

If you offer software as a service, you should have an email campaign that offers helpful tips, as a way of reminding people of where they've been. Emailing every day is excessive, however.

Go read the user manual $23 ebook we're giving you FREE!

Sadly, for users coming from Google Docs or PowerPoint, Powtoon's first-use experience still confusing and it is enormously helpful to read the manual. Many people read the manual without realizing it because it is delivered in the form of an Ebook, with a value attached to it. "You could buy this book from Amazon for $23!" says the copy, "or download it here FREE".

When I get a free book I usually toss it in the recycling. But if it's worth money, I'll probably read it.

Homework

Sometimes users need to read the manual to get the most out of your product. If reading the manual is a part of your sales funnel, and you should do everything you can to get people to read it. Package it as an E-book to get more people through this stage.

Price for value, not competition

Powtoon could have looked at Adobe Creative suite and priced their product cheaper. But Adobe is not the competition. Instead, the much larger, richer market of skilled powerpoint folks are looking at Powtoon, and comparing it with professional video design. From that perspective it looks like a bargain.

Homework

Unless your goal is to be acquihired, it is difficult to build a business on $9/month. If your software has no plan that costs at least $100/month, try to think of some feature that would be a must-have for businesses. For big companies, cost doesn't matter, as long as your highest plan has something they need.

A clear reason to buy

It is easy to make software that hides the premium features away. They are grayed out and pushed to the end of the list. I've done it myself. For users, its like living in a cosy room. If you ignore the locked door behind the couch, you can forget that it is the foyer of a mansion.

At every step, Powtoon reminds you that you are missing the majority of its capabilities. While choosing a background tune, I had to scroll through hundreds of songs that I could play but couldn't use. Only about five songs were available, but they were strewn through the list. Every other item of clipart is only available in the highest-cost plan, but again, you have to scroll through them. You can make a great video without paying anything. You are free to use any song or image that you can upload. But you leave with the deep sense that the software is crippled.

Homework

Do users on your free plan say that they don't see any reason to buy? If the premium features are hidden away, it's time to make them more visible.

Price segmentation

The powtoon Agency plan serves two purposes. Firstly, it lets users who can pay more do so. Some users don't care how much something costs don't want to have to think about it. These users have an option to pay more. But I am not one of them.

The Agency plan is listed first on the pricing page, and psychologically anchors the value of the rest of the plans. When I first used Powtoon I was stuck by the high cost. I couldn't justify it, so I resolved to use the free plan. But a few days later I saw this:

A few days after I signed in, Powtoon sent me this offer. It made me rethink things in a hurry. With this sale, and the "savings" of over $400 buy-now-before-it's-too-late plastered on my monitor every time I sign in, they successfully reframed the middle tier as a bargain.

Homework

In your pricing page, what techniques do you use to get people from the free tier into the middle tier plan? How can you use the high-cost tier to reframe the value of the other plans?

So what did I end up making?

I whipped together a video for my consulting product, Zwibbler. Zwibbler is a drop-in solution that lets users draw in your web app.

Get tips on improving your software business, right in your inbox

How I run my business selling software to Americans

2013-04-30T14:09:51-05:00

I first realized I had overpaid when I received my articles of incorporation from the law firm. Was it because they were in a leather bound binder? Was it because it had been shipped overnight from Toronto to Waterloo, a distance of 83 km, such a distance that I could have driven there and picked it up and then returned and paid less than the cost of shipping it? Instead I had to wait two business days for the Fedex truck to drive back to Cambridge since I wasn't home, and then I had to call in to arrange to pick it up at a "conveniently located" Fedex office a week later.

No, I was miffed because I had paid $1500 to incorporate, when I could have done over the web for far less. Still, it is a very nice binder.

Since that time I have slowly been finding ways to optimize my business, which consists of selling software to Americans. I sell it all kinds of ways.

A typical month:

WebSequenceDiagrams subscriptions	$1600
WebSequencediagrams Server Sales	$1600
Rhymebrain.com Adsense Revenue	$1700
Zwibbler.com licensing and consulting	$2000

*My annual reports are on Google Plus, where nobody reads them.

According to the Canada Revenue Agency, I'm a profitable small business. The only thing preventing me from spending it all on iPads, Google Glasses and Surface Tablets is the fact that I have to feed my lovely family.

I keep costs down by feeding them chocolate flavoured soylent in a bucket

Here's some tips on running a business in Canada selling software to Americans.

Incorporation

Incorporation is a good choice. While it gives a valuable sense of security (albeit false) against lawsuits, the most useful benefit is income deferral. I started this company while I was working full time. If it were a sole proprietorship, I would have had to pay the top tax rate of 40% on everything I earned. This would have been a huge disincentive to growing my business.

Evil Tip for starting a company while working somewhere else

Whenever someone asks if it's legal, point them to the corporate policy and claim that there's a simple form that you fill out, and loudly complain that it takes the legal department eight months to answer any emails. With luck, the person that asked will launch into his own stories about the slow legal department, thus deflecting the conversation to a more useful topic.

With all of the profits inside a corporation, I had to pay only the 16% corporate tax on them. But I can keep the profits there until I feel like withdrawing them. It's like having an extra RRSP.

However, incorporation does have some added responsibilities. First, I have to pay Intuit TurboTax $200 every year to file my taxes. And that software only does about 10% of the work -- I have to maintain a balance sheet and income statement for the year so I can get the numbers to enter into TurboTax. Still, I figure we are about even, because Intuit also bought the server edition of WebsequenceDiagrams.

Evil Tip for doing your own taxes

It is easy to make a lot of mistakes the first time. But hiring an accountant costs $2000, while penalties from the government for making a mistake are maybe about $50 tops. I'll re-evaluate this when I'm making sufficiently more profit.

After you incorporate, you can't do very much until you get:

A business bank account

Canada has a cartel of five major banks. Stay away from them. I was explaining banking to my 3 year old daughter (her twitter account):

Me: Banks are a place where you keep your money.

Lillian: WHY?

Me: Because they give you interest... (thinking) but then they take it away and charge you more money.

Lillian: WHY?

Me: I guess you put your money in a bank to keep it safe, and every month they take some away.

Lillian: WHY?

Me: I don't know. If you keep your money in the bank they will slowly take it away from you.

Lillian: I WILL KEEP MY MONIES BESIDE MY POTTY.

Me: Good. Now it's time to watch Dora. Daddy's got to go buy some bitcoin.

Instead, I use a local credit union, which has a pay-as-you go account. For $5 a month I can keep all of my profits there and write cheques. They wanted to sell me a business cheque book. What is it with all these leather-bound things? Does my business have to have everything wrapped in cow skin to appear successful? I imagine it might be useful in a narrow range of situations:

Me in line at the grocery store: Will you take a cheque for these Ruffles^TM brand potato chips?

Attractive cashier: Um, noooo. What do you think this is, like 1985? Don't you have a Paypass chip?

Me: What about... from THIS chequebook? (whips out the corinthian leather-bound Execu-Check 5000 with dual-signature, day-planner, and matching gold pens.)

Attractive cashier: Oooh, no problem, Mr. Hanov. What are you doing later?

My wife: He'll be sleeping in the basement. Let's go.

I managed to get them to give me personal cheques with my business name written on them by asking very nicely. Credit unions are nice that way.

Unfortunately you have to deal with big banks sometimes. I needed to get:

A credit card

After a lot of research, I selected the Bank of Montreal credit card for businesses, because there is no fee and every December I get some cash back for using it. I filled out the application with my personal information, and since I was working at the time, there was no problem getting it.

Evil tip for paying for things

Currency exchange is expensive. As a rule, I pay for US things with US dollars, and Canadian things with Canadian dollars. This was only a problem with Microsoft Office 365, which insisted on charging my Paypal account in CAD. I had to tell Microsoft that I live in Beverly Hills to use my USD Paypal account. Because that's the only zip code I know.

I've only talked about the Canadian side so far. But many Canadian software companies get all their revenue in US dollars. There is an important trick for dealing with this, which I will get to shortly. But first:

Paypal

Paypal is utterly horrible to use and develop for. For example, to cancel a subscription for a user from last August, I have to page through them all, 25 at a time, waiting 5-10 seconds for each page load, until I get to August. If I didn't know when the subscription started, then I would be there for much longer reading all of the names. For developers, Paypal offers a special sandbox area which hasn't worked in months, and special paypal IPN notifications which are broken for several months out of the year.

But once it's finally working, Paypal works everywhere. It lets me enter in tax rates for all of the Canadian provinces and territories (Why aren't they there already?). It lets me accept orders from Israel, France, the UK, Germany, Australia, New Zealand, Finland, Norway. When companies that I sell to refuse to use Paypal, it lets me pay $35 to enter the credit card details manually. The fee is outrageous, but it's a steal compared to anywhere else I can use. (Update: Use stripe.com.

I use Paypal for most of my sales, but I have a small nagging fear that one day, the US Government, (which to us Canadians appears quite insane, so they would do this kind of thing) will take all of my money one day to fight terror by arming day cares. But when I transfer money out of Paypal into Canada, they use over-the-counter exchange rates. If you are using Paypal for currency-conversion you must stop immediately, because there is a better way, which I will explain to you shortly. But first, to get the money out of Paypal without currency conversion, you need:

A US dollar bank account

My credit union couldn't offer all the services I needed to run my global company. I searched around and I found a reasonable deal with the Royal Bank of Canada USD Checking account. It is regularly $9/month, but with a minimum balance of $2500 the price drops to $2.

I need the USD bank account so I can accept incoming bank transfers with no currency losses. Outside of North America, it is common to pay for large items by exchanging bank account information, so the buyer can transfer the cash directly into the seller's accounts. North American banks discourage this behaviour by levying huge fees. When I invoice a customer, I include the bank details and in a few weeks I receive the full amount, minus the $15 fee for RBC, and $25 for some mysterious "intermediary bank". Still, a flat $40 charge looks pretty good on amounts greater than $1000 when compared to Paypal's fees for the same.

Evil tip for well-connected, wealthy European financiers

An intermediary bank is a good business to get into. Also, drugs.

So I have a Canadian Dollar account in a credit union, US dollars in Paypal, and US dollars in RBC. How do I get my money into good old Canadian loonies and toonies? That's where my favourite part comes in.

XE.com

If you are a Canadian company whose revenue comes in USD, you should immediately get an account at a currency broker. Once set up, it is a simple matter to transfer money between a US and Canadian bank account, at crazy-low conversion fees.

For example, I just went to Paypal and XE and priced out transferring $5000 USD into Canada. Today, the difference is not huge, but the spread has been much higher in the past.

Paypal	$4,929.33 CAD
XE.com	$4,980.00 CAD

I do not want 1 to 3% of my revenue disappearing off the top, so I use XE. When I signed up, I registered my US account and the Canadian one with them by copying the numbers off some cheques, and now I can initiate a transfer in seconds.

Charging tax

I added this section due to the comments. As a Canadian, if you make more than $30,000 in revenue in a year or recently (See the CRA web site for specific rules) then you have to register for GST/HST. You may register before that time though, and it may be advantageous, because registered businesses don't pay GST / HST. Keep track of all the stuff you buy for your business, tell the CRA the total each January, and you will get it all back in a nice fat cheque. This works out well when you, for instance, buy a $5000 Macbook pro.

Registering for GST/HST, however, requires you to do certain things.

Fill out the GST/HST return each year. It is a simple two-page form. Just report your total sales, the tax you collected, and the tax you paid. That's it!
When you invoice a Canadian customer, then include your HST registration number and the tax charged in the invoice. This varies by province, so make sure you look it up on Wikipedia before you invoice those freeloading Albertans.

You aren't obligated to collect taxes for the governments of other countries. Just invoice the subtotal, and they can work out their own damn taxes. This does lead to a ridiculous HST return, where you've collected $100k in revenue and only $100 in HST. That's OK... software is a global product, and there are hardly any Canadians in this global market.

Do you charge in US dollars? That's OK too. It may look weird, but you can invoice a Canadian in USD. I do this because all my accounts are set up in USD. Also include HST in USD on the amount, as if it were in CAD. For reporting purposes, I convert the total based on the yearly average USD conversion rate on the CRA web site, but you are free to use other methods.

In January you will have to pay, or receive, the difference in taxes you collected and taxes you paid for stuff.

How do you do it?

Do you have a different way for optimizing your cash flow? Did I miss anything? Please share your tips in the comments.

0, 1, Many, a Zillion

2013-04-05T17:17:13-05:00

There are only four numbers in computer programs:

0, 1, many, "a zillion"

If you have 2 or more of anything, you are, in general, better off using loops to process many of them.

But what is "a zillion?"

Zillion is a made-up number. Your system cannot hold a zillion items in memory. It cannot show a zillion items on the screen.

Doesn't work for "a zillion":

Select employee name:

Doesn't work for "a zillion":

def handleFiles( filenames: Array[String] ) {
    val results = openFiles(filenames).readAll().processAll()
    results
}

* The program first opens all the files, and then processes them. The OS will run out of file handles.

Doesn't work for "a zillion":

Changing software from handling "many" to "a zillion" is hard if the program is already written.

Decide when you need to handle a zillion.

Give your Commodore 64 new life with an SD card reader

2012-08-01T21:22:38-05:00

This August marks the 30th anniversary of the most successful computer model in history. One company put personal computers into the people's homes, and launched an entire industry overnight. For an entire decade, despite attempts at marketing improvements, the original platform stood the test of time, virtually unchanged. Even today, the Commodore 64 is celebrated by a community of hobbyists.

In honour of the anniversary month, here are some up to date instructions on how to read old data from the disks. Floppy disks only have a usable lifespan of 10 years, and most of them in my collection are over 25 already. At this point, the chemical binder between the magnetic particles and the plastic substrate is degrading, and the data can literally fall out of the disks. Fortunately, up to date software from the opencbm project tries very hard to read the data and can often succeed when a disk appears to be unusable when connected to a C64.

Here's what you need to read your floppy disk into your PC:

A commodore 1541 disk drive, in good condition, and power supply.
The drive cable.
This ZoomFloppy adapter
a USB cable.

In addition, if you want to read disk images from an SD card hooked into the real Commodore 64, you will need the uIEC adapter, which can be substituted for a real floppy drive.

Transferring floppy disks to the PC

I want to scan in my copy of Jumpman so its digital bits are preserved.

This copy is totally genuine EPYX product. At least that's what they guy at the swap meet told me, behind the K-mart.

You have to download the special build of the opencbm software from here. I am using the windows version. Read the manual very carefully to install the driver, before you can use the command line tools.

Then, plug the ZoomFloppy into the Commodore drive using the commodore cable (make sure the drive is off!), and into your PC using a USB cable.

Confusingly there's some empty ports. If you want, you can hook other things into it for decoration, I guess.

Now we turn on the drive, put in the disk and cross our fingers. From a command line, I type:

d64copy -r 16 8 "jumpman.d64"

This means:

Retry bad blocks up to 16 times before giving up.
Copy from device #8. Remember, on Commodore, the first floppy drive was always device 8.
Copy the contents of the disk to the file called jumpman.d64

A minute later, and it is done. Unfortunately, we have a bad sector. Hopefully it is in an unused part of the disk, or in some graphics that won't cause a crash.

At this point, we can run the game in the Vice emulator (On windows this distribution seems to work).

Running games from SD cards

No serious commodore enthusiast is without his or her SD card reader. I happen to have the uIEC which Jim Brain soldered together right in front of me at the World of Commodore 2011 in Toronto.

It has to plug into the back of the Commodore, presumably for power, and then the drive cable goes into it as well. Then, it is ready to act exactly like a floppy drive.

But how do folders work?

You can just about fit all the Commodore software in existence onto one card, so how do you access it? You can switch disk images by sending special drive commands to it in BASIC. I've placed the jumpman.d64 file on the SD card. Here is the command to switch to it:

OPEN1,8,15:PRINT#1,"CD:JUMPMAN.D64":CLOSE1

There are many other commands for the uIEC. It has some buttons on the circuit board that let you swap disks on the fly without typing commands, but these have to be set up by listing them in a special file. All the other commands are described here.

A major drawback of the uIEC is that while it emulates the standard Commodore drive operation, the commodore 1541 drive was actually another computer that you could load programs on and run. Some programs did this for copy protection or to implement custom fast-loaders. More recently the retro demo-scene uses it for extra storage and computing power. I was disappointed when the Second Reality demo didn't work.

Well, I have Jumpman loaded anyway, so off to an evening of retro fun.

The cheater method

Of course, if you don't want to go to all this trouble, you could obtain just about any software you can think of using pokefinder. Just Bing it.

20 lines of code that will beat A/B testing every time

2012-05-28T21:10:15-05:00

Zwibbler.com is a drop-in solution that lets users draw on your web site.

A/B testing is used far too often, for something that performs so badly. It is defective by design: Segment users into two groups. Show the A group the old, tried and true stuff. Show the B group the new whiz-bang design with the bigger buttons and slightly different copy. After a while, take a look at the stats and figure out which group presses the button more often. Sounds good, right? The problem is staring you in the face. It is the same dilemma faced by researchers administering drug studies. During drug trials, you can only give half the patients the life saving treatment. The others get sugar water. If the treatment works, group B lost out. This sacrifice is made to get good data. But it doesn't have to be this way.

In recent years, hundreds of the brightest minds of modern civilization have been hard at work not curing cancer. Instead, they have been refining techniques for getting you and me to click on banner ads. It has been working. Both Google and Microsoft are focusing on using more information about visitors to predict what to show them. Strangely, anything better than A/B testing is absent from mainstream tools, including Google Analytics, and Google Website optimizer. I hope to change that by raising awareness about better techniques.

With a simple 20-line change to how A/B testing works, that you can implement today, you can always do better than A/B testing -- sometimes, two or three times better. This method has several good points:

It can reasonably handle more than two options at once.. Eg, A, B, C, D, E, F, G, Ã¢ï¿½Â¦
New options can be added or removed at any time.

But the most enticing part is that you can set it and forget it. If your time is really worth $1000/hour, you really don't have time to go back and check how every change you made is doing and pick options. You don't have time to write rambling blog entries about how you got your site redesigned and changed this and that and it worked or it didn't work. Let the algorithm do its job. This 20 lines of code automatically finds the best choice quickly, and then uses it until it stops being the best choice.

The Multi-armed bandit problem

Picture from Microsoft Research

The multi-armed bandit problem takes its terminology from a casino. You are faced with a wall of slot machines, each with its own lever. You suspect that some slot machines pay out more frequently than others. How can you learn which machine is the best, and get the most coins in the fewest trials?

Like many techniques in machine learning, the simplest strategy is hard to beat. More complicated techniques are worth considering, but they may eke out only a few hundredths of a percentage point of performance. One strategy that has been shown to perform well time after time in practical problems is the epsilon-greedy method. We always keep track of the number of pulls of the lever and the amount of rewards we have received from that lever. 10% of the time, we choose a lever at random. The other 90% of the time, we choose the lever that has the highest expectation of rewards.

def choose():
    if math.random() < 0.1:
        # exploration!
        # choose a random lever 10% of the time.
    else:
        # exploitation!
        # for each lever, 
            # calculate the expectation of reward. 
            # This is the number of trials of the lever divided by the total reward 
            # given by that lever.
        # choose the lever with the greatest expectation of reward.
    # increment the number of times the chosen lever has been played.
    # store test data in redis, choice in session key, etc..

def reward(choice, amount):
    # add the reward to the total for the given lever.

Why does this work?

Let's say we are choosing a colour for the "Buy now!" button. The choices are orange, green, or white. We initialize all three choices to 1 win out of 1 try. It doesn't really matter what we initialize them too, because the algorithm will adapt. So when we start out, the internal test data looks like this.

Orange	Green	White
1/1 = 100%	1/1=100%	1/1=100%

Then a web site visitor comes along and we have to show them a button. We choose the first one with the highest expectation of winning. The algorithm thinks they all work 100% of the time, so it chooses the first one: orange. But, alas, the visitor doesn't click on the button.

Orange	Green	White
1/2 = 50%	1/1=100%	1/1=100%

Another visitor comes along. We definitely won't show them orange, since we think it only has a 50% chance of working. So we choose Green. They don't click. The same thing happens for several more visitors, and we end up cycling through the choices. In the process, we refine our estimate of the click through rate for each option downwards.

Orange	Green	White
1/4 = 25%	1/4=25%	1/4=25%

But suddenly, someone clicks on the orange button! Quickly, the browser makes an Ajax call to our reward function $.ajax(url:"/reward?testname=buy-button"); and our code updates the results:

Orange	Green	White
2/5 = 40%	1/4=25%	1/4=25%

When our intrepid web developer sees this, he scratches his head. What the F*? The orange button is the worst choice. Its font is tiny! The green button is obviously the better one. All is lost! The greedy algorithm will always choose it forever now!

But wait, let's see what happens if Orange is really the suboptimal choice. Since the algorithm now believes it is the best, it will always be shown. That is, until it stops working well. Then the other choices start to look better.

Orange	Green	White
2/9 = 22%	1/4=25%	1/4=25%

After many more visits, the best choice, if there is one, will have been found, and will be shown 90% of the time. Here are some results based on an actual web site that I have been working on. We also have an estimate of the click through rate for each choice.

Orange	Green	White
114/4071 = 2.8%	205/6385=3.2%	59/2264=2.6%

Edit: What about the randomization?

I have not discussed the randomization part. The randomization of 10% of trials forces the algorithm to explore the options. It is a trade-off between trying new things in hopes of something better, and sticking with what it knows will work. There are several variations of the epsilon-greedy strategy. In the epsilon-first strategy, you can explore 100% of the time in the beginning and once you have a good sample, switch to pure-greedy. Alternatively, you can have it decrease the amount of exploration as time passes. The epsilon-greedy strategy that I have described is a good balance between simplicity and performance. Learning about the other algorithms, such as UCB, Boltzmann Exploration, and methods that take context into account, is fascinating, but optional if you just want something that works.

Wait a minute, why isn't everybody doing this?

Statistics are hard for most people to understand. People distrust things that they do not understand, and they especially distrust machine learning algorithms, even if they are simple. Mainstream tools don't support this, because then you'd have to educate people about it, and about statistics, and that is hard. Some common objections might be:

Showing the different options at different rates will skew the results. (No it won't. You always have an estimate of the click through rate for each choice)
This won't adapt to change. (Your visitors probably don't change. But if you really want to, in the reward function, multiply the old reward value by a forgetting factor)
This won't handle changing several things at once that depend on each-other. (Agreed. Neither will A/B testing.)
I won't know what the click is worth for 30 days so how can I reward it?

More blog entries

[comic] Appreciation of xkcd comics vs. technical ability

2012-02-14T08:00:00-05:00

Previous Comic | Next Comic

VP trees: A data structure for finding stuff fast

2011-12-02T08:00:00-05:00

Let's say you have millions of pictures of faces tagged with names. Given a new photo, how do you find the name of person that the photo most resembles?

Suppose you have scanned short sections of millions of songs, and for each five second period you have a rough list of the frequencies and beat patterns contained in them. Given a new audio snippet, can you find the song to which it belongs?

What if you have data from thousands of web site users, including usage frequency, when they signed up, what actions they took, etc. Given a new user's actions, can you find other users like them and predict whether they will upgrade or stop using your product?

In the cases I mentioned, each record has hundreds or thousands of elements: the pixels in a photo, or patterns in a sound snippet, or web usage data. These records can be regarded as points in high dimensional space. When you look at a points in space, they tend to form clusters, and you can infer a lot by looking at ones nearby.

In this blog entry, I will half-heartedly describe some data structures for spatial search. Then I will launch into a detailed explanation of VP-Trees (Vantage Point Trees), which are simple, fast, and can easily handle low or high dimensional data.

Data structures for spatial search

When a programmer wants to search for points in space, perhaps the the first data structure that springs to mind is the K-D tree. In this structure, we repeatedly subdivide all of the points along a particular dimension to form a tree structure.

With high dimensional data, the benefits of the K-D tree are soon lost. As the number of dimensions increase, the points tend to scatter and it becomes difficult to pick a good splitting dimension. Hundreds of students have gotten their masters degree by coding up K-D trees and comparing them with an alphabet soup of other trees. (In particular, I like this one.)

The authors of Data Mining: Practical machine Learning Tools and Techniques suggests using Ball Trees. Each node of a Ball tree describes a bounding sphere, using a centre and a radius. To make the search efficient, the nodes should use the minimal sphere that completely contains all of its children, and overlaps the least with other sibling spheres in the tree.

Ball trees work, but they are difficult to construct. It is hard to figure out the optimal placement of spheres to minimize the overlap. For high dimensional data, the structure can be huge. The nodes must store their centre, and if a point has thousands of coordinates, it occupies a lot of storage. Moreover, you need to be able to calculate these fake sphere centres from the other points. What, exactly, does it mean to calculate a point between two sets of users' web usage history?

Fortunately, there are methods of building tree structures which do not require manipulation of the individual coordinates. The things that you put in them do not need to resemble points. You only need a way to figure out how far apart they are.

Entering metric space

Image you are blindfolded and placed in a gymnasium filled with other blindfolded people. Even worse: you also lost all sense of direction. When others talk, you can sense how far away they are, but not where they are in the room. Eventually, some basic laws become clear.

If there is no distance between you and the other person, you are standing in the same spot.
When you talk to another person, they perceive you has being the same distance away as you perceive them.
When you talk to person A and person B, the distance to A is always less than the distance to B plus the distance from A to B. In other words, the shortest distance between two people is a straight line. Distance is never negative.

This is a metric space. The great thing about metric spaces is that the things that you put in them do not need to do a lot. All you need is a way of calculating the distances between them. You do not need to be able to add them together or find bounding shapes or find points midway between them. The data structure that I want to talk about is the Vantage Point Tree (a generalization of the BK-tree that is eloquently reviewed in Damn cool algorithms.

Each node of the tree contains one of the input points, and a radius. Under the left child are all points which are closer to the node's point than the radius. The other child contains all of the points which are farther away. The tree requires no other knowledge about the items in it. All you need is a distance function that satisfies the properties of a metric space.

How searching a VP-Tree works

Let us examine one of these nodes in detail, and what happens during a recursive search for the nearest neighbours to a target.

Suppose we want to find the two nearest neighbours to the target, marked with the red X. Since we have no points yet, the node's center p is the closest candidate, and we add it to the list of results. (It might be bumped out later). At the same time, we update our variable tau which tracks the distance of the farthest point that we have in our results.

Then, we have to decide whether to search the left or right child first. We may end up having to search them both, but we would like to avoid that most of the time.

Since the target is closer to the node's center than its outer shell, we search the left child first, which contains all of the points closer than the radius. We find the blue point. Since it is farther away than tau we update the tau value.

Do we need to continue the search? We know that we have considered all the points that are within the distance radius of p. However, it is closer to get to the outer shell than the farthest point that we have found. Therefore there could be closer points just outside of the shell. We do need to descend into the right child to find the green point.

If, however, we had reached our goal of collecting the n nearest points, and the target point is farther from the the outer shell than the farthest point that we have collected, then we could have stopped looking. This results in significant savings.

Implementation

Here is an implementation of the VP Tree in C++. The recursive search() function decides whether to follow the left, right, or both children. To efficiently maintain the list of results, we use a priority queue. (See my article, Finding the top k items in a list efficiently for why).

I tried it out on a database of all the cities in the world, and the VP tree search was 3978 times faster than a linear search through all the points. You can download the C++ program that uses the VP tree for this purpose here.

It is worth repeating that you must use a distance metric that satisfies the triangle inequality. I spent a lot of time wondering why my VP tree was not working. It turns out that I had not bothered to find the square root in the distance calculation. This step is important to satisfy the requirements of a metric space, because if the straight line distance to a <= b+c, it does not necessarily follow that a² <= b² + c².

Here is the output of the program when you search for cities by latitude and longitude.

Create took 15484122
Search took 36
ca,waterloo,Waterloo,08,43.4666667,-80.5333333
 0.0141501
ca,kitchener,Kitchener,08,43.45,-80.5
 0.025264
ca,bridgeport,Bridgeport,08,43.4833333,-80.4833333
 0.0396333
ca,elmira,Elmira,08,43.6,-80.55
 0.137071
ca,baden,Baden,08,43.4,-80.6666667
 0.161756
ca,floradale,Floradale,08,43.6166667,-80.5833333
 0.163351
ca,preston,Preston,08,43.4,-80.35
 0.181762
ca,ayr,Ayr,08,43.2833333,-80.45
 0.195739
---
Linear search took 143212
ca,waterloo,Waterloo,08,43.4666667,-80.5333333
 0.0141501
ca,kitchener,Kitchener,08,43.45,-80.5
 0.025264
ca,bridgeport,Bridgeport,08,43.4833333,-80.4833333
 0.0396333
ca,elmira,Elmira,08,43.6,-80.55
 0.137071
ca,baden,Baden,08,43.4,-80.6666667
 0.161756
ca,floradale,Floradale,08,43.6166667,-80.5833333
 0.163351
ca,preston,Preston,08,43.4,-80.35
 0.181762
ca,ayr,Ayr,08,43.2833333,-80.45
 0.195739

Construction

I'm too lazy to implement a delete or insert function. It is most efficient to simply build the tree by repeatedly partitioning the data. We build the tree from the top down from an array of items. For each node, we first choose a point at random, and then partition the list into two sets: The left children contain the points farther away than the median, and the right contains the points that are closer than the median. Then we recursively repeat this until we have run out of points.

// A VP-Tree implementation, by Steve Hanov. (steve.hanov@gmail.com)
// Released to the Public Domain
// Based on "Data Structures and Algorithms for Nearest Neighbor Search" by Peter N. Yianilos
#include <stdlib.h>
#include <algorithm>
#include <vector>
#include <stdio.h>
#include <queue>
#include <limits>

template<typename T, double (*distance)( const T&, const T& )>
class VpTree
{
public:
    VpTree() : _root(0) {}

    ~VpTree() {
        delete _root;
    }

    void create( const std::vector& items ) {
        delete _root;
        _items = items;
        _root = buildFromPoints(0, items.size());
    }

    void search( const T& target, int k, std::vector* results, 
        std::vector<double>* distances) 
    {
        std::priority_queue<HeapItem> heap;

        _tau = std::numeric_limits::max();
        search( _root, target, k, heap );

        results->clear(); distances->clear();

        while( !heap.empty() ) {
            results->push_back( _items[heap.top().index] );
            distances->push_back( heap.top().dist );
            heap.pop();
        }

        std::reverse( results->begin(), results->end() );
        std::reverse( distances->begin(), distances->end() );
    }

private:
    std::vector<T> _items;
    double _tau;

    struct Node 
    {
        int index;
        double threshold;
        Node* left;
        Node* right;

        Node() :
            index(0), threshold(0.), left(0), right(0) {}

        ~Node() {
            delete left;
            delete right;
        }
    }* _root;

    struct HeapItem {
        HeapItem( int index, double dist) :
            index(index), dist(dist) {}
        int index;
        double dist;
        bool operator<( const HeapItem& o ) const {
            return dist < o.dist;   
        }
    };

    struct DistanceComparator
    {
        const T& item;
        DistanceComparator( const T& item ) : item(item) {}
        bool operator()(const T& a, const T& b) {
            return distance( item, a ) < distance( item, b );
        }
    };

    Node* buildFromPoints( int lower, int upper )
    {
        if ( upper == lower ) {
            return NULL;
        }

        Node* node = new Node();
        node->index = lower;

        if ( upper - lower > 1 ) {

            // choose an arbitrary point and move it to the start
            int i = (int)((double)rand() / RAND_MAX * (upper - lower - 1) ) + lower;
            std::swap( _items[lower], _items[i] );

            int median = ( upper + lower ) / 2;

            // partitian around the median distance
            std::nth_element( 
                _items.begin() + lower + 1, 
                _items.begin() + median,
                _items.begin() + upper,
                DistanceComparator( _items[lower] ));

            // what was the median?
            node->threshold = distance( _items[lower], _items[median] );

            node->index = lower;
            node->left = buildFromPoints( lower + 1, median );
            node->right = buildFromPoints( median, upper );
        }

        return node;
    }

    void search( Node* node, const T& target, int k,
                 std::priority_queue& heap )
    {
        if ( node == NULL ) return;

        double dist = distance( _items[node->index], target );
        //printf("dist=%g tau=%gn", dist, _tau );

        if ( dist < _tau ) {
            if ( heap.size() == k ) heap.pop();
            heap.push( HeapItem(node->index, dist) );
            if ( heap.size() == k ) _tau = heap.top().dist;
        }

        if ( node->left == NULL && node->right == NULL ) {
            return;
        }

        if ( dist < node->threshold ) {
            if ( dist - _tau <= node->threshold ) {
                search( node->left, target, k, heap );
            }

            if ( dist + _tau >= node->threshold ) {
                search( node->right, target, k, heap );
            }

        } else {
            if ( dist + _tau >= node->threshold ) {
                search( node->right, target, k, heap );
            }

            if ( dist - _tau <= node->threshold ) {
                search( node->left, target, k, heap );
            }
        }
    }
};

Why you should go to the Business of Software Conference Next Year

2011-10-29T00:19:36-05:00

Most people, having already paid $2000.00 of their hard earned money, and then having flown, driven, or otherwise travelled to Boston to attend a conference, and then having paid an additional $250/night plus $33/night parking and "tourism taxes" to the Seaport Hotel -- most people, after all this, are unlikely to say that it was a waste of time and they should have stayed home watching the remaining salvaged episodes of Doctor Who on Netflix.

In fact, I found it quite useful.

The talks by Clayton Christenson (author of The Innovators Dilemma), Rory Sutherland (expert on Behavioural economics) and the dozens of entrepreneurs (both serial and parallel) were all very fascinating and useful, and they were all broadcast for free, and they will soon be up for streaming, for free.

So why go through all of this effort to physically go to the conference?

One of the conference rooms at Business of Software 2011.

What the the World Trade Center in Boston lacks in number of bathrooms, it more than makes up for in hallways. It has roughly 1000 miles of hallways in which you can bump into successful business people. And every one of them is trying to meet you and get your take on important, urgent business-related matters like, "Have you seen an empty bathroom?"

Seriously, when not at the conference, and people ask what I do, I have learned to say something like "I do computers". People here understand when I talk about NoSQL databases, SaaS models, and programmer development tools. The amount of time until their eyes glaze over is well over the 60 second mark.

You also get some inside info. People aren't shy talking about their pricing. How much does the super-mega-ultra corporate option cost? The one where instead of a price, it says "Call"? These people will tell you, because they don't get to talk about it much, and they are honestly trying to help.

I talked to C.E.O.s, and C.T.Os, of 3 to 30 person companies. I talked to VPs, Cloud Engineers, and Intrapreneurs of big companies. For many, this is the first opportunity to talk to an outsider about their businesses. It is like psychotherapy. Often they would come to a sudden realization. "Hey," a micro-ISV would say, "I just have a fear of releasing the next version because it's missing some difficult features. I should just do it anyway!". If you go to this conference, you probably already know what you should do to improve your business. But having Jason Cohen, or some seasoned CEO tell you in person moves it up onto the todo list.

General trends

Disruption - Disruption is big. If you're not disruptive, you might as well be selling mainframes and typewriters. Companies are disrupting each-other at an astounding rate. Sometimes, while one company is busy disrupting an industry, another one will sneak up behind it and try to disrupt it when it is not looking. That is why companies need to be agile and pivot frequently.
Metrics - The info-geeks have taken over. Founders are demanding dashboards for their business, updated in real time. But not only for themselves -- every click of the web site, and every cancellation is streamed to every employee to give an accurate picture of the health of the company. A special version containing only the "Customer Happiness Index" and a huge happy face is streamed to the investors.
Crowd-sourced employee recognition - At least three companies are working on this. It can be hard for bosses to identify their best contributors to allocate bonuses. The idea is to crowd-source this from their workforce. "So we'll give them a button -- so whenever anybody does something nice, other people will just push it and they get a -- a pony point --- yeah! And then I just have to add them all up to find the best contributors!" If you've worked at a large company for more than a year, you already know what an awesome idea this is. Just rename "pony" to "stab" and invert the score.
Skype - Ask anybody, in tiny or large companies. Odds are that they bypass their Enterprise Collabosoft GrouperWare system and secretly use Skype to communicate. Just a minute while I go privately Skype to people about why Microsoft should acquire my startup.
Dishonesty - Jason Cohen gave a talk about how honesty in business can differentiate you. If you are a small company, he says, you should not try to hide it. Companies will be refreshed by your truthfulness, and it sets the correct expectations at the outset. Most of the attendees believe honesty is a great idea. Companies should all be honest! Because Jason Cohen says it pays! But if you are in a uniquely special business, such as storing data securely in the cloud, or selling software as a service, or selling licensed software, or you offer a limited or very diverse product line, or you have competition -- in these very special situations, honesty definitely will never work. At least, that's the going opinion.

I hope you're convinced of the value that Business of Software has to offer, and I hope to see you there next year. I should be finished Doctor Who by then.

Four ways of handling asynchronous operations in node.js

2011-09-30T21:30:00-05:00

Javascript was not designed to do asynchronous operations easily. If it were, then writing asynchronous code would be as easy as writing blocking code. Instead, developers in node.js need to manage many levels of callbacks.

Today, we will examine four different methods of performing the same task asynchronously, in node.js. We will read in all the files in the current folder,

In parallel, using callback functions
Sequentially, using callback functions
In parallel, using promises
Sequentially, using promises

This will help you decide which to use for your particular situation. It is simply a matter of taste. If you want to know which is the best method -- the absolute best way to go -- is probably to switch to Golang.

Reading the files in parallel using callbacks

Our toy problem is to read all of the files in the current folder.

With our toy problem, we can't do everything in parallel. To read the files in the folder, we first need to know which files are in the folder. Thus we start with the readdir() function. We wait for the operation to complete. Then, for each file, we use the readFile() to get the contents of the file.

Here are the documentation for the functions, from node.js.

fs.readdir(path, [callback])
Asynchronous readdir(3). Reads the contents of a directory. The callback gets two arguments (err, files) where files is an array of the names of the files in the directory excluding '.' and '..'.

fs.readFile(filename, [encoding], [callback])
Asynchronously reads the entire contents of a file.
The callback is passed two arguments (err, data), where data is the contents of the file.
If no encoding is specified, then the raw buffer is returned.

All of the readFile() operations happen at once, and then we wait for the results to come in. We simply count how many times a readFile() operation has completed. When all of the files have been read, we know we are done.

    // Read all files in the folder in parallel.
    var fs = require("fs");

    fs.readdir( ".", function( err, files) {
        if ( err ) {
            console.log("Error reading files: ", err);
        } else {
            // keep track of how many we have to go.
            var remaining = files.length;
            var totalBytes = 0;

            if ( remaining == 0 ) {
                console.log("Done reading files. totalBytes: " +
                    totalBytes);
            }

            // for each file,
            for ( var i = 0; i < files.length; i++ ) {
                // read its contents.
                fs.readFile( files[i], function( error, data ) {
                    if ( error ) {
                        console.log("Error: ", error);
                    } else {
                        totalBytes += data.length
                        console.log("Successfully read a file.");
                    }
                    remaining -= 1;
                    if ( remaining == 0 ) {
                        console.log("Done reading files. totalBytes: " +
                            totalBytes);
                    }
                });
            }
        }
    });

Reading the files sequentially using callbacks

It is usually most efficient to to the above. Order the computer to do everything at once, and let the operating system sort it out. But that's not always what you want. Sometimes you need to impose an order and do things sequentially.

Here is an example of reading each file one at a time. The for loop is gone. It is replaced by a recursive function. The function checks to see if it has reached the last file. If so, it is done. Otherwise, it calls itself to process the next file in the list.

    // Read all the files in the folder in sequence, using callbacks
    var fs = require("fs");

    fs.readdir( ".", function( error, files ) {
        if ( error ) {
            console.log("Error listing file contents.");
        } else {
            var totalBytes = 0;

            // This function repeatedly calls itself until the files are all read.
            var readFiles = function(index) {
                if ( index == files.length ) {
                    // we are done.
                    console.log( "Done reading files. totalBytes = " + 
                        totalBytes );
                } else {

                    fs.readFile( files[index], function( error, data ) {
                        if ( error ) {
                            console.log( "Error reading file. ", error );
                        } else {
                            totalBytes += data.length;
                            readFiles(index + 1);
                        }
                    });
                }

            };

            readFiles(0);
        }
    });

Reading the files in parallel using Promises

A promise (also known a future, and sometimes a channel) is a concept from the 1970's that has recently become popular for Javascript programming. Promises are implemented in the node.js module promise. This module is not included unless you add it.

When you call a function and expect a return value, and the value is not yet available, the function instead returns a promise. The caller can then store the promise for later or schedule a subsequent operation when it completes. Promises can be seen as a more specialized form of node.js's EventEmitter, where the only two events are reject or resolve. Instead of using "on" to listen for the events, we use "then".

Here is a really simple example of promises being used.

var promise = doSomeAsynchronousOperation();
promise.then( function(result) {
    // yay! I got the result.
}, function(error) {
    // The promise was rejected with this error.
}

function doSomeAsynchronousOperation()
{
   var promise = new Promise.Promise();
   fs.readFile( "somefile.txt", function( error, data ) {
        if ( error ) {
            promise.reject( error );
        } else {
            promise.resolve( data );
        }
    });

    return promise;
}

Promises may be easier to deal with for some people, because functions that return promises are harder to misuse. You could forget whether a callback belongs in the 3rd or 4th parameter, but you can't make that mistake with a return value. Another advantage is that they encapulate the recursive function loop above. You can easily construct a super-promise from an array of promises, what will resolve only when each of its members resolve. That's we do in the code below, with Promise.all()

    var fs = require("fs");
    var Promise = require("promise");

    // Wrap the io functions with ones that return promises.
    var readdir_promise = Promise.convertNodeAsyncFunction(fs.readdir);
    var readFile_promise = Promise.convertNodeAsyncFunction( fs.readFile );

    p = readdir_promise( "." );
    p.then( function( files ) {

        // Create an array of promises
        var promises = [];

        for ( var i = 0; i < files.length; i++ ) {
            promises.push( readFile_promise( files[i] ) );
        }

        Promise.all( promises ).then( function(results) {
            var totalBytes = 0;
            for ( i = 0; i < results.length; i++ ) {
                totalBytes += results[i].length;
            }
            console.log("Done reading files. totalBytes = " + totalBytes);
        }, function( error ) {
            console.log("Error reading files");
        });

    }, function( error ) {
        console.log( "readdir failed.");

    });

Reading the files sequentially using Promises

By default, promises don't support sequential operations very well. But we can build on them using a PromiseSequence, which adds the ability to define a series of steps and loops, which are performed sequentially.

Here is the program again, reading a file. Instead of indenting the code by many levels, we are able to write it in more of a sequential style. Also, the above examples had two places where errors had to be handled. With a promise sequence, the errors for any operation in the sequence are handled in one place.

We first add the readdir() operation to the sequence, and then add a loop() to the sequence. The loop executes repeatedly until exitLoop() is called. Since there are no further step, the argument to exitLoop resolves the promise and the program ends.

    // Read all the files in the folder in a sequence, using Promises
    var fs = require("fs");
    var Promise = require("promise");
    var PromiseSequence = require("./PromiseSequence").PromiseSequence;

    // Wrap the io functions with ones that return promises.
    var readdir_promise = Promise.convertNodeAsyncFunction(fs.readdir);
    var readFile_promise = Promise.convertNodeAsyncFunction( fs.readFile );

    var seq = new PromiseSequence();
    var index = 0;
    var totalBytes = 0;
    var files = null;

    seq.add( function() {
        return readdir_promise( "." );
    });

    seq.loop( 
        // The "next" function of the loop takes the result of the readdir and
        // reads the file. It is executed when the loop is entered, and again after
        // each time the body is executed.
        function( files_arg ) {
            files = files_arg;
            if ( index == files.length ) {
                seq.exitLoop(totalBytes);
                return;
            } else {
                console.log("Reading file " + files[index]);
                return readFile_promise( files[index++] );
            }
        },

        // The "body" function of the loop is called with the result of the "next" function.
        // It simply sums the length of the file.
        function( contents ) {
            totalBytes += contents.length;
        }
    );

    seq.run().then( function(total) {
        console.log("Done reading file. Total bytes: " + total);
    }, function(error) {
        console.log("Error reading files: ", error);
    });

Why does my brain hurt?

I have been programming for over two decades. I know lots of languages, but only Javascript makes my brain hurt when I have to do simple things. If you have a lot of asynchronous operations to perform, and you have the choice, please consider your language selection very carefully.

Zero load time file formats

2011-06-29T17:27:08-05:00

Sometimes you cannot afford to load data files from disk. Maybe you need results immediately, or the data is simply too large to fit into memory. A technique that I like to use is an on-disk data structure. Here is a toy example for instantly accessing lists of related words.

In this article, I address the problem of the time needed to load data into memory from disk. However, I do not make any optimization for disk caches or blocks. I am not going to talk about B-Trees or cache-oblivious structures.

No waiting

Using an on-disk data structure, there is no need to load the whole file into memory or parse it. Instead of opening a file and reading its contents, we will use a memory mapped file. We tell the operating system the file name, and it will lazily load the parts of the file only when we access them. These parts remain in the disk cache even after our program exits. So if you later start the program again, it will execute similar queries more quickly. We let the operating system do the caching for us. In python, this is done using the mmap module. Mmap makes the file appear as a very long string.

Toy example

Here is an on-disk structure for looking up related words that I prepared. (Download 11 MB of it). It has three sections: A header, an index, and a word section. The header contains the number of words. The index contains a list of pointers to word records. The word section contains the word records. It is constructed so that we can instantly query for words related to a given word by jumping around to different parts of the file.

    --- header
    4 bytes: number of words

    --- index section. The words are listed in alphabetical order, so you can
    --- look one up using binary search.
    for each word:
        4 byte ptr to word record

    --- word section:
    for each word:
       null terminated text
       4 bytes: number of related words
       for each link,
           ptr to linked word record

Here is a short python program for accessing the data file.

#!/usr/bin/python
# An on-disk data structure for finding related words
# By Steve Hanov. This code and data file are released to the public domain.

import sys, mmap, struct

class FrozenThesaurus:
    def __init__(self, filename):
        self.f = file(filename, "rb")
        self.mmap = mmap.mmap( self.f.fileno(), 0, access=mmap.ACCESS_READ )

    def getDword( self, ptr ):
        # return the 32 bit number beginning at the given byte offset in the
        # file.
        return struct.unpack("<I", self.mmap[ptr:ptr+4])[0]

    def getString( self, ptr ):
        # return the null terminated string beginning at the given byte offset.
        result = []
        while self.mmap[ptr] != "\x00":
            result.append(self.mmap[ptr])
            ptr += 1
        return "".join(result)

    def getWordCount(self):
        # Retrive the number of words in the file.
        return self.getDword(0)

    def getWord(self, index):
        # Retrive a word, given its index. The index must be less then the word
        # count.
        return self.getString( self.getDword(4 + index * 4) )

    def getIndexOf( self, word ):        
        # perform a binary search through the index for the given word.
        high = self.getWordCount()
        low = -1

        while (high - low > 1):
            probe = (high + low) / 2

            candidate = self.getWord(probe)

            if candidate == word:
                return probe
            elif candidate < word:    
                low = probe
            else:
                high = probe

        return None

    def getRelatedWords( self, word ):
        # Returns the list of related words to the given word.
        results = []

        index = self.getIndexOf( word )
        if index == None: return results

        ptr = self.getDword( 4 + index * 4 )

        # skip past the word text
        while self.mmap[ptr] != '\x00': ptr += 1
        ptr += 1

        numRelated = self.getDword( ptr )
        for i in range(numRelated):
            ptr += 4
            results.append( self.getString( self.getDword( ptr ) ) )

        return results;            

data = FrozenThesaurus("thesaurus.dat")
print data.getRelatedWords(sys.argv[1])

When shouldn't you use this?

SQLITE uses memory mapped files internally. If you create the proper indicies, SQLITE will match the performance of any file format that you can come up with yourself, though it may be larger. If you can store your data in a relational database, you should not go through the trouble of creating your own on-disk data structure. In particular, a thesaurus could easily be stored in an SQLITE database.

Finding the top K items in a list efficiently

2011-06-04T22:56:39-05:00

Algorithms will always matter. Sure, processor speeds are still increasing. But the problems that we want to solve using those processors are increasing in size faster. People who are dealing with social network graphs, or analyzing twitter posts, or searching images, or solving any of the hundreds of problems in vogue would be wasting time without the fastest possible hardware. But they would sitting around forever if they weren't using the right tools.

That's why I get sad when I see code like this:

# find the top 10 results
results = sorted(results, reverse=True)[:10]

Anything involving a sort will usually take O(nlogn) time, which, when dealing with lots of items, will keep you waiting around for several seconds or even minutes. An O(nlogn) algorithm, for large N, simply cannot be run in realtime when users are waiting.

The Heap

Finding the top K items can be done in O(nlogk) time, which is much, much faster than O(nlogn), using a heap (wikipedia). Or, since I usually end up rewriting everything in C++ eventually, a priority queue.

The strategy is to go through the list once, and as you go, keep a list of the top k elements that you found so far. To do this efficiently, you have to always know the smallest element in this top-k, so you can possibly replace it with one that is larger. The heap structure makes it easy to maintain this list without wasting any effort. It is like a lazy family member who always does the absolute minimum amount of work. It only does enough of the sort to find the smallest element, and that is why it is fast.

Here's some code to demonstrate the difference between a linear search, and a heap search to find the top K elements in a large array. The heap search is 4 times faster, despite the test being biased in favour of the linear search. The linear search ends up executing in compiled C inside python itself, while the heap search is completely in interpreted python. If they were both in C, the difference in performance would be more pronounced.

#!/usr/bin/python
import heapq
import random
import time

def createArray():
    array = range( 10 * 1000 * 1000 )
    random.shuffle( array )
    return array

def linearSearch( bigArray, k ):
    return sorted(bigArray, reverse=True)[:k]

def heapSearch( bigArray, k ):
    heap = []
    # Note: below is for illustration. It can be replaced by 
    # heapq.nlargest( bigArray, k )
    for item in bigArray:
        # If we have not yet found k items, or the current item is larger than
        # the smallest item on the heap,
        if len(heap) < k or item > heap[0]:
            # If the heap is full, remove the smallest element on the heap.
            if len(heap) == k: heapq.heappop( heap )
            # add the current element as the new smallest.
            heapq.heappush( heap, item )
    return heap

start = time.time()
bigArray = createArray()
print "Creating array took %g s" % (time.time() - start)

start = time.time()
print linearSearch( bigArray, 10 )    
print "Linear search took %g s" % (time.time() - start)

start = time.time()
print heapSearch( bigArray, 10 )    
print "Heap search took %g s" % (time.time() - start)

Creating array took 7.15145 s
[9999999, 9999998, 9999997, 9999996, 9999995, 9999994, 9999993, 9999992, 9999991, 9999990]
Linear search took 10.9981 s
[9999990, 9999992, 9999991, 9999994, 9999993, 9999998, 9999997, 9999996, 9999999, 9999995]
Heap search took 2.66371 s

Also, if you see stuff like this, you should go directly to the wikipedia page on the Selection Algorithm

# find the median
median = sorted(results)[len(results)/2]

My Heap in Javascript: Github

An instant rhyming dictionary for any web site

2011-06-04T21:30:00-05:00

Many good web applications, and many bad ones, have an API, and my hobby project, RhymeBrain.com, is no exception. The trouble is: the target users of both the web site and the API don't know the difference between Javascript and Java. They don't even know what "A.P.I." stands for. The most they can do is edit some HTML and paste in some code, so that's what my API has to be.

I managed to get it down to that using a technique called JSONP. This technique avoids any problems with cross domain requests, and allows non-coders to use the API using a short, customizable HTML snippet.

Demonstration

Paste this in your web site:

<form action="javascript:RhymeBrainSubmit()"> <input type=text id="RhymeBrainInput"> <input type=submit value="Rhyme"> </form> <script type="text/javascript"> var RhymeBrainMaxResults = 50; </script> <script type="text/javascript" src="http://rhymebrain.com/external.js"></script> Rhyme results are provided by <a href="http://rhymebrain.com">RhymeBrain.com</a>

... and you get this:

How it works

1. The API user pastes a form and reference to a script into their web page.

2. The script inserts some more DIV elements to contain the search results. The DIVs could have been included in the pasted code, but this way is more flexible in case I want to change it later on.

var resultDiv;
document.write("<div id='RhymeBrainResultDiv'></div><div style='clear:both'></div>");
resultDiv = document.getElementById("RhymeBrainResultDiv");

When the user clicks the Rhyme button, this function is called. It creates a special URL that points back to a program on my server. For example, this one: http://rhymebrain.com/talk?function=getRhymes&maxResults=10&word=javascript&jsonp=RhymeBrainResponse. It adds a brand new script element to the end of the document, and sets it source to point back to a program on the rhymebrain server.

function RhymeBrainSubmit()
{
    var input = document.getElementById("RhymeBrainInput");
    var word = input.value;

    $(resultDiv).empty();
    resultDiv.appendChild( img );

    var script = document.createElement("script");
    script.type = "text/javascript";
    script.src = "http://rhymebrain.com/talk?function=getRhymes" +
        "&word=" + encodeURIComponent(word) +
        "&maxResults=" + MaxResults + 
        "&jsonp=RhymeBrainResponse";

    document.body.appendChild(script);
}

3. The server does some super intensive processing involving C, mmap() and 100 MB files, and sends back a response. The response happens to be executable javascript.

RhymeBrainResponse([ {"word":"equipped", "freq":22,"score":"300","flags":"c","syllables":"2"},
 {"word":"manuscript", "freq":22,"score":"300","flags":"bc","syllables":"3"},
 {"word":"script", "freq":21,"score":"300","flags":"bc","syllables":"1"},
 {"word":"shipped", "freq":21,"score":"300","flags":"c","syllables":"1"},
 {"word":"slipped", "freq":21,"score":"300","flags":"c","syllables":"1"},
 {"word":"stripped", "freq":21,"score":"300","flags":"c","syllables":"1"},
 {"word":"dipped", "freq":20,"score":"300","flags":"c","syllables":"1"},
 {"word":"tipped", "freq":20,"score":"300","flags":"c","syllables":"1"},
 {"word":"whipped", "freq":20,"score":"300","flags":"c","syllables":"1"},
 {"word":"gripped", "freq":19,"score":"300","flags":"c","syllables":"1"}]);

4. The browser executes the javascript, which calls the function name that was passed as the jsonp parameter to the script URL. It calls a function which displays the results on the web page.

That's all there is to it. JSONP neatly sidesteps the problem cross domain requests, since script tags can be included from any domain, and it provides a way for non-coders to create mashups.

Succinct Data Structures: Cramming 80,000 words into a Javascript file.

2011-03-20T22:47:59-05:00

Let's continue our short tour of data structures for storing words. Today, we will over-optimize John Resig's Word Game. Along the way, we shall learn about a little-known branch of computer science, called succinct data structures.

John wants to load a large dictionary of words into a web application, so his Javascript program can quickly check if a word is in the dictionary. He could transfer the words as long string, separated by spaces. This doesn't take much space once it is gzip-compressed by the web server. However, we also have to consider the amount of memory used in the browser itself. In a mobile application, memory is at a premium. If the user switches tabs, everything not being used is swapped out to flash memory. This results in long pauses when switching back.

One of the best data structures for searching a dictionary is a trie. The speed of search does not depend on the number of words in the dictionary. It depends only on the number of letters in the word. For example, here is a trie containing the words "hat", "it", "is", and "a". The trie seems to compress the data, since words sharing the same beginnings only show up once.

We need to solve two problems. If we transmit the word list to the web browser, it then has to build the trie structure. This takes up a lot of time and memory. To save time, we could pre-encode the trie on the server in JSON format, which is parsed very quickly by the web browser. However, JSON is not a compact format, so some bandwidth is wasted downloading the data to the browser. We could avoid the wasted bandwidth by compressing the trie using a more compact format. The data is then smaller, but the web browser still has to decompress it to use it. In any case, the browser needs to create the trie in memory.

This leads us to the the second major problem. Despite appearances, tries use a lot of memory to store all of those links between nodes.

Fortunately, there is a way to store these links in a tiny amount of space.

Succinct Data Structures

Succinct data structures were introduced in Guy Jacobson's 1989 thesis, which you cannot read because it is not available anywhere. Fortunately, this important work has been referenced by many other papers since then.

A succinct data structure encodes data very efficiently, so that it does not need to be decoded to be used. Everything is accessed in-place, by reading bits at various positions in the data. To achieve optimal encoding, we use bits instead of bytes. All of our structures are encoded as a series of 0's and 1's.

Two important functions for succinct structures are:

rank(x) - returns the number of bits set to 1, up to and including position x
select(y) - returns the position of the yth 1. This is the inverse of the rank function. For example, if select(8) = 10, then rank(10) = 8.

Corresponding functions exist to find the rank/select of 0's instead of 1's. The rank function can be implemented in O(1) time using a lookup table (called a "directory"), which summarizes the number of 1's in certain parts of the string. The select() function is implemented in O(logn) time by performing binary search on the rank() function. It is possible to implement select in constant time, but it is complicated and space-hungry.

p	0	1	2	3	4	5	6	7
Bit	1	1	0	0	0	0	0	1
rank(p)	1	2	2	2	2	2	2	3
select(p)		0	1	7

A Succinct Trie

Here's a trie containing the words "hat", "is", "it", and "a".

First, we add a "super root". This is just an additional node above the root. It's there to make the math work out later.

We then process the nodes in level order -- that is, we go row by row and process the nodes left to right. We encode them to the bit string in that order.

In the picture below, I've labeled each node in level order for convenience. I've also placed the nodes encoding above it. The encoding is a "1" for each child, plus a 0. So a node with 5 children would be "111110" and a node with no children is "0".

Now, we encode the nodes one after another. In the example, the bits would be 10111010110010000. I've separated them out in this table so you can see what's going on, but only the middle row is actually stored.

Position	0	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16
Bit	1	0	1	1	1	0	1	0	1	1	0	0	1	0	0	0	0
Node			0				1		2			3	4		5	6	7

We then encode the data for each node after that. To get the data for a given node, just read it directly from that node's index in the data array.

hiaatst

Getting the data

The main thing that we want to do with a trie is follow links from each node to its children. Using our encoding, we can follow a link using a simple formula. If a node is numbered i, then the number of its first child is select₀(i + 1) - i. The second child is the one after that, and so forth. To obtain the number of children, look up the first child of the i+1th node and subtract, since they are stored consecutively.

For example: We want the first child of node 2. The 3rd 0 is at position 7. Seven minus two is five. Therefore the first child is numbered 5. Similarly the first child of node 3 is found to be 7 by this formula (no, it doesn't really exist, but it works for the calculation). So node 2 has 7 minus 5 equals 2 children.

Demo

Here is a demonstration, hosted on my faster server. (Source code: Bits.js) (It doesn't work in RSS readers -- go to my blog to see it. Paste a list of words in the top text area (or click Load dictionary to load one). Click "Encode" to create the trie and encode it. This step can be very slow, because I did not optimize the encoding process. Once encoding is complete, you can use the Lookup button to check if words are in the dictionary.

Using this encoding method, a 611K dictionary containing 80000 words is compressed to 216K, or 132K gzipped. The browser does not need to decode it to use it. The whole trie takes as much space as a 216K string.

Your browser does not support iframes

Details

The directory contains the information needed to compute the rank and select functions quickly. The trie is the bitstring representing the trie and the connections between all of its nodes.

To avoid problems with UTF encoding formats and escaped characters, the bit strings are encoded in BASE-64. All of the bit decoding functions are configured to operated on BASE64 encoded units, so that the input string does not need to be decoded before being used.

We only handle the letters "a" to "z" in lower case. That way, we can encode each letter in 5 bits.

You can decrease space usage and performance by increasing the L2 constant, and setting L1 = L2*L2. This controls the number of bits summarized in each section of the rank directory. L2 is the maximum number of bits that have to be scanned to implement rank(). More bits means fewer directory entries, but the select() and rank() functions will take longer to scan the range of bits.

Caveats

I described how to create an MA-FSA in a previous article. There is no known way to succinctly encode one. You must store one pointer for each edge. However, as the number of words increases, an MA-FSA (also known as a DAWG) may eventually become more compact than the trie. This is because a trie does not compress common word endings together.

Throw away the keys: Easy, Minimal Perfect Hashing

2011-03-09T18:00:00-05:00

CORRECTION: In this article, I incorrectly state that an acyclic finite state automata (aka a DAWG) cannot be used to retrieve values associated with its keys. I have since learned that it can. By storing in each internal node the number of leaf nodes that are reachable from it, we can, upon arriving at the destination key, know its unique index number. Then its value can be looked up in an array. I am now using this technique on rhymebrain.com

This technique is described in Kowaltowski, T.; CL. Lucchesi (1993), "Applications of finite automata representing large vocabularies", Software-Practice and Experience 1993

Here's my implementation using an MA-FSA as a map: Github

In part 1 of this series, I described how to find the closest match in a dictionary of words using a Trie. Such searches are useful because users often mistype queries. But tries can take a lot of memory -- so much that they may not even fit in the 2 to 4 GB limit imposed by 32-bit operating systems.

In part 2, I described how to build a MA-FSA (also known as a DAWG). The MA-FSA greatly reduces the number of nodes needed to store the same information as a trie. They are quick to build, and you can safely substitute an MA-FSA for a trie in the fuzzy search algorithm.

There is a problem. Since the last node in a word is shared with other words, it is not possible to store data in it. We can use the MA-FSA to check if a word (or a close match) is in the dictionary, but we cannot look up any other information about the word!

If we need extra information about the words, we can use an additional data structure along with the MA-FSA. We can store it in a hash table. Here's an example of a hash table that uses separate chaining. To look up a word, we run it through a hash function, H() which returns a number. We then look at all the items in that "bucket" to find the data. Since there should be only a small number of words in each bucket, the search is very fast.

Notice that the table needs to store the keys (the words that we want to look up) as well as the data associated with them. It needs them to resolve collisions -- when two words hash to the same bucket. Sometimes these keys take up too much storage space. For example, you might be storing information about all of the URLs on the entire Internet, or parts of the human genome. In our case, we already store the words in the MA-FSA, and it is redundant to duplicate them in the hash table as well. If we could guarantee that there were no collisions, we could throw away the keys of the hash table.

Minimal perfect hashing

Perfect hashing is a technique for building a hash table with no collisions. It is only possible to build one when we know all of the keys in advance. Minimal perfect hashing implies that the resulting table contains one entry for each key, and no empty slots.

We use two levels of hash functions. The first one, H(key), gets a position in an intermediate array, G. The second function, F(d, key), uses the extra information from G to find the unique position for the key. The scheme will always returns a value, so it works as long as we know for sure that what we are searching for is in the table. Otherwise, it will return bad information. Fortunately, our MA-FSA can tell us whether a value is in the table. If we did not have this information, then we could also store the keys with the values in the value table.

In the example below, the words "blue" and "cat" both hash to the same position using the H() function. However, the second level hash, F, combined with the d-value, puts them into different slots.

How do we find the intermediate table, G? By trial and error. But don't worry, if we do it carefully, according to this paper, it only takes linear time.

In step 1, we place the keys into buckets according to the first hash function, H.

In step 2, we process the buckets largest first and try to place all the keys it contains in an empty slot of the value table using F(d=1, key). If that is unsuccessful, we keep trying with successively larger values of d. It sounds like it would take a long time, but in reality it doesn't. Since we try to find the d value for the buckets with the most items early, they are likely to find empty spots. When we get to buckets with just one item, we can simply place them into the next unoccopied spot.

Here's some python code to demonstrate the technique. In it, we use H = F(0, key) to simplify things.

#!/usr/bin/python
# Easy Perfect Minimal Hashing 
# By Steve Hanov. Released to the public domain.
#
# Based on:
# Edward A. Fox, Lenwood S. Heath, Qi Fan Chen and Amjad M. Daoud, 
# "Practical minimal perfect hash functions for large databases", CACM, 35(1):105-121
# also a good reference:
# Compress, Hash, and Displace algorithm by Djamal Belazzougui,
# Fabiano C. Botelho, and Martin Dietzfelbinger
import sys

DICTIONARY = "/usr/share/dict/words"
TEST_WORDS = sys.argv[1:]
if len(TEST_WORDS) == 0:
    TEST_WORDS = ['hello', 'goodbye', 'dog', 'cat']

# Calculates a distinct hash function for a given string. Each value of the
# integer d results in a different hash value.
def hash( d, str ):
    if d == 0: d = 0x01000193

    # Use the FNV algorithm from http://isthe.com/chongo/tech/comp/fnv/ 
    for c in str:
        d = ( (d * 0x01000193) ^ ord(c) ) & 0xffffffff;

    return d

# Computes a minimal perfect hash table using the given python dictionary. It
# returns a tuple (G, V). G and V are both arrays. G contains the intermediate
# table of values needed to compute the index of the value in V. V contains the
# values of the dictionary.
def CreateMinimalPerfectHash( dict ):
    size = len(dict)

    # Step 1: Place all of the keys into buckets
    buckets = [ [] for i in range(size) ]
    G = [0] * size
    values = [None] * size
    
    for key in dict.keys():
        buckets[hash(0, key) % size].append( key )

    # Step 2: Sort the buckets and process the ones with the most items first.
    buckets.sort( key=len, reverse=True )        
    for b in xrange( size ):
        bucket = buckets[b]
        if len(bucket) <= 1: break
        
        d = 1
        item = 0
        slots = []

        # Repeatedly try different values of d until we find a hash function
        # that places all items in the bucket into free slots
        while item < len(bucket):
            slot = hash( d, bucket[item] ) % size
            if values[slot] != None or slot in slots:
                d += 1
                item = 0
                slots = []
            else:
                slots.append( slot )
                item += 1

        G[hash(0, bucket[0]) % size] = d
        for i in range(len(bucket)):
            values[slots[i]] = dict[bucket[i]]

        if ( b % 1000 ) == 0:
            print "bucket %d    r" % (b),
            sys.stdout.flush()

    # Only buckets with 1 item remain. Process them more quickly by directly
    # placing them into a free slot. Use a negative value of d to indicate
    # this.
    freelist = []
    for i in xrange(size): 
        if values[i] == None: freelist.append( i )

    for b in xrange( b, size ):
        bucket = buckets[b]
        if len(bucket) == 0: break
        slot = freelist.pop()
        # We subtract one to ensure it's negative even if the zeroeth slot was
        # used.
        G[hash(0, bucket[0]) % size] = -slot-1 
        values[slot] = dict[bucket[0]]
        if ( b % 1000 ) == 0:
            print "bucket %d    r" % (b),
            sys.stdout.flush()

    return (G, values)        

# Look up a value in the hash table, defined by G and V.
def PerfectHashLookup( G, V, key ):
    d = G[hash(0,key) % len(G)]
    if d < 0: return V[-d-1]
    return V[hash(d, key) % len(V)]

print "Reading words"
dict = {}
line = 1
for key in open(DICTIONARY, "rt").readlines():
    dict[key.strip()] = line
    line += 1

print "Creating perfect hash"
(G, V) = CreateMinimalPerfectHash( dict )

for word in TEST_WORDS:
    line = PerfectHashLookup( G, V, word )
    print "Word %s occurs on line %d" % (word, line)

Experimental Results

I prepared separate lists of randomly selected words to test whether the runtime is really linear as claimed.

Number of items	Time (s)
100000	2.24
200000	4.48
300000	6.68
400000	9.27
500000	11.71
600000	13.81
700000	16.72
800000	18.78
900000	21.12
1000000	24.816

Here's a pretty chart.

CMPH

CMPH is an LGPL library that contains a really fast implementation of several perfect hash function algorithms. In addition, it compresses the G array so that it can still be used without decompressing it. It was created by the authors of one of the papers that I cited. Botelho's thesis is a great introduction to perfect hashing theory and algorithms.

gperf

Gperf is another open source solution. However, it is designed to work with small sets of keys, not large dictionaries of millions of words. It expects you to embed the resulting tables directly in your C source files.

Dr. Daoud's Page

Dr. Daoud contacted me below. Minimal Perfect Hashing Resources contains an even better implementation of the algorithm with a more compact representation. He originally invented the algorithm in 1987 and I am honoured that he adapted my python algorithm to make it even better!

Why don't web browsers do this?

2011-02-06T17:10:14-05:00

In the 80's, computers started instantly. They were READY to go when they first turned on.

Over the next few decades, people wanted to do more things and operating systems got slower to initialize. To solve this, OS and hardware manufacturers created hibernate and standby modes.

Now, many people have stopped using native applications and moved to the web. When I load facebook or gmail, it takes dozens of seconds to start up, and minutes over a slower connection. During this time,

The source files for the application are loaded from the server,
The source code is compiled and run.
Requests are made to retrieve the application state from the server, and
the DOM is manipulated to present the state to the user.

It would be trivial to snapshot the DOM and application state in Javascript and provide access to these snapshots with a simple API. The API would also allow you to discard an application version that is too old, or convert the state to the newer one. Then, application startup would be instantaneous.

Or, without any co-operation from standards, browsers can do this RIGHT NOW and snapshot commonly used pages instead of discarding them when users close a tab. When the url is re-entered, from the application perspective it is just as if the machine went into standby and then resumed. The browser could take cookie expiration into account, or to be totally safe, web pages could opt in with a meta tag.

Just sayin'.

Fun with Colour Difference

2011-02-04T10:46:55-05:00

Are you looking for a nifty way to choose colours that stand out? Are you the type of person who is not satisfied until you have mathematically proven that your choice is optimal?

One way to do it is to treat red, green, and blue colour values as coordinates in a cube. Two colours are different if the distance between their coordinates is large. But the RGB colour space is not perceptually uniform. Because of the way the human eye works, lots of greens look the same, but we can easily see the difference between subtle shades of yellow. That's why George Takei is hawking TVs. It's also why perceptually uniform colour spaces, such as LAB or LUV warp the cube, as if it were made of play-dough, and left out in the sun for a while. The result is that the differences between the coordinates almost correspond to the perceived difference between two colours for most people.

The colour-space calculations are all on wikipedia, and they are dead simple to implement. For fun, I put them into a simple force system using Javascript (You'll need an HTML5 browser to view. If you're using an RSS reader, you'll have to go to my blog to see it.)

Explanation

Below are all of the CSS colours which have names commonly recognized by most browsers. Every colour name from "AliceBlue" to "Gainsboro" to "YellowGreen" is there. The circles float freely, and are repelled by each other and the four sides of their container.

When you click on a colour, the background changes to that colour. All of the circles are then attracted to a vertical position based on how different they are from the background. Those near the top are close to the background colour. Those near the bottom are further away from the background. You can change the colour space in which the distance is calculated by clicking on band at the top of the container.

For HSV, the H parameter is divided by 360 before the distance is calculated, to make its influence fair, since the S and V values range from 0 to 1.

Observations

In RGB, black is the most different from white. But in LAB, black is in the middle somewhere, and other dark colours are more distant.

To my eye, both RGB and LAB perform well for finding differences, but HSV results in some odd choices. To choose a contrasting colour, use RGB or LAB and avoid picking anything less than a third of the way down.

Source code

The source code is released to the public domain.

Colours.js - A simple library for converting between text, RGB, HSV, and LAB colours.
driver.js - Does the force calculations based on this wikipedia article, and handles mouse clicks.

You might be interested in a previous article about exploiting colour difference for edge detection.

Compressing dictionaries with a DAWG

2011-01-24T21:40:57-05:00

Last time, I wrote about how to speed up spell checking using a trie (also known as a prefix tree). However, for large dictionaries, a trie can waste a lot of memory. If you're trying to squeeze an application into a mobile device, every kilobyte counts. Consider this trie, with two words in it.

It can be shortened in a way so that any program accessing it would not even notice.

As early as 1988, Scrabble^TM programs were using structures like the above to shrink the their dictionaries. Over the years, the structure has been called many things. Some web pages call it a DAWG (Direct Acyclic Word Graph). But computer scientists have adopted the name "Minimal Acyclic Finite State Automaton", because some papers were already using the name DAWG for something else.

The most obvious way to build a MA-FSA, as suggested in many other web pages, is to first build the trie, and look for duplicate branches. I tried this on a list of 7 million words that I had. I wrote the algorithm in C++, but no matter how hard I tried, I kept running out of memory. A trie (or prefix tree) uses a lot of memory compared to a DAWG. It would be much better if one could create the DAWG right away, without first creating a trie. Jan Duciuk describes such a method in his paper. The central idea is to check for duplicates after you insert each word, so that the structure never gets huge.

Ensure that words are inserted in alphabetical order. That way, when you insert a word, you will then know for sure whether the previous word ended an entire branch. For example, "cat" followed by "catnip" does not result in a branch, because the s just added to the end. But when you follow it with "cats" you know that the "nip" part of the previous word needs checking.
Each time you complete a branch in the trie, check it for duplicate nodes. When a duplicate is found, redirect all incoming edges to the existing one and eliminate the duplicate.

The paper that I am paraphrasing, by Jan Daciuk and others, also describes a way to insert words out of order. But it is more complicated. In most cases, you can arrange to add your words in alphabetical order.

What's a duplicate node?

Two nodes are considered the same if they are both the final part of a word, or they are both not the final part of a word. They also need to have exactly the same edges pointing to exactly the same other nodes.

We start eliminating duplicates starting from the bottom of the branch, so each elimination can reveal more duplicates. Eventually, the branch of the trie zips together with a prior branch.

Step 1:

Several steps later:

Why go through so much trouble?

If you have a large word list, you could run it through gzip and get much better compression. The reason for storing a dictionary this way is to save space and remain easily searchable, without needing to decompress it first. Tries and MA-FSAs can support fuzzy search and prefix queries, so you can do spell checking and auto-completion. They can easily scale up to billions of entries. They have even been used to store large portions of the human genome. If you don't care about memory or speed, just store your words in an SQL database, or spin up 100 machines "in the cloud". I don't mind. More power to you!

MA-FSAs can be stored in as little as 4 bytes per edge-connector, as described by this web page.

Implementation

Here's a python implementation. I tried it and it could easily handle seven million words in a couple minutes.

#!/usr/bin/python
# By Steve Hanov, 2011. Released to the public domain.
import sys
import time

DICTIONARY = "/usr/share/dict/words"
QUERY = sys.argv[1:]

# This class represents a node in the directed acyclic word graph (DAWG). It
# has a list of edges to other nodes. It has functions for testing whether it
# is equivalent to another node. Nodes are equivalent if they have identical
# edges, and each identical edge leads to identical states. The __hash__ and
# __eq__ functions allow it to be used as a key in a python dictionary.
class DawgNode:
    NextId = 0
    
    def __init__(self):
        self.id = DawgNode.NextId
        DawgNode.NextId += 1
        self.final = False
        self.edges = {}

    def __str__(self):        
        arr = []
        if self.final: 
            arr.append("1")
        else:
            arr.append("0")

        for (label, node) in self.edges.iteritems():
            arr.append( label )
            arr.append( str( node.id ) )

        return "_".join(arr)

    def __hash__(self):
        return self.__str__().__hash__()

    def __eq__(self, other):
        return self.__str__() == other.__str__()

class Dawg:
    def __init__(self):
        self.previousWord = ""
        self.root = DawgNode()

        # Here is a list of nodes that have not been checked for duplication.
        self.uncheckedNodes = []

        # Here is a list of unique nodes that have been checked for
        # duplication.
        self.minimizedNodes = {}

    def insert( self, word ):
        if word < self.previousWord:
            raise Exception("Error: Words must be inserted in alphabetical " +
                "order.")

        # find common prefix between word and previous word
        commonPrefix = 0
        for i in range( min( len( word ), len( self.previousWord ) ) ):
            if word[i] != self.previousWord[i]: break
            commonPrefix += 1

        # Check the uncheckedNodes for redundant nodes, proceeding from last
        # one down to the common prefix size. Then truncate the list at that
        # point.
        self._minimize( commonPrefix )

        # add the suffix, starting from the correct node mid-way through the
        # graph
        if len(self.uncheckedNodes) == 0:
            node = self.root
        else:
            node = self.uncheckedNodes[-1][2]

        for letter in word[commonPrefix:]:
            nextNode = DawgNode()
            node.edges[letter] = nextNode
            self.uncheckedNodes.append( (node, letter, nextNode) )
            node = nextNode

        node.final = True
        self.previousWord = word

    def finish( self ):
        # minimize all uncheckedNodes
        self._minimize( 0 );

    def _minimize( self, downTo ):
        # proceed from the leaf up to a certain point
        for i in range( len(self.uncheckedNodes) - 1, downTo - 1, -1 ):
            (parent, letter, child) = self.uncheckedNodes[i];
            if child in self.minimizedNodes:
                # replace the child with the previously encountered one
                parent.edges[letter] = self.minimizedNodes[child]
            else:
                # add the state to the minimized nodes.
                self.minimizedNodes[child] = child;
            self.uncheckedNodes.pop()

    def lookup( self, word ):
        node = self.root
        for letter in word:
            if letter not in node.edges: return False
            node = node.edges[letter]

        return node.final

    def nodeCount( self ):
        return len(self.minimizedNodes)

    def edgeCount( self ):
        count = 0
        for node in self.minimizedNodes:
            count += len(node.edges)
        return count

        
dawg = Dawg()
WordCount = 0
words = open(DICTIONARY, "rt").read().split()
words.sort()
start = time.time()    
for word in words:
    WordCount += 1
    dawg.insert(word)
    if ( WordCount % 100 ) == 0: print "%dr" % WordCount,
dawg.finish()
print "Dawg creation took %g s" % (time.time()-start)    

EdgeCount = dawg.edgeCount()
print "Read %d words into %d nodes and %d edges" % ( WordCount,
        dawg.nodeCount(), EdgeCount )
print "This could be stored in as little as %d bytes" % (EdgeCount * 4)    

for word in QUERY:
    if not dawg.lookup( word ):
        print "%s not in dictionary." % word
    else:
        print "%s is in the dictionary." % word

Updated code on github: using a DAWG as a map

Using this code, a list of 7 million words, taking up 63 MB, was translated into 6 million edges. Although it took more than a gigabyte of memory in Python, such a list could be stored in as little as 24 MB. Of course, gzip could do better, but the result would not be quickly searchable.

Extensions

A MA-FSA is great for testing whether words are in a dictionary. But in the form I gave, it's not possible to retrieve values associated with the words. It is possible to include associated values in the automaton. Such structures are called "Minimal Acyclic Finite State Transducers". In fact, the algorithm I above can be easily modified to include a value. However, it causes the number of nodes to blow up, and you are much better off using a minimal perfect hash function in addition to your MA-FSA to store your data. I discuss this in part 3.

Fast and Easy Levenshtein distance using a Trie

2011-01-14T20:07:53-05:00

If you have a web site with a search function, you will rapidly realize that most mortals are terrible typists. Many searches contain mispelled words, and users will expect these searches to magically work. This magic is often done using levenshtein distance. In this article, I'll compare two ways of finding the closest matching word in a large dictionary. I'll describe how I use it on rhymebrain.com not for corrections, but to search 2.6 million words for rhymes, for every request, with no caching, on my super-powerful sock-drawer datacenter:

Algorithm #1

The levenshtein function take two words and returns how far apart they are. It's an O(N*M) algorithm, where N is the length of one word, and M is the length of the other. If you want to know how it works, go to this wikipedia page.

But comparing two words at a time isn't useful. Usually you want to find the closest matching words in a whole dictionary, possibly with many thousands of words. Here's a quick python program to do that, using the straightforward, but slow way. It uses the file /usr/share/dict/words. The first argument is the misspelled word, and the second argument is the maximum distance. It will print out all the words with that distance, as well as the time spent actually searching. For example:

smhanov@ubuntu1004:~$ ./method1.py goober 1
('goober', 0)
('goobers', 1)
('gooier', 1)
Search took 4.5575 s

Here's the program:

#!/usr/bin/python
#By Steve Hanov, 2011. Released to the public domain
import time
import sys

DICTIONARY = "/usr/share/dict/words";
TARGET = sys.argv[1]
MAX_COST = int(sys.argv[2])

# read dictionary file
words = open(DICTIONARY, "rt").read().split();

# for brevity, we omit transposing two characters. Only inserts,
# removals, and substitutions are considered here.
def levenshtein( word1, word2 ):
    columns = len(word1) + 1
    rows = len(word2) + 1

    # build first row
    currentRow = [0]
    for column in xrange( 1, columns ):
        currentRow.append( currentRow[column - 1] + 1 )

    for row in xrange( 1, rows ):
        previousRow = currentRow
        currentRow = [ previousRow[0] + 1 ]

        for column in xrange( 1, columns ):

            insertCost = currentRow[column - 1] + 1
            deleteCost = previousRow[column] + 1

            if word1[column - 1] != word2[row - 1]:
                replaceCost = previousRow[ column - 1 ] + 1
            else:                
                replaceCost = previousRow[ column - 1 ]

            currentRow.append( min( insertCost, deleteCost, replaceCost ) )

    return currentRow[-1]

def search( word, maxCost ):
    results = []
    for word in words:
        cost = levenshtein( TARGET, word )

        if cost <= maxCost:
            results.append( (word, cost) )

    return results

start = time.time()
results = search( TARGET, MAX_COST )
end = time.time()

for result in results: print result        

print "Search took %g s" % (end - start)

Runtime

For each word, we have to fill in an N x M table. An upper bound for the runtime is O( <number of words> * <max word length> ^2 )

Improving it

Sorry, now you need to know how the algorithm works and I'm not going to explain it. (You really need to read the wikipedia page.) The important things to know are that it fills in a N x M sized table, like this one, and the answer is in the bottom-right square.

		k	a	t	e
	0	1	2	3	4
c	1	1	2	3	4
a	2	2	1	2	3
t	3	3	2	1	2

But wait, what's it going to do when it moves on to the next word after cat? In my dictionary, that's "cats" so here it is:

		k	a	t	e
	0	1	2	3	4
c	1	1	2	3	4
a	2	2	1	2	3
t	3	3	2	1	2
s	4	4	3	2	2

Only the last row changes. We can avoid a lot of work if we can process the words in order, so we never need to repeat a row for the same prefix of letters. The trie data structure is perfect for this. A trie is a giant tree, where each node represents a partial or complete word. Here's one with the words cat, cats, catacomb, and catacombs in it (courtesy of zwibbler.com). Nodes that represent a word are marked in black.

With a trie, all shared prefixes in the dictionary are collaped into a single path, so we can process them in the best order for building up our levenshtein tables one row at a time. Here's a python program to do that:

#!/usr/bin/python
#By Steve Hanov, 2011. Released to the public domain
import time
import sys

DICTIONARY = "/usr/share/dict/words";
TARGET = sys.argv[1]
MAX_COST = int(sys.argv[2])

# Keep some interesting statistics
NodeCount = 0
WordCount = 0

# The Trie data structure keeps a set of words, organized with one node for
# each letter. Each node has a branch for each letter that may follow it in the
# set of words.
class TrieNode:
    def __init__(self):
        self.word = None
        self.children = {}

        global NodeCount
        NodeCount += 1

    def insert( self, word ):
        node = self
        for letter in word:
            if letter not in node.children: 
                node.children[letter] = TrieNode()

            node = node.children[letter]

        node.word = word

# read dictionary file into a trie
trie = TrieNode()
for word in open(DICTIONARY, "rt").read().split():
    WordCount += 1
    trie.insert( word )

print "Read %d words into %d nodes" % (WordCount, NodeCount)

# The search function returns a list of all words that are less than the given
# maximum distance from the target word
def search( word, maxCost ):

    # build first row
    currentRow = range( len(word) + 1 )

    results = []

    # recursively search each branch of the trie
    for letter in trie.children:
        searchRecursive( trie.children[letter], letter, word, currentRow, 
            results, maxCost )

    return results

# This recursive helper is used by the search function above. It assumes that
# the previousRow has been filled in already.
def searchRecursive( node, letter, word, previousRow, results, maxCost ):

    columns = len( word ) + 1
    currentRow = [ previousRow[0] + 1 ]

    # Build one row for the letter, with a column for each letter in the target
    # word, plus one for the empty string at column 0
    for column in xrange( 1, columns ):

        insertCost = currentRow[column - 1] + 1
        deleteCost = previousRow[column] + 1

        if word[column - 1] != letter:
            replaceCost = previousRow[ column - 1 ] + 1
        else:                
            replaceCost = previousRow[ column - 1 ]

        currentRow.append( min( insertCost, deleteCost, replaceCost ) )

    # if the last entry in the row indicates the optimal cost is less than the
    # maximum cost, and there is a word in this trie node, then add it.
    if currentRow[-1] <= maxCost and node.word != None:
        results.append( (node.word, currentRow[-1] ) )

    # if any entries in the row are less than the maximum cost, then 
    # recursively search each branch of the trie
    if min( currentRow ) <= maxCost:
        for letter in node.children:
            searchRecursive( node.children[letter], letter, word, currentRow, 
                results, maxCost )

start = time.time()
results = search( TARGET, MAX_COST )
end = time.time()

for result in results: print result        

print "Search took %g s" % (end - start)

Here are the results:

smhanov@ubuntu1004:~$ ./method1.py goober 1
Read 98568 words into 225893 nodes
('goober', 0)
('goobers', 1)
('gooier', 1)
Search took 0.0141618 s

The second algorithm is over 300 times faster than the first. Why? Well, we create at most one row of the table for each node in the trie. The upper bound for the runtime is O(<max word length> * <number of nodes in the trie>). For most dictionaries, considerably less than O(<number of words> * <max word length>^2)

Saving memory

Building a trie can take a lot of memory. In Part 2, I discuss how to construct a MA-FSA (or DAWG) which contains the same information in a more compact form.

RhymeBrain

In December, I realized that Google had released their N-grams data, a list of all of the words in all of the books that they have scanned for their Books search feature. When I imported them all into RhymeBrain, my dictionary size at once increased from 260,000 to 2.6 million, and I was having performance problems.

I already stored the words in a trie, indexed by pronunciation instead of letters. However, to search it, I was first performing a quick and dirty scan to find words that might possibly rhyme. Then I took that large list and ran each one through the levenshtein function to calculate RhymeRank^TM. The user is presented with only the top 50 entries of that list.

After a lot of deep thinking, I realized that the levenshtein function could be evaluated incrementally, as I described above. Of course, I might have realized this sooner if I had read one of the many scholarly papers on the subject, which describe this exact method. But who has time for that? :)

With the new algorithm, queries take between 19 and 50 ms even for really long words, but the best part is that I don't need to maintain two separate checks (quick and full), and the RhymeRank^TM algorithm is performed uniformly for each of the 2.6 million words on my 1GHz Acer Aspire One datacenter.

(Previous articles on RhymeBrain)

Other references

In his article How to write a spelling corrector, Peter Norvig approaches the problem using a different way of thinking. He first stores his dictionary in a hash-table for fast lookup. Then he goes through hundreds, or even thousands of combinations of spelling mutations of the target word and checks if each one is in the dictionary. This system is clever, but breaks down quickly if you want to find words with an error greater than 1. Also, it would not work for me, since I needed to modify the cost functions for insert, delete, and substitution.

In the blog article Fuzzy String Matching, the author presents a recursive solution using memoization (caching). This is equivalent to flood-filling a diagonal band across the table. It gives a runtime of O(k * <number of nodes in the trie>), where k is the maximum cost. You can modify my algorithm above to only fill in only some entries of the table. I tried it, but it made the examples too complex and actually slowed it down. I blame my python skills.

Update: I just realized the author has created a new solution for dictionary search, also based on tries. I quickly tried it on my machine and dictionary, and got a time of 0.009301, assuming the prefix tree is prebuilt. It's slightly faster for an edit distance of 1! But somethings going on, because it takes 1.5 s for an edit distance of 4, whereas my code takes only 0.44. Phew!

And of course, you could create a levenshtein automaton, essentially a big honking regular expression that matches all possible mispellings. But to do it efficiently you need to write big honking gobs of code. (The code in the linked article does not do it efficiently, but states that it is possible to do so.) Alternatively, you could enumerate all possible mispellings and insert them into a MA-FSA or DAWG to obtain the regular expression.

The Curious Complexity of Being Turned On

2010-11-29T10:26:01-05:00

The imaginary Larmin Corp is designing the next killer product: A mood ring. Okay it's too big to wear around your finger and is more of a wrist device. But it works with 80% accuracy and it's got its own app store and it is expected to be a big hit at CES. There's a snag: unnamed sources are attributing the delay in the product launch to the "On/Off" problem. Larmin Corp denied all rumours and promptly launched lawsuits against the unnamed sources, their children, and pets, and the everyone at the bar that night.

Here's how the device works:

It is comprised of two parts: The mood detector, and the User Interface (UI)
The user interface runs all the time (It uses your brain waves for energy)
The mood detector can be turned on and off. However, interpreting your brain waves is complex business, so it can take several seconds to switch on or off.

How do you turn on and off this system? One way is to ignore the delays and pretend it takes no time to turn on and off. The UI simply freezes up until the task is done.

Click to edit

The problem is sometimes mood detector takes a little longer to turn on, and user's think it's crashed and exhibit extreme anger. Some even start banging the entire device on the desk.

So we don't freeze the UI during the turn-on procedure. But this leads to the following behaviour:

Click to edit

As users get impatient waiting for it to turn on, they keep restarting the procedure. But if you try to turn off the detector while it is turning on, it crashes. The UI team first decides to handle this by adding another layer above the mood detector. If you send a command to it, and the mood detector is busy, it stores it in a queue for later. As soon as the mood detector completes, the layer replays the next queued action.

Click to edit

The problem is the user gets impatient and starts repeatedly hitting the button, and the device eventually gets so many commands queued up that it just sits there, repeatedly turning on and off until the user slams it against the wall and the battery falls out. Also, if mood detector ever turns on while the user is angry, it screws up the detector's calibration. (In version 1, users are instructed to be in a neutral mood when activating the ring).

So the design architects bring out the big guns and propose a "OnOffManager". Instead of using a queue of commands, the OnOffManager remembers the last requested state and uses it.

This works pretty well, except that during the design phase the graphics designer gets fed up with the whole debate and and simply grays out the button with an ajax spinny thing, so that any further clicks are ignored during turn on. The OnOffManager code is left in, because it took six months to design, but it is never exercised. Everyone lives happily ever after.

Wait, scratch that. Shortly before release, someone writes a location aware app which periodically turns on the mood detector and sends its status to Facebook. Another group is working on the highly secretive "mood gestures" app, which turns off the mood detector if the user thinks a certain sequence of moods. It's not long before somebody complains about their mood ring randomly turning on and off all the time. After analysis, we see the following sequence:

It's an easy fix. The On/Off manager is modified to keep a count of every app that wants it on. The mood detector is only turned on when the counter goes from 0 to 1, and off when the counter goes from 1 to 0. All other states are ignored.

Everything is great, until the charismatic C.E.O of Larmin Corp, George Jalopsky is giving a keynote speech. During the speech, his mood ring turns on without him realizing it. The video goes viral. Jalopsky flies back in a huff and holds a meeting of all of software development. "You must fix this problem," he cries, waving his wrists around, projecting bright crimson onto the walls and the faces of the engineers. "When I turn it off, I want it to stay off!"

All development is halted and design committees are formed. Soon, no meeting room at Larmin is available because they are all full of developers talking about the problem. Curious discussions like this one are overheard: "If I turn you on, but George turns you off, are you on or are you off?" This is followed by snickers.

And then someone proposes a solution: The mood ring will have a "soft off" function. You can turn it off, but it will still be allowed to turn on again by third party apps, unless you turn it really off. Provisional patents are quickly filed for the software and the design of the power button itself. The software folks toss around the ideas of what off, and really off mean, and whether it makes sense for the ring to be really turned on *snicker*.

Eventually, they come up with the generalized solution. The On/Off manager shall have two counters. One is for a special class of apps that are designated as "System Apps". There's only one for now -- the on off button. But there could be a plurality in the future. The other counter works as before for third-party apps. If the system counter is 0, the mood detector is off and any other commands are ignored. However, if the system counter is non-zero, then the value of the app counter is used to determine if the mood detector should be on. This is illustrated in the following sequence diagram.

Click to edit

And that, folks, is how something like turning the system on and off can grow in complexity very quickly. Soon, Larmin Corp will add low power modes and the special "BlueMood" peripheral, which transmits the moods to other users, but due to brain wave interference patterns, it only works with the mood detector is off even though the user has buttons for both independently.

Come back next time, to read about how the moods are sent from the detector to the display, in "States of confusion".

Cross-domain communication the HTML5 way

2010-11-25T12:30:12-05:00

Making a web application mashable -- useable in another web page -- has some challenges in the area of cross-domain communications. Here is how I solved those problems for Zwibbler.com. (See the API demo here)

Zwibbler consists of a large javascript program and a little HTML. The javascript part uses Ajax methods to send POST requests back to the zwibbler.com server, to render some items without the limitations of the CANVAS tag. In particular, this allows it to support PDF output as well as SVG and PNG.

If you want to include Zwibbler.com on another web site, the zwibbler application still needs to communicate with zwibbler.com in order to perform these tasks. However, browsers will not allow this due to security restrictions. Javascript code can only communicate with the server that the main web page came from.

HTML allows you to embed one web page inside another, in the <iframe> element. They remain essentially separated. The container web site is only allowed to talk to its web server, and the iframe is only allowed to talk to its originating server. Furthermore, because they have different origins, the browser disallows any contact between the two frames. That includes function calls, and variable accesses.

But what if you want to get some data in between the two separate windows? For example, a zwibbler document might be a megabyte long when converted to a string. I want the containing web page to be able to get a copy of that string when it wants, so it can save it. Also, it should be able to access the saved PDF, PNG, or SVG image that the user produces. HTML5 provides a restricted way to communicate between different frames of the same window, called window.postMessage(). The postMessage function takes two parameters:

A string to pass
The target's origin, or "*" to allow any origin.

For example, to pass a message from the container web page to the iframe, we use:

iframe.contentWindow.postMessage("hello there", "http://zwibbler.com");

The receiver of the message must have previously registered for an HTML event named "message". This event arrives via the same mechanism as mouse clicks.

window.addEventListener("message", function( event ) {
    if ( event.data === "hello there" ) {
        // event.origin contains the host of the sending window.
        alert("Why, hello to you too, " + event.origin);
    }
}, false );

Problem 1: Two way communication

This method of communication is one way, but for a method call, we have to allow two way communication. We add a simple wrapper on top, called a Messenger, to allow two way communication. Each time you call a method in the iframe, you pass a reply function that is called with the results of that method call. We use JSON for the parameter marshalling.

The Messenger object must also keep track of how to direct the replies it receives. It assigns each request a unique ticket, and stores them in a table along with the reply function. When a reply with a matching ticket is recieved, the corresponding function is called:

Messenger.prototype = {
    init: function( targetFrame, targetDomain) {
        // The DOM node of the target iframe.
        this.targetFrame = targetFrame;

        // The domain, including http:// of the target iframe.
        this.targetDomain = targetDomain;
        
        // A map from ticket number strings to functions awaiting replies.
        this.replies = {};
        this.nextTicket = 0;

        var self = this;
        window.addEventListener("message", function(e) {
            self.receive(e);
        }, false );
    },

    send: function( functionName, args, replyFn ) {
        var ticket = "ticket_" + (this.nextTicket++);
        var text = JSON.stringify( {
            "function": functionName,
            "args": args,
            "ticket": ticket
        });

        if ( replyFn ) {
            this.replies[ticket] = replyFn;
        }

        this.targetFrame.postMessage( text, this.targetDomain );
    },

The receive function first checks the origin of the message. If it is not the one that we expected, then we ignore the message. Maybe it's from another iframe, such as an ad or a game that happens to be on the same page. It then checks to see if it has a ticket number. If so, it decodes the arguments and calls the associated reply function.

    receive: function( e ) {          
        if ( e.origin !== this.targetDomain ) {
            // not for us: ignore.
            return;
        }

        var json;

        try {
            json = JSON.parse( e.data );
        } catch(except) {
            alert( "Syntax error in response from " + e.origin + ": " + e.data );
            return;
        }

        if ( !(json["ticket"] in this.replies ) ) {
            // no reply ticket.
            return;
        }

        var replyFn = this.replies[json["ticket"]];
        delete this.replies[json["ticket"]];

        var args = [];
        if ( "args" in json ) {
            args = json["args"];
        }

        replyFn.apply( undefined, args );
    },

Problem 2: Delayed loading

There is one other complexity to handle. When we load the iframe, it takes some time to initialize before it is ready to receive events. If you send a message before it has registered to receive it, I'm not sure what happens, but it didn't work when I tried it.

So we have to add a bit of logic to the above code. When the iframe completes initializing, it sends a message consisting of the text "ready" to its parent window. If the Messenger is asked to send a message, and it has not yet received the "ready" message, then instead of sending it, it adds it to a queue for later. When it finally receives the ready message, it loops through the queue and finally sends all of the waiting messages to the iframe.

The complete code is contained in component.js

Five essential steps to prepare for your next programming interview

2010-09-27T18:00:00-05:00

There are at least two kinds of programming interviews. One type is where you are asked for details about your prior work experience. The other one is where they put you in a room, give you a problem, and stare at you while you fumble around with markers on a whiteboard for 45 minutes. The first focuses on what you have done in the past. The second focuses on what you can do in the room right now without looking anything up. You should be prepared for either.

Step 1: Get your stories straight

You will spend a large chunk of time in a job interview talking about things that you have done in the past. If haven't figured out a half dozen stories that best represent your skills, then you need to do that now. Here is a list of questions from a standard list. Many of them are stupid, but trust me -- they force you to think about yourself. Even if you aren't asked a question identical to one on this list, you will use your prepared answers during an interview. The point of this exercise is to build a repertoire of examples from your work life that you can use to answer questions.

Tell me about yourself
What are your short-term goals? What about in 2 and 5 years from now?
What is your own vision/mission statement?
What do you think you will be looking for in the job following this position?
Why do you feel you will be successful in this work?
What other types of work are you looking for in addition to this role?
What supervisory or leadership roles have you had?
What experience have you had working on a team?
What have been your most satisfying/disappointing experiences?
What are your strengths/weaknesses?
What kinds of problems do you handle the best?
How do you reduce stress and try to achieve balance in your life?
How did you handle a request to do something contrary to your moral code or business ethics?
What was the result the last time you tried to sell your idea to others?
Why did you apply to our organization and what do you know about us?
What do you think are advantages/disadvantages of joining our organization?
What is the most important thing you are looking for in an employer?
What were some of the common characteristics of your past supervisors?
What characteristics do you think a person would need to have to work effectively in our company with its policies of staying ahead of the competition?
What courses did you like best/least? Why?
What did you learn or gain from your part-time/summer/co-op/internship experiences?
What are your plans for further studies?
Why are your grades low?
How do you spend your spare time?
If I asked your friends to describe you, what do you think they would say?
What frustrates you the most?
When were you last angry at work and what was the outcome?
What things could you do to increase your overall effectiveness?
What was the toughest decision you had to make in the last year? Why was it difficult?
Why haven't you found a job yet?
You don't seem to have any experience in ___ (e.g., sales, fundraising, bookkeeping), do you?
Why should I hire you?

Source: The University of Waterloo Career Development Manual

The problem is that they require deep thought and introspection to answer, so itÃ¢ï¿½ï¿½s important to do that thinking in advance. Take an hour and think about the answers to these questions (you can use the same answer for more than one). For questions where you need to tell a story, your answer should follow this format:

20 seconds: Describe the situation. "The code was crashing and the whole team had to stop and figure out why."
30 seconds: Describe what you did "I thought of doing a memory dump, and I noticed that the AbstractMemberCreationFactory had a lot of instances but it was supposed to be a singleton."
20 seconds: Describe the results. "I fixed the memory leak with one line of code and we shipped the product on time. Later on, I added a test to make sure this wouldn't happen again."

Before each interview, go through the entire list and practice your answers out loud. Doing this will give you an edge over the other candidates, because it will make you more comfortable during the interview. When asked a question, other candidates will be staring at the ceiling saying "ummm", trying to remember everything that happened to them in the the past five years. Meanwhile, you'll smile, look the interviewer in the eye, and launch into your story.

Step 2: Build confidence by solving the most common programming exercises beforehand

Pianists have to learn a specific set of short pieces before they advance to the next level. These tunes will never be a hit at parties, but they exercise particular things, such as the right hand little finger, or syncopation. Likewise, certain problems keep coming up in programming interviews, although you will probably never, ever use them in your code. You will probably be asked one of the these time worn classics.

Reverse a singly linked list (in one pass through the list)
Reverse a string (in one pass). Reverse the order of words in a paragraph (in two passes)
Draw a circle of arbitrary size by printing out "*" characters. (hint: calculating whether to go "one down, two over" is the wrong approach)
Convert an integer to a string. Convert a string to an integer. (Manually, of course, by looping through each digit somehow.)
Write a function to return the number of 1's in the binary representation of an integer.
Write a function that will display all possible arrangements of letters in a string. Example: abc acb bac bca cab cba

Always start with the easiest solution that works, without considering the runtime. Then, try to make it faster. It's better to have something that works than spend all your time trying to optimize and end up with a page full of scribbles.

Don't cheat yourself by looking up the answers

The first time I tried to reverse a singly-linked list, it was between classes at school. I wasn't rushing, and it took me over half an hour to go from the slow and obvious solution to the fast one. But when I verified that my answer was correct, I was thrilled! I knew that I could tackle this question without looking up the answer. During interviews, when I was given a problem that I hadn't seen before, that experience gave me the confidence I needed to avoid blanking and keep trying.

Step 3: Practice your problem-solving

Some interviewers believe that being able to solve brain-teasers equates to good programming ability. In case you get one of these, you should develop a passing interest in puzzles and techniques for solving them. A visit to your local library will result in a dozen books, filled with puzzles to practice. Pick some interesting problems to tackle, and resist looking up the answers until you have spent at least a half hour on each one. Before long you'll have no difficulty helping foxes & ducks cross rivers, or figuring out to escape locked rooms by burning various lengths of rope.

Step 4: Show genuine enthusiasm

A powerful technique is to show real enthusiasm. As human beings, we can't help responding in kind and becoming excited to work with you. On the other hand, we also have evolved the ability to see through fake smiles, so it's vital that you be genuinely yourself.

The best interviewers will try to get you to talk about something that you are passionate about, even if it doesn't directly relate to the job. Most interviewers, however, will not. You will have to think about something that you've done that excites you, and look for opportunities to talk about it. Do this early in the interview. After the first 10 minutes it is probably too late, since the interviewers will have already ranked you.

Picture yourself coming in to work at this new job on the first day, turning on the new quad-core development workstation, meeting some interesting new friends, and learning about life at the company. There's got to be something exciting about that. Otherwise, why are you applying?

Step 5: Sleep

The "Tip of the tongue" phenomenon -- the inability to recall names, words, and facts -- increases dramatically if you have a sleep debt. Don't be caught struggling to remember an important detail during an interview. Instead, get a good night's sleep (7-9 hours).

Finding awesome developers in programming interviews

2010-09-13T18:00:00-05:00

In a job interview, I once asked a very experienced embedded software developer to write a program that reverses a string and prints it on the screen. He struggled with this basic task. This man was awesome. Give him a bucket of spare parts, and he could build a robot and program it to navigate around the room. He had worked on satellites that are now in actual orbit. He could have coded circles around me. But the one thing that he had never, ever needed to do was: display something on the screen.

Some people have a knack for asking the right questions to spot awesome developers in a job interview. Other interviewers dread it, come in with their tail between their legs, ask a few questions from the Internet and just go along with the group decision. But interviewing is an essential skill for most developers. A bad hire has terrible long term consequences, because eventually a sub-par employee may bring others into the organization. On the other hand, unfairly excluding an awesome candidate also hurts.

A programming interview includes at least three parts. In part I, we prove any assumptions we have after reading the resume. In part II, we get a sense for how much true experience the candidate has. Finally, we test this experience using a few spot checks and a coding question.

Part I: Testing assumptions from the resume

Once I was intervewing a candidate along with a fellow co-worker. When it was done, I thought the candidate had done okay, but not brilliantly. My co-worker, on the other hand, seemed angry. "He lied about technology X. He obviously has not worked with it. Definately a no-hire." Technology X was not even important to us. "If he lied about that," my co-worker went on, "I don't trust anything else on the resume."

Candidates should use the resume to portray themselves in a positive light. (See The Completely Honest Resume). However, there is a line where this positive portrayal becomes misrepresentation. In the example above, I wasn't as concerned as my colleague, because I already assume that everything on the resume is false until proven otherwise. If the resume says, "expert in technology X", then I will assume that the candidate merely knows the name of technology X. If the resume says, "Worked in a group that created a multi-threaded stock trading platform," then I will assume that the candidate merely chose the colours for the background. I used to be less strict until I met the guy with 10 years of experience who couldn't write code. If someone says that they wrote the text formatter in OpenOffice, or has a Ph.D, it is easy to make assumptions about their skills. Assume nothing. All must be tested.

For each relevant item on the resume, I first try to get a sense of what the candidate actually did. Then, I get him or her to prove it by talking about it.

Created a real-time operating system as a course project.
How large a group did you work in? A group of 15? Oh, okay then, what specific part did you work on? The message queue? Great! Can you describe what happens when a high priority task sends a message to a low priority task?
Developed from scratch an audio transfer protocol for wireless security systems.
How large was your team? Just you? Wow, how did you test it? Why didn’t you use RTP?
Fixed bugs in the XYZEngine.
Can you describe a bug that you found particularly challenging, and how you fixed it?

Part II: Finding true experience

Having more experience is a good indication of awesomeness. Experienced developers have made mistakes. They know when, and when not to apply design patterns. They have a sixth sense about what part of requirements will probably change, and what part will probably stay the same. They know when to be lazy and when to be pedantic. It is true experience which makes the gap between awesome and mediocre programmers so wide.

But not all experience is the same. It is certainly possible for someone to gain solid skills in a couple of years, simply by working on lots of different things, writing and rewriting countless lines of code, and making many mistakes. On the other hand, it is also possible for someone to spend a decade writing one-line changes to a single project, without learning anything new.

Finding hidden time

There are lots of great developers who started coding when they were in their second year of university. By the time they get out of school, they will have had a few years of experience. On the other hand, some awesome developers started learning their art at an early age. I know several people who wrote some non-trivial programs in their teens or earlier. This information is nowhere to be found on their resume, and must be coaxed out during an interview.

Why did you get into the software development field?
What's the first programming language that you ever learned?

Density of Experience

Many awesome programmers do all of their coding at work. These are great, well, rounded individuals that you should definitely hire. However, doing personal programming projects outside of work or class is a pretty good indicator of awesomeness. A candidate with personal programming time simply has more flight time under his or her belt, and will be better for it. No personal projects? These other indicators will also count for some points:

Working on smaller teams or groups.
Working on a wide variety of projects
Detailed knowledge of several layers of abstraction on a large project
Being the main contributor in a group project

Part III: Verifying experience

After gaining a sense of the candidate's true level of experience, it is important to verify that experience testing their programming abilities. A few minutes of time is completely inadequate for a true test, but that's all that's available. We can get an idea of the breadth and depth of knowledge of the candidate by asking questions about different areas of software development. Of course your perception of the candidate's skills will be biased by your own experiences. You cannot judge the correctness of answers in topics that are unfamiliar to you. That's why there are several interviewers.

The specific topics depend on the job requirements. Nevertheless, some example areas are:

data structures and algorithms
multithreading
bit manipulation
memory allocation
objects and inheritance, design patterns
recursion
compilation and how computers run programs

Each area that I choose has a selection of basic questions (“What’s a semaphore?”). These are so basic that if the candidate has done any work at all in the area he or she would be able to answer. Each area also has some more detailed follow-up questions. The way in which a candidate answers can prove or disprove awesomeness. For example something is amiss if you ask a seasoned embedded programmer to convert 0x4c to binary, and they start by writing down 4 x 16 + 12.

The Coding Question

Usually, after all of the above, I have a very good idea whether the candidate will pass or fail, but the coding question removes all doubt. It is so important, that even phone interviews are not exempt. To be useful, a coding question requires careful thought and planning before the interview. Asked the wrong way, the response will be useless.

First, one must choose a question based on what the candidate has had experience with. You may have a clever problem that becomes easy if you think of converting everything to intersecting 3D planes. Save it for the lunch hour with your colleagues. If the job does not involved 3D graphics, candidates would be unfairly excluded.

The question must be precisely worded. "Write a function to shuffle a deck of cards" is woefully ambiguous. Provide the function header and avoid misunderstandings, which are all too common. If you are not careful, the candidate will answer a harder or easier problem than the one you asked. The harder one is nice, unless it causes him or her to freeze up. The easier one provides no information. To prevent a huge waste of time, ask for a verbal outline of the solution after a few minutes, to check if the candidate is on the right track.

The influence of the order of questions

The order in which you ask questions can profoundly influence the thought processes of the candidate. For example, I used to ask question about hash tables when I thought the candidate knew about them. Later on in the interview, I would ask a coding question that had nothing to do with hash tables. Candidates would invariably decide to use a hash table in their implementation, with the keys being unique, consecutive integers starting at 0. If I avoided talking about hash tables, the candidates would instead choose to use an array.

Candidates are also strongly influenced by you in their choice of language. For example, if you say the job primarily involves Java, every candidate will swear that, by golly, Java is his best and favourite language to work in. He will choose to use it for all coding questions, realizing too late that he can't remember how to declare variables in the language he is "best" at.

Avoid language bias

It's terribly easy to be biased toward a specific programming language that you use at your company. By fixating on a particular tool, you throw away a lot of awesome developers. Do not try to determine if the candidate awesome at programming in C or Java or whatever. Instead, you should be trying to find out if the candidate awesome at programming in the language that he or she knows best.

Going further

The guidelines above do not address everything. They focus on experience, and they might miss awesome developers that have little experience, but a lot of innate ability. In particular, interviewers may want to test problem solving ability using puzzles that don't require any coding.

The interviewing technique that I have described here is based on proving a hypothesis, probability, and gut instinct. The hypothesis is that the candidate is an awesome developer. What traits does an awesome developer have? You cannot directly measure those traits, so you have to instead ask: What is the probability that the candidate has those traits given that he or she can answer a particular question quickly? It is not possible to assess a candidate within an interview with 100% success, but by asking thoughtful questions, you can come close.

Compress your JSON with automatic type extraction

2010-08-16T21:00:47-05:00

JSON is horribly inefficient data format for data exchange between a web server and a browser. For one, it converts everything to text. The value 3.141592653589793 takes only 8 bytes of memory, but JSON.stringify() expands it to 17. A second problem is its excessive use of quotes, which add two bytes to every string. Thirdly, it has no standard format for using a schema. When multiple objects are serialized in the same message, the key names for each property must be repeated, even though they are the same for each object.

JSON used to have an advantage because it could be directly parsed by a javascript engine, but even that advantage is gone because of security and interoperability concerns. About the only thing JSON going for it is that it is usually more compact than the alternative, XML, and it is well supported by many web programming languages.

Compression of JSON data is useful when large data structures must be transmitted from the web browser to the server. In that direction, it is not possible to use gzip compression, because it is not possible for the browser to know in advance whether the server supports gzip. The browser must be conservative, because the server may have changed abilities between requests.

Today, let's tackle the most pressing problem: the need to constantly repeat key names over and over. I will present a Javascript library for compressing JSON strings by automatically deriving a schema from multiple objects. The library can be used as a drop in replacement for the methods JSON.stringify() and JSON.parse(), except that it lacks support for a reviver function. In combination with Rison, the savings could be significant.

Download it here

Suppose you have to transmit several thousand points and rectangles. JSON might encode them like this (without the comments):

[
    { // This is a point
        "x": 100, 
        "y": 100
    },

    { // This is a rectangle
        "x": 100, 
        "y": 100,
        "width": 200,
        "height": 150
    },

    {}, // an empty object

    ... // thousands more
]

A lot of the space is taken up by repeating the key names "x", "y", "width", and "height". They only need to be stored once for each object type:

{
    "templates": [ ["x", "y"], ["x", "y", "width", "height"] ],
    "values": [ 
        { "type": 1, "values": [ 100, 100 ] }, { "type": 2, "values": [100, 100, 200, 150 ] }, {} ]
}

Each object in the original input is transformed. Instead of listing the keys, the "type" field refers to a list of keys in the schema array. (The type is 1-based, instead of zero based, and I will explain why later). But we are still repeating "x", and "y". The rectangle shared these properties with the point type, and there is no need to repeat them in the schema:

{
    "templates": [ [0, "x", "y"], [1, "width", "height"] ],
    "values": [ 
        { "type": 1, "values": [ 100, 100 ] }, { "type": 2, "values": [100, 100, 200, 150 ] }, {} ]
}

We prefix each key list in the schema with a number. This number is the one-based index of a prior schema which is prepended to it to form the combined list. Zero means the empty object, which is why we use one-based indicies.

But we can still go a little further. Instead of having a separate "type" field in each object, we stick the type as the first element of the values array.

{
    "templates": [ [0, "x", "y"], [1, "width", "height"] ],
    "values": [ 
        { "values": [ 1,  100, 100 ] }, { "values": [2, 100, 100, 200, 150 ] }, {} ]
}

Finally, since we are trying to save space, we rename our properties, and stick in a format code so we can detect that compresed json is used.

{
    "f": "cjson",
    "t": [ [0, "x", "y"], [1, "width", "height"] ],
    "v": [ { "": [1,  100, 100 ] }, { "": [2, 100, 100, 200, 150 ] }, {} ]
}

Automatic type extraction

The hard part is finding the objects which share sets of keys. It sounds a lot like the Set Cover problem, and if so, an optimal solution is NP-complete. Instead, we will approximate the solution using a tree structure. While we are building the value array, when we encounter an object, we add all of its keys to the tree in the order that we encounter them.

At the end of the process, we can traverse the nodes of the tree and create the templates. Nodes which represent the end of a key list (shown in gray) must have entry in the key list. Although not illustrated here, nodes with multiple children are also points where the the child object types inherit from a common parent, so they also get an entry.

The astute reader will realize that the final schema depends on the order that we inserted the keys into the tree. For example, if, when we encountered the rectangle, we inserted the keys "width" and "height" before "x", and "y", the algorithm would not find any common entries.

It is possible to gain more efficient packing by using a greedy algorithm. In the greedy algorithm, before we begin, an initial pass through all the objects would be made to build a list of unique object types. Then when it comes time to insert keys into the tree, they are first sorted so that the ones which occur in the most unique types are inserted first. However, this method adds a lot of extra processing and I feel the gains would not be worthwhile.

Real world savings

Here is an actual document from my web site, Zwibbler.com. Click on "Transform" to see how CJSON compresses it vs. JSON.

Download

Download the CJSON code here.

"Your program is stupid. It doesn't work," my wife told me

2010-06-19T10:31:12-05:00

The HSV colour wheel, based on barycentric coordinates, is my favourite colour selection device. It discourages picking unnatural looking saturated colours. Instead, it gives the realistic designer colours more space in the triangle. That's why I chose it for Zwibbler.com, my online Javascript sketching application.

I was working on the drawing tool one evening and my dear wife happened to start using it to draw stuff. I watched her and asked her to change the colour of what she had drawn to blue.

Naturally, she clicked on the blue outer portion of the colour wheel.

"This is stupid. It doesn't work," she complained. I love that I can always count on her for honest feedback!

The colour wheel works, of course. It works exactly the same way as Inkscape and other graphics design software. Clicking on the ring sets the hue. But when the saturation is zero, the hue component doesn't matter, because the absence of colour is always gray.

On Zwibbler, this always happens, because the default colours are black and white.

With one line of code, I made the colour wheel behave the way she expected, and eliminated a negative experience for first time users.

(Update in response to criticism.) The fix I made matches expectations of new users, as well as experienced designers. If the current colour has saturation level 0 (i.e. it's white/black/gray), and you click on the outer ring, then obviously your intention is to eventually increase the saturation. Otherwise what you are doing has no effect. The program sets the saturation component to 1.0 only in that specific case. Otherwise, it leaves your current position in the triangle alone, allowing you to rotate the hue value.

You might also want to read:

The simple and obvious way to walk through a graph

2010-06-03T08:00:00-05:00

At some point in your programming career you may have to go through a graph of items and process them all exactly once. If you keep following neighbours, the path might loop back on itself, so you need to keep track of which ones have been processed already.

function process_node( node )
{
    if ( node.processed ) {
        return;
    }

    node.processed = true;

    var i;
    for( i = 0; i < node.neighbours.length; i++ ) {
        process_node( node.neighbours[i] );
    }

    // code to process the node goes here. It is executed only
    // once per node.
}

The code works, but it only works once! The next time you try it, all the nodes will already be marked as processed, so nothing will happen.

Here's a neat trick that elegantly solves the problem. Instead of using a boolean flag in each node, use a generation count.

var CurrentGeneration = 0;
function process_node( node )
{
    if ( node.generation == CurrentGeneration ) {
        return;
    }
    
    node.generation = CurrentGeneration;

    var i;
    for( i = 0; i < node.neighbours.length; i++ ) {
        process_node( node.neighbours[i] );
    }

    // code to process the node goes here. It is executed only
    // once per node.
}

function process_all_nodes( root )
{
    CurrentGeneration += 1;
    process_node( root );
}

It's simple and obvious right? So why didn't I think of it in eight years?

Asking users for steps to reproduce bugs, and other dumb ideas

2010-05-27T21:00:00-05:00

A common misconception about software development:

When a bug occurs, users will it into a tracking system with detailed information on how to reproduce it.
A developer walks through the given steps to reproduce the issue, finds the problem, and submits a fix

That's based on several bad assumptions. Most users will not bother to enter bugs into your system. It is an unselfish act of altruism to enter a bug report. The user knows that it could be months, or even years before the bug is fixed, but she needs to be finished with your app by 5pm today.

The second problem is that many bugs are not reproducible. Maybe the bug depends on something the user was doing, the fact that she always clicks on the okay button instead of pressing enter, or because she installed a printer driver that replaced part of the OS with a modified, out of date library. Maybe your application is the first thing she uses in the morning while the hard drive is still chugging and 99 other programs are trying to update themselves at the same time.

Even worse: sometimes the problem just goes away on its own.

It's tempting to dismiss bug reports that you cannot reproduce. As a developer, you have enough work to do. The least they can do is tell you how to reproduce the problem then you'll have a chance at fixing it. Thousands of open bugs are closed or left to languish for years because they cannot be reproduced.

Refusing to fix a serious problem until you have reproducible steps is a cop-out excuse for lazy developers, who would rather get back to working on the physics engine for that secret flight simulator easter egg. That excuse might work for non-commercial software, but in the commercial software business, it will lose customers and get you fired.

How to fix non-reproducible bugs

Use a logging system

The best way to fix non-reproducible issues is to have adequate logging in the first place. Whenever the user does something, selects a menu, clicks "cancel", or inhales, record that action somewhere. Keep the file small, so the old parts scroll away. Take care to scrub anything that would violate privacy. Then, when a problem occurs, you can ask the user to attach the log file to the problem report.

You can use a well designed log as input to a test framework. You can then automatically reproduce the issue as many times as you need to test the fix, and ultimately make it part of your regression test suite.

Otherwise, fix by inspection

In 1981, Mark Weiser studied how experts debug software. The very best programmers create a mental slice of the program, so they only have to think about a few functions at a time. Weiser defined a program slice as the minimal program that still reproduces the problem. He developed an automatic method for finding the program slice to make debugging easier.

Unfortunately, thirty years later we are still doing things manually, and debugging requires lots of creative detective work. Use source control to ensure that you are looking at the same version of the software that your customer is using. Work backwards from the error message, keeping careful notes of the reverse call graph. If the X variable was set, what were the possible values of Y? Could this else clause have run? It's grueling work, and it takes days, but at the end of it, you will have a list of potential paths through the system that caused the error. And now you can methodically fix each one of them, without ever having reproduced the problem.

Sometimes, though, you will find that the problem logically could not have happened. In that case, you can either add more logging, or detect and recover, or do both.

Otherwise, detect and recover

Okay, so you had adequate logging. You've mentally traced through the source code for days. You've written additional tests for different theories and failed to reproduce the problem. The only way it could have happened is if a hole in the universe opened up and changed the laws of physics for a moment, or cosmic rays rewrote some register values. You will just have to detect and recover.

If slow memory leak causing the application to crash after three weeks, make sure your application restarts itself every few hours. If your complicated data structure somehow gets into an inconsistent state, write a function to go through and fix it after every single change. You get the idea.

Detecting and recovering from a mysterious bug is sometimes the only way to turn a major show-stopper into a minor annoyance. It is not the final solution, but it is a way to give you more time to find the real cause.

Get to it

The next time you get a serious bug that you can't reproduce, don't close it. Fix it, as if your job depended on it.

Creating portable binaries on Linux

2010-04-29T18:55:31-05:00

Distributing applications on Linux is hard. Sure, with modern package management, installing software is easy. But if you are distributing an application, you probably need one Windows version, plus umpteen different versions for Linux. In this article, we'll create a dummy application that targets the following operating systems, which are commonly used in business environments:

Windows Server 2003
Windows XP
Red Hat Enterprise Linux 3
Red Hat Enterprise Linux 4
Ubuntu 6.06.2
Ubuntu 8.04
Ubuntu 9.10

As evidence that the problem is hard, try downloading Firefox. It fails to start on many of the above platforms, due to missing libraries.

The sample application

The sample application is called plookup (download source and all binaries). It runs from the command line and takes a hostname, looks it up and prints out the IP address. Ignoring the security flaws, it has several monkey wrenches thrown in that make it hard to port to different versions of linux:

It uses C++, which causes headaches when dynamically linking
It uses socket functions, which cause migraines when statically linking

Will distributing source code solve the problem?

In theory, distributing source code seems to be an easy way to get around the problem, assuming your end user 1) has administrative access, 2) can install a compiler 2) knows how to run a configure script, 3) has the technical knowledge to interpret the output of a configure script and download the appropriate dependencies, 4) has technical expertise to resolve conflicts in library versions.

In other words, it's a completely unreasonable solution for software written for normal human beings.

Building on Windows

With the appropriate build flags, you can produce software on Windows 7 that will run on all versions of windows since Windows 95. By defining WINVER and some other macros, you'll be warned at compile time if you're using a feature that will break your program on earlier versions.

CL /EHsc /Feplookup.exe /DWINVER=0x0400 /D_WIN32 /D_WIN32_WINDOWS=0 /D_WIN32_IE=0x0400 wsock32.lib *.cpp

We're done with Windows. On to Linux!

Static linking on Linux FAIL

Static linking has gained an undeserved reputation for being portable. But see what happens when you try to create a statically linked version of plookup:

steve@ubuntu:~/plookup$ g++ -static -static-libgcc -o plookup main.cpp
/tmp/ccMhUffR.o: In function `hostlookup(std::basic_string, std::allocator > const&, std::basic_string, std::allocator >&)':
main.cpp:(.text+0x15): warning: Using 'gethostbyname' in statically linked applications requires at runtime the shared libraries from the glibc version used for linking

It defeats the purpose. The warning isn't kidding, either. Your app might run fine for months, then all it takes is one update and it will crash. Even if you didn't use any socket functions, you might have other problems.

Note: If you statically link using the GNU compiler, then according to the L-GPL you have to also distribute your object files so the end-user can possibly re-link them to another version of the C libraries that they could have modified. Because your customers love hacking on strcat in their spare time.

Dynamic Linking

I built and tested the sample application on many different platforms. This table summarizes the results of the experiment.

	Build on Red Hat EL 3	Build on Red Hat EL 4	Build on Ubuntu 6.06.2 g++3.3	Build on Ubuntu 6.06.2 g++4.0	Build on Ubuntu 8.04	Build on Ubuntu 9.10
Runs on Red Hat EL 3?	Yes	No	Yes	No	No	No
Runs on Red Hat EL 4?	No	Yes	No	Yes	Yes	Yes
Runs on Ubuntu 6.06.2?	No	Yes	No	Yes	Yes	Yes
Runs on Ubuntu 8.04?	No	Yes	No	Yes	Yes	Yes
Runs on Ubuntu 9.10?	No	Yes	No	Yes	Yes	Yes

Analysis

There are two classes of systems. Those with libstdc++5.0 (Red Hat EL 3), and those with libstdc++6.0 (All others). If you build on a system with one version, your application will only run on systems which have that library.

Ubuntu 6.06.2 is a special case. It does not have libstdc++5.0 by default, but you can add it by installing and building with g++3.3. That makes Ubuntu 6.06.2 a great build environment for portable binaries.

Linux Standards Base

Linux Standards Base (LSB) has an excellent utility that will predict which versions of Linux your application won't run on. But I had some problems using the rest of their toolchain:

It's not clear what you have to download to use their toolchain.
It's not clear how to use their toolchain (Hint: It's a wrapper around gcc)
Their toolchain doesn't work with recent versions of gcc
Once I finally found a distribution that would work with the lsb tools, it did not produce a portable application.

I spent six hours on LSB one weekend and gave up.

Summary

To distribute binaries on Linux,

Dynamically link the standard libraries
Produce one binary using g++3.3 for older systems
Produce another binary using g++4.0 for newer systems

Hopefully in time the situation will improve. According to the FAQ on libstdc++ 3:

The GNU C/C++/FORTRAN/ compiler (gcc, g++, etc) is widely considered to be one of the leading compilers in the world. Its development has recently been taken over by the GCC team. All of the rapid development and near-legendary portability that are the hallmarks of an open-source project are being applied to libstdc++.

If you liked this you'll love:

Bending over: How to sell your software to large companies

2010-04-16T18:00:00-05:00

This post also appears, with my permission, on A Smart Bear, with additional editorial comments by Jason Cohen, founder of Smart Bear Software.

For a micro-ISV, selling to businesses can be more lucrative than selling to consumers. Instead of making a few dollars per sale and hoping for thousands of sales, you sell to only a few customers, and charge much higher rates. But the rates are high for a reason. It takes more time and money to sell to businesses.

Legal Issues

Consumers rarely read software license agreements. Most corporate customers don't read them either, but some have legal departments that must approve any agreement that the company makes, no matter how small. Your EULA will be examined with the same fervor as a billion dollar acquisition.

The license agreement's primary purpose, then, is to get past the customer's legal team quickly, because they stand between you and a sale. It helps if it is fair and well balanced at the start. That way, if they add crazy one-sided terms, you can negotiate without sounding unreasonable.

Some terms that you may be asked for:

"If you go out of business, we get all of your source code." The request is common. The customer sees it as an insurance policy in case a smaller supplier disappears, and the assurance it provides may be so important that they are unwilling to drop it. Source code escrow services will hold on to your source code for a fee (hint: get the buyer to pay). Opt for a more informal arrangement if they don't specifically ask for this service.
"If someone sues us over your product, you have to pay our legal costs." Indemnification is also a standard clause that is difficult to get removed. If you can't stomache any risk of personal bankruptcy, incorporating your micro-ISV is a must.
Support details. Are you going to be providing free technical support for this product in perpetuity? I hope not.
"What happens if the product is defective?" It's only fair to offer a full refund if the customer is not satisfied.

A good software license agreement that you can re-use in a variety of situations can cost anywhere from $1000 to $5000. It pays to shop around.

The procurement process

Quotations

A quotation looks just like an invoice, except that it has an expiry date. Sixty days ought to be long enough for the client to make a decision, even if the whole department goes on consecutive vacations.

Evaluation Version

The purchasing process can take a long time, so you might be asked to provide an evaluation version while the details of the sale are worked out. It's a great idea, because after the buyer incorporates your product into their processes, they aren't going walk away from the deal easily. However, it is unclear whether the customer acknowledges any of your license terms during the evaluation period. The product should have a time limited expiry and other technical measures to ensure compliance.

Purchase orders

You and your buyer have patiently waited for five months for the company's legal team review your license. Now, the signed copies have been faxed (yes, faxed!) back and forth. At last, they'll click on that Paypal button on your order page...

Think again. Once a large business has agreed to buy your product, you are expected to send it to them for free. They do not have to pay you a dime until they feel like it. Instead, they will send a purchase order.

The good news is that purchase orders are a legally binding promise to pay you, after all of the terms have been fulfilled. Here is a diagram to illustrate the procedure:

Click to Edit on WebSequenceDiagrams.com

If you are lucky, they will use PDF files for the purchase order and invoice. But you will probably have to send and receive some more faxes.

"Please read this 100 page document about our invoicing process.."

Sometimes, after everything is agreed, you'll be asked to perform some kind of insanely complex invoicing procedure. The instructions are laced with stern, upper case warnings that if the invoice doesn't follow the proper format, lacks item category labels (found in document B), or is submitted during the wrong hours, it will be ignored.

If you have priced your product appropriately it will be worth it to spend a few hours to learn their codes and procedures. If the price is too low, you can try your luck and (politely) ask if there are any other options. (Do not mention why!)

I have successfully gotten a couple of companies to use a reseller instead of dealing with me directly, but it is only because they had an existing relationship with the reseller. See below.

"We will release the funds after you provide your US social security number.."

US customers will sometimes ask for a Taxpayer Identification Number (TIN). If you are not a US taxpayer, you don't need one. Once you point this out, they may be okay with it, or they will ask you to fill out a US form W-8BEN and send it to them. The form is scary because it states that your "income" will be subject to a 40% withholding tax. Don't worry: the purchase price is not "an amount subject to withholding", and sellers do not need to start doing US taxes (if they aren't already). Some US businesses feel that they must keep this form on file for all suppliers, and it's easier to comply than argue. Here's some more information.

"We only pay using Bankers' Scrolls made from papyrus"

Many companies have a policy against using Paypal. It's best to use an old fashioned check if you can. You can suggest, but never insist on a method of payment. Money is money! Some international customers only use bank transfers. If so, call your bank for the information that you need to provide them, and expect about $30 of the payment to go to fees.

I absolutely love cheques. I walk into the bank and pay $0.60 to deposit a $5000 cheque, whereas Stripe or Paypal would have cost me $125 for the same. More and more, though, US companies are using electronic transfers. RBC randomly charges me $15 for these, but not all the time.

Resellers

Imagine you are asked to buy some software from, say Adobe.

You go to their web site,
try to find the link to buy,
figure out how to pay,
get to the checkout page,
then stop and search Google for "adobe coupon codes",
go back to step 1
keep refreshing your email for the link,
download the software.
Keep of record of the receipt somewhere.

Now imagine you have to do this for 1000 different items, at 1000 different web sites. It gets to be a very large job. Some companies have outsourced their procurement and license management to resellers.

A reseller is simply an intermediary who pays you and provides the software to their client. It's also their job to ask for a discount, but there is no need to provide one. They have been told to acquire your product, and have already been paid a fee as a percentage of your price.

As a consequence, the reseller will be very quick to renew your software and pay the maintenance fee each year. This often occurs even if the original buyer is no longer at the company.

You don't need to be listed in the reseller's catalogue, and you don't need to have a relationship with the reseller. Just be prepared for emails asking, "Do you work with resellers?" and respond yes, because they have a client who definitely wants buy your software.

Follow instructions, invoice quickly, and you will learn to love resellers. The only annoying thing is on mornings when you get six of them asking for quotes on the same product, and you know a big company is just trying to shop around, causing more work for you.

Keep smiling

Selling to big companies can be frustrating. Throughout the process, it is important to stay professional and pleasant. Sometimes, it may appear that your customer is trying to screw you. Even if they are, is your job to be jovial, point it out, and assume that it is a simple oversight. It makes no business sense to throw money away because of a rude email.

Regular Expression Matching can be Ugly and Slow

2010-03-13T11:00:00-05:00

If you open the first few pages of O'Reilly's Beautiful Code, you will find a well written chapter by Brian Kernighan (Personal motto: "No, I didn't invent C. Who told you that?"). The non-C inventing professor describes how a limited form of regular expressions can be implemented elegantly in only a few lines of C code.

/* match: search for re anywhere in text */
int match(char *re, char *text)
{
   if (re[0] == '^')
      return matchhere(re+1, text);
   do { /* must look at empty string */
      if (matchhere(re, text))
         return 1;
   } while (*text++ != '');
   return 0;
}

/* matchhere: search for re at beginning of text */
int matchhere(char *re, char *text)
{
   if (re[0] == '')
      return 1;
   if (re[1] == '*')
      return matchstar(re[0], re+2, text);
   if (re[0] == '$' && re[1] == '')
      return *text == '';
   if (*text!='' && (re[0]=='.' || re[0]==*text))
      return matchhere(re+1, text+1);
   return 0;
}

/* matchstar: search for c*re at beginning of text */
int matchstar(int c, char *re, char *text)
{
   do { /* a * matches zero or more instances */
      if (matchhere(re, text))
         return 1;
   } while (*text!='' && (*text++==c || c=='.'));
   return 0;
}

However, if you try to actually use it for something you will realize that it is really just a glorified filename globber, which is a much simpler problem than regular expressions. In particular, it doesn't support alternation (|), subexpressions, character ranges, or capturing.

I wondered what it would take to implement the whole shebang â�� capturing regular expressions as concisely as possible, in C.

int _capslot = 0;

int match_char(char* re, char str)
{
    return *re == '' ? 
        re[1] == 'd' ? isdigit(str) : 
            re[1] == 's' ? isspace(str) :
            re[1] == 'w' ? isalnum(str) || str == '_' :
            re[1] == 'n' ? str == 'n' :
            re[1] == 'r' ? str == 'r' :
            re[1] == 't' ? str == 't' :
            re[1] == str 
        : (*re == '.' || *re == str);
}

// Performs a capturing match-here operation. Returns the number of characters
// matched.
//
// re: Indirect pointer to regular expression
// str: Indirect pointer to string to match.
// skip: must be 0. (Used internally for recursion)
// capture: Returned array of captured subexpressions. To determine the size
// needed, count the number of unescaped '(' in the expression. Subexpressions
// may be nested.
int parse_expr(char** re, char** str, int skip, char* capture[])
{
    int ok = 0;
    char* backup_str = *str;
    for( ;; ) {
        int ok2 = 1, skip2 = skip;
        while( **re != 0 && **re != '|' && **re != ')' ) {
            int ok3 = 1, cnt=0, backup_capslot = _capslot;
            char* backup_re = *re;
            for( ;; *re = backup_re, _capslot = backup_capslot ) {
                char* backup_str = *str;
                if ( **re == 0 ) {
                } else if ( **re == '(' ) {
                    char* start = *str;
                    int slot = _capslot++;
                    (*re)++;
                    ok3 = !!parse_expr(re,str,skip2, capture);
                    if ( ok3 && capture ) {
                        free( capture[slot] );
                        capture[slot] = (char*)malloc( *str-start+1 );
                        memcpy( capture[slot], start, *str-start );
                        capture[slot][*str-start] = 0;
                    }
                    if ( **re == ')' ) (*re)++;
                } else if ( **re == '[' ) {
                    int inv = 0;
                    (*re)++;
                    ok3 = 0;
                    while( **re && **re != ']' ) {
                        if ( **re == '^' ) {
                            inv = 1;
                        } else if ( **re != '' && (*re)[1] == '-' ) {
                            ok3 |= **str >= (*re)[0] && **str <= (*re)[2];
                            *re += 2;
                        } else if ( match_char( *re, **str ) ) {
                            ok3 = 1;
                        }
                        (*re)++;
                    }

                    ok3 ^= inv;
                    ok3 && (*str)++;

                    if ( **re == ']' ) (*re)++;
                } else if ( skip2 ) {
                    *re += (**re == '') + 1;
                } else if ( match_char(*re, **str ) ) {
                    *re += (**re == '') + 1;
                    (*str)++;
                } else {
                    *re += (**re == '') + 1;
                    ok3 = 0;
                }
                if ( !ok3 ) *str = backup_str;
                cnt += ok3;
                if ( **re != '+' && **re != '*' || skip2 || !ok3 ) break;
            }

            if ( **re == '?' ) {
                (*re)++;
                ok3 = 1;
            } else if ( **re == '+' ) {
                (*re)++;
                ok3 = cnt >= 1;
            } else if ( **re == '*' ) {
                (*re)++;
                ok3 = cnt >= 0;
            }

            ok2 &= ok3;
            skip2 |= !ok2;
        }
        ok |= ok2;
        if ( **re != '|' ) break;
        (*re)++;
        ok && (skip = 1) || (*str = backup_str);
    } 
    if ( !ok ) *str = backup_str;
    return *str - backup_str;
}

int main(void)
{
    char* text = "September 22, 1966";
    char* expr = "([A-Za-z]+) (d+), (d+)";

    char* fields[3];

    parse_expr( &expr, &text, 0, fields );

    printf("Month: %s, Day: %s, Year: %sn", fields[0], fields[1], fields[2]);

    return 0;
}

This is horrible on so many levels. First of all, it is impossible to understand. Secondly, it uses backtracking, which means if the expression did not match a certain way it keeps trying until all possibilities are exhausted. Thirdly, I forgot to put in "$" and "^".

There is a better way

A more efficient way to match regular expressions has been known for decades, and is described nicely in Regular Expression Matching can be Simple And Fast. I won't repeat it here (yet).

However, I will point out some code that is beautiful.

In 1993, Anthony Howe's concise implementation of grep won the Obfuscated C Coding competition. It lazily constructs a finite state machine for the expression as it goes, forming only the parts needed to match or reject the input text.

Howe kindly released an unobfuscated version of his entry. If you are interested in the innards of regular expressions, after you read Regular Expression Matching can be Simple And Fast, then take a look at the elegant representation of NFAs in Howe's comments. He describes how the state machine is stored using a flat array. No structures, nodes, or pointers are needed.

If you're used to making an object for everything, then spending 15 minutes to understand this is guaranteed to expand your mind.

The full source code is now on github. (Look for agag.c).

/*
 * Non-deterministic Finite-state Automaton
 *
 * An NFA is a directed graph whose nodes are called states.  Each
 * state is either a labeled state denoting a literal value, or a
 * NIL state that has no label.  Each state has at most two edges
 * leaving it.  There is only one start state and one final state,
 * which may be the same state.
 *
 * Various state-structures can be represented by an array.  These 
 * structures have only one start point being the left most index,
 * and one end point being the right most index.
 *
 *	a		...[a]...
 *
 *	ab		...[a][b]...
 *
 *	^a		[bol][a]...
 *
 *	a$		...[a][eol]
 *
 *	a.b		...[a][wild][b]...
 *	        		_
 *			       v 
 *	a*		...[][a][]...
 *			     ____^
 *
 *	a?		...[][a]...
 *			     ___^
 *			            _______________
 *			           /               v
 *	a|b		...[][a][/][stop][pass][b]...
 *			     _____________^
 *
 *	[a0-9]		...[class][a][0][1][2][3][4][5][6][7][8][9][stop]...
 *
 *	[^a0-9]		...[nclass][a][0][1][2][3][4][5][6][7][8][9][stop]...
 *
 *			            _____________  _______________
 *			           /             v/               v
 *	a|b|c		...[][a][/][stop][][b][/][stop][pass][c]...
 *			     _____________^_____________^
 *
 *			        _            _______________
 *			       v           /               v
 *	a*(b|c)		...[][a][][][b][/][stop][pass][c]...
 *			     ____^   _____________^
 */

#define BOL		('n')
#define EOL		('n')
#define STOP		('n')
#define WILD		(0)
#define CLASS		(-1)
#define NCLASS		(-2)
#define PASS		(-3)
#define JUMP(n)		(-4-(n))
#define CHAR(c)		(c)
#define ISJUMP(x)	((x) <= PASS)
#define ISCHAR(x)	(0 <= (x))

C++: A language for next generation web apps

2010-01-26T18:00:00-05:00

On Monday, I was pleased to be an uninvited speaker at Waterloo Devhouse, hosted in Postrank's magnificent office. After making some surreptitious alterations to their agile development wall, I gave a tongue-in-cheek talk on how C++ can fit in to a web application.

There were other cool presentations there too. Check it oot! Er, out!

During this presentation, I hope to convince you that the C++ programming language is ideal for developing your next web application.

You might be asking, Steve, Why C++? Why would I subject myself to this horrible language where you have to manage your own memory?

One reason is efficiency. Not only is C++ inherently fast. It also forces you to think differently. It discourages the use of overly complicated data structures that are built in to other languages. In C++, nesting more than one or two structures results in something that is simply too awkward to use. Instead, the you are forced to seriously consider using the simplest possible representation.

In addition, just about any library that you'd want to use has already been written, and has a free implementation available on the internet. Json, and CGI decoders are freely available. Also, as often overlooked, you have to handle UTF8 to Wide-character conversion, but this is easily achieved in a 10 line function.

The only thing you do have to be careful with is SQL database access. MySql is GPL'd, so you can't even link to its client library in a closed source app. SQLite seems to be free, except if you do business in Germany it is $1000 because they have a different definition of public domain.

Here is the first strategy that you might use to incorporate C++ into your web app. In this diagram, there are two things between the browser and your application -- the web server, and the php script. It is not as efficient as it could be, and there are also security implications with calling command line programs from php.

In this model, you write your C++ program as a CGI script directly, using one of the freely available query string parsing libraries. It is efficient and clean. When your browser requests information, the webserver starts your program, which spits out the response in json format. This response is then relayed back to the browser.

I use the above strategy for RhymeBrain. It includes advanced statistical algorithms that let it sound out any word that you put in. For example, you can enter the word "postrank" and see that it rhymes with such gems as "blank, prank, drank, tanked, and stank".

Because it's written in C++, I can run it on my super-powerful datacenter. It sits atop my sock drawer at RhymeBrain headquarters.

This powerful 1GHz beast can load the entire 2.6 million word database and fire back the response in about 50 milliseconds. It does this from a cold start, for each request.

But there is a third strategy: If you write your own webserver, you can cut out the middleman and serve the request directly. Your javascript code makes a request, and all your web server has to do is call a function to send back the results.

Writing a webserver isn't that hard. Here is the complete implementation of the Hibachi web server. It supports virtual hosts, and perl and php scripting, among other things. It was written by former Waterloo-ite Anthony Howe, and won the 2004 International Obfuscated C Coding Competition.

Inspired by Hibachi, I wrote my own webserver and built WebSequenceDiagrams (which runs in a real data center..). Doing it this way reveals a new business model. It is possible to package up the entire web application into a single installer that runs on Windows and Linux. Since it's all integrated, there is no need for customers to fuss around with Apache or the numerous other moving parts that could break if I shipped separate components. (You can run it in your organization for as little as $99).

The only disadvantage is that the final executable is really small -- around 700K. It may be a little smaller than some customers are expecting. During the presentation, it was suggested that I ship it as an appliance, in a huge box, to compensate.

Conclusion

I hope that I've convinced you of the benefits of C++ in your next web application:

Reduced hardware costs
Readily available libraries for web tasks
Portability
Extreme flexibility in deployment

qb.js: An implementation of QBASIC in Javascript

2010-01-08T21:30:00-05:00

If your browser supports the proposed CANVAS tag, you will see a screen below containing a BASIC program. This only implements enough of the language to run NIBBLES.BAS

Many programmers seldom think about how their compiler or scripting language is implemented. To them, it is a tool, and the less it gets in the way, the better. However, for a craftsman, knowing how your tools work can help you do a better job. It helps to think on several levels at once. Take this code, for example:

function createDummyString( length )
{
    var result = "";
    for ( var i = 0; i < length; i++ ) {
        result = result + " "; 
    }

    return result;
}

In some languages, strings are immutable, so result could be repeatedly created, copied, and destroyed millions of times in this one function call.

Those who are familiar with compilers have the ability to think on another level. For them, visible about 10 cm behind the computer screen is a whole other level of code, which contains how the programming language might be implemented. Beyond that, further in the distance, lies assembly language, with its precarious branch prediction, happy cache hits, and misrable misses.

Here is an implementation of QBASIC in Javascript. In this next series of blog entries, we will explore its inner workings, covering all parts of the compilation process.

What works

Only text mode is supported. The most common commands (enough to run nibbles) are implemented. These include:

Subs and functions
Arrays
User types
Shared variables
Loops
Input from screen

What doesn't work

Graphics modes are not supported
No statements are allowed on the same line as IF/THEN
Line numbers are not supported
Only the built-in functions used by NIBBLES.BAS are implemented
All subroutines and functions must be declared using DECLARE

This is far from being done. In the comments, AC0KG points out that P=1-1 doesn't work.

In short, it would need another 50 or 100 hours of work and there is no reason to do this.

The parser is slow because we are using a simple Earley parser. (EarleyParser.js). Update in 2014: If I ever do this again I would use a "Packrat" parser.

License

License is GPL v3.

Overview of the system

The compiler takes your BASIC program and converts it into a list of user data types, data from DATA statements, and bytecode statements. Here's pretty picture of the previous sentence:

If you just look at the Javascript source it will be confusing to figure out what goes where. So here is a map of all the important "classes" of the system:

Console

(Source) A canvas that represents the screen and captures keyboard input

Virtual machine

(Source) The virtual machine executes bytecode. The bytecode instructions may manipulate the stack, jump to a new address, or use the console functions. They may also execute system functions (such as LEFT$) or system subroutines (such as CLS).

Types

(Source) Each type (single, double, integer, long, array, user) has functions to create an initial value, or to copy a value into a variable. Upon a copy, for instance, the integer type performs rounding and truncates the value to 16 bits.

Codegenerator

(Source) The code generator visits each node of the abstract syntax tree abd generates bytecode for it. It uses the type information added by the TypeChecker as an aid.

TypeChecker

(Source) The TypeChecker's job is to catch any errors before we try compiling. Without it, you could write a program that tries to multiply an array by the string "George". Then the virtual machine would say "WTF?!" and crash. It fills in the type for any expression in the syntax tree.

GlrParser

(Source) The GlrParser is an implementation of Tomita's GLR parser. It uses a RuleSet to parse the program into an abstract syntax tree.

Unfortunately, there is a problem with my implementation of the GLR parser. I think I know how to solve it, but at the moment, the system uses the very slow, but concise Earley parser, in EarleyParser.js EarleyParser.js.

Tokenizer

(Source) If you try to use javascript's built in Regex object for splitting text into tokens, you will soon pull your hair out and run screaming through the halls. Instead, the tokenizer implements a simple Thompson NFA with lazy evaluation. In some cases, this technique is faster than Javascript's own RegEx functions!

Ruleset

(Source) The ruleset contains grammar rules. In addition, it can remove redundant rules, and compute the FIRST and FOLLOW sets.

Ruleparser

(Source) The RuleParser is implemented on top of the RuleSet. It uses the parser to parse rules, and transforms them to add goodies like comma separated lists, kleene star, and alternation operators. But it's main purpose is to allow me to say, "My parser uses itself to parse its own rules!"

QBasic

(Source) Finally, qbasic.js contains all of the grammar rules and Abstract Syntax Tree nodes for BASIC programs.

Virtual machine

The most straightforward way of creating a basic compiler in javascript is to directly translate the basic into javascript functions. But this approach will not work for two reasons. First, there is "goto" which, although it is a reserved word, is not yet in Javascript. (Obviously, ECMAScript community finds "with", prototype inheritance, and the rules of the 'this' keyword to be far less confusing than allowing "goto"). It is possible to automatically move statements around to eliminate GOTO, but you don't want to go there.

The other problem is that browsers tend to freeze until javascript programs finish running. To avoid freezing the browser until the program ends, we break the program into small chunks, and execute a few of those chunks every so often using a javascript timer. This gives the appearance of a running program and doesn't freeze the browser.

Bytecode solves both of those problems. By breaking the program into bytecode instructions, we can implement goto by just changing which instruction we are going to execute next. We can also suspend execution any time to allow the user to interact with the browser.

The virtual machine executes programs, which consist of types, an array of instructions, and a set of variable names which are shared. It has some data:

pc, the program counter (the index of the next instruction to execute)
a stack of frames. A frame maps variable names to their values. Each time a function is called, the frame is added to the stack. Each time it returns, the frame is removed, thus destroying all local variables.
an execution stack. The instructions can manipulate this stack. Two types of things can be pushed onto the stack: either a value (which is.a javascript string, number, or null), or a reference to a value (which can be the ScalarVariable or ArrayVariable objects).

Executing Instructions

Javascript excels at looking up things in objects and mapping strings to functions. This makes the virtual machine instruction lookup very efficient.

All of the information about each instruction is stored in a single object, called "Instructions". That means we can run dispatch instructions very simply, leaving the heavy work of figuring out where the code is to the javascript runtime:

while( this.pc < this.instructions.length ) {
    var next = this.instructions[this.pc++];
    next.instr.execute( this, instr.arg );
}

Each instruction can manipulate the stack, set variables, or change the pc to jump to another location.

Example

Here's some basic code:

A = 1
B = 2
PRINT A + B

And here's the bytecode produced to run the above statements:

   ' L1 A = 1
[0] pushconst 1
[1] pushref A
[2] assign
   ' L2 B = 2
[3] pushconst 2
[4] pushref B
[5] assign
   ' L3 PRINT A + B
[6] pushvalue A
[7] pushvalue B
[8] +
[9] syscall print
[10] pushconst 'n'
[11] syscall print
[12] ret
[13] end

PUSHCONST 1 pushes a number 1 onto the stack. PUSHREF A is a bit complicated, though. It pushes a reference to variable A onto the stack. Since there was no prior value of A, it has to create one with the default type of SINGLE and adds the mapping from the name "A" to that variable. After all that housekeepint, it does the push. The state of the virtual machine looks like this:

The ASSIGN instruction expects a reference and a constant on the stack. It removes them, and assigns the reference to the variable. Here's the actual javascript implementation of the instruction:

    ASSIGN: {
        name: "assign",
        numArgs: 0,
        execute: function( vm, arg )
        {
            // Copy the value into the variable reference.
            // Stack: left hand side: variable reference
            // right hand side: value to assign.

            var lhs = vm.stack.pop();
            var rhs = vm.stack.pop();

            lhs.value = lhs.type.copy( rhs );
        }

Let's look at instructions 6 to 8. We've seen PUSHREF, and PUSHCONST, now what's PUSHVALUE? This instructions the variable name and pushes its current value on the stack. After instruction 7, the state of the virtual machine is this:

All of the instructions are pretty simple to implement. Here's how the "+" instruction works. Thanks to Javascript, it works for both strings and numbers.

    "+": {
        name: "+",
        numArgs: 0,
        execute: function( vm, arg )
        {
            var rhs = vm.stack.pop();
            var lhs = vm.stack.pop();
            vm.stack.push( lhs + rhs );
        }
    },

Finally, we call the "print" system function, which just pops the result off the stack and displays it on the screen.

Here are the other instructions:

ASSIGN

Pop two things off the stack, and assign the value to the reference.

MEMBER_VALUE

Pop the reference to the user defined structure, and push the value of the named member.

MEMBER_DEREF

Like MEMBER_VALUE, but pushes a reference to the named mamber.

ARRAY_DEREF

Pop the array indices, and push a reference to the given location of the array.

PUSHCONST

Push the literal value onto the stack.

RET

destroy the current variable map, and restore the previous one, jumping to the previous value of PC.

GOSUB

Like call, but instead of using a new variable map, copy the current one.

CALL

Create a new, empty variable map, store PC in it, and jump to the given location.

JMP

Immediately jump to the given location.

BNZ

pop the top of the stack and jump to the given location if it is non-zero

MOD, /, *, +, -, AND, OR, NOT, <>, >=, <=, <, >, =

Pop one or two arguments off the top of the stack, perform the given operation, and push the result onto the stack.

BZ

Pop the top of the stack and jump to the given location if it is zero.

END

Halt the VM.

NEW

Create a new unnamed instance of the variable of the specified type, and push it onto the stack.

PUSHTYPE

Push the named type onto the stack

PUSHVALUE

Push the value of the given variable onto the stack.

PUSHREF

Push a reference to the given variable onto the stack.

POP

Pop the stack and discard

POPVAL

Set the given variable's value to be the value at the top of the stack.

RESTORE

Set data index to the given value

COPYTOP

duplicate the top of the stack.

FORLOOP

Using counter, step, and end value on the stack, determine if the for loop should continue. If not, jump to the given location.

SYSCALL

Call the given system function or subroutine (eg, LOCATE or CLS or LEFT$)

Console

The console performs these functions:

Printing to the screen
Converting QBASIC's colour codes to HTML colours
Keeping track of cursor position
Blinking the cursor
Allowing line oriented user input
Keeping a keyboard buffer for INKEY$, and converting DOM keycodes to qbasic keyboard codes.

The console uses an HTML Canvas object for display. It has an image of every character in the IBM character set. When you want to print something, it first draws a solid rectangle of the background colour at that position. Then it copies the character's image, leaving alone any transparent pixels.

In input mode, anything the user types is displayed on the screen and copied to a buffer. When you hit enter, input mode ends, and a completion function is called to restart the virtual machine and process the input.

While not in input mode, and you hit a key, it is converted into a QBASIC character code and added to the keyboard buffer. This buffer is used for the INKEY$ function.

Until next time

We've covered the runtime system of our virtual computer. We've left out how to handle arrays and function calls. For those features, you'll have to look up how the ARRAY_DEREF and CALL instructions are implemented, in virtualmachine.js.

Zwibbler: A simple drawing program using Javascript and Canvas

2009-12-28T16:17:47-05:00

Tested under Firefox 3.5.6 and Google Chrome 3.0.195.38

Introduction

This project extends the technique I created for imprecise line-drawing to create an entire vector graphics application, similar to Inkscape. It is written almost entirely in Javascript, except for a server-side program that renders text.

See below if you are interested in the implementation and coding parts.

Features

Download and share your drawings in PNG, PDF, or SVG formats.
Box, circle, lines, and curve primitive shapes
Shadows when supported by browser
Text in several hand-drawn fonts rendered on the server
Rotate & scale shapes individually or in groups
Select colours using an HSV colour wheel.
Unlimited levels of Undo/Redo

Usage

Users who are familiar with vector graphics programs such as CorelDRAW! or Inkscape will have no problem using it, because the some of the most common keyboard shortcuts work the same way. Otherwise, it will take some getting used to.

List of keyboard shortcuts

The keyboard is the only way to do some things.

C	Start drawing a new curve
L	Start drawing a new line
Ctrl+D	Duplicate selection
Page Up	Move selected shapes toward you
Page Down	Move selected shapes away from you
Home	Bring selected shapes to front
End	Send selecting shapes to back
Ctrl+G	Group selected shapes
Ctrl+Shift+G	Un-group selected shapes
+	Zoom in
-	Zoom out
Shift +	Restore normal zooming
Arrow Keys	Move around while zoomed-in

Saving and loading

Drawings can be stored on the website by clicking on "Save". If you do not have an account, the drawings will be deleted in a few hours, or if you close your web browser. As soon as you create an account, the drawings will be transferred to long term storage.

If you don't want to create an account, the "Save" option will also allow you to download your drawing as an image.

Drawing simple shapes

Click on the rectangle or circle tool in the toolbar, and the desired shapes will immediately appear in the upper left of the drawing area. You can then drag them to where you want to go.

No two shapes are alike. If you create two circles, they will be slightly different. However, if you duplicate a shape using Ctrl+D, the duplicate will be exactly the same.

Drawing lines and curves

To draw a line, click the line tool in the toolbar, then click anywhere in the drawing window to place the first point. You can then click again to place the second point, and so on. To end the line, double click. If you end the line or curve close enough to where you started it, then it will create a closed shape.

Hint: The line tool can be activated by pressing "L", and the curve tool can be activated by pressing "C".

Lines have a sloppiness property that can be set after drawing it. To set the property, select on the line you have drawn. The options pertaining to the line will appear to the left of the drawing window.

Curves do not have sloppiness, but smoothness. By increasing smoothness, the curve is modified to make rounder corners.

Drawing text

Text is placed the same way as circles or rectangles. Click on the text tool, and some default text is placed in the upper left of the image. You can change what the text says by clicking on the text, and then modifying the "text" property that appears to the right of the canvas.

Example: Editing text using the properties area

You can move, scale, or rotate text. However, in this version of the drawing software, scaling will result in loss of quality. Instead, change the font size in the properties area.

Selecting things

While not in line or curve mode, you can select things by dragging a box around them using the mouse. The box must fully enclose the shapes to select them. You can also add a shape to an existing selection by pressing the shift key while clicking it.

Overlapping shapes

You can move a shape to the front using the Home key on the keyboard. Send it to the back using the End key.

Point edit mode

For some shapes, you can move the corners around by entering "Edit" mode. You can enter edit move by clicking on an already-selected object. In edit mode, the corners that you can move are highlighted using a blue box.

Example: Point edit mode

Delete

Delete all objects in the selection by hitting the Delete key on the keyboard.

Grouping

To avoid having to repeatedly select complex groups of shapes, you can "group" the current selection using Ctrl+G. Now, when you click on any one of the group members, all will be selected. You can break apart any selected groups using Ctrl+Shift+G.

Copy/Paste

You can copy pieces of your drawings between documents and browser windows. While there are shapes selected, click on "Copy" and they will be transferred to the zwibbler clipboard, stored in your browser. You can then open another document later or in another browser window and paste the shapes there.

Duplicate

While there are shapes selected, you can duplicate them by pressing Ctrl+D. Copies of the objects will appear over the existing ones. You will have to move them aside to see the effect.

Zooming in

Increase magnification by pressing +. Decrease by pressing -. Restore to original magnification by pressing Shift+'+'. While zoomed in, you can move around using the arrow keys on the keyboard.

Bugs

The last 10% of work to make a polished program takes 90% of the time. I know about these bugs:

Scaling text reduces its quality.
There's no way to set the background colour of a drawing.

Implementation Notes

I should write something about the implementation. Hmm, the undo stack is pretty standard stuff but may be new to some people. But the thing I was happiest to find was the concept of abstracted Mouse Behaviours.

Design patterns for mouse behaviour are slim pickings. When a novice programmer designs a mouse-intensive application, he or she will tend to make a very complex function that deals with all possible cases when the mouse is clicked or moved. It can quickly grow to be unmaintainable.

To deal with this complexity, we separate the possible states of the system into various MouseBehaviour objects, which implement onMouseDown(x,y), onMouseUp(x,y), and onMouseMove(x,y). (This is a variation on the State design pattern). The default mouse behaviour is the selection mode. If you press the mouse button while over something, it then replaces the default mouse behaviour with a new mouse behaviour. When you lift the button, the new behaviour ends and reverts back to the previous behaviour. This keeps everything simple. There are several mouse behaviours:

DefaultBehaviour -- Depending on where you click, it replaces the current behaviour with a new instance of one of the other behaviours.
SelectBoxBehaviour -- if you click on an empty area of the image, it handles drawing the select box. When you lift the button, it selects everything in the rectangle you have drawn.
TransformSelection -- if you clicked on a selection handle, then it transforms the selected shapes as you move the mouse.
MoveEditNodeBehaviour -- If you click a blue edit node, then it moves the node.
DrawLinesBehaviour -- Places points in a line shape until you double-click to end it.

These objects can be found in DrawView.js

License

You don't need a project/solution to use the VC++ debugger

2009-11-05T08:00:00-05:00

You learn a lot of things on the job as a programmer. Years ago, at my first coop position, I was a little confused when my boss went to Visual C++, and tried to open the .EXE file as a project. What a dolt! I thought. That's not going to work.

Luckily I kept my mouth shut. You don't need to create projects or solution files to use Visual C++ as a debugger. Just open up the EXE file and run it. If it has debugging information, you can also manually open up the source files and create break points and everything.

Boring Date (comic)

2009-10-26T08:00:00-05:00

Previous Comic | Next Comic

STARTUP INK

It's a re-run, from before I used computerized lettering.

barcamp (comic)

2009-10-19T08:00:00-05:00

Previous Comic | Next Comic

STARTUP INK

How IE

2009-10-15T08:00:00-05:00

The <canvas> tag is the current fangled way of displaying vector graphics in a web browser. Before, all graphics were images, Flash animations, or even thousands of one-pixel <div>s. Finally, Internet browsers have caught up to the 1970s and will be able to draw lines and curves programmatically, and you don't have to pay $699 USD for the priviledge.

At the time of this writing, Internet Explorer at version 8.0 still lacks the <canvas> tag. But you can easily add the capability by including a short javascript file in your page. At first glance, that's astounding. How do you implement an entire vector graphics API in a few lines of Javascript?

Actually, IE has had the ability to do vector graphics for years. For IE 5.0, Microsoft was aware at how useful it would be, and also keenly aware of how long standardization takes, so they went ahead and implemented something called VML (Vector Markup Language), after it had been submitted to the standards process. SVG is simply VML (combined with some competing submissions) after it went through the process of being standardized.

For example, this code will draw an ellipse in VML.

<style>v: * { behavior:url(#default#VML); display:inline-block }</style>

<xml:namespace ns="urn:schemas-microsoft-com:vml" prefix="v" />

<v:oval style="width:100pt;height:50pt" fillcolor="red">
</v:oval>

In 2005, Emil A Eklund created a simple translation layer in Javascript that emulates canvas.

IE doesn’t support SVG natively either, it does support something called VML though, and it’s been around since the 5.0 days, if I remember correctly. VML does pretty much the same thing as SVG (as far as basic drawing is concerned).
Using VML, in combination with behaviors, it should therefor be possible to emulate a subset of SVG or Canvas in IE. That’s the idea I got a few days ago when working on a basic drawing abstraction, and according to Google a few others have thought along those lines as well. Couldn’t find any actual implementation though so I decided to make my own, how hard could it be?
If such implementation could be created it would open up the world of client side drawing to web site developers and allow all kind of neat widgets to be developed.

So there is it. The ExplorerCanvas script isn't magic. It inserts VML tags into your web page when you call its various drawing routines.

I didn't know you could mix and match (comic)

2009-10-12T08:00:00-05:00

Previous Comic | Next Comic

STARTUP INK

Sign here (comic)

2009-10-05T08:00:00-05:00

Previous Comic | Next Comic

STARTUP INK

It's a dirty job... (comic)

2009-08-24T08:00:00-05:00

Previous Comic | Next Comic

STARTUP INK

The PenIsland Problem: Text-to-speech for domain names

2009-08-20T08:08:08-05:00

"expertsexchange.com" is a domain name that can be read in multiple, unintended ways. Howshouldatexttospeechsystemresolvethisambiguity?

Recently, I was contracted to run a list of domain names through the custom-built pronunciation engine that powers my rhyming web site. On the first attempt, I found that the results were embarrassingly bad. A quick inspection revealed the problem: most domain names are severalwordsstucktogether.

When a pronunciation by analogy system encounters an unknown word, it searches its knowledge base for words that look similar, and tries to stitch together their pronunciations. In this case, it was doing just what it was supposed to do. For example, lots of words end with an 'e', and usually that 'e' is silent when at the end of a word. But stick another word on, and the system would try to pronounce the 'e', just like a six-year-old learning to read by sounding out each letter. Most people, on the other hand, would recognize the two words and say them each individually.

Try these domains in the AT&T text to speech system, which many consider to be the best in the world, at http://www.research.att.com/~ttsweb/tts/demo.php.

thepiratebay.com (sounds like separately?)
mydreamcloset.com (huh?)
torrentspy.com (sounds like a polish name)
123greetings.com (AT&T is ridiculous with this one)

This world-class system mispronounces them all, even when given the huge hint of the ".com" at the end.

Time for a bit of dynamic programming. After finding an appropriate scoring function, we can break up text the same way a human reader would. We also use some simple heuristics to say numbers properly.

Although I don't have a speech synthesizer, you can check the raw pronunciation output using this form. The phonemes correspond to the ones in the CMU pronouncing dictionary.

Pitching to VCs #2 (comic)

2009-08-17T08:00:00-05:00

Previous Comic | Next Comic

STARTUP INK

Building a better rhyming dictionary

2009-08-13T08:00:00-05:00

Back in 2007, I created a rhyming engine based on the public domain Moby pronouncing dictionary. It simply reads the dictionary and looks for rhyming words by comparing the suffix of the words' pronunciations. Since that time, I have made some improvements.

Using a comnbiation of techniques from artificial intelligence, math, and linguistics, the rhyming engine can now figure out how to say any word that you enter. That means if you enter a word that is not in the dictionary, it will still be able to find some rhymes.

Rather than looking for technically perfect rhymes, it suggests words that would sound good together in song or poetry. For example, we sometimes ignore consonants, as suggested by this 1985 paper. That way, fervently will rhyme with urgently despite the v/g mismatch.

There is a legal advantage to this technique as well. Many of the standard word lists used by natural language processing researchers include words from an old edition of the Oxford dictionary, and so cannot be used for "commercial purposes". That's why both Rhymezone and Write Express have a relatively limited dictionary size. My rhyming engine can sidestep this issue, since it only needs to be seeded with a small number of words from unrestricted sources, and it can then import words in bulk, and guess the pronunciations without using any restricted content.

I couldn't resist doing some premature optimization. It uses one of my favourite data structures -- the trie. The program starts, reads the entire 260,000 word database, and completes in 60 ms on my netbook web server. It takes about 8 MB of memory. I guess that equates to about 0.48 mega-byteseconds per request.

Why is this hard?

Text to speech for English is still a hard problem to solve, and it is an active area of research. Consider the words rough, through, bough, thought, dough, cough, or photOgraph, photOgraphy, or physics, lymphatic, and loophole. In the 80's, and still today in many cases, text to speech is done by hiring specially trained linguists to develop the thousands of rules necessary to create pronunciations. It is only in the last 10 years or so that this task has been automated. My system has over 200,000 hints on how to interpret each part of a word given its context. With further refinements, this could probably be reduced to tens of thousands, which is still a lot.

Does Android team with eccentric geeks? (comic)

2009-08-10T08:00:00-05:00

Previous Comic | Next Comic

STARTUP INK

Comment spam defeated at last

2009-08-06T08:00:00-05:00

For years when running this blog, I would have to log in each day and delete a dozen comments due to spam. This was a chore, and I tried many ways to stem the tide.

Finally, a few months ago, I found a way that worked 100% of the time. This raw text file shows what I'm up against, containing all server variables and full text of every comment I've gotten in the last couple of months.

Here's the code for the comment form below. Can you spot my solution? (No, it's not the "http:" part, which almost worked)

    <div class="roundedcornr_top_473174"><div></div></div>
      <div class="roundedcornr_content_473174">
        <div id=commentBox class=commentBox> 
        <div class=pad>
        <h2>Post comment</h2>

        <form action="/blog/index.php" method=POST onsubmit="return validateCommentForm(this);">
        Real Name: <input type=text name=displayname /><br>
        <span style="visibility:hidden"> Your Email (Not displayed): <input type=text name="email"/></span><br>
        Text only. No HTML. If you write "http:" your message will be ignored.
        <br>
        <textarea cols=60 name=comment rows=10 wrap=soft ></textarea><br>
        <input type=submit value="Post" />

        <input type=hidden name=id value="75">
        </form>
        </div>
        <div class=comment>
        </div>
        </div>

      </div>

Pitching to VCs (comic)

2009-08-03T08:00:00-05:00

Previous Comic | Next Comic

STARTUP INK

How QBASIC almost got me killed

2009-07-30T08:00:00-05:00

Back in high school, I had too much free time, so I decided to play a joke on my computer teacher. I created an exact clone of the school's DOS system using QBasic. It would pretend to execute three commands: DIR, DEL *.*, and FORMAT. The simulation was so realistic that during development, I was kicked out of the lab. Usually students would be playing Secret Agent or Jill of the Jungle.

The day arrived when my project was ready to be unleashed upon the world. I waited until the teacher was hovering nearby and then I started my application, running the FORMAT command on the network drive. Some classmates were watching the screen and she hurried over to see what all the fuss was about.

The reaction was immediate. She stared at the screen, eyes wide open, and mouth agape, as the terrible seconds ticked by. At that moment I regretted my deception and tried to abort the demo. But QBasic didn't understand CTRL-C during the SLEEP command. Pressing CTRL-C just interrupted the current SLEEP, so it caused the percentage to advance faster. I had to hold down the abort keys and wait until it advanced to 100% before I could prove that everything was really okay.

But then it said:

Unable to read from drive X:

Abort, Retry, Fail? _

That was the closest I've ever come to being murdered.

Source code for Fake DOS
My QBASIC implementations of Pacman and Jumpman (with EXE, use speed 170)
My QBASIC implementation

Blame the extensions (comic)

2009-07-27T08:00:00-05:00

Previous Comic | Next Comic

STARTUP INK

How to run a linux based home web server

2009-07-23T08:00:00-05:00

There are plenty of places you can go if you just want to put up some static web pages for free, or very low cost. But costs go up very quickly if you need to do any more than that, or if you get spikes in traffic. Sometimes you need complete control over the server, and don't want to pay $20 to $40 a month for a VPS. In this article, I'll describe step by step how to set up a home web server using Ubuntu, capable of handling modest spikes in traffic.

There are several things you need:

Connection
Hardware
Domain name
DNS server
Web server software properly configured for your system

The Connection

Many large internet providers forbid running any servers in their usage agreement. They might not actually check for a long time, but it's no fun to have your Internet cut off without warning after months with no problems. You can only do this if you are lucky enough to find an ISP that allows servers. In Canada, try Teksavvy.

You will want as high an uplink speed as you can get. Unless your website is extremely popular, bandwidth caps will not be an issue, but be aware of what they are.

The Hardware

Over your home internet connection, traffic will be naturally smoothed by the low uplink bandwidth. Incoming requests will be queued and served in order, not all at once. CPU speed is not going to be an issue, and a one GHz machine is plenty of power. Although this blog's web server has only 512 MB of memory, I recommend at least 1 GB of RAM.

Your router will also be an important part of the system, but quality varies. I have a Linksys WRT54GL. I downloaded and installed dd-wrt on it. The new firmware unlocks a lot of hidden features, including dynamic DNS updating. However, this is not strictly necessary, because your linux system can handle this step just as easily.

Domain name

You can get a domain name for 10 to 15 dollars. Try DomainsAtCost or GoDaddy. After you buy one from a registrar, you will get an account with them. After logging in, you can renew the domain name. But the most important priviledge is to be able to set the name server. That will have to wait until the next step...

A few months after getting your domain name you will start getting emails from companies claiming they have registered it in another country, and offering to sell it back to you. You can safely ignore these scams.

DNS server

Because your IP address will often change, you will need a dynamic DNS updating service. DynDns.com provides very good service, with updates taking effect within seconds. However, they do charge a fee. ZoneEdit provides a free service for up to five domains, although updates can take several minutes to propagate.

After signing up with a dynamic DNS updating service, you will get:

a username and password
IP addresses of nameservers

Save the username and password for later. Go back to your domain name registrar and enter the nameservers into their system.

Configuring your server

Install the latest Ubuntu Desktop edition, and then add the additional software you will need for web serving:

sudo apt-get install mysql-client mysql-server php5 apache2 php5-mysql ddclient

You will be prompted several times for your mysql root password. This is not the same as your linux root password. As long as you are behind a firewall, it is okay to press enter and leave it blank.

Now check that your web server works using a web browser browsing to http://localhost. Apache should show a default page telling you that it works.

Fixing ddclient

DDclient is a program that continually monitors your true IP address. When it changes, it updates your dynamic DNS service so that your domain name will resolve to your home web server.

Unfortunately, the version of ddclient that you can install on Ubuntu 9.04 (and most possibly later ones) is screwed up. Lets fix it.

wget http://downloads.sourceforge.net/sourceforge/ddclient/ddclient-3.8.0.tar.gz
tar xvzf ddclient-3.8.0.tar.gz
sudo cp ddclient-3.8.0/ddclient /usr/sbin/ddclient
sudo ln -s /etc/init.d/ddclient /etc/rc2.d/S99ddclient

Edit /etc/ddclient/ddclient.conf. If you use zoneedit, the file should look like this. Replace username and password and domain with your true values.

protocol=zoneedit1 
use=web
server=dynamic.zoneedit.com 
login=yourlogin 
password='yourpassword' 
yourdomain.com

Now erase the ddclient cache and restart it.

sudo rm /var/cache/ddclient/*
sudo /etc/init.d/ddclient stop
sudo /etc/init.d/ddclient start

After a few minutes you should be able to get to your web server by entering your domain name. If things aren't working, running 'sudo ddclient --verbose' will help you figure out what it's doing.

Configuring Apache2

There is one last thing. Because your machine has limited memory and bandwidth, you will need to set some parameters in /etc/apache2/apache2.conf.

The KeepAlive setting keeps your server busy for a few seconds after a page is served, allowing clients to download images without starting a new connection. The default setting is far too long, and it would cause your server to choke under heavy load. Set KeepAliveTimeout to 1 or 2 seconds.

The MaxClients setting determines how many connections your web server can handle at one time. Each connection takes around 5 megabytes of unshared RAM, so you will have to set this value taking into account the amount of RAM on your machine, while leaving room for other system processes. For 1 GB, a nice safe value is 130, but your mileage may vary. If your server becomes unresponsive, reboot it and lower this number.

After any change to server settings, use "sudo apache2ctl restart" to safely restart everything. Your users won't notice a thing.

Optimize for bandwidth

When your bandwidth is limited, the number of visitors you can handle is directly proportional to size of the page, and all the images. If you have 120 KB/s of uplink bandwidth, and your page is 120 KB in size, then you can handle no more than one visitor per second. The single most important thing you can do is make your images as small as possible. Use low quality JPG for photos, and 32 or 256 colour PNG files for everything else.

Type 'sudo a2enmod deflate' and restart apache to allow your HTML, scripts, and css files to be transmitted compressed.

If you use common Javascript libraries, use the Google hosted versions of them.

Include all your own CSS and Javascript in the same file as your HTML, so that only one page needs to be requested. During times of heavy load, 99.9% of your visitors only load a single page of your web site anyway, so it makes no sense to split things up into different files.

In conclusion

If you need 99.999% reliability, you should look for a hosted VPS solution. However, for modest needs, you can achieve 98.9% reliability for no cost other than what you are paying your internet provider.

Microsoft's generosity knows no end for a year (comic)

2009-07-19T14:24:53-05:00

Previous Comic | Next Comic

STARTUP INK

Using the Acer Aspire One as a web server

2009-07-18T12:08:19-05:00

My web site has been going up and down over the night. I've intentionally been trying to elicit reddit traffic so I can test different parameters for apache server optimization, to handle high traffic over slow connections.

A netbook can be ideal for a home web server. They are cheap, and use less power than a CFL light bulb. The only trade-off is that it won't reboot on a power failure. Fortunately, the built in UPS will sustain it through all but the longest power outages.

My Via Artigo died several weeks ago while I was using it. I have no air conditioner, so it must have been about 30 degrees which may have contributed to its demise. I took it apart and poked at it, but the monitor and disk light wouldn't light up. So ended up getting a $179 refurbished Acer Aspire One, which is now happily serving up this web site.

Specs

1.6 GHz Atom processor (with hyperthreading)
512 MB memory
8 GB SSD
Wifi, ethernet, VGA ports
3 USB ports and 2 SD card slots (?!)

Initial experiences

I turned it on briefly and it had a linux based OS on it that boots up in a less than 5 seconds and looks confusingly like Windows XP. Enough about that.

Because I only install Ubuntu on USB keys, I was able to plug it in, boot up, and my web site was up and running in only minutes.

The fan is the quietest I have ever heard on any laptop. It makes almost no sound at all.

Lousy performance as a desktop system

My old Artigo had a Via 1 GHz CPU and 1 GB of memory, and it could be used as a desktop system. Not so with the Acer. Even with the increased processor power, the decrease in RAM eats up all the benefit. After installing Xubuntu, the system is too sluggish to handle common desktop tasks. It badly needs a RAM upgrade, but this requires you to completely disassemble it..

The tiny speakers are junk. Imagine listening to a movie through ear-buds one metre away. That's how bad it is. But it doesn't matter, because it's a server.

Handling Server Load

With Ubuntu 9.04, after a day of serving up moderate web activity, the Linux Atheros Wifi driver was stuck in a bad state so it was offline. I switched it to a plug-in cable and haven't had any more issues.

The system went non-responsive an hour after I posted my last comic strip. It had a blank screen and needed rebooting. However, I hadn't optimized apache2 yet. Using the approriate settings, I believe that it can handle about between 30 and 40 visits/minute, or anything reddit and ycombinator can throw at it. I'm still tweaking with the apache parameters, and I'll post something about them soon.

Things to watch out for when running a home web server

Ensure that MaxClients is set to a reasonable value (memory divided by 7 MB?) in your apache2.conf file. If you come home to find your machine non-responsive and thrashing, this is the cause. To be ultra-safe, you can flip the "KeepAlive" setting to Off, but your pages will load more slowly.
If serving over a dynamic IP address, make sure your machine is configured to automatically update it. I installed ddclient to do this, but on Ubuntu it wasn't automatically restarting after a reboot.
As always, be careful with large images. Save photos as low quality JPGs, and screenshots as 16-colour .PNG files. Don't put up anything over 50 kb.
The limiting factor is bandwidth, not CPU. Enable gzip compression using 'a2enmod deflate'

When programmers design web sites (comic)

2009-07-13T07:19:52-05:00

Previous Comic | Next Comic

STARTUP INK

Finding great ideas for your startup

2009-07-08T21:36:26-05:00

Previous Comic | Next Comic

STARTUP INK

"I just don't have any ideas." This is the #1 stumbling block for budding entrepreneurs. Here are a few techniques to get the creative juices flowing.

Copy somebody else, but fix the problems

If you have a lack of ideas, you might be mentally discarding lots of things. Ideas don't have to sound good to be successful. If you take a look at this list of ideas from 2008, none of them are original or earth shattering.

If somebody's already doing it, don't let that stop you. Instead, do it better! For years, the technical question and answer site that came up most often in searches was Experts Exchange. But it obnoxiously demands payments all the time, and is notoriously hard to read. Last year, StackOverflow popped up, with its large, user-friendly fonts and zero costs, and quickly rose to popularity.

Copy an idea and fill a niche

Ebay is a one stop shop for buying anything online. You could certainly sell gift cards on Ebay, but why go to the trouble when you can sell through Giftah, a whole site devoted to pawning unwanted gift cards. The founders of Giftah were unfazed by the many existing options, and built their site anyway.

Implement ideas from academia

There is a clear separation between the practical world of business, and the "publish or perish" world of academia. The research journals are filled with ideas that are ripe for the plundering. Entire branches of research from decades ago were simply dropped because they were a solution without a problem, or the author achieved tenure. Start here, and you might find something useful.

This is one of the harder paths of entrepreneurship. Still, if you're smart enough to develop a complete understanding of a complex subject, and translate it into something useful and practical, you can turn obscurity into opportunity. Google did it with PageRank. More recently, Pittsburgh Pattern Recognition is a company that went this route, taking state of the art facial mining software and applying it to old episodes of Star Trek as an effective demonstration of their technology.

Solve a business problem

Are you selling to consumers? It's a great start because you're a consumer and you know what you like, and its nice to build something that your friends can use. But many startups quickly switch gears and sell their services to businesses instead. Why? Simple scaling. You have to be big to handle sales marketing support, while still finding the resources to develop your core technology. Dealing with ten thousand customers requires a lot more infrastructure than dealing with ten.

Quack.com started off with the dream running a phone portal, to let people call in and get their email, news, and other information using voice recognition. But their first paying customer was the Lycos search engine, and later on they were bought by America Online, which gave them the resources to scale. (Unfortunately things didn't work out under AOL's direction.)

Businesses expect a more complete solution. For example: Kelly sells popcorn to consumers in a mall for $3 a bag, works 8 hours a day, and after rent makes $200 a day. "Grandpa" Joe sells popcorn to business -- for $500 he'll cart his popcorn machine to your corporate event. Over three hours, he'll make cheery conversation while doling out the yellow goodness, and clean up when he leaves, and all it takes is a phone call.

To jumpstart things, use the fortune cookie trick. Take an idea that is traditionally marketed to consumers, and add "to business" after it. It doesn't take much imagination to figure out a way that a consumer business can be converted to a B2B.

We provide...
social networks video and photo sharing contact management news and blog post aggregation greeting cards online training	to business

Do what you know

Ali "Brickbreaker" Asaria is the founder of Well.ca, Canada's online drugstore, and the son of a pharmacist. Ali noticed that the market for health and beauty products was under-served -- many other online drugstores seem focused on drugs only. He created an extensive business plan, built a web site, sought funding, and now runs a successful company.

Be flexible

The odds are that you will end up building something completely different from what you started with. Mike Lazaridis founded Research in Motion, the creator of BlackBerries, in 1984 to do custom hardware development. An early product was a digital barcode system for Hollywood film editing. Later, they contracted with another company to do wireless point of sale terminals. That worked out pretty well, and they were left with a bunch of radio modems laying around the office. The rest is history.

Start your engines...

Ideas by themselves are of little worth. It is the hard work, personal risk, sleepness nights, plus a lot of luck and networking that has real value and can help you be successful. How many hundreds of creative minds used flash to stream video, or created tiny video streaming websites, before someone went and built YouTube? Having an idea is the easy part. The real issue is finding the spark of excitement and drive to follow through with it, even when the world is saying you will fail.

Game Theory, Salary Negotiation, and Programmers

2009-06-25T22:41:12-05:00

Disclaimer: Use these tips at your own risk. Don't get career advice from bloggers.

When you get a new job, you can breathe a sigh of relief, but not for long. You have an offer letter in your hand, and it is easy to miss one of the most important opportunities of your life: the starting salary. Here's the tale of two programmers. When getting a job, Goofus didn't negotiate. Gallant asked and got an extra $2500. They both get yearly raises of 3%

Year	Goofus	Gallant
1	45000	47500
2	46350	48925
3	47740	50392
4	49172	51904
5	50647	53461

After five years, Gallant has made an extra $13272, enough to get his car paid off, or keep his Macbook software up to date.

Goofus is in prison because he had to become a spam lord to pay child support for his six kids.

Everything in life is negotiable. C.E.O.s and corporate executives are simply people that learned this at an early age. The things that are most negotiable are the things written in black and white in indelible ink. They are engraved in silver, carved in stone, simply to trick you into thinking you cannot negotiate. "Just sign here. It's a formality." "It's a preprinted form, it can't be changed."

Do not be intimidated. Salary negotiation is a game, and the first to give a number loses.

When it goes wrong?

A friend once confided in me one of the biggest mistakes of his life. He got a job in technology, and they asked him his salary expectations. Just being out of school, he gave them a very low number for the type of work, in the low 40s. Then the worst thing happened: They gave him the job, and his lowball salary. He felt unmotivated and ripped off. He wasn't working there long.

He broke the golden rule of salary negotiation. If you say a number, you lose. If you are asked on a form, leave it blank. If someone is pressing you for a number, just repeat: "I expect to be paid fairly based on my skills." The chances of you mentioning a number and getting it right are low.

Let's prove it using some game theory. In this the rules of this game, there are two possible salaries: high and low. You and the company's recruiter both write your salary expectations separate cards, and then you each flip them over one at a time. Sounds simple, right? Here's the catch: The second person to go gets to change their vote after the first move. If both cards match, you get the job at the agreed upon salary. Here are all possible outcomes if the job applicant goes first:

Your expectations
	Company's expectations
		Low	High
Low	LL	LL (changed)
High	HL	HH

You only get the high salary 25% of the time, and in one unfortunate case (HL) they have security escort you from the building. Let's look at what happens if the company reveals their card first:

Company's expectations
	Your expectations
		Low	High
Low	LL	LL (changed)
High	HH (changed)	HH

When the company is the first to give a number, you always get the job, and you have a 50% chance of getting a high salary.

When it goes right

Another friend of mine was working in a job that didn't challenge her, so she applied around and was interviewed for a much better position. The problem was, it wasn't paying that much more than her old job. For the same money, she could keep her old job and play on the Internet much of the time.

She called the recruiter and said she would love the position and was very excited about it, and looking forward to the new challenges, except for the trifling little detail of the salary, wasn't there anything they could do about that? A few hours later she got the job -- at much higher pay.

The manager trying to hire you has chosen you over all the other candidates, and this give you the upper hand. In a larger company, the person you will be working for often has no idea what happens after the interview, since you are now dealing with an HR person. If you get away, this HR person has failed, and it will be disappointing to the manager. Use this to your advantage.

Final tips

The economy is turning around, and more jobs are popping up. The fact that you are reading this blog means that you have an interest in your craft, and this puts you in the top 10% of candidates. If you are on the hunt for a job, remember these tips:

Know what you are worth. Do salary research on Monster.com or Monster.ca, or ask your friends if they have any information on pay scales.
The opportunity to negotiate is after you've been interviewed, but before you've accepted. Don't even mention salary during the interview, unless they bring up the subject first. It is in your interest to postpone salary discussion as late as possible, after they are sure you are the best candidate.
During or after the interview, do try to get a sense of whether you are the preferred candidate.
Companies will rarely give you the opportunity to negotiate, or even bring up the topic of salary at all. You will probably have to call them about it. They will say no. Be persistent if you are confident in your skills.
If you are first to mention a number, you are a loser.
If the company mentions a number, then do the following: Repeat it back to them, then stare at them while counting to 30 in your head. 90% of the time they will then give you a higher number.
Do try to negotiate, even if the salary surpasses your wildest dreams.
Even if your asking doesn't work right away, it may be remembered and lead to more perks later on.

Coding tips they don't teach you in school

2009-06-23T09:45:19-05:00

Here are some C coding tips, because I have been unable to post anything for a while.

Some of these time-saving shortcuts are intended for small projects or prototyping code.

The nested ? : trick

The switch statement is very efficient, and the compiler will often implement it as a table lookup so it doesn't have to do any comparisons. But it sure can tire your fingers in a hurry:

switch( number ) {

    case 1: str = "one"; break;
    case 2: str = "two"; break;
    case 3: str = "three"; break;
    case 4: str = "four"; break;
    case 5: str = "five"; break;
    default: str = "unknown number"; break;
}

If getting the code out is more important than its speed, you can use a nested conditional operator to save typing:

str = number == 1 ? "one" : 
      number == 2 ? "two" :
      number == 3 ? "three" :
      number == 4 || rand() == 42 ? "four" :
      number == 5 ? "five" :
      "unknown number";

You can freak out your coworkers if you do it all on one line.

DeMorgan's theorem of negativity

// if blah blah blah blah blah blah....
if ( !(A && B) )

is the same as

// if blah blah
if ( !A || !B )

Pick the one that is easier to understand and read out loud. It is usually the second.

Rounding numbers using integer math

Suppose you have 2001 boolean variables, so you want to keep them in a bitmap of bytes, 8 at a time. How do you declare this array?

#define NUMBER_OF_BITS 2001
unsigned char bitmap[ NUMBER_OF_BITS / 8 ]; // FAIL!

The computer uses integer math, so 2001 / 8 is 250 and there is one bit left over. When you store bit 2001, you will corrupt memory. You can round numbers up like this:

unsigned char bitmap[ (NUMBER_OF_BITS + 7) / 8 ];

This works because integer division always rounds down. However, if you add the divisor less one, then you will force it to always round up. It will also work for 32 bit integers:

unsigned int bitmap[ (NUMBER_OF_BITS + 31) / 32 ];

Real rounding via integer division

Adding something before dividing is a general technique. This code rounds a and b to the nearest 10.

int a = 12;
int b = 16;

printf("Round a: %d", (a + 5) / 10 * 10 );
printf("Round b: %d", (b + 5) / 10 * 10 );

Multiply by arbitrary numbers using shift

Sometimes, for good reasons, you need to hand optimize your code. But most of the time, somebody is just showing off, and you will see code like this:

b = ( a << 1 );

This is the same as multiplying by 2. Because of the way binary numbers are represented, you can multiply by any power of two by shifting by that power. You can also multiply by other numbers.

// b = 10*a, which is 8*a + 2*a
b = ( a << 3 ) + ( a << 1 );

Division isn't as nice looking.

Getting array sizes

In C, you can get the total size of an array in bytes using the sizeof operator. Dividing it by the size of each element will give you the number of elements in the array.

MyHugeStructure array[100];
int i;

for( i = 0; i < sizeof(array)/sizeof(*array); i++ ) 
{
    array[i].id = i;
}

The downside is that if you ever change the array to be dynamically allocated using new or malloc(), then sizeof() will only return the pointer size. Of course, in a large project you should have have a defined constant for the number of entries.

Else Flattening

Deeply nested else clauses are hard to understand. But if they look like this, you are in luck:

if ( A ) {

} else {
   if ( B ) {

   } else {
       if ( C ) {

       }
   }
}

Save money on wide screen monitors by flattening the elses.

if ( A ) {

} else if ( B ) {

} else if ( C ) {

}

Black is White and Up is Down if you're not using a mainstream compiler

If your code has to run on a lot of different platforms, you should know that GCC and Microsoft VC++ are very similar and this similarity will trick you into thinking that all compilers are alike. But the C standard leaves a lot of stuff out that will surprise you, if you do work on embedded systems with crappy, but standards-conforming compilers.

What is sizeof(char)? In most compilers it is 1 byte. Switch compilers, and it might be 2 or 4 bytes. The C standards only says that it cannot be larger than an int.

What happens if you add something to the maximum integer?

int a = MAX_INT;
a++;
printf("%dn", a);

In mainstream compilers you get a negative number, since the number space wraps around. But the C standard says the result is undefined. So some compilers will leave a at the maximum value, MAX_INT.

And don't even think about bit-shifting a negative number:

int a = -2;
printf("%dn", a >> 1 );

When a reporter mangles your elevator pitch

2009-06-01T18:00:30-05:00

If a reporter asks you about your new startup company, be careful what you say.

The statement that sounds best will be quoted.
Some of what you say will be re-ordered or deleted.
Long, rambling descriptions will be paraphrased and condensed.

Here is a pitch from a new startup company, taken from an article in The KW Record on Wednesday, April 1st, 2009:

"We are positioning ourselves to disrupt the entire computing experience," said Ted Livingston, co-founder of Unsynced, an emerging company that won free patent-filing services from law firm Miller Thomson LLP at yesterday's competition for the students from the VeloCity residence...
The initial software will let people keep all of their music files on their BlackBerry and also be able to manage those tunes from any computer without having to download an application like Apple's iTunes.
"All of your music can be on your BlackBerry, but if you want to play it on a computer anywhere in the world, you just plug it in," said Livingston, a third year engineering student.

This pitch starts with a generic, ho-hum opening, the kind that makes my eyes skip down a couple of paragraphs. The reporter is only going to give you a few sentences, so use them wisely, and don't throw them away with verbal fluff about "disrupting the entire computer experience."

The pitch fails to differentiate itself from existing solutions. This is no different from a $5 USB key. You can keep your music files on it. When you stick it in your computer a player appears and starts playing your music. In fact, the BlackBerry already has mass storage support, so when you plug it in your files appear without any special software.

To be fair, perhaps Livingston couldn't give too much away, if he is going to use his new, free patent filing services.

The pitch below did get me excited:

Remember how you struggled to not show your disappointment at Christmas, when your Aunt May gave you a gift card to a book store but you really wanted the cash toward your new cellphone?
Giftah.com will replace that disappointment with smiles by helping people sell those unwanted gift cards, said co-founder Nick Belyaev, a fourth year UW math student.

I want to use this web site now. Unfortunately, the ideas that sound best may not always be the most successful in practice. According to Joel Spolksy, the ideas that work sound the dumbest:

If you explain it, and everyone says "Oh yeah, that would work, I'm surprised that's not being done," then it is being done. However, if you explain it and they say, "That wouldn't work, because of blah. It could never possibly work. You could never have auctions on the Internet because people are untrustworthy and they will use it to steal your money by pretending to sell you a laptop and not sending you the laptop, so you can't have auctions on the web." But as it turns out, you can have auctions on the web.
Whatever the idea is, it has to have a fatal flaw at first glance -- or has to sound like a terrible idea. You have to believe in it for some reason, which you just have trouble explaining to anyone except your brother-in-law who joins you in your startup, or your college roommate who doesn't really get it. Because you do need someone to join you, but the idea has to be not obvious and it has to sound bad. Otherwise it's getting done.

If you have a company, and a reporter asked you to explain it, what would you say?

Update in 2011 Ted recently donated $1M to the University of Waterloo. So it seems he has improved his elevator pitch in the past two years!

Test Driven Development without Tears

2009-05-13T20:46:35-05:00

Every company that I worked for has its own method of testing, and I've gained a lot of experience in what works and what doesn't. At last, that stack of conflicting confidentiality agreements that I got as a coop student have now all expired, so I can talk about it. (I never signed them anyway.)

Warning: My recollection of events may be different from what actually occurred. Do you remember what you were doing 10 years ago?

Don't make testing a pain in the ass.

Two places I worked at had test systems that were painful to use. They were Microsoft and Soma.

Microsoft's is the worse of the two. The system of my particular team in 2000 was so bad that developers couldn't even use it. Only dedicated Software Test Engineers had access to it and could run the full suite. Setting it up on a new PC could not be done in a day. Eventually at the end of the summer I managed to get a copy running, and it found lots of bugs in my module, but by then I ran out of time to fix them. Yes, this was all my fault because I should have been running the tests earlier. I didn't run them because they were so darn hard to get a hold of and set up.

Testing must be done continuously and during development, by developers.

Soma had the right idea. They were working on a software phone that ran on Linux. Before checking in a change, we were required to write a test for it and run the full regression suite without any failures.

The software and tests were all written in Java, so to create a test, you'd create a Java object with a procedure that called the high level APIs to start a phone call, and figure out a way to use the feature you were working on.

Some of the features needed a complex set of steps to invoke. If you wanted to test the user hitting a #-code to add a participant during a five-way conference call, you had to first set up five fake calls using the Java API. There were functions for dialing a number and such. Whenever an API changed, all the test objects would have to be updated.

The test system was downright painful to use. Early versions ran in real time, with real timeouts. To test that the dial tone only played for 30 seconds before the off-hook beeps started, you had to sit there and wait while the system did nothing until the timer expired. There was a speeded-up mode where timers would expire immediately. But even in speed mode, the Java system was so bogged down that it would take several hours to run everything. While I was there, they were in the process of building a cluster of 100 PCs just to run all the tests.

The problem was that all of the tests were system level. Programming in Java eventually leads to a system that I call object-soup: More abstraction can seem to be better, but it leads to more classes, and more classes have more references to each other. The system grew so complex that there was no way to reset the state to the middle of a five-way call, without actually performing all of the call setup. Each one had to be set up for every test. Unit testing of an individual class was meaningless, but it was impossible to separate a feature from the rest of the system.

You can't test GUIs

In 1999, Corel's flagship product was its DRAW! Vector graphics suite. As I remember it, testing was completely manual. Of course, developers were expected to test their changes as much as possible, but CorelDRAW is a huge program with thousands of features and modes of operation. After we'd done some cursory testing, we'd check in a change into Visual Source Safe and wait bugs to come in from beta testers. When a bug was submitted, the test specialist for my team, Mona, would reproduce it and create an issue report.

The system was clearly broken. We had about 100,000 open bugs and even some of the more serious ones (copy text with certain bullet styles results in crash.) had been unfixed for several versions.

The problem was the lack of any regression testing. We relied too much on developers manually going through all the code paths, and beta testers to submit reports. The engines were physically separated into libraries, but they were tightly coupled into the UI, so automated testing was impossible. If you broke something in a little-used feature, it might be years until someone noticed.

Regression testing is only feasible if the process is automated, but to this day, I still can't think of a good way to automatically test GUIs. If your program has a graphical user interface, you are stuck with laborious manual testing. The best thing you can do is to separate your program into two parts: a user interface part, and an engine part that does the real work. The user interface part will have to be tested by moving the mouse and going through all the options. The engine part has to have a well defined API, with inputs and outputs that you can test automatically.

Note: Most people stop reading at this point.

Three rules for Test Driven Development

In a modern software company, developers should be running some kind of regression tests with every change they make. If you make it painful, then your developers will be unproductive. If you make it painful and mandatory, then your developers will be unproductive and unhappy. For effective test driven development, you need three things:

The test system should support the developer, not the other way around.
A developer must be able to create a new test in under 10 minutes.
The test suite should be capable of running hundreds of tests in under 5 minutes.

If the test system is painful to use and create tests for, then developers will not create tests. There has to be some payoff for spending the effort of creating a test, in terms of finishing and going home early. Otherwise, the tests will be put off to later, or they won't get done, or you will have to have a separate team whos only job it is to create tests, and then you are no longer doing test driven development.

The test system must support the easy creation of tests. You should be able to take a bug report or some log from the field, run it through a tool, and out pops your test that is ready to run to reproduce the issue. If the tests are written in Java that is quite hard to do. Ideally instead of writing code, you will have some other kind of input, like a list of events that occurred since system startup. You can run the events through your system to exactly reproduce its state. These types of tests take no development effort to create, and the best part is they don't depend on function names and classes. You could rewrite your system from scratch, and as long as it takes the same input, your tests will still work.

Finally, the test suite should be fast. You're going to want your automated build system to run the tests after every few changes. If you think about it, a developer might make, on average, one or two changes a day. If you have 100 developers, you will have 100 to 200 changes a day. Developers will need to be able to run the tests at their desk, too. If regression tests passing is mandatory before committing a change, then it should take only a few minutes to run, so you don't have developers leaving for a three hour lunch, checking their stocks, or having swordfights.

File importer example

Lets say we are responsible for writing file importers for a word processor. We are fixing a bug in a file importer for Microsoft DOC format. The input is a .DOC file, and the output is a series of function calls that modify a document. So when we read some text, we call document.addText(), and when we get a new font, we call document.setFont(), etc.

But instead of maintaining a real document and displaying on the screen, we have a generic test document. When we call document.addText(), we just record this fact to a text file. So after our importer runs, we might have something like this:

called addText("This document is copyright")
called setFont("Symbol", 10)
called addText("c")
called setFont("Times New Roman", 10 )
called addText("1995")

Suppose we had hundreds of .doc files, taken from actual bug reports by actual users. We put them into a folder called "input" and check them into our source control system. We run them through the test document. This only takes a few seconds, because all its doing is creating text files. We then take these output .txt files, and check them into our source control system.

If, one year later, I'm making a change and the output of the test suite changes in any way, then it is either a bug, an improvement, or irrelevant. I'll revise the change to fix the problem, or update the checked-in text files with the new results.

We now have a regression test system that is easy to use and quick to run. The tests are not Java files. They take absolutely no effort to write -- we just save an attachment. In fact, it's easier to fix an issue after creating a test for it, because you can run it over and over again. Developers will naturally create tests as part of doing their job, without even being asked.

Now we can re-write stuff without fear. We can re-write the file importers from scratch, and make sure they work on all the same documents in exactly the same way. We can also re-write the rest of the system, as long as the interface to addText() and setFont() still works the same way.

Sure, there are some bad parts. If you change document.setFont() so that it needs a font encoding parameter, you will need to update all the test scripts. But these changes aren't difficult to manage, and the benefits far outweigh the inconvenience.

In Conclusion

If you are setting up a regression test system, it should be effortless to create a new test, and it should be able to run hundreds of tests in five minutes. Most importantly, the test system should make it easier to fix bugs, so developers will naturally want to create new tests.

Drawing Graphs with Physics

2009-05-08T21:42:40-05:00

I use graphviz whenever I need to draw state machine diagrams. Drawing circles connected with lines is a hard problem for computers, because they have to decide where to place the circles so the diagram makes sense. These types of diagrams are called graphs.

To my surprise, I found that there is a very simple way to arrange graphs that can be expressed in only a few lines of code, using force-directed placement [Fruchterman, 1991]. We pretend that the nodes on the graph all strongly repel each other. However, on the other hand, nodes that are connected attract each other with a weaker force. It is as if you had a bunch of statically charged styrofoam balls connected with springs, and its very fun to watch.

Click the image to reset it.

Here's the code that moves the nodes around.

Springs.prototype.recalc = function()
{
    var width = this.ctx.canvas.width;
    var height = this.ctx.canvas.height;

    // K is related to how long the edges should be.
    var k = 200.0;

    // C limits the speed of the movement. Things become slower over time.
    var C = Math.log( this.frame + 1 ) * 100;
    this.frame++;

    // calculate repulsive forces
    for ( var vindex = 0; vindex < this.graph.nodes.length; vindex++ ) 
    {
        var v = this.graph.nodes[vindex];

        // Initialize velocity to none.
        v.vx = 0.0;
        v.vy = 0.0;

        // for each other node, calculate the repulsive force and adjust the velocity
        // in the direction of repulsion.
        for ( var uindex = 0; uindex < this.graph.nodes.length; uindex++ ) 
        {
            if ( vindex == uindex ) {
                continue;
            }

            var u = this.graph.nodes[uindex];

            // D is short hand for the difference vector between the positions
            // of the two vertices
            var Dx = v.x - u.x;
            var Dy = v.y - u.y;
            var len = Math.pow( Dx*Dx+Dy*Dy, 0.5 ); // distance
            if ( len == 0 ) continue;
            var mul = k * k / (len*len*C);
            v.vx += Dx * mul;
            v.vy += Dy * mul;
        }
    }

    // calculate attractive forces
    for ( var eindex = 0; eindex < this.graph.edges.length; eindex++ ) 
    {
        var e = this.graph.edges[eindex];

        // each edge is an ordered pair of vertices .v and .u
        var Dx = e[0].x - e[1].x;
        var Dy = e[0].y - e[1].y;
        var len = Math.pow( Dx * Dx + Dy * Dy, 0.5 ); // distance.
        if ( len == 0 ) continue;

        var mul = len * len / k / C;
        var Dxmul = Dx * mul;
        var Dymul = Dy * mul;
        // attract both nodes towards eachother.
        e[0].vx -= Dxmul;
        e[0].vy -= Dymul;
        e[1].vx += Dxmul;
        e[1].vy += Dymul;
    }

    // Here we go through each node and actually move it in the given direction.
    for ( var vindex = 1; vindex < this.graph.nodes.length; vindex++ ) 
    {
        var v = this.graph.nodes[vindex];
        var len = v.vx * v.vx + v.vy * v.vy;
        var max = 10;
        if (this.frame > 20) max = 2.0;
        if ( len > max*max ) 
        {
            len = Math.pow( len, 0.5 );
            v.vx *= max / len;
            v.vy *= max / len;
        }
        v.x += v.vx;
        v.y += v.vy;
    }
};

Its really fun to play with this method. Here's what happens if we add a bit of gravity to the system. (Click to reset)

Coding that made me want to play The Incredible Machine.

Keeping Abreast of Pornographic Research in Computer Science

2009-04-25T08:00:09-05:00

Burgeoning numbers of Ph.D's and grad students are choosing to study pornography. Techniques for the analysis of "objectionable images" are gaining increased attention (and grant money) from governments and research institutions around the world, as well as Google. But what, exactly, does computer science have to do with porn? In the name of academic persuit, let's roll up our sleeves and plunge deeply into this often hidden area that lies between the covers of top-shelf research journals.

Lena

One cannot do research in image processing without an encounter with Lena (pronounced Lenna). The image of the woman with a feathered hat has become the de-facto test image for many algorithms, and appears in thousands of articles and conference papers. And it is of pornographic pedigree:

Alexander Sawchuk estimates that it was in June or July of 1973 when he, then an assistant professor of electrical engineering at the USC Signal and Image Processing Institute (SIPI), along with a graduate student and the SIPI lab manager, was hurriedly searching the lab for a good image to scan for a colleague's conference paper. They had tired of their stock of usual test images, dull stuff dating back to television standards work in the early 1960s. They wanted something glossy to ensure good output dynamic range, and they wanted a human face. Just then, somebody happened to walk in with a recent issue of Playboy.
The engineers tore away the top third of the centerfold so they could wrap it around the drum of their Muirhead wirephoto scanner, which they had outfitted with analog-to-digital converters (one each for the red, green, and blue channels) and a Hewlett Packard 2100 minicomputer. The Muirhead had a fixed resolution of 100 lines per inch and the engineers wanted a 512 x 512 image, so they limited the scan to the top 5.12 inches of the picture, effectively cropping it at the subject's shoulders.

The rest of the story (and the rest of Lena) can be found here. Indeed, the 70s marked the beginning of a long relationship between computer science and pornography. However, after the birth of the world wide web, things really got hot and heavy.

Finding Naked People

In the 1990s the world wide web began to explode, pumping information of all kinds into the homes of the technologically savvy at rates as high as 9600 bits per second. It was the time when search engines such as Webcrawler, Altavista, and Yahoo began the arduous task of spidering the scattered bits of information in Internet servers everywhere. The problem was that someone might search for a completely innocuous query such as the Trojan Room Coffee Pot, and come up with images that were unexpected and inappropriate, and depending on one's tastes, objectionable.

It's not likely to be on his business card, but David A. Forsyth is an expert in web pornography, having served on the NRC committee for this topic. It is evident from his web page that he has a sense of humour, which explains the superbly descriptive title for his 1996 paper, Finding Naked People. Forsyth was one of the first researchers to study the problem of identifying objectionable content.

One of Forsyth's research areas is tracking people in images and videos and figuring our their pose. In the general case, the system has to cope with the fact that people can wear clothes. It would be easier if the subjects all wore the same colour, or didn't wear anything at all. Finding Naked People describes a way of first masking out areas of skin. The areas are then grouped together into human figures (visualized by drawing a stick figure on the image). The crux of the paper is the grouping algorithm. The grouper knows rules such as how limbs fit together into a body, and the fact that a person cannot have more than two arms. Using the rules, it figures out how to superimpose a body onto the skin patches. If it can successfully do this, the image is probably a naked person. If it cannot, then it is probably something else, like a lamp.

Here is a visualization of the skin probability field from the paper, with the grouper output segments superimposed on top:

More probability masks can be found in Proceedings of the 4th European Conference on Computer Vision, volume II on page 598. Be careful -- the pages tend to stick.

It's better with more than one

Finding Naked People piqued a lot of interest in the field of objectionable images, and the skin matching idea is now the first step in many algorithms. However, as James Ze Wang of Stanford notes, "it takes about 6 minutes on a workstation for the figure grouper in their algorithm to process a suspect image passed by the skin filter."

In their System for Screening Objectionable Images, Wang and his colleagues describe the WIPE^TM method for screening content. They use a wavelet edge detection algorithm to obtain the shape of the image. Edge detection transforms an image into the outlines of the object. Wavelet edge detection allows them to tune it to detect sharp or increasingly blurry edges until well-defined shapes appear.

Image moments allow one to treat any shape as a flat, physical object (like a plate). You can figure out the centre of gravity, axis of symmetry, and other properties that don't change when you move, rotate, or change the size of the object. This typically results in a set of 3 to 7 numbers that you can use to compare how similar shapes are. They were used in early OCR (optical character recognition) algorithms circa 1962.

Wang uses both edge detection and image moments in the analysis. His algorithm is different from modern ones, because an image must pass a series of five YES/NO tests. Future algorithms would combine the detectors using statistical methods and give a probability estimate.

If the image is small, it is assumed to be an icon, and allowed. Icons (such as a mail envelope) were frequently used on the world wide web in the 1990s.
If the image contains few continuous tones, it is considered to be a drawing and is allowed to pass.
If a great portion of the colours of the image are human body colors, then the image is rejected as porn. The algorithm is pretty smart -- if a patch identified as skin has lots of edges in it, it is probably not really skin and is removed from the analysis. (This also counts as the texture matching step)
Finally, the edge (outline) image is converted into 21 numbers representing the translation, scale, and rotation invariant moments. If the 21 numbers are to close to anything already in the database, the image is rejected.

Here are some examples where the algorithm fails. We have blurred them to protect the eyes of the gentle reader. For high resolution versions, you'll have to refer to  Proceedings of the 4th International Workshop on Interactive Distributed Multimedia Systems and Telecommunication Services on page 20 (the dog-eared one).

Getting a leg up on skin models

Skin detection is an important step in porn detection, but figuring out which colours represent skin is a hard problem. Colour depends on the lighting used in the photo, the ethnicity of the participants, and the quality and noise level. Michael J. Jones and James M. Rehg at Compaq studied the problem in detail. They first manually labeled hundreds of images, highlighting all the areas that were skin using a custom drawing application. Once you have billions of pixels that you know are skin, and billions that you know are not, you can easily classify them using introductory math:

The paper describes how to find the probability function, P, using a database of images painstakingly highlighted by an army of enthusiastic research interns. However, as a porn detector, the method needs work.

It will be obvious to anyone who has bought a digital camera recently how to improve this system. Before reading on, can you spot the solution?

Taking the ogle out of Google

In recent years, Google has had its hands full with the problem of pornographic imagery. Henry A. Rowley, Yushi Jing, and Shumeet Baluja at the Mountain View campus, have developed a system that combines skin detection with a number of different features. After applying face detection, they can deduce that the pixels around the face represent skin colour, and therefore find other skin pixels in the image. If the face is the majority of the image, as in a portrait, the image is safe. They use a colour histogram to detect artificial images such as screen shots. (so dirty cartoons are safe?).

Doing what only Google could, they must have set a record for the rate of pornographic analysis. They evaluated the speed of the algorithm on a corpus of around 1.5 billion thumbnail images of less than 150x150 pixels. "Processing the entire corpus took less than 8 hours," the team bragged, "using 2,500 computers."

Bags of visual words (Arm, leg, or . . .?)

In 2008, Thomas Deselaers et al. came up with a unique way of finding porn, from the world of artificial intelligence. Large news databases can automatically classify news articles based on the words in them. Articles containing the names of political figures or sports jargon can be easily categorized by machines, that don't need to really understand what the article is about. Techniques exist so that the machines can learn on their own which words or names are important. The same methods can be applied to images, using visual words.

To create the visual vocabulary, they extract image patches around "points of interest", parts of the image that are likely to contain features. They are then scaled to a common size, and analyzed using PCA to find commonalities. It is similar to face detection, but for things that aren't faces. It also takes colour into account in the analysis. Because colour is a part of the "vocabulary" already, skin detection is unnecessary.

Using this technique, Deselaers is even able to go beyond simple YES/NO classification and reach a new level of precision. The algorithm can rate images into one of five categories of increasing levels of offensiveness, from benign, to lightly dressed, to partly nude, fully nude, and porn. The paper contains examples from each category, and is guaranteed to offend somebody.

Corpus non indutus

At the end of the Google paper, the authors speculate on how to spur further advances:

...because of the ubiquity of the Internet, search engines, and the widespread proliferation of electronic images, adult-content detection is an important problem to address. To improve the rate of progress in this field it would be useful to establish a large fixed test set which can be used by both researchers and commercial ventures.

Yes, bring on the grant-sponsored porn, so that researchers can make the world a better place. But despite the years of study, one question remains unanswered: if such a corpus existed, how would we find it?

For a good time, read this

Exploiting perceptual colour difference for edge detection

2009-04-16T19:42:41-05:00

Estonian Translation

I once took computer vision class. Every algorithm we learned was done on grayscale images, as if it were 1950 and we couldn't afford those new fangled colour VDTs. I put up my hand and asked why we discard all of this lovely colour information. The prof answered that some people have tried it, but it doesn't give much benefit.

But in some situations, colour can be important. Here's an image in GIMP:

First, we convert to grayscale:

Umm......... I don't think we need to go any further here. 'Nuf said.

Traditional approaches for colour

Computer vision researchers use two approaches for dealing with colour. The first is to break the image into red, green, and blue images, do the algorithm three times, and combine the results. The second is to treat the RGB values as a vector and subtract the vectors to determine colour difference. These approaches can work, but they don't take into account human perception of colour.

The RGB values in an image have very little to do with how the human eye works. They have more to do with the voltage fired at phosphors in an 80's era CRT display than human vision. (Let's conveniently ignore the fact that LCDs have completely different properties. Everybody else does.)

To compare two colours, you have to first convert them to a different colour space. If you find yourself subtracting RGB values, you're doing it wrong.

Colour Difference

People can distinguish skin colours very well, so we can tell if a potential mate is healthy or not. This is more important than distinguishing different shades of green leaves, which would be mere distractions from the delicious bouquet of fruits hanging in trees. Describing the exact shade of an azure sky is simply not required for survival. We are hypersensitive to differences in yellows, reds, and oranges.

In 1915, Albert Munsell created a Colour Atlas, a book of colour paint samples. He came up with a way of logically arranging them that made sense to his eye. Each colour was indexed by hue, saturation, and value (also known as lightness). We still use his system today, and most of modern colour theory is based on Munsell's ideas.

The CIE is an international authority in charge of standardizing things to do with light and colour. In 1960 they created a mathematical arrangement of colour, and improved it in 1975. In this space, the more different two colours are, the farther apart they are in the drawing. The arrangement is not perfect, but the math is simple enough to work with. It's no good to create an international standard that nobody understands.

Finding outlines in images

Edge detection is usually the first step in any computer vision algorithm, because it converts an image into a black and white outline that is easier to work with.

You can count the vertical and horizontal edges and figure out if a photo is taken in a city or a forest. City buildings are often square.
You can treat it as a flat, physical object and figure out the center of gravity and angle of symmetry, and get some numbers that you can compare two images with, even if one is rotated or scaled.
You can mark strong intersecting edges as corners, and compare the locations of corners in two images to figure out how to line them up, or if they are the same object.

A nice, easy way to perform edge detection is to use the sobel operator:

Translation: We create two new images, Gx, and Gy. The pixels in the new images get their values from the difference in the values of the pixels around their corresponding position in the old image. In the Gx image, the differences are in the right to left direction. In the Gy image, the differences are in the up and down direction. After the two new images, are created, we combine them into one by adding the pixels together in a special way. We square the values, add them together, and take the square root. The result is the edge-detected image.

I like the sobel algorithm because it is easy to adapt to color difference. Instead of subtracting the pixel gray values, we subtract the two colours using the distance in L*u*v colour space, which is almost perceptually uniform.

Results

I implemented the two algorithms in 400 lines of C code. When the program is run with a given PNG image, it reads it and performs edge detection first on the grayscale version, and then on the colour version using colour differences. In both cases, the images are normalized so that the highest difference is white, and the lowest is black. Here are the results.

Note: The grayscale image appears blurred, and the colour one does not. This is an error in the blog entry and should be corrected later. The implementation blurs both versions before performing edge detection, to reduce noise.

Grayscale original	Traditional Edge Detection
Colour Original	Colour Edge Detection

Grayscale original	Traditional Edge Detection
Colour Original	Colour Edge Detection

Experiment: Deleting a post from the Internet

2009-04-12T13:40:20-05:00

Once you post something on the Internet, it is hard to get rid of it. As an experiment, I deleted one of my past posts, and I tried to remove all traces of it.

I selected my post about Technical Interview tips, because it is mildly popular, but never did very well. It was on reddit for only a couple of hours. Yet it regularly received a lot of hits from Google looking for interview tips for RIM. In my opinion the writing needed work, so I deleted it. Forever.

First, I removed it from my blog. I have a checkbox that says whether a post is shown or not. Unchecking it removes it from the main page, and whenever people try to see it, they get the main article listing instead.

RSS Reader caches

That wasn't good enough, because the article was still available in RSS readers. When Google reader retrieves my blog entries, it simply merges the updated ones with its own database. The atom specification does not define any way to delete posts, but it does allow updates. I had to put the post back, but remove its contents. Then, when the RSS reader did the merge, it would update its database to contain the empty post.

Google Cache

My post still appeared in Google, and you could read it by clicking on the cached link. To remove it from the Google cache, I had to make the page return a HTTP 404 error. I tried using the .htaccess file:

redirect gone /~smhanov/blog/?id=43

Unfortunately this had no effect on my web server. Apparently .htaccess doesn't apply to php scripts. I had to physically change my blog software to return a 404 HTTP status if that entry is retrieved:

    if ( $_GET['id'] == '43' ) { 
        header("HTTP/1.0 404 Not Found");
        exit;
    }

Comments about the post appeared on reddit. Since I was the original submitter to reddit, I have the option to delete it:

Clicking delete didn't work as advertised. You can still get to the post, but it is marked as [deleted]. This is a real problem on Reddit's part, because people might post something under the mistaken belief that they can remove it later. The button should be descriptive of what it will actually do. Software shouldn't lie.

Conclusion

The main text of the article is nowhere to be found. The problem is any comments or blog reactions will still be there, although they will have broken links. The experiment is a partial success.

The best way to hide your embarrassing past is to bury them with new things. For example, if you search for my name you won't find my Lion King fan-fiction anywhere in the first few pages of results.

Is 2009 the year of Linux malware?

2009-04-03T23:24:27-05:00

It is common knowledge that Linux users needn't worry about viruses because users don't run as root. I've never understood the reasoning behind this. Here are a few of the malicious things that a program can do without being root on Ubuntu 8.10:

Start a program every time you login
Add an entry to .config/autostart
Configure firefox to route all traffic through a remote proxy
Change a line in .mozilla/firefox/*/prefs.js
Replace everything in your "System Settings" menu with a command that asks you for your password, then does something else before invoking the real program.
Add a file to .local/share/applications
Download and install other programs in the background
Putting them in .gnome2/system32 seems somehow appropriate
Run a server of any kind (web/ftp/irc/etc)
Just pick a port above 1024, and update the firewall with uPnp
Install a new firefox plugin
put it in .mozilla/firefox/*/extensions/ call it "Ubuntu System Integration Plugin Helper"

Once malware has its grubby code all over your home folder, you are one fake dialog box away from giving it complete control over your system:

If you have ever run a program or script that wasn't included in your distribution, then you could have been infected with malware. (You weren't.)

Have you installed a zipfile full of video codecs from somewhere?
Have you ever run a script from ubuntuforums that promised to make it easy to get something working?
Have you ever been told by a web site to add a line to your sources.list file? Instant rootkit!

If you are interested in more examples, The Malware Project (PDF) is a great read that takes you step by step through an actual social engineering experiment with users. The results will surprise you.

Ubuntu in particular must be very enticing for malware writers, because:

It is easy to get new users to run things. There are thousands of annoyances with desktop linux that can only be fixed by dropping to the command line, or downloading something to do it for you.
It has a rich, portable API. Malware writers have access to all unix commands and a rich programming environment that is guaranteed to be available on every desktop, allowing them to search and change any file in your home folder, or even implement complex network protocols.
Open source makes it easy to copy other programs. If you can change sources.list, you can then replace top, ps, and System Monitor with exact clones that neglect to display your processes. This is much easier than hacking up the Windows Task Manager internal memory. Or just do everything in kernel mode for ultimate captcha cracking, DDOS power.
People are unprepared. The "fact" that linux can't get viruses is constantly repeated all over the web.

Is 2009 the year of the linux desktop malware? How long until we see headlines like, "Researchers find massive botnet based on linux 2.30"?

Email Etiquette

2009-04-01T20:20:03-05:00

If you begin your emails with "Hi, <name>!" then they will seem less rude.

> steve, we were testing your changes and they errored. can you pls 
> look at these tracefiles?
> cam

You dolt, you set HPLX_PARAMETER_7=A0, but it should be set to AO. 
Please re-run the test.

Compare to:

> steve, we were testing your changes and they errored. can you pls 
> look at these tracefiles?
> cam

Hi, Cam!

You dolt, you set HPLX_PARAMETER_7=A0, but it should be set to AO. 
Please re-run the test.

See what I mean?

How a programmer reads your resume (comic)

2009-03-26T21:11:26-05:00

Previous Comic | Next Comic

Click to enlarge

Japanese Translation by Yasushi Aoki

Unauthorized Chinese translation, I think.

Here are the real tips

How to recognize a good programmer
Another Resume Tip - From Joel on Software
Ten Tips for a Slightly Less Awful Resume - Advice from Steve Yegge. An entertaining read.
Getting your resume read - From Joel on Software

How wide should you make your web page?

2009-03-22T15:58:15-05:00

Based on 22500 unique IP addresses over the past week, reddit users have these browser widths:

The numbers on the bottom are browser widths (minus 40), and the numbers on the side are the counts of unique visitors with that width.

Data collection

At the time this blog posting was made, my blog had hand-drawn borders in the page title. They were generated by a CGI program on the fly. For each visitor, a number about 40 pixels below your browser width is recorded in my server logs. I wrote a perl script to create a histogram for unique visitors, and pasted the result into gnumeric to create the chart.

Results

The peaks seem to correspond to 1024, 1280, 1400, 1600, 1920 screen widths, with 1280 being the greatest peak. This indicates that most people have their browser window maximized.

There is also a section of uniformity between 1000 and 1280. This seems to be a sweet spot among those who do not maximize their browsers.

Surprisingly, there are still a few people running at 800x600 resolutions. And one at over 2500 pixels across.

Usability Nightmare: Xfce Settings Manager

2009-03-21T22:22:04-05:00

Quick! Where do you go to increase the text size in all your applications? Can you pick the right button on the first try? Do you feel lucky, punk?

I might try:

Desktop, because it's at the top, I see it first, and text is a part of my desktop experience.
Display, because that's where the setting is in Windows XP.
User Interface, because everything I have ever wanted to change is a user interface setting.
Panel, because I also want to change the text size on the panel.
Window Manager, because windows can display text on them.
Window Manager Tweaks, because I want to tweak to how text is shown.

Or maybe you're in the wrong settings manager. The first hurdle was in the Settings menu. Text settings aren't mentioned at all, although some things are in two or three places. I'm glad I'm not setting up a printer!

It turns out that we chose wisely. Unfortunately, in the Xfce Settings Manager, text sizes are distributed between two categories. Choose Window Manager for the window title bars, and User Interface for everything else.

Can't you come up with a better way of organizing settings than Windows 3.1? Has the state of the art in Settings dialogs not advanced since 1992?

At least Microsoft's control panel was unambiguous. It even had a description in the status bar of what you were going to click on.

Today's Usability Tips

Did you notice that I never mentioned the word font? New users might call things by different names. More people in the world know what "text size" means than "font size".
Manuals have gone the way of the floppy disk. Instead of writing a manual, spend time just watching someone use your software, so you won't need one in the first place.
People read the screen from the top down. Take advantage of this, and put the most common settings first.
If your software has many settings, thoughtlessly dividing them into categories is a sure way of making them more confusing.
If you are copying user interface layout from Windows 3.1, at least do it right.

cairo blur image surface

2009-03-14T13:34:45-05:00

This really should have been included in cairo. Instead, everyone that wants to have shadows has to roll their own blur function. Here's my take on it. I'll even release this into the public domain.

This is used in the back-end for www.websequencediagrams.com.

// to build:
// gcc -I/usr/include/cairo -lcairo -o blur blur.c

#include 
#include 
#include 
#include "cairo.h"

void cairo_image_surface_blur( cairo_surface_t* surface, double radius )
{
    // Steve Hanov, 2009
    // Released into the public domain.
    
    // get width, height
    int width = cairo_image_surface_get_width( surface );
    int height = cairo_image_surface_get_height( surface );
    unsigned char* dst = (unsigned char*)malloc(width*height*4);
    unsigned* precalc = 
        (unsigned*)malloc(width*height*sizeof(unsigned));
    unsigned char* src = cairo_image_surface_get_data( surface );
    double mul=1.f/((radius*2)*(radius*2));
    int channel;
    
    // The number of times to perform the averaging. According to wikipedia,
    // three iterations is good enough to pass for a gaussian.
    const MAX_ITERATIONS = 3; 
    int iteration;

    memcpy( dst, src, width*height*4 );

    for ( iteration = 0; iteration < MAX_ITERATIONS; iteration++ ) {
        for( channel = 0; channel < 4; channel++ ) {
            int x,y;

            // precomputation step.
            unsigned char* pix = src;
            unsigned* pre = precalc;

            pix += channel;
            for (y=0;y0) tot+=pre[-1];
                    if (y>0) tot+=pre[-width];
                    if (x>0 && y>0) tot-=pre[-width-1];
                    *pre++=tot;
                    pix += 4;
                }
            }

            // blur step.
            pix = dst + (int)radius * width * 4 + (int)radius * 4 + channel;
            for (y=radius;y= width ? width - 1 : x + radius;
                    int b = y + radius >= height ? height - 1 : y + radius;
                    int tot = precalc[r+b*width] + precalc[l+t*width] - 
                        precalc[l+b*width] - precalc[r+t*width];
                    *pix=(unsigned char)(tot*mul);
                    pix += 4;
                }
                pix += (int)radius * 2 * 4;
            }
        }
        memcpy( src, dst, width*height*4 );
    }

    free( dst );
    free( precalc );
}

int main(int argc, char* argv[])
{
    cairo_surface_t* surface;
    cairo_t* ctx;
    cairo_text_extents_t text_extents;
    cairo_font_extents_t font_extents;
    double FontSize = 100;
    double radius = 7;
    double width, height;

    if ( argc != 3 ) {
        printf("Syntax: %s  ""n", argv[0]);
        return -1;
    }

    // Get text size.
    surface = cairo_image_surface_create( CAIRO_FORMAT_ARGB32, 10, 10 );
    ctx = cairo_create( surface );
    cairo_set_font_size( ctx, FontSize );

    cairo_font_extents( ctx, &font_extents );
    cairo_text_extents( ctx, argv[2], &text_extents );

    height = font_extents.ascent + font_extents.descent + radius * 2;
    width = text_extents.x_advance + radius * 2;

    cairo_destroy( ctx );
    cairo_surface_destroy( surface );

    // Draw text.
    surface = cairo_image_surface_create( CAIRO_FORMAT_ARGB32, width, height );

    ctx = cairo_create( surface );
    cairo_set_font_size( ctx, FontSize );

    cairo_move_to( ctx, 0 + radius, font_extents.ascent + radius );
    cairo_show_text( ctx, argv[2] );
    cairo_fill( ctx );

    cairo_image_surface_blur( surface, 5 );

    cairo_move_to( ctx, 0, font_extents.ascent );
    cairo_show_text( ctx, argv[2] );
    cairo_fill( ctx );

    cairo_destroy( ctx );

    cairo_surface_write_to_png( surface, argv[1] );

    cairo_surface_destroy( surface );

    return 0;
}

Automatically remove wordiness from your writing

2009-03-04T11:37:18-05:00

Show/Hide Changes

I recently started re-reading William Zinsser's On Writing Well. Zinsser emphasizes simplicity in writing. To reduce wordiness, he implores the writer to remove needless words and phrases:

"I might add," "It should be pointed out," "It is interesting to note that" how many sentences begin with these dreary clauses announcing what the writer is going to do next? If you might add, add it. If it should be pointed out, point it out. If it is interesting to note, make it interesting. Being told that something is interesting is the surest way of tempting the reader to find it dull; are we not all stupefied by what follows when someone says, "This will interest you"? As for the inflated prepositions and conjunctions, they are the innumerable phrases like "with the possible exception of" (except), "due to the fact that" (because), "he totally lacked the ability to" (he couldn't), "until such time as" (until), "for the purpose of" (for).

It's not only dry corporation speak that you should worry about. Actually, what I mean to say is that a little bit of wordiness totally creeps into informal writing way more than you'd think. If you do any sort of writing on the web, you seriously need to think about editing, and more often than not, this tool can help point out some bad habits.

You might be concerned that your writing will lose its personality. Zinsser goes on to say:

You will reach for gaudy similes and tinseled adjectives, as if "style" were something you could buy at a style store and drape onto your words in bright decorator colors. (Decorator colors are the colors that decorators come in.) Resist this shopping expedition: there is no style store. ... Style is organic to the person doing the writing, as much a part of him as his hair, or, if he is bald, his lack of it. Trying to add style is like adding a toupee.

You don't want your blog to wear a toupee, do you? Writing style isn't about needless words. Once you remove them, your thoughts will shine through, clearer and more powerful, and then you can then build them back up. This takes time, but your readers will appreciate it.

By using sources on the web, I came up with about 600 simple substitution rules to cut out wordy phrases, and encoded them into a python script. Along with other sources, I used Jeff Atwood's Coding Horror blog to train it, ~~[edit] as he seems to have a high wordiness factor~~, because I wondered if I could get a web celebrity to notice my little blog, and it totally worked.

Try it out above. Paste your entire blog article, essay, or email into it. Download the python source here.

Unfortunately Internet explorer 6 has some problem with my code...

Why Perforce is more scalable than Git

2009-02-23T21:31:22-05:00

Okay, say you work at a company that uses Perforce (on Windows). So you're happily tapping away using perforce for years and years. Perforce is pretty fast -- I mean, it has this "nocompress" option that you can tweak and turn on and off depending on where you are, and it generally lets you get your work done. If you change your client spec, it synchronizes only the files it needs to. Wow, that's blows the mind! Perforce is great, why would you ever need anything else? And its way better than CVS.

Suddenly you have to clone something with git, and BAM! The world is changed. You feel it in the water. You feel it in the earth. You smell it in the air. Once you've experienced git, there is no going back, man. Git is the stuff man. You might have checked out firefox -- but have you checked out firefox ooon GIT?

So many really obvious things are missing in p4. Want to restore your source tree to a pristine state? "git clean -fd". Want to store your changes temporarily to work on something else? "git stash". Share some code with a cube-mate without checking in? "git push". Want to automatically detect out of bounds array accesses and add missing semicolons to all your code? "git umm-nice-try"

Branching on git is like opening a new tab in a browser. It's a piece of cake. You can branch for EVERY SINGLE BUGFIX. And you wrote the code, so you get to merge it back in, because you are the expert.

Branching on Perforce is kind of like performing open heart surgery. It should only be done by professionals: experts in the art who really know what they are doing. You have to create a "branch spec" file using a special syntax. If you screw up, the entire company will know and forever deride you as the idiot who deleted "//depot/main". The merging is done by gatekeepers. Hope they know what they're doing!

Now, if you have been using git for a few days you might discover this tool called "git-p4". "AHA!" you might say, "I can import from my company's p4 server into git and work from that, and then submit the changes back when I am done," you might say. But you would be wrong, for a number of reasons.

git-p4 can't handle large repositories

Really. It's just a big python script, and it works by downloading the entire p4 repository into a python object, then writing it into git. If your repo is more than a couple of gigs, you'll be out of memory faster than you can skim reddit.

But that problem's fixable. I was able to hack up git-p4 to do things a file at a time in about an hour. The real problem is:

Git can't handle large repositories

Okay this is subjective because it depends on your definition of large. When I say large, I mean about 6 gigs or so. Because your company's source tree is probably that large. If you have the power, you will use it. Maybe you check in binaries of all your build tools, or maybe for some reason you need to check in the object files of the nightly builds, or something silly like that. P4 can handle this because it runs on a cluster of servers somewhere in the bowels of your company's IT department, administered by an army of drones tending to its every need. It has been developed since 1995 to handle the strain. Google also uses Perforce, and when it started to show its strain, Larry Page personally went to Perforce's headquarters and threatened to direct large amounts of web traffic up their executives' whazzoos until they did something about it.

Git has none of that. The typical git user considers the linux kernel to be a "large project". If you've looked at Linus's git rant on Google code, take a listen to see how he sidesteps the question of scalability.

Don't believe me? Fine. Go ahead and wait a minute after every git command while it scans your entire repo. It's maddening because its long enough to be annoying, but not enough time to skim Geekologie.

The solution

You know what? I don't think many people really use distributed source control. The centralized model is here to stay. Most git users (especially those using Github) use the centralized model anyway.

Ask yourself this: Is it really that important to duplicate the entire history on every single PC? Do you really need to peruse changelist 1 of KDE from an airplane? In most cases, NO. What you really want is the other stuff: easy branching, clean, and stash, and the ability to transfer changes to another client. The distributed stuff isn't really asked for, or needed. It just makes it hard to learn.

Just give me a version control system that lets me do these things and I'll be happy:

Let me merge changes into my coworker's repos, without having to check them in first.
Let me "stash" stuff cause it's really handy. Clean is nice to have too.
Make branching easy.
Don't waste 40% of my disk space with a .git folder, when this could be stored on a central server.

Is that really so hard?

Optimizing Ubuntu to run from a USB key or SD card

2009-01-25T20:17:06-05:00

If you've installed Ubuntu on a USB key or SD card, you are probably experiencing the annoying slowness of Firefox. It freezes up for a couple of seconds every time you click a link. Like many things on Ubuntu, it doesn't work right out of the box and needs some tweaking. Fortunately, by following the tips below, you can make your USB or SD card based linux system fly!

Tip 1: Stop Firefox from writing to disk

Firefox 3 has a hard-to-fix bug that causes linux to write to disk every time you visit a write page. But it doesn't just write Firefox stuff -- it causes your entire system to dump all changes to disk. Unfortunately, your USB key might only have a 6 MB/s write speed, causing everything to freeze up.

Under the Privacy setting, uncheck "Keep my history for..."
Under the Advanced Tab, select Network and ensure that you use up to 0 MB of disk space for the cache.

Tip 2: Use preload

The preload daemon is a program that constantly looks at the programs you are running and figures out which ones you are most likely to use. When you start your computer, it automatically loads these programs and library from disk in the background, so when you start firefox, for example, it will pop up right away. The background is described in the author's Master's thesis.

It's kind of like putting magnets under your pillow to improve health. Maybe it's having an effect, but I can't tell. I install it anyway:

sudo aptitude install preload

Tip 3: Compress your files

This tip can wreck your system, and to undo it you will need to be able to use a command line editor like nano, emacs, or vim. At worst you will need to mount the USB key on another linux system to recover (by editing /etc/fstab). If you can't do that, then skip this tip.

On solid state storage, space is expensive. Ubuntu uses a huge amount of space will all the programs it installs. The /usr folder contains your programs, and it is usually 1.8 GB. Using squashfs, it can be compressed to 0.7 GB. Since read speeds are so slow, you can actually gain performance because there is less data to read. I've adapted these instructions from here.

Install squashfs and unionfs:

sudo apt-get install squashfs-tools unionfs-tools

Add the following lines to /etc/modules:

unionfs
squashfs
loop

Remove apparmor. Otherwise, the cups print server will stop working:

sudo apt-get purge apparmor

Make space for the filesystem:

sudo mkdir -p /.filesystems/usr/overlay

Compress your filesystem:

sudo mksquashfs /usr /.filesystems/usr/usr.sqfs

Add these lines to /etc/fstab:

/.filesystems/usr/usr.sqfs /usr squashfs ro,loop,nodev 0 0
unionfs /usr unionfs nodev,noatime,dirs=/.filesystems/usr/overlay=rw:/usr=ro 0 0

Switch to runlevel 1. (Ubuntu will close all open programs, then prompt you what to do. Choose opening a root shell)

sudo init 1

Move aside the old /usr directory and create a new mount point:

mv /usr /usr.old
mkdir usr

Test whether you previously edited fstab successfully by typing:

mount -a

If you get error messages or your /usr directory shows up empty, either fix your fstab or undo the changes before continuing.

Now reboot and make sure it all works:

reboot

If it works, remove the /usr.old directory to reclaim the space.

Tip 4: Use memory instead of disk

This tip can also lead to data loss. If you do it, you will have to always shut down your computer properly from now on, because unexpected power failures will lead to data loss.

Linux usually ensures that all changes are written to disk every few seconds. Since disk writes are so slow, you can change your system to keep things in memory longer. All changes will be written to memory, and the excruciatingly slow writes to happen in the background while you continue working. This has an instant, noticeable effect, but it can lead to data loss.

Add these lines to /etc/sysctl.conf, and reboot.

vm.swappiness = 0
vm.dirty_background_ratio = 20
vm.dirty_expire_centisecs = 0
vm.dirty_ratio = 80
vm.dirty_writeback_centisecs = 0

The problem: using this tip means that your system stops writing changes to disk until you shut down or type "sync" at a command line. If your system loses power unexpectedly, you will get bad blocks. I did. You can limit the amount of data loss in the event of a power failure to one minute by setting vm.dirty_writeback_centisecs = 6000.

A side effect is that shutting down your computer will may take several minutes where it appears to be doing nothing. Don't cut the power until it's done, because it is busy writing all those changes to disk.

UMA Questions Answered

2008-12-20T10:08:32-05:00

Geez, I just went through my server logs and it is clear that people have lots of questions on UMA. Whenever someone asks a question in Google, and my web page pops up, and they click on it, I can see what they typed into Google. So in a way, all of you people on the Internet are able to tell me what to write about. So here is my page with what you want to know about UMA.

Because I work for RIM, there is a general employee directive to not give any technical support online. So this article relates to all cell phones in general. I certainly don't want to get fired. (Ssssh! They don't know about this blog). Oh yeah, almost forgot:

The postings on this site are my own and don't necessarily represent the position, opinions or strategies of RIM

You might also be interested in my general description of UMA, and whether you can get free long distance over UMA.

"how does UMA connection work"

From your perspective, you get one phone number that will work over your Internet connection when you are at home, and over cell towers when you are outside. If you travel to another country, you could possibly make calls to your home town and not be charged long distance. However, UMA is not skype. Other than that exact situation, you will normally be charged for long distance calls depending on your plan.

Normally your phone sends its signals to a cell tower, which forwards it to a server on the carrier's network. With UMA, your phone logs into the carrier's network through your internet connection and sends its signals directly to that server. That means you can access all of the same services over UMA, like voice, data, and SMS. Unfortunately, it means that all of those same services go through your carrier's network, perhaps unnecessarily. Some phones support something called "Internet Offload", in which UMA is only used for voice calls, but all the data goes directly over the Internet. Got it?

"connecting a uma phone to the network."

Connecting UMA is hard. Normal cell phones are designed so that any idiot with a bank account can use them. But unfortunately, UMA is based on the Wifi network, and Wifi has a lot of lot of things that you can screw up. To connect to UMA, six things have to happen:

Your cell phone has to be specifically designed for UMA. It has to contain a Wifi radio on it, or it just isn't going to work. You can't take your 10 year old Motorola and sign up for UMA service. It has to be fairly new -- probably made in the last two years. Furthermore, it has to contain the UMA software on it. You pretty much have to have the phone and OS designed to accommodate UMA, because the UMA application needs direct access to secure sockets, audio, and SIM card security functions.
You have to sign up for UMA service with your carrier. Their security gateway knows whether you are paying for their service and it won't let you through if you didn't sign up.
You have to have the security certificate installed on your phone. Each carrier (for example, T-Mobile, or Rogers in Canada) has a certificate that has to be loaded onto your phone for you to connect. It is kind of like a password, but it is far too long to type in manually, so it has to be loaded with special software. This probably happened when your phone was manufactured and customized for the carrier. Unfortunately, that means that it is locked to a particular carrier, unless you can figure out a way to load a new certificate. This is what gives people problems when they try to use a phone bought on ebay.
You have to successfully connect to Wifi. This will involve scanning for the Wifi network using your phone, and probably entering a WEP or WPA password of some kind. You should try this on your laptop first, to see if your Internet is set up properly. If your laptop can't connect, then there is some other problem with your Wifi router.
Your phone has to be configured to use the UMA connection. The mobile phone standards dictate that all phones have to have some way of choosing how UMA is used. You have to be able to choose between cellular-only, cellular-preferred, UMA only, and UMA preferred modes of operation. The only option that makes sense is UMA-preferred. (This is sometimes called "Wifi preferred")
Finally, UMA must successfully connect. Turn off your microwave oven. Stop bittorrent. At this point, one missed packet can cause a huge delay. The connection phase retry can take up to 32 minutes, because the mobile standards describe a precise scheme of retries. Specifically, it tries three times, waiting 30 seconds between tries. If those tries fails, it waits 2 minutes, then tries three more times. On failure, it doubles the two minutes to 4 minutes, and so forth, until it eventually waits 32 minutes between retries. Then it stops doubling the timeout. To avoid the 32 minute wait, you can probably make it try again immediately by turning off and on the phone.

"how to use uma phone without sim"

It is not possible to use UMA without signing up for an UMA plan with a carrier. That is because the SIM card actually performs the authentication process to login to UMA. No SIM means no UMA.

"does uma work over different carriers"

If you travel, you can use your UMA phone to connect to your home carrier and make calls without roaming charges. Your main problem will be connecting your phone to wireless hotspots, because often you will have to login to the hotspot and click "I agree" to some usage agreement. If your phone browser isn't able to display the button, then you won't be able to connect to Wifi. But if you pass this hurdle, your phone should work fine. Just make sure it is actually using the UMA network when you make your call. In this case, it might be good to set it to UMA only mode.

"how to make uma server"

So you want to make an UMA server so you can run your own phone service. Okay, well, first you have to remember that UMA is simply a wrapper around the regular cell phone messages so they can be transported over the Internet. So all you have to do is implement the entire rest of that phone network. Feel free to download the 3GPP specifications and implement your own phone network. Don't let me stop you. Out of the 1,000,000 pages of telephony standards, I think you'd only have to read about 50,000.

Or you could just go write a SIP server like a normal person.

"which carriers have uma"

Umm, I don't know. I know T-Mobile US is doing it and Rogers Canada is as well. There is a complete list on wikipedia.

"best uma phone"

The best UMA cell phone is the BlackBerry, which does support Internet Offload. The latest model at this point is the 8900. Go out and buy that one as soon as you can. But be careful if it is used.

"UMA skype"

Skype shouldn't be used over UMA, because as I have said, on most phones, all data goes through the carriers servers, and is chargeable. Your Skype usage would cost thousands of dollars in data charges. If your phone supports Skype and Wifi, and you can turn of UMA, do so to avoid these charges.

"is uma service free"

That depends on your plan and contract. You did read it, didn't you? Usually you have to pay extra for UMA, but then you get unlimited calling while on UMA. But read your plan very carefully.

If you make long distance calls, you will probably be charged extra.

But the number one search topic for UMA is:

"uma nude"

I don't have all the answers about UMA! For that, you'll have to look elsewhere.

See sound without drugs

2008-12-09T19:19:33-05:00

To further my understanding of frequency analysis and the fast fourier transform, I have created an application that just turns on the microphone and continually plots the FFT magnitude of what it records. It allows control over the window size and sampling rate.

Download SoundLab

Because the FFT of a real valued function is even, we only display the first half.

It's fun to whistle and see the result. I think it would be cool to make a game out of this. For example, you could make pong and control the paddle with sound. I plan to call it Whistle Hero.

For some more fun, run soundlab at the same time as WaveStudio, an older application I made that lets you draw a waveform and hear it. That way you can see a waveform that you create in the time domain in the frequency domain, and hear it all at the same time. Its actually quite challenging to try to make a perfect soundwave, and eliminate all of the harmonics.

Stock Picking using Python

2008-10-18T17:16:08-05:00

The stock market is a lot different than it was just a few months ago. Once again, I present my stock selections, as found via python script. Comparing it with last time, you will find most of the same names are on there.

Financial data for 1600 public companies listed on the TSX is downloaded from http://finance.google.com.
The annual and quarterly revenue and earnings is scraped from the HTML file using an sgrep query.
Each company is filtered according to the following criteria:
- Resource, mining, and energy stocks are excluded.
- Stocks with PE ratio higher than 50 are excluded.
- Stocks which had negative revenue or earnings in the past two years are excluded.
The remaining stocks are sorted by growth and displayed here.

Source code on Github Revenue Growth EPS Growth Price P/E (Years positive) (Years Positive) ------------------------------------------------------------ 183% (2) 509% (2) 12.50 5 165% (2) 127% (2) 5.59 8 140% (2) 629% (2) 6.36 3 94% (3) 58% (2) 14.50 6 88% (3) 57% (3) 3.07 7 82% (2) 785% (3) 1.95 1 78% (3) 35% (2) 12.00 22 70% (3) 119% (2) 4.00 12 65% (3) 86% (3) 70.25 22 58% (3) 68% (3) 4.67 6 53% (2) 477% (2) 3.10 44 52% (2) 68% (2) 7.12 7 Appliances Income Fund 51% (2) 5% (2) 3.99 5 44% (3) 241% (2) 4.08 14 43% (3) 46% (3) 15.21 7 43% (2) 43% (2) 12.07 14 40% (3) 977% (3) 14.00 12 38% (3) 34% (2) 10.13 6 37% (3) 23% (3) 15.00 5 37% (3) 42% (3) 8.40 9 36% (3) 55% (3) 6.50 4 35% (3) 39% (3) 6.39 4 35% (3) 209% (3) 6.78 5 33% (2) 232% (2) 31.07 31 33% (3) 21% (2) 2.90 13 30% (3) 123% (2) 5.91 3 29% (2) 41% (2) 6.75 12 28% (3) 302% (2) 8.00 11 28% (3) 44% (3) 23.00 10 26% (3) 215% (2) 10.87 6 25% (3) 19% (3) 20.86 19 24% (3) 30% (2) 7.30 4 23% (3) 67% (3) 10.95 7 23% (3) 28% (3) 8.50 15 23% (3) 24% (3) 18.64 11 22% (3) 79% (2) 9.20 7 22% (3) 72% (2) 35.30 12 22% (3) 45% (2) 5.66 7 21% (3) 29% (3) 24.94 18 20% (3) 27% (3) 24.35 25 20% (3) 30% (3) 30.02 13 20% (3) 78% (2) 5.75 5 Investment Trust 20% (3) 2% (2) 9.26 8 20% (3) 29% (3) 23.31 26 18% (3) 26% (3) 15.00 20 17% (3) 18% (3) 16.85 7 16% (3) 42% (2) 12.20 11 13% (2) 203% (2) 2.82 6 13% (3) 38% (3) 7.50 12 13% (3) 9% (2) 28.55 19 13% (3) 54% (2) 6.80 10 10% (3) 13% (3) 35.52 10 10% (3) 29% (3) 15.20 9 10% (3) 9% (3) 17.86 11 10% (3) 90% (2) 9.50 8 10% (3) 19% (3) 14.65 10 9% (3) 20% (3) 24.08 12 8% (3) 16% (3) 45.15 18 8% (3) 17% (2) 4.51 7 8% (3) 10% (3) 9.70 11 8% (3) 6% (3) 8.84 5 7% (2) 33% (3) 4.49 19 7% (3) 32% (3) 88.00 6 7% (2) 5% (3) 12.63 11 6% (3) 33% (3) 47.71 9 6% (3) 35% (3) 39.62 9 5% (3) 67% (3) 12.50 5 4% (2) 41% (2) 7.90 13 wide should you make your web page?

How QBASIC almost got me killed

Copy a cairo surface to the windows clipboard

Finding great ideas for your startup

Yes, You Absolutely Might Possibly Need an EIN to Sell Software to the US

Optimizing Ubuntu to run from a USB key or SD card

An instant rhyming dictionary for any web site

I stumbled accross this page about myself on this rotten company Spoke.com, who, without my permission, gathered my name and employment history together into one place. I object to it, but there was no obvious way to get it removed. After a lot of searching, I found a contact page and filled it out, but I'm not at all confident that it will be acted on.

Spoke, if you want me to remove this entry about you, you can opt-out at any time using the contact form below. After I verify your identity I will put your request into a queue for removal.

Copy a cairo surface to the windows clipboard

2008-09-19T10:00:00-05:00

I just spent several hours debugging clipboard copy of a DIB image. I could copy from my application, and paste into Paint. I could paste into Word. But if I pasted into WordPad, nothing showed up. If I pasted into GIMP, it crashed.

The general procedure is to fill out a BITMAPINFO structure, calculate the size of the image + row padding + the bitmap info structure itself, then allocate a memory handle with GlobalAlloc(). Finally, copy the BITMAPINFO structure into the given memory followed by the image pixel data.

What they don't tell you in MSDN is that, for maximum compatibility with other applications, you must use a positive value for the BITMAPINFOHEADER biHeight member. That means that you have to create the bitmap upside down

The other thing they don't tell you, unless you're reading really, really carefully, is that you have to insert padding at the end of the rows so that they always end at a DWORD (4 byte) boundary.

Anyway, here's a code snippet. Hopefully it will help somebody someday. If so, give me $2. Really.

// copy a cairo win32 surface (with dib) to the clipboard.
bool
GraphicsClipboardSurface::copyToClipboard(HWND hwnd)
{
    cairo_surface_t* imageSurface = cairo_win32_surface_get_image( _surface );
    if ( imageSurface == NULL ) {
        assert(false);
        return false;
    }

    unsigned char* bits = cairo_image_surface_get_data( imageSurface );

    if ( bits == NULL ) {
        assert( false );
        return false;
    }

    assert( cairo_image_surface_get_format( imageSurface ) == CAIRO_FORMAT_RGB24 );

    BITMAPINFOHEADER bmi;
    unsigned biSizeImage;
    memset( &bmi, 0, sizeof(bmi) );
    bmi.biSize = sizeof(bmi);
    bmi.biWidth = cairo_image_surface_get_width( imageSurface );
    bmi.biHeight= cairo_image_surface_get_height( imageSurface ); 

    unsigned rowPad = ( 4 - ( ( bmi.biWidth * 3 ) & 3 ) ) & 3;

    bmi.biPlanes = 1;
    bmi.biBitCount = 24; // 24 or 32. If 32, high byte is not used.
    bmi.biCompression = BI_RGB;
    biSizeImage = bmi.biWidth * bmi.biHeight * ( bmi.biBitCount / 8 ) + bmi.biHeight * rowPad;
    bmi.biXPelsPerMeter = (LONG)((double)96 * 100 / 2.54 + 0.5) ; // dpix
    bmi.biYPelsPerMeter = (LONG)((double)96 * 100 / 2.54 + 0.5); // dpiy
    bmi.biClrUsed = 0;
    bmi.biClrImportant = 0;

    HGLOBAL hMem = NULL;
    unsigned char* ptr = 0;
    unsigned size;
    bool success = false;

    // OpenClipboard
    if ( !OpenClipboard(hwnd) ) {
        return false;
    }

    // call EmptyClipboard
    if ( !EmptyClipboard() ) {
        goto error;
    }

    // calculate size of the data.
    size = bmi.biSize + biSizeImage;

    // Allocate the data using GlobalAlloc with GMEM_MOVEABLE flag.
    hMem = GlobalAlloc( GMEM_MOVEABLE, size );

    if ( hMem == NULL ) {
        goto error;
    }

    ptr = (unsigned char*)GlobalLock( hMem );
    if ( ptr == 0 ) {
        goto error;
    }

    // copy data to clipboard
    memcpy( ptr, &bmi, bmi.biSize );

    // copy each row of the bitmap in reverse order, adding padding after each
    // row.
    unsigned char* src = bits + bmi.biWidth * (bmi.biHeight-1) * 4;
    unsigned char* dest = ptr + bmi.biSize;
    for ( int i = 0; i < bmi.biHeight; i++ ) {
        for ( int j = 0; j < bmi.biWidth; j++ ) {
            *dest++ = *src++;
            *dest++ = *src++;
            *dest++ = *src++;
            src++;
        }

        dest += rowPad;
        src -= bmi.biWidth * 4 * 2;
    }

    GlobalUnlock( hMem );

    // Call SetClipboardData
    if ( !SetClipboardData( CF_DIB, hMem ) ) {
        goto error;
    }

    hMem = NULL; 

    success = true;
error:
    if ( hMem ) {
        GlobalFree( hMem );
    }
    CloseClipboard();
    
    return success;

On the plus side, websequencediagrams.com Desktop Edition is coming along very nicely. I implemented Print yesterday. It should be ready soon... Hopefully some companies will buy it. Here's a screenshot.

Update: Of course, by 2011 desktop applications are gone. I made the right call by not finishing the desktop version, and just licensing the whole webserver. More information on this strategy is in C++: A language for next generation web apps

Simulating freehand drawing with Cairo

2008-07-30T20:49:28-05:00

Have a look at this image. You might think I scrawled it on a napkin and scanned it in. Wrong! It was completely automatically generated by an upcoming release of www.websequencediagrams.com, with the new "napkin" style. Getting it to render this way was easy, simply with a tiny bit of math and a change to my line drawing function. The handwriting font FG Virgil.

Here's the same diagram in a different style:

Here's the C code that I use for my line drawing function, using the cairo API.

void
crazyLine( cairo_t* ctx, double fromX, double fromY, double toX, double toY)
{
    // Crazyline. By Steve Hanov, 2008
    // Released to the public domain.

    // The idea is to draw a curve, setting two control points at random 
    // close to each side of the line. The longer the line, the sloppier it's drawn.
    double control1x, control1y;
    double control2x, control2y;

    // calculate the length of the line.
    double length = sqrt( (toX-fromX)*(toX-fromX) + (toY-fromY)*(toY-fromY));
    
    // This offset determines how sloppy the line is drawn. It depends on the 
    // length, but maxes out at 20.
    double offset = length/20;
    if ( offset > 20 ) offset = 20;

    // Overshoot the destination a little, as one might if drawing with a pen.
    toX += ((double)rand()/RAND_MAX)*offset/4;
    toY += ((double)rand()/RAND_MAX)*offset/4;

    double t1X = fromX, t1Y = fromY;
    double t2X = toX, t2Y = toY;

    // t1 and t2 are coordinates of a line shifted under or to the right of 
    // our original.
    t1X += offset;
    t2X += offset;
    t1Y += offset;
    t2Y += offset;

    // create a control point at random along our shifted line.
    double r = (double)rand()/RAND_MAX;
    control1X = t1Y + r * (t2X-t1X);
    control1Y = t1Y + r * (t2Y-t1Y);

    // now make t1 and t2 the coordinates of our line shifted above 
    // and to the left of the original.

    t1X = fromX - offset;
    t2X = toX - offset;
    t1Y = fromY - offset;
    t2Y = toY - offset;

    // create a second control point at random along the shifted line.
    r = (double)rand()/RAND_MAX;
    control2X = t1X + r * (t2X-t1X);
    control2Y = t1Y + r * (t2Y-t1Y);

    // draw the line!
    cairo_move_to( _ctx, fromX, fromY );
    cairo_curve_to( _ctx, control1X, control1Y, control2X, control2Y, toX, toY );
}

Free, Raw Stock Data

2008-06-05T21:23:37-05:00

Why can't anybody write a decent stock screener? Google did, but they left out my favourite exchange, the TSX. The best indicator of whether a stock is going to go up in the medium term is growth in earnings, but it is near impossible to find this information for Canadian stocks. I have tried the one at GlobeInvestor.com, but it seems to be written by an imbecile, and its results are quite random.

Frustrated, I wrote my own tool to pull this information from publicly available sources (Only took about 5 hours). Here, at last, is a text file containing the fundamentals for about 1100 securities on the TSX, as of June, 2008.

Right now I list only Revenue and EPS, because that is what I use to screen stocks. My plan is to analyse this data in the near future, and find the next Research in Motion.

Download the Database

Example and format

svc,q,2008-02-29,Revenue,8290000.00
svc,q,2008-02-29,EPS,-0.05
svc,q,2007-11-30,Revenue,17110000.00
svc,q,2007-11-30,EPS,0.00
svc,q,2007-08-31,Revenue,21180000.00
svc,q,2007-08-31,EPS,0.02
svc,q,2007-05-31,Revenue,20020000.00
svc,q,2007-05-31,EPS,0.08
svc,q,2007-02-28,Revenue,15380000.00
svc,q,2007-02-28,EPS,0.05
svc,a,2007-11-30,Revenue,73680000.00
svc,a,2007-11-30,EPS,0.14
svc,a,2006-11-30,Revenue,31660000.00
svc,a,2006-11-30,EPS,0.00
svc,a,2005-11-30,Revenue,15810000.00
svc,a,2005-11-30,EPS,-0.04
svc,a,2004-11-30,Revenue,3260000.00
svc,a,2004-11-30,EPS,-0.11
svc,a,2003-11-30,Revenue,2040000.00
svc,a,2003-11-30,EPS,-0.07

Each line is a comma separated list of the following fields:

Symbol
q (Quarterly) or a (Annual)
Date of the report
Type of data (Revenue or EPS)
Revenue or EPS, both in dollars.

Hottest non-energy/non-mining stocks

So here are the hottest stocks, filtered with the following criteria:

Both Revenue and EPS are increasing for at least two years
P/E ratio is positive and less than 50.
Industry is not resources or energy.

All data is from Friday, June 6, 2008

Stock	Average Revenue Growth (# years positive)	Average EPS Growth (# years positive)	Industry	Price	Price/Earnings Ratio
Yellow Pages Income Fund (ylo.un)	629% (4)	298% (4)	Communications & Media (Publishing & Printing)	$9.80	10
AltaGas Utility Group Inc. (aui)	165% (2)	127% (2)	Utilities (Gas Utilities)	$6.79	10
Canadian Helicopters Income Fund (chl.un)	140% (2)	629% (2)	Transportation and Environmental Services (Transportation)	$13.00	6
IBI Income Fund (ibg.un)	94% (3)	58% (2)	Business Services (Consulting)	$22.90	7
ADF Group, Inc. (drx)	82% (2)	785% (3)	Industrial Products (Metal Fabricators)	$4.95	5
RuggedCom Inc. (rcm)	78% (3)	35% (2)	Industrial Products (Electrical & Electronic)	$14.25	32
Parkbridge Lifestyle Communities Inc. (prk)	70% (3)	119% (2)	Business Services (Computer Software & Processing)	$5.45	18
Glacier Ventures International Corp. (gvc)	62% (5)	57% (3)	Communications & Media (Publishing & Printing)	$4.10	11
Saxon Financial Inc. (sfi)	62% (5)	31% (5)	Financial Services (Investment Companies and Funds)	$14.20	11
Martinrea International Inc. (mre)	58% (3)	68% (3)	Industrial Products (Metal Fabricators)	$8.08	9
Cargojet Income Fund (cjt.un)	53% (2)	477% (2)	Transportation and Environmental Services (Transportation)	$11.15	24
Pollard Banknote Income Fund (pbl.un)	52% (2)	68% (2)	Industrial Products (Misc. Industrial Products)	$7.59	6
Coast Wholesale Appliances Income Fund (cwa.un)	51% (2)	5% (2)	Consumer Products (Household Goods)	$7.72	8
Parkland Income Fund (pki.un)	49% (5)	209% (3)	Merchandising and Lodging (Specialty Stores)	$11.45	7
Armtec Infrastructure Income Fund (arf.un)	43% (3)	88% (4)	Industrial Products (Misc. Industrial Products)	$24.00	11
World Point Terminals Inc. (wpo)	43% (2)	43% (2)	Transportation and Environmental Services (Transportation)	$14.00	17
Gerdau Ameristeel Corporation (gna)	43% (5)	30% (2)	Industrial Products (Steel)	$18.20	11
Logibec Groupe Informatique Ltd. (lgi)	41% (5)	40% (5)	Business Services (Computer Software & Processing)	$20.00	27
GMP Capital Trust (gmp.un)	38% (5)	39% (3)	Financial Services (Investment Houses)	$16.31	8
Aastra Technologies Limited (aah)	37% (4)	34% (2)	Communications & Media (Telecommunications)	$25.68	12
Sleep Country Canada Income Fund (z.un)	31% (4)	31% (4)	Merchandising and Lodging (Specialty Stores)	$20.00	10
Gemcom Software International Inc. (gcm)	31% (5)	37% (3)	Business Services (Computer Software & Processing)	$2.99	23
Melcor Developments Ltd. (mrd)	30% (4)	42% (4)	Real Estate (Developers)	$14.73	7
Equitable Group Inc. (etc)	29% (5)	28% (4)	Financial Services (Finance and Leasing)	$21.35	8
Stella-Jones Inc. (sj)	29% (4)	53% (4)	Industrial Products (Misc. Industrial Products)	$35.26	17
Canaccord Capital Inc. (cci)	29% (2)	34% (2)	Financial Services (Investment Houses)	$9.97	3
Western Financial Group (wes)	29% (5)	21% (2)	Financial Services (Insurance)	$4.17	20
Sceptre Investment Counsel Limited (sz)	29% (2)	41% (2)	Financial Services (Investment Companies and Funds)	$9.21	17
RDM Corporation (rc)	28% (4)	203% (3)	Business Services (Computer Software & Processing)	$1.60	9
Energy Savings Income Fund (sif.un)	26% (5)	73% (4)	Utilities (Gas Utilities)	$14.79	11
Algoma Central Corporation (alc)	22% (4)	50% (4)	Transportation and Environmental Services (Transportation)	$138.00	9
Marsulex Inc. (mlx)	22% (4)	302% (2)	Transportation and Environmental Services (Environmental)	$13.50	16
Gildan Activewear Inc. (gil)	21% (3)	29% (3)	Consumer Products (Household Goods)	$29.40	21
TSX Group, Inc. (x)	21% (4)	22% (5)	Other Services (Other Services)	$43.82	20
Premium Brands Income Fund (pbi.un)	20% (4)	79% (2)	Consumer Products (Food Processing)	$13.10	8
General Donlee Income Fund (gdi.un)	20% (3)	78% (2)	Industrial Products (Metal Fabricators)	$8.75	8
Computer Modelling Group Ltd. (cmg)	20% (5)	24% (5)	Business Services (Computer Software & Processing)	$18.81	21
The Churchill Corporation (cuq)	19% (5)	123% (2)	Real Estate (Contractors)	$21.38	16
Ritchie Bros. Auctioneers (rba)	18% (5)	29% (3)	Merchandising and Lodging (Specialty Stores)	$26.45	37
Stantec Inc. (stn)	18% (5)	22% (5)	Business Services (Consulting)	$29.14	18
Cogeco Cable Inc. (cca)	17% (5)	72% (2)	Communications & Media (Cable)	$39.95	14
Firm Capital Mortgage Investment Trust (fc.un)	17% (5)	2% (2)	Financial Services (Finance and Leasing)	$10.58	10
Shoppers Drug Mart Corporation (sc)	17% (5)	18% (5)	Merchandising and Lodging (Specialty Stores)	$56.17	23
Velan Inc. (vln)	16% (3)	707% (2)	Industrial Products (Misc. Industrial Products)	$12.21	22
Guardian Capital Group Ltd. (gcg)	16% (5)	42% (4)	Financial Services (Investment Companies and Funds)	$8.46	12
Macdonald Dettwiler & Associates Ltd (mda)	16% (5)	17% (5)	Business Services (Computer Software & Processing)	$42.23	18
WFI Industries Ltd. (wfi)	15% (5)	21% (5)	Consumer Products (Misc. Consumer Products)	$25.95	30
easyhome Ltd. (eh)	13% (5)	42% (2)	Merchandising and Lodging (Specialty Stores)	$16.49	14
Descartes Systems Group Inc. (dsg)	13% (2)	203% (2)	Business Services (Computer Software & Processing)	$3.85	9
Toromont Industries Ltd. (tih)	12% (5)	24% (5)	Merchandising and Lodging (Wholesale Distributors)	$30.61	16
Finning International Inc. (ftt)	12% (5)	29% (3)	Merchandising and Lodging (Wholesale Distributors)	$28.00	17
Linamar Corporation (lnr)	11% (5)	27% (4)	Industrial Products (Transportation Equip. & Compnts)	$16.30	10
IGM Financial Inc. (igm)	11% (4)	12% (5)	Financial Services (Investment Companies and Funds)	$45.84	13
CAE, Inc. (cae)	11% (4)	54% (2)	Industrial Products (Transportation Equip. & Compnts)	$13.34	20
Richelieu Hardware Ltd. (rch)	10% (5)	11% (5)	Merchandising and Lodging (Wholesale Distributors)	$21.25	14
North West Company Fund (nwf.un)	10% (3)	13% (5)	Merchandising and Lodging (Department Stores)	$18.27	14
Thomson Reuters Corporation (toc)	8% (2)	30% (2)	()	$37.48	22
Cossette Communication Group Inc. (kos)	8% (5)	17% (2)	Business Services (Advertising Agencies)	$6.15	7
The Forzani Group Ltd. (fgl)	7% (5)	90% (2)	Merchandising and Lodging (Specialty Stores)	$17.02	12
Leon's Furniture Ltd. (lnf)	7% (5)	13% (4)	Merchandising and Lodging (Specialty Stores)	$12.00	14
CML Healthcare Income Fund (clc.un)	7% (2)	18% (4)	Other Services (Medical Services)	$15.42	13
Canadian Pacific Railway Limited (cp)	6% (4)	25% (4)	Transportation and Environmental Services (Transportation)	$69.03	11
Indigo Books & Music Inc. (idg)	5% (3)	116% (5)	Merchandising and Lodging (Specialty Stores)	$14.49	6
TELUS Corporation (t)	5% (5)	45% (4)	Utilities (Telephone Utilities)	$46.07	11
High Liner Foods Incorporated (hlf)	4% (2)	41% (2)	Consumer Products (Food Processing)	$8.85	20

Why are all my lines fuzzy in cairo?

2008-04-04T13:50:07-05:00

Cairo is the hot new cross platform graphics library. It is becoming very popular, because it solves two outstanding problems in a portable way:

Path based drawing
Antialiasing

Both of these problems are astoundingly hard. You would have to read a whole graphics textbook in order to implement basic drawing, and antialising. Before cairo, your choices were Win32 GDI based drawing, or whatever GTK uses. In addition, cairo is supported in Python.

The problem is that cairo has something that's not obvious for some people. A lot of users might write a program to draw a line and get this:

#!/usr/bin/python
import cairo

def drawLine( ctx, x1, y1, x2, y2 ):    
    ctx.move_to( x1, y1 )
    ctx.line_to( x2, y2 )
    ctx.set_line_width( 1.0 )
    ctx.stroke()    

surface = cairo.ImageSurface(cairo.FORMAT_RGB24, 32, 32)
ctx = cairo.Context( surface ) 
ctx.set_source_rgb( 1.0, 1.0, 1.0 )
drawLine( ctx, 2, 16, 30, 16 )    
drawLine( ctx, 16, 2, 16, 30 )    
surface.write_to_png( "out.png" )

(Magnified 4 times)

The lines are all fuzzy! Even Inkscape, an otherwise well-polished graphics program, has this naive implementation, and it frustrates users to no end, because all of their lines are fuzzy.

The reason is because cairo's coordinates are centered on the pixel boundaries, instead of in the middle of a pixel. So when you draw the line at coordinate (2, 16), it is really beginning half way in between pixel 2 and 3, and pixels 16 and 17.

The immediate solution is to add 0.5 to all your coordinates. If you are doing more complicated drawing, with varying pen widths and scales, you will have to modify it somewhat. Also, this system breaks down as soon as you scale the image smaller, as adding 0.5 starts to make huge errors in where things are. But for an image that is not scaled smaller, please snap the coordinates to avoid the fuzzy lines, and the eyesight of your users!

#!/usr/bin/python
import cairo

def snapCoords( ctx, x, y ):
    (xd, yd) = ctx.user_to_device(x, y)
    return ( round(x) + 0.5, round(y) + 0.5 )

def drawLine( ctx, x1, y1, x2, y2 ):    
    point1 = snapCoords( ctx, x1, y1 )
    point2 = snapCoords( ctx, x2, y2 )
    ctx.move_to( point1[0], point1[1] )
    ctx.line_to( point2[0], point2[1] )
    ctx.set_line_width( 1.0 )
    ctx.stroke()    

surface = cairo.ImageSurface(cairo.FORMAT_RGB24, 32, 32)
ctx = cairo.Context( surface ) 
ctx.set_source_rgb( 1.0, 1.0, 1.0 )
drawLine( ctx, 2, 16, 30, 16 )    
drawLine( ctx, 16, 2, 16, 30 )    
surface.write_to_png( "out.png" )

A simple command line calculator

2008-03-23T15:28:15-05:00

How many times have you needed to calculate something, for example the value of 0x398A3BB, so you pop up windows calculator to convert it? I did lots of times. The problem is I hate to use the mouse. It takes precious deciseconds away from software development time to remove my hands from the keyboard and use the mouse. That's why I created calc.exe. Its a simple command line calculator (and its also an example of recursive decent parsing).

Download calc.exe
Download calc.c

Examples

C:>calc 5+5*5
30.000000

c:>calc 0x30
48.000000

c:>calc (123456 % 51)/12
3.000000

Tool for Creating UML Sequence Diagrams

2008-03-03T19:51:21-05:00

If you have to draw something called "UML Sequence Diagrams" for work or school, you already know that it can take hours to get a diagram to look right. Here's a web site that will save you some time:

www.websequencediagams.com

You can just write the diagram out in text, click "Draw", and the web site will spit out an image. Then you can tell your boss that you slaved for hours in MS Visio perfecting every line...

Example

Here's an example of what you'd write. Notice that the syntax is very natural.

Alice->Bob: Authentication Request
alt successful case
    Bob->Alice: Authentication Accepted
else some kind of failure
    Bob->Alice: Authentication Failure
    note right of Bob: Bob clears key cache  
end

... And here's the resulting image:

Hey, wait a minute...

Astute readers will notice that I am the author of websequencediagrams.com. Yesterday I added a note pleading with people to blog about it or at least link to it, and I figured that I should practice what I preach. So this article is nothing but a shameless plug for my other web site.

The fact is, the page doesn't have much text on it, so its hard for people to come across it by searching alone. I added it to Wikipedia and so far that's where most people find it. The more links I have, the more Googler's will find it.

Alternatives

Quick Sequence Diagram Editor	Java program with confusing syntax.
mscgen	Unix command line program. I was inspired by its syntax, but found it overly verbose.
Sequence Diagram Editor	Nice editor, if you love filling in text boxes. $99
Tracemodeler	A worthy competitor, though we have different views on the use of text. Its author, Yanic, and I try to see is first to mention our tool on web forums.

Exploring sound with Wavelets

2007-12-27T22:39:00-05:00

Here's a program to create scalograms of sound files. Pictured below is the "windows xp startup sound". See how the individual frequencies have been isolated visually.

I have created a separate web page for this project... please go there.

Download Installer

I've been curious about wavelets since I did a course project on them.

The wavelet transform is similar to Fourier analysis, in that it figures out which frequencies exist in a given signal. The difference is that it adds another dimension to the data. From a 1-D waveform, you will get a 2-D picture. Each row is a frequency, and the columns are times. So you get a picture of how the frequency changes with time.

The DWT does speedup the wavelet transform greatly, and mathematically, no information is lost. However it is not a very good way to look at the data visually. From 1024 samples, you only get 10 frequency bands. There's no way to, for instance, distinguish individual notes in song. Here's an example of what you'd get from the DWT. Compare it to the result from the first image, and you see how much information is hidden!

Figure 1: Ten frequency bands from 512 samples. Where did all the information go???

Because of the DWT, very few people give the CWT (continuous wavelet transform) a second glance. The library is filled with books on wavelets that spend two pages on the CWT, and then talk for the rest of the book about applying the DWT. As a result, people think the DWT is all there is.

Another technique, called the the wavelet packet transform, gives you a little more detail. But at the end of it, if you have 1024 sound samples, you will have 1024 transformed points. The more times you perform the algorithm, the more detail you loose in time (and the image looks like a pixellated mess).

Continuous Wavelet Transform

My program applies the continuous wavelet transform to a wave file that you load in, and lets you zoom into see the individual frequencies that make up a sound. Give it a try!

One problem with it is that it generates a lot of data. Analyzing that sound took 170 MB of memory, and a couple of minutes on my computer. If you tried it on a 5 minute MP3 file, that's 5 times 60 seconds times 44100 samples per second * 44100/60 frequency bands = 9.7 billion data points, or about 38 GB of floating point data, if you don't use stereo!.

But it does produce some pretty pictures for short files. Here's a closeup view of the famous tada.wav:

Closeup on a small section of tada.wav

The majestic noise of the "c:windowsmediarecycle.wav" paper crumpling makes a great wallpaper.

How it works

The program loads in a wave file using libsnd. If it is stereo or multichannel, the other channels are ignored and only the first channel is used.
When you see "Rendering... 1%" on the screen, the program is busy calculating the wavelet transform. It first calculates some frequency scales, from 2 samples to sampleRate divided by 60 samples long, and goes through them logarithmically (eg. 2, 4, 8, 16 samples long).
For each scaling factor, it creates a "real" and "complex" wavelet whose period is that many samples long. The wavelet we use is the cosine function multiplied by a gaussian (For the real part) and the imaginary part is the same thing, but with a sine function. This is known as the Morlet wavelet, and it is exceptionally good for sound analysis due to the sine and cosine basis.
Once it has created the wavelets, it convolves the wavelet with the signal. Convolution is kind of like smearing one signal with another. To speed up the algorithm, I perform convolution by multiplying the fourier transforms of the signal and the wavelet. After the convolution, we end up with the strength of the wavelet in the signal at each point in time.
The process is repeated for each scale level.
Now we have real and complex data samples. The magnitude of the data samples are converted into a huge device independent bitmap in memory, so it can be displayed to the screen. I hope you have lots of RAM.

Future Work

If I have time in the new year, I'm going to add some fun stuff:

Drag and drop pitch shifting - This is not as easy as moving pixels on the image... first you have to do something called "phase unwrapping". My first cut at a phase unwrapping algorithm didn't work, so I'm trying to translate some fortran code from a 1981 paper I found. Does anybody have some C code for this???
Boost/Reduce -- Draw a square with the mouse and boost or reduce the strength of that region. This could be great for manual noise elimination and sound retouching, or restoring that old copy of Brahms playing the piano.

UMA and free long distance

2007-07-29T21:47:40-05:00

UMA and free long distance

Last time, I talked about the UMA technology used in some newer cell phones. Some of you might be thinking, these new cell phones work over the Internet. What's to stop me from travelling to another continent, and then making free long distance calls to local numbers back home?

Technically, nothing's stopping you. But in theory, carrier policy might get in the way. UMA technology makes it possible for the carrier's to decide what should happen.

You see, when an UMA phone starts up, it is required to scan the cellular network first, before it does anything else. Then, it will try to get UMA service. This will happen even if you've configured the device to only use WiFi.

As part of the registration process towards the UNC (this is the carrier's server ,the one that acts like a cell tower, only over the Internet), the mobile will report the identity of the surrounding network. Part of this report is the MCC, or mobile country code of the network. Using this information, the carrier can easily figure out what country you are in. If they have a database of the exact cells in the area, they could figure out where you are to within a 30 km radius too.

If that doesn't work, devices equipped with GPS will generally report your coordinates as well. This is all happening as soon as you power up the phone.

So if there is cellular coverage, your home carrier will be able to figure out where in the world you are. They can then comply with existing roaming agreements with the carrier in the country you are in, or they could just be evil and charge you more, keeping all the money for themselves, since they don't even need the other carrier.

I suppose you could make sure the phone is out of cellular coverage somehow. Maybe you could wrap your hotel room in aluminum foil. But then the mobile will report that too. It will say that it's out of coverage, and your carrier will know that you are not at home from your IP address, and they may refuse you service.

Reality

In reality, carrier's aren't all that concerned about this yet. It seems like, for the time being, you can get free long distance using this method. It makes sense for the carriers to extend their UMA service abroad, because otherwise you would simply be benefiting a foreign network.

Here's an ideal scheme: Suppose you have a lot of family in the US, but you live in the UK. So you go to the US, sign up for UMA service from T-Mobile, and they give you a handset and an access point. Say thanks, and then go back home. Plug that AP into your existing broadband internet, and you can now make calls to the US at local rates. This assumes that your Internet bandwidth is cheap, however.

So right now, you can beat the system. But when travelling, I'd take an extra roll of aluminum foil, just in case...

UMA's dirty secrets

2007-07-24T19:57:10-05:00

For more UMA answers, see my more recent article.

What's UMA?

Recently, many carriers have started offering UMA, or WiFi phones. These are cell phones with WiFi capabilites. Don't be fooled -- you won't be able to get free calls and run skype on them. The UMA technology is meant to extend the carrier's cellular network into your home using your broadband internet connection.

How does UMA work?

An UMA phone operates just like a regular cell phone. It can talk to cellular base stations. But it is dual mode, and it also has a WiFi radio on board. When it finds a WiFi access point, it will attempt to connect to your carrier's servers over the Internet. If the connection is successful, it will "Rove in" and begin sending everything over the Internet.

The carrier's server is called an UMA Network Controller, (UNC). From the perspective of the phone, the UNC looks just like a regular cell tower, and it talks to it in the same way as a cellular base station, except that everything is wrapped up and forwarded over the Internet. Communicating in this way has some important differences from the way your laptop accesses the Internet

Hands in your pocket

When you browse the web from your laptop, the data flows from your laptop to the web site you are visiting, with nobody in between. It is different when you are browsing with a cell phone, however. With a cell phone, you get assigned an IP address in your carrier's core network. The IP address is how your handset is identified on the network. For example, when you browse a web site, the IP address lets the web server know who to send the web page back to.

When you are using a cell phone, the idea is that your IP address will stay the same no matter which cellular tower you are at. So, if you are loading a web page and driving down the highway at 120 km/h, you might switch from cell tower to cell tower, but your IP address will remain the same, and your web page will still load. The carriers accomplish this by giving you an IP address in their core network. When you ask for a web page, your request is forwarded through your cell phone company's servers. Your cell phone company actually downloads the web page for you, and then sends it to your phone.

The same thing happens with UMA. You might rove-in to your WiFi connection, but your IP address will remain the same. Your device is still directly connected to the carrier's core network, and the web page loads through your carrier's servers.

So if you wanted to load Skype on your phone to try and make free phone calls, forget it. It would cost you more in data usage charges than you'd save. Also, it's probably technically impossible, due to the amount of extra work your phone has to do.

UMA efficiency

Another important difference between browsing using your laptop and the cell phone is efficiency. Because your laptop is directly connected to the internet, it has a much greater advantage in terms of speed. Your tiny cell phone, however, is burdened with extra protocols that make loading web pages a very costly operation.

When your laptop is transmitting data, the data is broken up into small chunks, called IP packets. These IP packets can then be transmitted directly over the Internet.

Over UMA, however, the situation is very different. IP packets over UMA are transmitted using the same techniques as if they were going over a cell tower. That means that after your web browser forms an IP packet, it has to be transformed into a form that is recognizable by carrier's servers. The packet will first be broken up into smaller chunks, called frames. Each frame will then have extra information added to it, called headers, that is needed to be understood by your carrier's network. The extra information is not so much, but what is really costly is the security.

Security

Your UMA phone has a direct pipe into your carrier's core network. This requires a lot of security, because your carrier doesn't want just anyone to have this kind of access. So your phone communicates using a special protocol known as IPSec.

You many be familiar with IPSec already. It's used by a lot of companies that issue their employees laptops. If you have to work from home, you might have some kind of security key, and to log in, you'll start up an application called a VPN Client, and then boom, it's as if you were sitting in your cubical at work, except that you're at home in your underwear.

UMA phones use the same technology. To connect, they form an IPSec tunnel into the carrier's network. Instead of a password, the phone checks that your SIM card is valid and up to date before letting you on.

IPSec provides great security. The packets are encrypted, and it's pretty much impossible to figure out what they mean, what web pages you're browsing, or what you are saying in your phone calls. However, it has a huge cost in terms of overhead. Each packet has to have extra headers added, and then it's encrypted. This encryption can expand the packets by as much as 30%. This means that your web pages will take 30% longer to load vs. using your laptop, even under the best of conditions.

I filed this patent to try to mitigate the problem.

UMA Advantages

If UMA is so inefficient, why use it at all? I am a strong supporter of UMA, despite its flaws. It's great for the consumer, because it gives you better coverage when you are at home. Also, it affects the pricing of your phone. Many carriers have special discounts, or even unlimited calling when you are on UMA.

You see, it's all part of the strategy to get you to use your cell phone at home. Carriers would much prefer you to use your cell phone all the time, so they can squeeze more revenue out of you. This would be beneficial for the consumer too, because rather than paying for a cell phone plus a landline, you ditch your landline and just pay a little more for your cell phone.

But if everybody did this, without UMA, it wouldn't work, because cell towers can only support a few dozen calls at the same time. UMA is a relatively cheap add-on to a carrier's infrastructure, so it makes sense to add it. Adding a new base station to cover dead spots in a neighbourhood costs a quarter million dollars. A WiFi access point, at wholesale rates, costs maybe $30.

Handsets

Early generation handsets, like the Samsung and Nokia, have a few problems. I have read reviews on the Internet and apparently they were horrible and people are asking for their money back.

There are a few reasons behind this. UMA specifications were only finalized as recently as 2005, and unlike the mature GSM specifications, they leave much open to interpretation. They don't address things like when your phone is supposed to rove in and rove out. There isn't an easy way to figure out if your Wifi connection is stronger than your cell tower. Your typical tower transmits at several watts of power because it has to reach tiny cell phones up to 30 km away, but your typical access point transmits at only a tiny fraction of that power. Your cell phone can't just choose the stronger one. There just isn't an easy way to decide which one to use. If your phone chooses to use the Wifi access point, but it's too weak to be used, then quality will suffer.

Quality of service is also a problem. If your laptop is downloading movies using bittorent, and you're trying to make a phone call, it just isn't going to work. Theoretically, a technology called Quality of Service is supposed to fix problems like this, but the technology just isn't there today in 2007. Most access point deployed don't support it at all, or they say they support it but it is completely inadequate. So if you are planning on making internet calls while watching videos online, plan on getting an up-to-date AP with decent QoS.

Finally, most people's access points use the default settings set at the factory. That means that they will be using the same WiFi channel, and two of them placed close together (as in an apartment building) will cause interference. Other appliances like Microwaves will also cause a degradation of the signal.

But wait...

Some cell phones can send data natively over WiFi, without going over carrier's servers. The carrier has no way to track this data, so you can send as much as you want. But in practice, you really have no way of knowing if it is using Wifi directly, or the UMA connection.

Installing the Latest Debian on an Ancient Laptop

2007-05-19T16:09:36-05:00

The challenge: Install Linux on a really old laptop. The catch: It has only 32 MB of RAM, no network ports, no CD-ROM, and the floppy drive makes creaking noises. Is it possible? Yes. Is it easy? No. Is is useful? Maybe...

Motivation

Why? Like mountain climbers say: because it's there. As an environmental nut, I don't like to throw away things that still work. But I have a PCMCIA network card and I would rather not have to hunt down and install 10 year old drivers to get it to work with Windows 95. The latest Linux definitely supports more hardware out of the box than Windows 95.

The Laptop

I don't know very much about this laptop. I recall that the system information utility in Windows 95 is "MSD" or "MSDIAG", but either didn't exist in this installation, or my memory is faulty and I was typing the wrong command.

The only thing I do know is that it has 32 MB of RAM, integrated stereo sound and modem, no network card, and no CDROM drive. When it boots, there is no obvious way to enter the BIOS utility. I tried DEL, F8, etc but I figure it doesn't have one.

Ubuntu's Tragic Failure

I tried really hard to install Ubuntu, which I am familiar with on my other machines. The problem is that Ubuntu has removed the ability to install it from floppies. You have to a) use the CDROM or b) do a network boot.

I wasted hours trying to do the network boot, which I have done before for the machine running this web server (The DELL DVD drive died long ago). I followed the instructions to put the network boot CDROM onto another Linux server, and installed TFTP and a DHCP server. But the ancient Compaq laptop presented a problem.

Normally you could go to http://rom-o-matic.net/ and make a boot floppy, which will boot up, detect the network interface, and then do a network boot (which would start the Ubuntu installer). However, the only network card I had was a PCMCIA 3com card that is supposed to be supported by rom-o-matic. But no matter what I did, etherboot would not detect it and would just sit there dumbly.

It may have been possible to use the PLIP (IP over parallel port networking) option and boot from over the network using a special cable. But I don't have the cable, and such an install is so uncommon, it would be a miracle if it worked at all.

Debian saves the day

After some research I found that Debian still supports the floppy install option. All you need are 4 floppies for boot.img, root.img, and two additional disks of network drivers. Of course, I only own one floppy disk so I had to keep re-imaging it during the install.

The minimum memory required for installation is 32 MB, so we are in luck. The problem is that the installer enters a "low memory mode" and doesn't load any kernel modules on its own. Instead, it pops up a list and you have to guess what drivers you are going to need. If you are wrong, you can always click "go back" to back up. I went through the four floppy disks that it asked for, and selected anything that looked like IDE (for the hard drive), 3COM, and PCMCIA (for the network card). Actually, at first I didn't select the IDE components. As a result, the installer offered to partition my floppy disk. I went back to add in the hard disk drivers.

Finally, the installer was working. It connected to the network and downloaded and installed the minimum debian distribution. The only changes I made were to the partitioning. Initially, it offered a 90 MB swap partition. That seems small, so I increased it to 400 MB, leaving 1.3 GB of disk space left over for the install.

Hiccups

The installer seemed to freeze at one stage, while "preparing installation report." I rebooted the machine and it worked the second time.

When the laptop boots up, I get lots of kernel messages about failed I/O operations. However, once it starts everything is okay. I did a surface check using e2fsck -c but the errors persist.

Running Programs

Once everything was set up, I installed gvim, xdm, Xorg, and icewm, (which is a great window manager that doesn't take up too much space). When I started X for the first time, the screen was red and didn't look right. It turns out that I had to limit it to 16-bit and reduce the resolution to 800x600, the native resolution of the screen. Then everything worked.

Battle of the Browsers

Once the graphics were set up, I used good old lynx, the text based browser, to download Firefox. But once I started my favourite browser, I waited, and waited, and waited...

It turns out that Firefox is a memory pig. It took 15 minutes to start, and it takes up over 100 MB of memory to show a blank window. Ugh!

There isn't a lot of choice of browsers out there. Galeon is part of Gnome, and I definitely didn't want any bloated Gnome packages on my lean but slow machine. Instead, I downloaded the latest version of Opera.

Opera starts in only 10 seconds or so. It's usable, if you don't mind waiting a few seconds between clicks. So Opera wins the browser wars for low-resource machines.

vncviewer

Since the laptop is so slow to use, I mostly use it to connect to other machines using xtightvncviewer. For this purpose, it works very well.

Conclusion

For old machines with no CD-ROM drive, you are better off installing Debian than Ubuntu.
Because of Debian's low-memory mode installer, you'd better be a computer expert to pick the right drivers during installation.
The best package manager is "aptitude", but only when run from the command line, because even the text-based GUI is too slow. It keeps stopping everytime you do something to do housekeeping.
The best browser for low-resource machines is Opera.
Aside from a minor hiccup with X.org's graphics detection, all hardware works flawlessly.

Experiments in making money online

2007-04-18T19:04:30-05:00

Is it possible to make money on the internet, if you try really hard? I want to find out. I have always been interested in getting money for doing nothing. In an ideal business, you would do some initial work to get a system set up, and then wait for cash to come in. Here are some results, including revenue earned, from:

Shareware
Adware
Adsense
Donations

Shareware

My experiements in shareware have been a dismal failure. I created Hotkey Jumpstart, a utility program that lets you start any program or music file by typing a few letters of its name, in 2004. After posting it on dozens of sites, I do have a hard time getting downloads, and hardly anybody registers. In two years, I made a total of $25.12 US.

The association of shareware professionals, which I joined for a year, has a few examples of success. Winzip apparently made lots of money, and its creator could earn a living off it it. A few others worked well too.

Apparently, software utilities are a bad category for shareware, and they don't do well at all. I think games would work better, because my wife has bought several flash games online. But to create games, you have to use Adobe Flash creator, and it costs $699 to download. That's pretty hard to justify.

It is also possible that Hotkey Jumpstart doesn't even work for most computers. It has when I tested it, but if it weren't working at all, I doubt that anybody would bother to email. Having a shareware product makes beta testing difficult.

So shareware hasn't worked for me. I think it would work better in these areas:

BlackBerry applications Web sites like Handango have gotten people used to having to pay to download something, without trying it out first.
Apple shareware Because there is so little software for Apple, people are still willing to pay for good applications.
Games - Games are easy to monetize. You just have to make extra levels, or put in a time limit.

Adware

In the year 1999, the term "spyware" didn't exist. We had trojans and viruses and if I heard the word spyware I would assume it is some kind of trojan that steals passwords (this is not what spyware does today). After I released Banshee Screamer Alarm, it was wildly successful because it was a free, and I had consciously made it better than other alarm clocks at the time. It was getting thousands of downloads a month.

I got an email from the marketing director of a company called Onflow. Onflow was trying to compete with Macromedia Flash. Their product was better because it allowed smaller downloads. If they could get their browser plugin installed on a lot of browsers, then they could (like Adobe today) charge advertisers hundreds of dollars for their program to create ads. According to this marketing guy, if I included the Onflow installer with Banshee Screamer Alarm, they would pay me 14 cents a download. I accepted, and I included their installer in my program.

A few months later, I got a check for about $1014 US, which I used to buy much needed clothing (my wardrobe at the time consisted of T-shirts that I got by signing up for things online). Then the checks stopped coming. Apparently Onflow went defunct in the tech crash.

So at one time adware was a very successful model. But what about today? I recently researched this topic. We all remember when the Opera browser had banner ads. At one point, pkzip for windows had banner ads too. I searched for ways of including ads in my programs, but all the companies that do this have apparently gone out of business.

The most successful company is Zango Cash, which apparently pays a huge rate for installs (if their web site is true). I refused to work with them, however. After some research, I found that they are the creators of the CoolWebSearch toolbar, which crippled my grandmother's computer. I spent a couple of hours trying to remove it, so I will not inflict this on people even for .40 cents a download.

Web Ads

When I was first promoting Hotkey Jumpstart, I dropped $60 into the Google adsense program for zero return. I read horrible stories about sweat shops that get paid to sit there all day clicking on Google ads. So the entire adsense program stunk to me. However, when I released PhotoWipe I found that my web site was getting thousands of hits a day, so I signed up for adsense.

My main problem was that people didn't have to visit my web site in order to download PhotoWipe. So I modified the installer to open up a "thank-you for downloading PhotoWipe" web page after you install it. (This is also how I track how many downloads vs installs I have).

On that web page, I put in the google ad for Picassa, which is actually very relevant. It says "Organize your Photos with Google Picassa". So problem solved! Every install gets exposure to the ad. One important remark: Google claims their "referral" program pays "up to" $2 per install. This is a blatant lie. Actually, I get 10-20 cents per install.

One problem was that (as far as I can tell) google referral ads don't change their language according to the user, but most of my installs were coming from Japan and Spain. My php code takes care of that, my choosing the ad based on the HTTP_ACCEPT_LANGUAGE code.

Results for ads

If all you want to do is pay for bandwidth, it's okay. Right now, people downloading PhotoWipe consume about 1 GB/day, which costs me $1 from my internet provider. I get about 4-5 installs of Google Pack per day, which is just over $1. So I'm just scraping by with a few cents a day of profit.

Once a week or so, somebody clicks on a $1 ad and my profits skyrocket for that day. Also, I seem to get a few dollars more, for a couple of days, whenever PhotoWipe makes it to the front page of a major web site (usually in Japan). But such earnings are short-lived, and the bandwidth costs make up the difference.

Donations

Since my shareware business is failing so badly, I wondered if donation works. At Donation Coder there is a discussion of it. Overall, it doesn't work.

The "Thank-you for installing PhotoWipe" page also has a paypal "donate" button, as does the Help menu. In one month, I have had three donations that total to $18. At 15,000 installs (that's installs, not downloads), that's pretty dismal.

Conclusions

I will continue this experiment, and updating this entry as new facts come in. Right now, it looks like the Google Adsense is a clear winner for keeping up with bandwidth costs, but there is not enough to make a profit. Donations come in second, but there is not enough data, since they come in so sporadically. Shareware fails for the Windows platform, because people won't download it. If I were unscrupulous, I could probably make a few thousand dollars a month with spyware / trojans.

Draw waveforms and hear them

2007-04-11T19:18:09-05:00

A while back I thought it would be interesting to be able to draw arbitrary waveforms and then listen to how they sound. I had an audio engine just laying around, so I whipped up a quick application to do that.

download WaveStudio.exe

Results

In theory, you can make any sound that you want. The results aren't very interesting. You can draw a sine wave and it sounds muffled. Add some jagged edges and the sound starts to sound more raw and high pitched. But it's okay to demonstrate what a sawtooth vs. sine vs. square wave sound like.

Future work

It would be a better to be able to draw in the time vs. frequency domain, using standard brush painting tools. Thay way you could come up with more interesting waveforms.

Cell Phones on Airplanes

2007-04-08T20:22:03-05:00

Much ink has been spilled about the use of cell phones on airplanes. Here's the truth, which will be disappointing to conspiracy theorists: Cell phone signals most definately have an effect on other electronic equipment. Read on for more.

Want proof? hold you cell phone near your computer speakers. Turn up the volume. Then make a call. You will be able to hear out loud the GSM signal. Although the microwave frequency isn't in the audible range, the envelope of the radio bursts is, so you are hearing a buzz formed by the radio packets going over the air.

You will only here these bursts when using the older 2G technology. 3G/HSDPA/LTE still interferes with equipment, but the designers specifically took the speaker interference into account. The radio bursts of newer phones are spread out in frequency and time, so that even though they are there, you can't hear them.

Interference

Sure, cell phones will probably work on airplanes. The range of a cell tower is a maximum of 30 km. How fast is the plane traveling? About 885 km/h, which works out to 14.75 km/minute. Cell phones only take about 5-10s to arrange a handover to another cell during a call, so it's certainly possible.

The problem is that since you are so far off the ground, and because the airplane is made of metal (impervious to radio signals), your phone will have to use the maximum transmit power. Such powers could easily interfere with radio equipment.

The real reason

There is a theory on the Internet that, while in the air, your cell phone can see many too many different base stations at once, and it can't handle it. In my tests (done without a SIM card, so that no transmission can occur) I have not had good results. As a protocol stack developer on the BlackBerry, I can make it go into a special mode where it shows all of the cell towers that it can see. What I see in commercial airlines is that it may be able to see one or two cells, but they will very quickly disappear. There won't even be enough time to register.

The real dealbreaker is that airlines tend to fly over open space most of the time. Cell towers cover only moderately populated areas, and most of the time, you won't be in range of a tower

Detecting C++ memory leaks

2007-04-07T19:38:11-05:00

A while ago I had the problem of detecting memory leaks in my code, and I didn't want to spend lots of money on a brittle software package to do that. It's fairly simple to redefine malloc() and free() to your own functions, to track the file and line number of memory leaks. But what about the new() and delete() operators? It's a little more difficult with C++, if you want to figure out the exact line number of a resource leak.

In this article, I'll explain how you can get a stack trace for where your resource leaks occur. This method is for Microsoft Windows. Linux developers are better served with Valgrind.

Download the source code:

MemExample.zip (Sample project)

Overview

We will use #define to replace the standard implementation of malloc() and free() with ones that record the file and line numbers where they are called. That way, we can track where memory leaks occur for allocations made using the standard C allocation functions.
We will overload the new() and delete() operators to track the address of the functions that they are called, by walking backwards up the stack.
Finally, we will parse the .map file generated by the linker. This will let us figure out where new() and delete() were called based on the return address information.

The header file

The first thing we'll do is have an #ifdef, because memory tracking is inefficient. You'll want to cut it out in release versions of your code.

debug.h:

#ifdef DEBUG_MEM
#include 
#define malloc(A) _dbgmalloc(__FILE,__LINE, (A) )
#define free(A) _dbgfree( __FILE__, __LINE__, (A) )
// ... continue with calloc, realloc, strdup, etc.
#endif

Every *.cpp source file in your program should include this file. It's optional, of course. But if you allocate something in a memory-tracked module, and free it in another that doesn't, your program will crash, since it was allocated with _dbgmalloc() and free'd with free() instead of _dbgfree().

The implementation for malloc

void*
_dbgmalloc( const char* file, int line, size_t size )
{
    void* ptr;

    if ( !_init ) {
        return malloc( size );
    }

    ptr = add_record( file, line, size );
    if ( ptr == 0 ) {
        dbgprint(( DMEMORY, "Out of memory." ));
        return 0;
    }

    dbgprint(( DMEMORY, "%s:%d: malloc( %d ) [%p]", file, line, size, ptr ));

    return ptr;
}

void _dbgfree( const char* file, int line, void* ptr )
{
    if ( ptr == 0 ) {
        return;
    }

    if ( !_init ) {
        free( ptr );
        return;
    }

    MemBlock* block = (MemBlock*)ptr - 1;
    int size = block->size;

    del_record( file, line, ptr );

    dbgprint(( DMEMORY, "%s:%d: free( [%p], %d )", file, line, ptr, size ));
}

The add_record() and del_record() functions perform the real work of memory tracking. They will allocate the requested amount of memory, but they will add space for extra tracking information. The tracking information is stored in the first few bytes of the memory block, and then the returned pointer offset by this amount. We will also reserve extra space at the end of the memory block, so we will be able to detect writes past the end of the array. We will write a specific sequence of bytes (Here, 0x12345678) at this location, and if when the block is free'd, the bytes have been modified, then your program has done something it shouldn't have, and the del_record() function will complain.

void*
add_record( const char* file, int line, size_t size )
{
    MemBlock* block;
    assert(_init);

    block = (MemBlock*)malloc( sizeof( MemBlock ) + size + 4 );

    if ( block == 0 ) {
        dbgprint(( DMEMORY, "Out of memory." ));
        return 0;
    }

    block->sentry = SENTRY;
    block->size = size;
    block->line = line;
    block->file = _strdup( file );
    if ( 0 == block->file && file ) {
        free( block );
        dbgprint(( DMEMORY, "Out of memory." ));
        return 0;
    }

    memcpy( (char*)block + sizeof(*block) + size, &SENTRY, 
        sizeof( SENTRY ) );

    EnterCriticalSection(&_cs);
    list_add_tail( &_blockList, &block->list );
    LeaveCriticalSection(&_cs);

    return block + 1;
}

What about new?

That's all fine and good for malloc() and free(), and strdup() and _tcsdup() and calloc() and realloc(), but what about C++? When you call malloc() above, you see that the macro puts in the file and line number information, but this is not possible for the new operator. Instead, we will do it the hard way. We'll redefine the new operator and then search up the stack for the caller's address and store that. Later, we'll parse the linker's map file to figure out which function it was from the address.

Here's the implementation for new() and delete(). They are almost the same as malloc() and free() above, except that they record the return address instead of the file and line information.

void* operator new( size_t size ) throw ( std::bad_alloc )
{
    static bool recurse = false;
    void* ret;
    CrashPosition_t pos;
    if ( recurse || !_init) {
        return malloc( size );
    }

    EnterCriticalSection(&_cs);
    pos = getFileLine(1);
    if ( pos.file == 0 ) {
        pos.file = pos.function;
    }

    ret = add_record( pos.file, pos.line, size );
    if ( ret == 0 ) {
        dbgprint(( DMEMORY, "Out of memory." ));
	    LeaveCriticalSection(&_cs);
        return 0;
    }

    dbgprint(( DMEMORY, "%s:%d: new( %d ) [%p]", pos.file, pos.line, size, ret ));
	    LeaveCriticalSection(&_cs);
    return ret;
}


/******************************************************************************
 *****************************************************************************/
void operator delete( void* ptr ) throw ()
{
    CrashPosition_t pos;
    
    if ( !_init ) {
        free( ptr );
        return;
    }

    if ( ptr == 0 ) {
        return;
    }
    EnterCriticalSection(&_cs);

    pos = getFileLine(2);
	    LeaveCriticalSection(&_cs);

    dbgprint(( DMEMORY, "%s:%d: delete [%p]", pos.file, pos.line, ptr ));
    del_record( pos.file, pos.line, ptr );
}

Walking the stack

Here's where the magic happens. Because file and line number information is not available to the new operator, we will walk the stack in order to record the return address. Later on, we'll figure out the function name where they were called from.

static int 
GetCallStack( unsigned* stack, int max )
{
    unsigned* my_ebp = 0;
    int i;

    __asm {
        mov eax, ebp
        mov dword ptr [my_ebp], eax;
    }

    // It is not safe to use this function in a WIN32 standard exception handler!
    if ( IsBadReadPtr( my_ebp + 1, 4 ) ) {
        return 0;
    }

    stack[0] = *(my_ebp + 1);
    for ( i = 1; i < max; i++ ) {
        unsigned addr;
        if ( IsBadReadPtr( my_ebp, 4 ) ) {
            break;
        }
        my_ebp = (unsigned*)(*my_ebp);

        if ( IsBadReadPtr( my_ebp + 1, 4 ) ) {
            break;
        }

        addr = *(my_ebp + 1);
        if ( addr ) {
            stack[i] = addr;
        } else {
			break;
		}
    }

    return i;
}

Making the map file

So far, for malloc() and free() calls, we have recorded the file and line number information, but for new() and delete() we have only the return address. How do we figure out which function called new() and delete()?

We will induce the linker to create a .map file. Add these options to your makefile when calling link.exe. Replace example with the name of your executable output file. (The debug.cpp code will assume that the map file has the same base name as the executable).

/MAP:example.map /MAPINFO:LINES

Compiler differences

Note: For Microsoft Visual Studio 2005, Microsoft has removed the MAPINFO:LINES option. So you should either use an earlier version of the compiler, or be content without line numbers. You will still have function names.

The Map File

The Map file contains a list of every function in your program, and the exact addresses to which they are loaded. So, using a binary search, we are able to look up a function given an address. I have implemented this process in Mapfile.cpp, which is called diretly from debug.cpp.

Putting it together

When your program exits, the debug.cpp module will automatically execute this cleanup code. The cleanup code will dump out any unfree'd memory chunks.

static void
dump_blocks()
{
    list_entry_t* entry = list_head( &_blockList );
    while( entry != &_blockList ) {
        MemBlock* block = list_entry( entry, MemBlock, list );
        dbgprint(( DMEMLEAK, "Leaked %d bytes from %s:%d [%08x]",
                    block->size, block->file, block->line, block + 1
                 ));
        
        entry = entry->next;
    }

    if ( list_empty(&_blockList ) ) {
        dbgprint(( DMEMLEAK, "No memory leaks detected." ));
    }
}

dbgprintf

To see the memory leaks, you will have to implement a debug message handler. I don't have time to explain this right now, but it should be pretty obvious from the source code. Or, you can replace dbgprint() with OutputDebugString(), or printf(), or MessageBox(), or whatever you want.

Enjoy!

What does your phone number spell?

2007-04-07T18:35:22-05:00

You might want to visit DialAbc.com, which has more results. Stay here if you are interested in the theory behind it.

This article was actually written in 2002. Here, I explain a technique for figuring out which words are in which phone numbers. Full C source code is included.

How did the computer do that?

Quick answer:

TrieStore.h
TrieStore.c
main.c
Compile using gcc -o spellophone *.c on unix/linux. A competent C programmaer can adapt it to run on Windows in about 5 minutes.

Long answer:

Computers are wonderful things. We take for granted that they can do stuff like the above in microseconds, but to non-computer scientists it seems like magic. If you've ever wondered what computer scientists do all day, this will help.

I developed the algorithm for spellophone over Christmas break 2001. It wasn't easy -- I went through three different drafts until I found the right one. Would you believe that the first one took over ten minutes to run on a ten-digit number on my old Pentium 166?

The first algorithm I tried is known as the "brute force" approach. I had the computer go through every possible combination of letters and check it for words. I thought I had a clever way of doing that because I used a trie to store all of the words in the dictionary. A trie is like a very smart parrot. You can shout letters at it, and it will squawk if anything you say forms a word. For example, if you say "D-O-G" at it, it will squawk because DOG is a word. If you then say "M-A" it will squawk again because DOGMA is also a word.

Despite the clever use of the trie, the program still ran too slowly. Can you guess why? Think about how many possible letters are in a telephone number.

If you look at a phone keypad, you'll see that each digit has three letters on it. Some of them have four on newer phones. So you can only make three one-letter words with one digit. But if you add a second digit, say "22", you can spell "AA, AB, AC, BA, BB, BC, CA, CB, CC", or 9 words. That's quite a bit more than one! It turns out that if you add 10 digits, there are 1049760 possible combinations of letters. And for each possible combination that spells actual words, there are lots of different ways those words can be placed together between dashes. So it turns out that the computer might have to go through millions of combinations, trying to pick out the ones that make sense. Unfortunately, even today's computers can't process all of those words fast enough. So I had to find a better way.

Dynamic Programming

I looked at what the computer was doing, and it seemed to me that it was wasting a lot of its time asking the trie about the same words over and over again. For example, if the last three digits in phone number didn't actually spell anything, it would still check them thousands of times in combination with the first part of the phone number.

Then something struck me. Last year, I had learned of something that was meant to deal with just this type of problem in one of my computer science courses. It's called "Dynamic Programming", or DP for short. Dynamic Programming is a way of programming that solves problems a little bit at a time. It's best for problems where each piece of the the problem is built on the previous one, so once you solve all of these little pieces you can put them together and solve the entire puzzle. If I could figure out a way to make DP work, then the computer would only need to check each word once, and the program might work in seconds instead of minutes!

I racked my brain, trying to remember what Professor Chan had said in his thick accent. With DP, you have to make a grid of squares, and each square represents a small part of the solution. After a lot of thinking, I drew a grid on paper. Across the top, I wrote Starting Position, and along the side, I wrote Length of word. Each square would contain all of the words that you could spell if you started on a certain digit and used a certain number of letters.

My hands trembled as I stepped through the algorithm on paper. I used the phone number "78225" because I knew it spelled "QUACK." (I used to work for a company called Quack.com and that was part of their number). Here's what I came up with. The partial words are in normal printing, and the finished words in each square are in bold:

Starting digit
Length	7 (PQRS)	8 (TUV)	2 (ABC)	2 (ABC)	5 (JKL)
	P, Q, R, S	T, U, V	A, B, C	A, B, C	J, K, L
	PU, QU, RU, ST, SU	TA, UB, VA	AA, AB, AC, BA, CA	AL, BL, CL	.
	PUB, PUC, QUA, RUB, STA, SUC	TAB, TAC, VAC	ACK, BAL, CAJ, CAK, CAL	.	.
	PUBA, QUAC, RUBA, RUBB, STAB, STAC, SUCC	TACK	.	.	.
	QUACK, STABL, STACK	.	.	.	.

What is good about this is the table could be calculated very quickly. Each square builds on the data that was already processed in the square above. I had done it in five minutes on paper -- the computer could do it in the blink of an eye. Also, it is pretty easy to string all of the words together so that the longest ones appear first.

Dynamic Programming always involves two steps -- first creating the solution table and then analyzing it to piece the solution together. The way I chose to piece the solution together is pretty simple:

Begin at the leftmost column that you haven't worked on yet.
Get a word from the bottom-most square.
Now put that word in your phone number, and go to the digits that are now left-over at the end. Start at step 1.
Once you run out of words, try the next one from the square you chose in step 2 before you went off to the left-over digits. If there are no more words left in the square, continue up-wards. Repeat step 3.
Once you have exhausted the left-most column, get rid of it, add the number instead of a letter, and start over until there are no more columns left in the table.

It's harder to explain than to do. Using the above data, you get the following words in this order: QUACK, STACK, STAB-5, PUBA-5, PUB-25, 7-TACK, 78-AB-5

Other work

There are quite a few web sites that do this kind of thing. Phonespell.org has a good description of how their algorithm works, in the F.A.Q. section.

dialabc.com is the nicest web site that I have seen to search for words in phone numbers.

A Rhyming Engine

2007-04-06T10:31:11-05:00

Here's a rhyming engine, written in 1000 lines of C++ code. It uses the freely available Moby dictionary, and full source code is provided. Give it a try. Read on for technical information.

Please try the updated rhyming web site.

Relationship to Rhymebrain.com
I originally wrote this post in 2007. For the past five years I have been researching linguistics and machine learning techniques and created rhymebrain.com. Rhymebrain, at its core, still uses the same techniques, but it does it much faster and supports any human language. Rhymebrain is propreitary. However, the source code presented with this article is released to the public domain.

How it works

Programs involving the english language are quite interesting, because there really are no rules. English is very complicated. A great resource is the Moby project, which is a public domain dictionary. It includes a text file providing word lists, including pronunciation and parts of speech.

Although I demonstrate only the rhyming part, my WordDatabase object uses both the part of speech and pronunciation information. This is for future expansion. Download source code here

Overview

As part of compilation, a perl script combines the Parts Of Speech and Pronunciation dictionary into a single file, dict.txt.
The rhyming engine reads dict.txt and, for every word, creates a Word object.
The Word object has the part of speech. It also contains a representation of the pronunciation of the word.
To figure out if two words rhyme, we compare the last part of their pronunciation. The more syllables that rhyme, the better the rhyme is.
The output is sorted so that better rhymes are listed first.

Similar projects

I'm not aware of many similar projects. Somebody named "tuffy" made a rhyming dictionary on sourceforge: rhyme.sourceforge.net. However, for the life of me, I can't figure out why it needs to use an external database package. My technique does not need to pre-compute the rhymes, and it is half as many lines of code.

Personally, I use the rhymezone.com rhyming dictionary for my rhyming needs.

The Rhyming API

class WordDatabase
{
public:
    WordDatabase();
    ~WordDatabase();

    bool load(const char* filename);
    bool findRhymes( DynamicArray& results, const char* word, WordFilter* filter,
            WordArray* wordList = 0);
    Word* lookup( const char* whichword );
    void filter( WordArray& results, WordFilter* filter );
    bool makeWords( WordArray& results, const char* text );
    bool loadThesaurus( const char* thesaurus );
    void addSynonym( Word* base, Word* synonym );
    unsigned getSynonyms( WordArray& results, const char* wordtext );

private:	
    StringMap _wordMap;
    DynamicArray _wordArray;
};

In this introduction, I use the word phoneme to mean a part of a word, as pronounced. Using a sequence of phonemes, and whether each phoneme is emphasized, you can completely describe how to pronounce a word, and hence derive its rhymes and syllables.

Preprocessing

The Moby project has separate files for parts of speech and pronunciation. So I wrote a perl script to combine the two files. The resulting word list only includes words that have both part of speech and pronunciation information.

From the readme file, I believe that the Moby pronunciation was derived by a British person. Fortunately, for the purposes of rhyming, it doesn't really matter. "Potato" will rhyme with "Tomato" no matter if you say "Po-tay-to" or "po-tah-toh".

A standard set of phonemes

For this project, I also leave it open to include the CMU dictionary, another free online dictinary. The problem is that these two dictionaries use a different set of phonemes. I had to figure out a mapping so that I could possible combine both dictionaries later on. The mapping is below. The fields are:

An enumeration for the phoneme
How the phoneme is displayed. This is for debugging purposes only, and is different from what is read in from the dictionary file.
Whether it represents a syllable.

PhoneSet_t PhoneSet[] = {
    { a_in_dab,       "ae",  1 },
    { a_in_air,       "ey",  1 },
    { a_in_far,       "ao",  1 },
    { a_in_day,       "ay",  1 },
    { a_in_ado,       "ah",  1 },
    { ir_in_tire,     "ire", 1 },
    { b_in_nab,       "b",   0 },
    { ch_in_ouch,     "ch",  0 },
    { d_in_pod,       "d",   0 },
    { e_in_red,       "e",   1 },
    { e_in_see,       "ee",  1 },
    { f_in_elf,       "f",   0 },
    { g_in_fig,       "g",   0 },
    { h_in_had,       "h",   0 },
    { w_in_white,     "wh",  0 },
    { i_in_hid,       "i",   1 },
    { i_in_ice,       "eye", 1 },
    { g_in_vegetably, "g",   0 },
    { c_in_act,       "k",   0 },
    { l_in_ail,       "l",   0 },
    { m_in_aim,       "m",   0 },
    { ng_in_bang,     "ng",  0 },
    { n_in_and,       "n",   0 },
    { oi_in_oil,      "oy",  1 },
    { o_in_bob,       "aa",  1 },
    { ow_in_how,      "ow",  1 },
    { o_in_dog,       "ah",  1 },
    { o_in_boat,      "oh",  1 },
    { oo_in_too,      "oo",  1 },
    { oo_in_book,     "ooh", 1 },
    { p_in_imp,       "p",   0 },
    { r_in_ire,       "er",  0 },
    { sh_in_she,      "sh",  0 },
    { s_in_sip,       "s",   0 },
    { th_in_bath,     "dth", 0 },
    { th_in_the,      "th",  0 },
    { t_in_tap,       "t",   0 },
    { u_in_cup,       "uh",  1 },
    { u_in_burn,      "u",   1 },
    { v_in_average,   "v",   0 },
    { w_in_win,       "w",   0 },
    { y_in_you,       "y",   0 },
    { s_in_vision,    "zh",  0 },
    { z_in_zoo,       "z",   0 },
    { a_in_ami,       "a",   1 },
    { n_in_francoise, "n",   0 },
    { r_in_der,       "r",   0 },
    { ch_in_bach,     "chh", 0 },
    { eu_in_bleu,     "eu",  1 },
    { u_in_duboise,   "u",   1 },
    { wa_in_noire,    "WA",  1 }
};

The combined dictionary

Here's the first few lines of the combined dictionary, which includes 110000 words. Each line has three parts. The first is the word, the second is the part of speech (N is noun, etc) and the third is the pronunciation. Right now, I use only the moby dictionary, so the pronunciation keys are weird characters as described in the moby readme file. Each phoneme is separated by a slash character.

A N /eI/
AWOL A '/eI/w/A/l
Aachen N '/A/k/@/n
Aalborg N '/O/lb/O/rg
Aalesund N '/O/l/@/,s/U/n
Aalst N /A/lst
Aalto N '/A/lt/O/
Aar N /A/r

Syllables

It took a bit of thinking to figure out how to count the syllables in the word. Believe it or not, it can be derived from the phoneme alone. The first approach I tried is to split the word into syllables based on consonents. This didn't work. In fact, you get the best results if you split based on vowels. (Linguists reading this are now shouting at the monitor, "Of course, you fool!") In the PhoneSet table above, I mark a phoneme with a 1 if it represents a new syllable. The number of syllables in a words can be derived by adding up the value of the phoneme.

Representing words

On startup, the program takes a second to read the dictionary file. It parses the phoneme (separated by slash characters) into a word structure, where each phoneme is represented by a 16 bit number. If the phoneme begins a new syllable, the upper 2 bits are:

0 if the phoneme doesn't begin a new syllable
1 if the phoneme is a secondary stress
2 if the phoneme is of primary stress

Rhyming

Rhyming is not very complex once the words are in the phoneme format. To determine whether two words rhyme, you simple compare the suffixes of their pronunciation. Stresses are included, as well. Amateur song writers often try to rhyme things like "hello" and "yellow" and they come up with horrible lyrics, since the words are stressed differently. Since the stresses are stored with the phonemes, a simple string comparison will take care of this automatically, and these words will not be found to rhyme.

Other features

Although I didn't include it here, the word database can optionally load in the moby thesaurus, "mobythes.aur". Thus you can figure out if any word has the same meaning as any other.

You can use a "WordFilter" object to filter the results to have a certain number of syllables, or match a set of stresses. For example, you can request a noun phrase of 5 syllables to match the stresses "0101010101", which is iambic pentameter.

Future work

Lots of stuff can be done. I have noticed in hip hop and rap music, the rhyming rules are relaxed greatly. For example, eminem might consider time and spine to be good rhymes. By redefining more phonemes to be identical, you can emulate this type of rhyming.

Rules for Effective C++

2007-04-06T10:25:28-05:00

I used to be a strong supporter of C++. It was the perfect language. In C++, if you want to influence how the hardware instructions are generated, you can do that. If you want to program without pointers and without caring about how memory is allocated, you can do that.

Recently, however, my views have changed after reading Scott Meyer's book, Effective C++. In Meyer's book, he goes through every feature of C++ and shows you how you have to program with extreme care to avoid undefined behaviour. It seems like every modern feature that C++ has was specifically designed to help you shoot yourself in the foot.

I never realized this before, because I simply never use these dangerous features. In this article, I'll show you how to program in C++ safely.

C++'s Broken Exceptions

Take exceptions, for example. Many programmers will tell you that they are a great idea. They let you indicate errors when creating object, and avoid making you check return codes. You can handle errors in the areas of the code that is prepared to handle them.

What you may not know is that using exceptions in C++ makes a lot of code unsafe. In effect, it means that you cannot use pointers. Take this code, for example:

void foo()
{
    MyObject* obj = new MyObject();

    bar();

    delete obj;    
}

If you are a C++ programmer and you use exceptions, you should see the obvious memory leak. If bar() throws an exception, or calls any function that throws an exception, then obj will not be deleted.

Steve's rules for effective C++

I have been programming in C++ for a decade, and I never realized these flaws until I read Meyer's books. I find C++ to be just fine, and the reason for that is because I program in a style that doesn't involve these pitfals. Here's how you can program in this way too:

Avoid exceptions

Exceptions will only leave you open to the memory and resource leaks. Don't use them. The exception to this rule is if you are programming in a style that doesn't use pointers, and everything is encapsulated into smart pointers.

Constructors should do nothing

Constructors have no way of returning an error code (unless you use exceptions, which are bad). That means that your constructors shouldn't do any real work. Don't try to open up a database connection, or call any functions that could fail. Constructors should be used only to initialize data members.

Use copy constructors sparingly

Copy constructors are very error prone, because they are another thing that you have to remember to change if you add a data member. You're much better off if you don't allow copying at all. Just pass pointers around. If you must pass by value, then don't put anything in your object, like pointers, that will require special handling. That way, you can use the automatically generated copy constructor. Unlike you, the compiler will never forget anything.

Use malloc or new without checking the result

I used to write programs that checked every call to malloc() and new() for failure. In Microsoft C++, the new() operator will actually through an exception if it fails, so checking for NULL is useless anyway. Today's machines have gigabyes of memory, and you don't need to verify every call to malloc() or new().

It's actually quite hard to induce these functions to fail, so even if you did handle their failure, you probably wouldn't test it. Do you really want to be releasing code that you haven't tested? There are cases where it would be better for your program to crash, then to continue to operate in an undefined state.

However, there are times where I would check whether a memory allocation failed:

When you are allocating something that is several megabytes, like space for images or files. In this case, it is quite possible for the allocation to fail if the user has opened up too many files in your program.
When you are programming for a nuclear reactor, or space shuttle. Also, a missile guidance system would be acceptable.

When you are programming for an embedded device, however, it might be beneficial to not check the return code of malloc. This is when there should be enough memory in the heap for all operations. If your process silently fails when memory allocation fails, you might never catch a memory leak that is exhausting your heap. It is much better to fail catastrophically by trying to use the NULL pointer than silently failing.

Cell Phone Secrets

2007-04-03T17:00:27-05:00

I am a mobile telecommunications "engineer", and I thought I'd explain what to look for in a cell phone. Most guides will review phones on their user interface, but pay little attention to one of the most important pieces: the radio. The radio on GSM cell phones is very mysterious to most people, so here is a guide on how to decode the features of cell phones.

Bands

The bands your mobile phone supports will depend on the region in which you live. In the year 2006, four bands are in common use: 850, 1900, 900, and 1800 MHz. The 850/1900 bands are used in North America, while the rest of the world (with a few exceptions) uses the 900/1800 bands. Your carrier will likely deploy either 850/1900 or 900/1800 in your area, so for maximum coverage your phone should support the two bands in use in your region.

If you are traveling, you will want a Tri-Band or Quad-Band phone. A Tri-Band phone supports the two bands in your country, plus one other one. It may also be advertised as a "World Band" phone. A Quad-Band phone will support all four bands.

All phones sold today should automatically detect and switch between the bands.

GPRS/EDGE

GPRS refers to the ability of the phone to transfer packet data (for example, emails, ring tones, and web pages). The speed of the connection is related to the multislot class. EDGE is an enhancement to GPRS that allows faster data rates, comparable to broadband connections. If all you want is phone calls, you should ask your carrier to disable your packet access, to avoid incurring usage charges by accident.

UMTS/3G

UMTS handsets, also called "3G", support a new technology that will theoretically give you faster data access than EDGE. However, because they are a newer technology, their battery life will be much less than a GSM-only phone. In addition, UMTS base stations are deployed only in major metropolitan centres. Recently, in the Blackjack phone, users discovered if they turned off the 3G feature, their battery life doubled!

If you are only making phone calls, you don't need this. It helps the carriers because it moves phone calls off of their congested GSM cells, onto their UMTS cells that hardly anybody uses at the moment. If you do want a data modem, you should consider it, because the higher data speeds will be noticeable.

Multislot class

The multislot class of your phone determines how quickly it will transmit or receive packet data. (For example, emails and web pages, but not voice or SMS messages.) Most phones will be Multislot Class 10, which means it can receive on four different channels simultaneously, or send on two different channels. Higher multislot classes allow more channels, and thus it will be faster. However, if the cell tower is being used by more than a few mobiles at the same time, this won't make any difference, because it will run out of channels.

Multislot class only applies to packet data, like web pages or picture messages. Phone calls only use one slot, anyway.

Dual Mode

A dual-mode handset will support two different radio technologies and switch between them when appropriate. For example, because UMTS base stations are deployed only in major cities, a UMTS handheld will probably be able to fall back on GPRS technology when you roam away from the city.

Dual transfer mode

Dual transfer mode handsets are expected to be deployed to some networks in 2007. With dual transfer mode, you will be able to transfer packet data during a phone call. This is something already supported by UMTS but not GSM, so there is a greater push for it in GSM phones.

Flip-Phone

Audio engineers love flip phones, because it brings the microphone closer to the mouth. Companies spend millions of dollars, and hire lots of Ph.D's to try to get the microphone to work when it is on your cheek, but a handset designer will tell you that you can get the best audio quality with a flip phone.

Data Modem

If your phone has data modem capabilities, you will be able to attach it to your computer and use it as a modem. However, be aware that your transfer speeds will be limited. GPRS/EDGE phones have an inherent limitation: The very first packet that you send (after a break of about 5 seconds) will take up to 1.5 seconds to start the transfer, although subsequent packets will be faster. This makes GPRS modems inefficient for the TCP protocol used by all networked PCs today. However, you should still be able to browse at speeds similar to dial-up.

Make sure to read the fine print: Your "unlimited" plan may actually only include a few MB/month, with steep charges if you go over the limit. The limit is especially troubling, because Microsoft Windows will typically send several megabytes of data in a few minutes, because a lot of the software that you have will constantly be checking for updates.

Talk and Standby Time

Talk time and standby time are tested in a standardized way. Because they are not real conditions, handset manufacturers can employ certain tricks to get a better rating. Look for a talk time of at least 4 hours, and a standby time of at least 10 days, whether you need it or not.

All rechargeable batteries have a limited life. They are killed by both heat and time. You should get a lithium battery, which will last for several years. Nickle-Metal-Hydride (NiMH) batteries are good too, but exhibit a "memory-effect", which means that you should charge them only when they are close to empty. NEVER leave a battery the car in hot weather, because this shaves months off of its life. When you are replacing it, buy only a newly manufactured battery, because they will slowly die even if left in the package. For this reason, don't bother getting a spare when you buy the phone. If you do want to store your battery for a period of time, discharge it to 40%, and keep it in the refrigerator inside a sealed plastic bag. Make sure it is dry, or the contacts will rust.

The more gadgets your phone has, the more the battery will run out. External memory slots, GPS, and Wi-Fi are all things that will suck the juice out of your battery. Also, if you live on the fringes of coverage and get only 1 to 3 bars of service, your battery will only last a couple of days because the phone will have to transmit at maximum power. GSM phones have to update with the network every 5-15 minutes, so they will consume power even if you are not using it.

Wi-fi

The industry is working on a new feature, called GAN or UMA, which is already available in some areas. In an UMA phone, the carrier will also sell you Internet access and give you a wireless access point. While in your home, your calls will go through the Wi-Fi connection. When you leave your house, the calls will be handed over to the cellular (even if you are talking). This means that you will always get full service in your home, but it will also reduce the standby time. You should get it only if you want to buy Internet service from your carrier.

AADLUND, F
AADLAND, G
AADLUND, J
AADLUND, M
AAMOLD, D
AARDE, J. R
AARON, C.
AARON, E
AARON, M
AARON, R
AARON, T
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1