Finding Bieber: On removing duplicates from a set of documents
Using a locality sensitive hash, you can mark duplicates in millions of items in no time.
Let's read a Truetype font file from scratch
A Quick Measure of Sortedness
How do you measure the "sortedness" of a list? There are several ways. In the literature this measure is called the "distance to monotonicity" or the "measure of disorder" depending on who you read. Here, I propose another measure for sortedness.
My thoughts on various programming languages
Some ill-informed remarks on various programming languages.
The strange man reading a novel in the meeting room
Why is a visitor reading a novel all week in the meeting room?
You can cheat so your web site seems faster than it is
You can make your web site seem faster without actually being faster.
Yes, You Absolutely Might Possibly Need an EIN to Sell Software to the US
After many months, your software sale is complete! You've got a purchase order, sent the invoice, delivered the software. You're already handling some support issues from users at BigCorp. Then BANG! Martha from Procurement emails back, as a favour, just to let you know that BigCorp has not received your W8 form with a valid tax id, and therefore will be withholding 30% of the purchase price of your multi-thousand dollar product for taxes.
Asana's shocking pricing practices, and how you can get away with it too
If one apple costs $1, how much would five apples cost? How about 500? If everyday life, when you buy more of something, you get more bananas for your buck. But software companies are bucking the trend.
5 Ways PowToon Made Me Want to Buy Their Software
Even though I saw through their tricks at every step along the way, I am now a customer and proud of it. It is worthwhile to look at what they did, because these are simple things that you can do to improve your software business.
How I run my business selling software to Americans
Here's what you can do to get the most out of your business in Canada if all of your revenue comes in US dollars.
0, 1, Many, a Zillion
It's common wisdom that there should only be three numbers in source code. But there's actually four. Here's why.
Give your Commodore 64 new life with an SD card reader
Dust off your old Commodore 64, and you could be the coolest kid on the block by plugging SD cards into it instead of floppies.
20 lines of code that will beat A/B testing every time
A/B testing is used far too often, for something that performs so badly. It is defective by design: Segment users into two groups. Show the A group the old, tried and true stuff. Show the B group the new whiz-bang design with the bigger buttons and slightly different copy. After a while, take a look at the stats and figure out which group presses the button more often. Sounds good, right? The problem is staring you in the face. It is the same dilemma faced by researchers administering drug studies. During drug trials, you can only give half the patients the life saving treatment. The others get sugar water. If the treatment works, group B lost out. This sacrifice is made to get good data. But it doesn't have to be this way.
VP trees: A data structure for finding stuff fast
Let's say you have millions of pictures of faces tagged with names. Given a new photo, how do you find the name of person that the photo most resembles?
In the cases I mentioned, each record has hundreds or thousands of elements: the pixels in a photo, or patterns in a sound snippet, or web usage data. These records can be regarded as points in high dimensional space. When you look at a points in space, they tend to form clusters, and you can infer a lot by looking at ones nearby.
Why you should go to the Business of Software Conference Next Year
Most people, having already paid $2000.00 of their hard earned money, and then having flown, driven, or otherwise travelled to Boston to attend a conference, and then having paid an additional $250/night plus $33/night parking and "tourism taxes" to the Seaport Hotel -- most people, after all this, are unlikely to say that it was a waste of time and they should have stayed home watching the remaining salvaged episodes of Doctor Who
In fact, I found it quite useful.
Four ways of handling asynchronous operations in node.js
Today, we will examine four different methods of performing the same task asynchronously, in node.js.
Type-checked CoffeeScript with jzbuild
Zero load time file formats
When your app needs to be fast, you can't afford to load things fro disk. In this toy example, an on-disk data structure helps you instantly look up lists of related words.
Finding the top K items in a list efficiently
Do you use sort() to find the top results? Here's a simple trick that will make your software run much faster.
An instant rhyming dictionary for any web site
Sometimes your API has to be simple enough for non-technical people to use it. Find out how to include a rhyming dictionary on your web page just by copying and pasting.
jQuery creator John Resig needs a little help storing lists of words in his side project. Let's go overkill and explore a little known branch of computer science called Succinct Data Structures.
Throw away the keys: Easy, Minimal Perfect Hashing
Perfect hashing is a technique for building a hash table with no collisions in the minimum possible space. They are a easy to build with this simple python function.
Why don't web browsers do this?
Why don't web pages start as fast as this computer from 1984?
Fun with Colour Difference
Are you looking for a nifty way to choose colours that stand out? Are you the type of person who is not satisfied until you have mathematically proven that your choice is optimal?
Compressing dictionaries with a DAWG
A practical, memory efficient way to store and search large sets of words.
Fast and Easy Levenshtein distance using a Trie
If you have a web site with a search function, you will rapidly realize that most mortals are terrible typists. Many searches contain mispelled words, and users will expect these searches to magically work. This magic is often done using levenshtein distance. In this article, I'll compare two ways of finding the closest matching word in a large dictionary. I'll describe how I use it on rhymebrain.com
The Curious Complexity of Being Turned On
In software, the simplest things can turn into a nightmare, especially at a large company.
Cross-domain communication the HTML5 way
Making a web application mashable -- useable in another web page -- has some challenges in the area of cross-domain communications. Here is how I solved those problems for Zwibbler.com, using HTML5 cross domain communication.
Five essential steps to prepare for your next programming interview
They put you in a room, give you a problem, and stare at you while you fumble around with markers on a whiteboard for 45 minutes. With a little preparation, you'll look like a pro.
Minimal usable Ubuntu with one command
If you install the default "ubuntu-desktop" you also get with it a gigabyte of crap
that you will never use. But if you don't install the ubuntu desktop, you get a system with a text-only login: prompt, and it's not clear what to install to get it to a usable state.
I have an irrational need to optimize my Ubuntu installation. I did some investigating and came up with this method, which gives a minimal graphical 1.2 GB install, with gnome, networking, and no applications.
Finding awesome developers in programming interviews
In a job interview, I once asked a very experienced embedded software developer to write a program that reverses a string and prints it on the screen. He struggled with this basic task. This man was awesome. Give him a bucket of spare parts, and he could build a robot and program it to navigate around the room. He had worked on satellites that are now in actual orbit. He could have coded circles around me. But the one thing that he had never, ever needed to do was: display something on the screen.
Compress your JSON with automatic type extraction
JSON is horribly inefficient data format for data exchange between a web server and a browser. Here's how you can fix it.
The simple and obvious way to walk through a graph
At some point in your programming career you may have to go through a graph of items
and process them all exactly once. If you keep following neighbours, the path might loop back on itself, so you need to keep track of which ones have been processed already.
Creating portable binaries on Linux
Distributing applications on Linux is hard. Sure, with modern package management, installing
software is easy. But if you are distributing
an application, you probably need one Windows version, plus umpteen different versions for Linux. In this article, we'll create a dummy application that targets the following operating systems, which are commonly used in business environments...
Bending over: How to sell your software to large companies
For a micro-ISV, selling to businesses can be more lucrative than selling to consumers. Instead of making a few dollars per sale and hoping for thousands of sales, you sell to only a few customers, and charge much higher rates. But the rates are high for a reason. It takes more time and money to sell to businesses.
Regular Expression Matching can be Ugly and Slow
If you open the first few pages of O'Reilly's Beautiful Code, you will find a well written chapter by Brian Kernighan (Personal motto: "No, I didn't invent C. Who told you that?"). The non-C inventing professor describes how a limited form of regular expressions can be implemented elegantly in only a few lines of C code.
C++: A language for next generation web apps
On Monday, I was pleased to be an uninvited speaker at Waterloo Devhouse
, hosted in Postrank's
magnificent office. After making some surreptitious alterations to their agile development wall, I gave a tongue-in-cheek talk on how C++ can fit in to a web application.
Now it's a commercial product, but Zwibbler
was once a fun side-project, and here's some details on its implementation.
You don't need a project/solution to use the VC++ debugger
You learn a lot of things on the job as a programmer. Years ago, at my first coop position, I was a little confused when my boss went to Visual C++, and tried to open the .EXE file as a project. What a dolt!
I thought. That's not going to work.
How IE <canvas> tag emulation works
The PenIsland Problem: Text-to-speech for domain names
Recently, I was contracted to run a list of domain names through the custom-built pronunciation engine that powers my rhyming web site. On the first attempt, I found that the results were embarrassingly bad. A quick inspection revealed the problem: most domain names are severalwordsstucktogether.
Building a better rhyming dictionary
Back in 2007, I created a rhyming engine
based on the public domain Moby pronouncing dictionary
. It simply reads the dictionary and looks for rhyming words by comparing the suffix of the words' pronunciations. Since that time, I have made some improvements.
Comment spam defeated at last
For years when running this blog, I would have to log in each day and delete a dozen comments due to spam. This was a chore, and I tried many ways
to stem the tide.
How QBASIC almost got me killed
The day arrived when my project was ready to be unleashed upon the world. I waited until the teacher was hovering nearby and then I started my application, running the FORMAT command on the network drive. Some classmates were watching the screen and she hurried over to see what all the fuss was about.
How to run a linux based home web server
Sometimes you need complete control over the server, and don't want to pay $20 to $40 a month for a VPS. In this article, I'll describe step by step how to set up a home web server using Ubuntu, capable of handling modest spikes in traffic.
Using the Acer Aspire One as a web server
A netbook can be ideal for a home web server. They are cheap, and use less power
than a CFL light bulb.
Finding great ideas for your startup
"I just don't have any ideas." This is the #1 stumbling block for budding entrepreneurs. Here are a few techniques to get the creative juices flowing.
Game Theory, Salary Negotiation, and Programmers
When you get a new job, you can breathe a sigh of relief, but not for long. You have an offer letter in your hand, and it is easy to miss one of the most important opportunities of your life: the starting salary. Here's what to do to increase your chances.
Coding tips they don't teach you in school
Some time-saving shortcuts for C code that will make your coworkers scream. In Awe.
When a reporter mangles your elevator pitch
If a reporter asks you about your new startup company, be careful what you say.
Test Driven Development without Tears
Every company that I worked for has its own method of testing, and I've gained a lot of experience in what works and what doesn't. At last, that stack of conflicting confidentiality agreements that I got as a coop student have now all expired, so I can talk about it. (I never signed them anyway.)
Drawing Graphs with Physics
To my surprise, I found that there is a very simple way to arrange graphs that can be expressed in only a few lines of code, using force-directed placement...
Keeping Abreast of Pornographic Research in Computer Science
Burgeoning numbers of Ph.D's and grad students are choosing to study pornography. Techniques for the analysis of "objectionable images" are gaining increased attention (and grant money) from governments and research institutions around the world, as well as Google. But what, exactly, does computer science have to do with porn? In the name of academic persuit, let's roll up our sleeves and plunge deeply into this often hidden area that lies between the covers of top-shelf research journals.
Exploiting perceptual colour difference for edge detection
Think colour isn't important in image processing algorithms? Let's try it both ways, and see for yourself.
Experiment: Deleting a post from the Internet
Once you post something on the Internet, it is hard to get rid of it. As an experiment, I deleted one of my past posts, and I tried to remove all traces of it.
Is 2009 the year of Linux malware?
Is 2009 the year of the linux desktop malware? How long until we see headlines like, "Researchers find massive botnet based on linux 2.30"?
If you begin your emails with "Hi, <name>!" then they will seem less rude.
How a programmer reads your resume (comic)
People thought it was a comic, so I never corrected them.
How wide should you make your web page?
Based on 22500 unique IP addresses over the past week.
Usability Nightmare: Xfce Settings Manager
Rant: Why can't anyone make a good settings screen?
cairo blur image surface
This really should have been included in cairo. Instead, everyone that wants to have shadows has to roll their own blur function. Here's my take on it. I'll even release this into the public domain.
Why Perforce is more scalable than Git
Branching on Perforce is kind of like performing open heart surgery. But here's why git can't hope to compete with it.
Optimizing Ubuntu to run from a USB key or SD card
Fortunately, by following the tips below, you can make your USB or SD card based linux system fly!
UMA Questions Answered
A bunch of questions answered about UMA wireless technology.
See sound without drugs
I have created an application that just turns on the microphone and continually plots the FFT magnitude of what it records. It allows control over the window size and sampling rate.
Stock Picking using Python
Python can tell you which stocks to buy. It's a sure thing!
Rant: Why do companies think they can make money by posting false information about you on the Internet?
Copy a cairo surface to the windows clipboard
I just spent several hours debugging clipboard copy of a DIB image. I could copy from my application, and paste into Paint. I could paste into Word. But if I pasted into WordPad, nothing showed up. If I pasted into GIMP, it crashed.
Free, Raw Stock Data
Scraping financial information is easy with my friend, python.
Why are all my lines fuzzy in cairo?
Make sure your lines are sharp using this simple trick.
A simple command line calculator
A textbook example of recursive descent parsing.
Tool for Creating UML Sequence Diagrams
If you have to draw something called "UML Sequence Diagrams" for work or school, you already know that it can take hours to get a diagram to look right. Here's a web site that will save you some time.
Exploring sound with Wavelets
Here's a program to create scalograms of sound files.
UMA and free long distance
What's to stop me from travelling to another continent, and then making free long distance calls to local numbers back home? Technically, nothing.
UMA's dirty secrets
Recently, many carriers have started offering UMA, or WiFi phones. These are cell phones with WiFi capabilites. Don't be fooled -- you won't be able to get free calls and run skype on them. The UMA technology is meant to extend the carrier's cellular network into your home using your broadband internet connection.
Installing the Latest Debian on an Ancient Laptop
Install Linux on a really old laptop. The catch: It has only 32 MB of RAM, no network ports, no CD-ROM
, and the floppy drive makes creaking noises. Is it possible? Yes. Is it easy? No. Is is useful? Maybe...
Experiments in making money online
Is it possible to make money on the internet, if you try really hard? I want to find out.
I have always been interested in getting money for doing nothing.
Draw waveforms and hear them
A while back I thought it would be interesting to be able to draw arbitrary waveforms and then listen to how they sound. I had an audio engine just laying around, so I whipped up a quick application to do that.
Cell Phones on Airplanes
Much ink has been spilled about the use of cell phones on airplanes. Here's the truth, which will be disappointing to conspiracy theorists: Cell phone signals most definately have an effect on other electronic equipment. Read on for more.
Detecting C++ memory leaks
It's fairly simple to redefine malloc() and free() to your own functions, to track the file and line number of memory leaks.
What does your phone number spell?
Here, I explain a technique for figuring out which words are in which phone numbers. Full C source code is included.
A Rhyming Engine
Here's a rhyming engine, written in 1000 lines of C++ code. It uses the freely available Moby dictionary, and full source code is provided.
Rules for Effective C++
The rules for safe C++ code are surprisingly controversial.
Cell Phone Secrets
How to choose a cell phone in 2006, if you want the best possible radio.