Hate UML?

Draw sequence diagrams in seconds.
http://www.websequencediagrams.com

Web Comic Aggregator
Posted on: 2007-04-21 19:50:09
Here's a python script that scrapes your favourite comics, and assembles them all onto one page. It also archives them on your server with back/forward buttons.

Many web sites today have rss feeds. RSS stands for "really simple syndication". Basically, it is a link to an ever-changing XML file with a list of articles. Even web sites like www.gocomics.com syndicate their content online, but what they include is pretty useless -- for each day, it's just the name of the comic and the date. Obviously, the content creators want you to visit their web sites and click on ads (and I encourage you to do so).

Technically inclined people (and their spouses) don't have to deal with these web sites, because now you can install this python script on your server. It will automatically scrape the latest comics from any web page you specify, and make them all available on a single page. All you have to do is come up with a regular expression for the filename of the comic.

The most significant challenge is legal. The reason you don't see pages that show a bunch of your favourite comics on one page, is that it's illegal to reproduce them. It's probably legal to download and store them for your own use, however. Please don't use this to build a public web site, because you will get sued.

Overview

Download comics.py here.
  • Create a folder on your server called, for example, comicdb.
  • Inside the folder, create a file database.txt that contains the web addresses and regular expressions for the comics that you want to scrape.
  • Change the DatabaseDir in comics.py to point to the database folder.
  • Put comics.py in the cgi-bin folder of your server.
  • Edit your crontab file so that comics.py update is run once a day.
  • When you run comics.py update, it will download the comics for the day and store them in the database folder.
  • When you load comics.py from a web browser, it present all the comics from a web page. It will let you navigate back and forth through the archive of comics that have been stored since you installed it.

Telling it what to scrape

Most web comics, like www.gocomics.com use easily guessable filenames. For example, the Foxtrot comic for April 15, 2007 can be accessed from the url http://images.ucomics.com/comics/ft/2007/ft070415.gif. Other web sites, like www.dilbert.com, deliberately mangle the url in some way to make it unpredictable. Dilbert has been inserting random digits in the date url since 1998. This may thwart very dumb script kiddies. However, by using regular expressions, we will be able to scrape any type of naming scheme.

The comicdb/database.txt file that I use looks like this:

-
Dilbert
www.dilbert.com
strip(\.sunday)?\.gif
-
Cathy
www.gocomics.com/cathy/
ca\d+.gif
-
For Better Or For Worse
www.gocomics.com/forbetterorforworse
fb\d+.gif
-
Foxtrot
www.gocomics.com/foxtrot
ft\d+.gif
-
Ctrl+Alt+Del
www.ctrlaltdel-online.com/comic.php
\d\d\d\d\d\d\d\d.jpg

Each entry is four lines:

  1. A hyphen: '-'
  2. The name of the comic (for display purposes)
  3. The url of the web site that has the latest comic
  4. A regular expression for the url of the latest comic in that page.
To derive the regular expression, you will have to examine the html source code of the comic's main page and look for the image url that they are using, and then create a regular expression to match it.

When you run comics.py update each day, the script will download the web page and look for lines like SRC="blah/blah/blah". When the contents of this line matches the regular expression that you specify, it will download that image, rename it, and store in in the data folder.

Not all comics are published daily. If you run your script every day, it may download the same comics twice. To prevent storing the same comic on different days, it calculates the md5 hash of the image file, and refuses to store it again if it already has another image with the same hash.

Scheduling

Using linux, I can automatically schedule the script to update every day using cron. Type crontab -e and add this line to the file:
0 5 * * * /home/yourname/public_html/cgi-bin/comics.py update
This will update every day at 5 am, so your comic page will be ready for you with the latest comics.

Future Work

Flash images

If you read the source code for some comics on www.gocomics.com (Foxtrot, for example) you see that for some bizarre reason they detect whether you are using windows or linux. On non-linux, they serve up an adobe flash object instead of a normal .gif file:
if ( isLinux ) {  // we've detected Linus
    var linuxContent = '<center><img src="http://images.ucomics.com/comics/ft/2007/ft070415.gif" width="600" height="428" border="0"></center>';document.write(linuxContent);  // insert non-flash content
} else if ( hasProductInstall && !hasReqestedVersion && !isLinux) {
        var productInstallOETags = '<script type=text/javascript>'
+ 'AC_FL_RunContent("codebase","http://download.macromedia.com/pub/shockwave/cabs/flash/swflash.cab",'
...
...
...
Since Adobe Flash has been available on Linux for years, they will probably remove the normal .gif option soon. Then I will need to find a way of converting flash to an image on the command line. Or I'd just download the .swf file instead and treat it like an image.

Avoiding detection

The script makes no attempt to hide the fact that it's not a browser. The target web site could check if the user agent matches common browser types, or that the person that is requesting the image has a cookie set. That could be easily faked in Python.

Suppose the comic site detects the scraping by looking for people downloading different comics too quickly. Then we could have it space out the requests over several hours. In addition, we could use tor to make every request use a different IP address.

How the script could be thwarted

As javascript has become universal, the comic web page could use javascript to disguise the URL or set cookies. This would be the most difficult for me to circumvent. In that case, would have to use an open source javascript interpreter to render the page internally and then grab the url. If it came to that, it might be easier to have a script load actually load the page in a real browser within a X-VNC session, and clip that part of the screen. The greasemonkey firefox add-in could probably do the trick. But this is starting to get complicated...

Resources

Want more programming tech talk?
Add to Circles on Google Plus
Subscribe to posts

Post comment

Real Name:
Your Email (Not displayed):

Text only. No HTML. If you write "http:" your message will be ignored.
Choose an edit password if you want to be able to edit or delete your comment later.
Editing Password (Optional):

Raindog

2007-11-12 01:28:23
Awesomesauce. Thanks for this.

Raindog

2007-11-12 01:40:20
Also, it would be super cool if you could modify this to also grab the image title text from the comic, since all the comics I read have title text that augments the comic (like www.xkcd.com).

Randall

2008-01-17 05:28:56
When I try to view comics.py in my browser, I get a copy of the comics.py script. I see that by running comics.py on my server, I generate proper HTML, but why isn't that being shown to me in my browser?

Steve Hanov

2009-07-22 15:32:39
Randall: I have that problem sometimes too. Apache can be very unpredictable. I got around it by renaming comics.py to comics.cgi.

Tim

2010-04-19 11:42:39
Sounds like a sweet script, but when I try to download it, I get an empty file...

jim dorey

2010-06-02 16:33:10
is there some place to find out what the wildcards are?

Steve Hanov

2010-06-03 08:31:56
The wild cards are called "regular expressions".
Email
steve.hanov@gmail.com

Other posts by Steve

Yes, You Absolutely Might Possibly Need an EIN to Sell Software to the US How Asana Breaks the Rules About Per-Seat Pricing 5 Ways PowToon Made Me Want to Buy Their Software How I run my business selling software to Americans 0, 1, Many, a Zillion Give your Commodore 64 new life with an SD card reader 20 lines of code that will beat A/B testing every time [comic] Appreciation of xkcd comics vs. technical ability VP trees: A data structure for finding stuff fast Why you should go to the Business of Software Conference Next Year Four ways of handling asynchronous operations in node.js Type-checked CoffeeScript with jzbuild Zero load time file formats Finding the top K items in a list efficiently An instant rhyming dictionary for any web site Succinct Data Structures: Cramming 80,000 words into a Javascript file. Throw away the keys: Easy, Minimal Perfect Hashing Why don't web browsers do this? Fun with Colour Difference Compressing dictionaries with a DAWG Fast and Easy Levenshtein distance using a Trie The Curious Complexity of Being Turned On Cross-domain communication the HTML5 way Five essential steps to prepare for your next programming interview Minimal usable Ubuntu with one command Finding awesome developers in programming interviews Compress your JSON with automatic type extraction JZBUILD - An Easy Javascript Build System Pssst! Want to stream your videos to your iPod? "This is stupid. Your program doesn't work," my wife told me The simple and obvious way to walk through a graph Asking users for steps to reproduce bugs, and other dumb ideas Creating portable binaries on Linux Bending over: How to sell your software to large companies Regular Expression Matching can be Ugly and Slow C++: A language for next generation web apps qb.js: An implementation of QBASIC in Javascript Zwibbler: A simple drawing program using Javascript and Canvas You don't need a project/solution to use the VC++ debugger Boring Date (comic) barcamp (comic) How IE <canvas> tag emulation works I didn't know you could mix and match (comic) Sign here (comic) It's a dirty job... (comic) The PenIsland Problem: Text-to-speech for domain names Pitching to VCs #2 (comic) Building a better rhyming dictionary Does Android team with eccentric geeks? (comic) Comment spam defeated at last Pitching to VCs (comic) How QBASIC almost got me killed Blame the extensions (comic) How to run a linux based home web server Microsoft's generosity knows no end for a year (comic) Using the Acer Aspire One as a web server When programmers design web sites (comic) Finding great ideas for your startup Game Theory, Salary Negotiation, and Programmers Coding tips they don't teach you in school When a reporter mangles your elevator pitch Test Driven Development without Tears Drawing Graphs with Physics Free up disk space in Ubuntu Keeping Abreast of Pornographic Research in Computer Science Exploiting perceptual colour difference for edge detection Experiment: Deleting a post from the Internet Is 2009 the year of Linux malware? Email Etiquette How a programmer reads your resume (comic) How wide should you make your web page? Usability Nightmare: Xfce Settings Manager cairo blur image surface Automatically remove wordiness from your writing Why Perforce is more scalable than Git Optimizing Ubuntu to run from a USB key or SD card UMA Questions Answered Make Windows XP look like Ubuntu, with Spinning Cube Effect See sound without drugs Standby Preventer Stock Picking using Python Spoke.com scam Stackoverflow.com Copy a cairo surface to the windows clipboard Simulating freehand drawing with Cairo Free, Raw Stock Data Installing Ubuntu on the Via Artigo Why are all my lines fuzzy in cairo? A simple command line calculator Tool for Creating UML Sequence Diagrams Exploring sound with Wavelets UMA and free long distance UMA's dirty secrets Installing the Latest Debian on an Ancient Laptop Dissecting Adsense HTML/ Javascript/ CSS Pretty Printer Web Comic Aggregator Experiments in making money online How much cash do celebrities make? Draw waveforms and hear them Cell Phones on Airplanes Detecting C++ memory leaks What does your phone number spell? A Rhyming Engine Rules for Effective C++ Cell Phone Secrets