Hate UML?

Draw sequence diagrams in seconds.
http://www.websequencediagrams.com

Compress your JSON with automatic type extraction
Posted on: 2010-08-16 21:00:47

JSON is horribly inefficient data format for data exchange between a web server and a browser. For one, it converts everything to text. The value 3.141592653589793 takes only 8 bytes of memory, but JSON.stringify() expands it to 17. A second problem is its excessive use of quotes, which add two bytes to every string. Thirdly, it has no standard format for using a schema. When multiple objects are serialized in the same message, the key names for each property must be repeated, even though they are the same for each object.

JSON used to have an advantage because it could be directly parsed by a javascript engine, but even that advantage is gone because of security and interoperability concerns. About the only thing JSON going for it is that it is usually more compact than the alternative, XML, and it is well supported by many web programming languages.

Compression of JSON data is useful when large data structures must be transmitted from the web browser to the server. In that direction, it is not possible to use gzip compression, because it is not possible for the browser to know in advance whether the server supports gzip. The browser must be conservative, because the server may have changed abilities between requests.

Today, let's tackle the most pressing problem: the need to constantly repeat key names over and over. I will present a Javascript library for compressing JSON strings by automatically deriving a schema from multiple objects. The library can be used as a drop in replacement for the methods JSON.stringify() and JSON.parse(), except that it lacks support for a reviver function. In combination with Rison, the savings could be significant.

Download it here

Suppose you have to transmit several thousand points and rectangles. JSON might encode them like this (without the comments):

[
    { // This is a point
        "x": 100, 
        "y": 100
    },

    { // This is a rectangle
        "x": 100, 
        "y": 100,
        "width": 200,
        "height": 150
    },

    {}, // an empty object

    ... // thousands more
]

A lot of the space is taken up by repeating the key names "x", "y", "width", and "height". They only need to be stored once for each object type:

{
    "templates": [ ["x", "y"], ["x", "y", "width", "height"] ],
    "values": [ 
        { "type": 1, "values": [ 100, 100 ] }, { "type": 2, "values": [100, 100, 200, 150 ] }, {} ]
}

Each object in the original input is transformed. Instead of listing the keys, the "type" field refers to a list of keys in the schema array. (The type is 1-based, instead of zero based, and I will explain why later). But we are still repeating "x", and "y". The rectangle shared these properties with the point type, and there is no need to repeat them in the schema:

{
    "templates": [ [0, "x", "y"], [1, "width", "height"] ],
    "values": [ 
        { "type": 1, "values": [ 100, 100 ] }, { "type": 2, "values": [100, 100, 200, 150 ] }, {} ]
}
We prefix each key list in the schema with a number. This number is the one-based index of a prior schema which is prepended to it to form the combined list. Zero means the empty object, which is why we use one-based indicies.

But we can still go a little further. Instead of having a separate "type" field in each object, we stick the type as the first element of the values array.

{
    "templates": [ [0, "x", "y"], [1, "width", "height"] ],
    "values": [ 
        { "values": [ 1,  100, 100 ] }, { "values": [2, 100, 100, 200, 150 ] }, {} ]
}

Finally, since we are trying to save space, we rename our properties, and stick in a format code so we can detect that compresed json is used.

{
    "f": "cjson",
    "t": [ [0, "x", "y"], [1, "width", "height"] ],
    "v": [ { "": [1,  100, 100 ] }, { "": [2, 100, 100, 200, 150 ] }, {} ]
}

Automatic type extraction

The hard part is finding the objects which share sets of keys. It sounds a lot like the Set Cover problem, and if so, an optimal solution is NP-complete. Instead, we will approximate the solution using a tree structure. While we are building the value array, when we encounter an object, we add all of its keys to the tree in the order that we encounter them.

At the end of the process, we can traverse the nodes of the tree and create the templates. Nodes which represent the end of a key list (shown in gray) must have entry in the key list. Although not illustrated here, nodes with multiple children are also points where the the child object types inherit from a common parent, so they also get an entry.

The astute reader will realize that the final schema depends on the order that we inserted the keys into the tree. For example, if, when we encountered the rectangle, we inserted the keys "width" and "height" before "x", and "y", the algorithm would not find any common entries.

It is possible to gain more efficient packing by using a greedy algorithm. In the greedy algorithm, before we begin, an initial pass through all the objects would be made to build a list of unique object types. Then when it comes time to insert keys into the tree, they are first sorted so that the ones which occur in the most unique types are inserted first. However, this method adds a lot of extra processing and I feel the gains would not be worthwhile.

Real world savings

Here is an actual document from my web site, Zwibbler.com. Click on "Transform" to see how CJSON compresses it vs. JSON.






Download

Download the CJSON code here.

Further reading

Want more programming tech talk?
Add to Circles on Google Plus
Subscribe to posts

Post comment

Real Name:
Your Email (Not displayed):

Text only. No HTML. If you write "http:" your message will be ignored.
Choose an edit password if you want to be able to edit or delete your comment later.
Editing Password (Optional):

Barryvan

2010-08-16 21:35:06
Interesting notion! My only concern would be the extra processing required to decode and rebuild the original JSON, particularly for large JSON strings, which is where you'd see the most savings. Have you done any performance testing on the encoding and decoding functions?

Babylon.

2010-08-16 23:41:25
I would suggest the time spent transmitting data over the wire is far greater than the time spent munging it.

andy

2010-08-16 23:56:50
I tried compressing your string with zip (default settings) and it compressed to 1882 bytes.

franz

2010-08-17 03:51:47
Your JSON minifying algorithm looks nice, however when you gzip the minified and the non-minified version, the filesizes only differ by about 7%. This is because the Huffman-Encoding in gzip can deal quite well with repeated text structures.

Matt

2010-08-17 05:16:17
In Opera I get a "ReferenceError: Undefined variable: CJSON" when I click Transform. Works fine in IE.

Fredrik C

2010-08-17 07:32:33
Matt: The code works fine in Opera here...

Josh

2010-08-17 11:34:28
Why not use

[ [1, 100, 100 ], [2, 100, 100, 200, 150 ], {} ]

instead of

[ { "": [1, 100, 100 ] }, { "": [2, 100, 100, 200, 150 ] }, {} ]

since the keys are all "" ?

Fábio

2010-08-17 11:50:40
Its a good codification strategy, but if you want to minimize the size, why not go into object serialization or a custom format with gzip applied on top? json is by purpose a verbose codification.

Most people choose json because of its simplicity (readibility) and your algorigthm breaks that feature.

JD

2010-08-17 13:34:47
I agree with Fabio... If you care so much about size, you should just go with custom serialization and compression.

Your process makes json illegible.

Kenny Parnell

2010-10-06 18:23:39
This is working great for us when used in tandem with the userData behavior in IE where space is limited to 128K.

VeganBen

2010-11-24 09:39:05
I was about the write the same final step as Josh, then read the comments.

Couple of further points.. keys do _not_ need to be enquoted.

I like the "values" portion. I think the embedding of templates in each other is an optimisation too far. It does more for obfuscation than it does for either minification or optimisation.

It might be more "optimised" in the data, but the extra process to extract the full template from a nested template can't be helpful.

Steve Hanov

2010-11-26 19:30:09
The reason for representing objects as {"": []} is to distinguish them from arrays. If you used Josh's method, then how would you represent an array?

Robby Pelssers

2011-03-27 10:02:33
If you really want to save on band-width and don't care about readability.. you can also store your data as an array of arrays instead of using objects since javascript arrays are not typed like in e.g. Java. It's comparable to using tuples (see Scala)

[

{"firstname": "Robby", "lastName": "Pelssers"},

{"firstname": "Steve", "lastName": "Hanov"},

...

]

[

["Robby","Pelssers"],

["Steve", "Hanov"],

...

]

Cheers,

Robby

Marc Lepage

2011-03-31 20:40:32
Sorry VeganBen but the quotes around the key names (which are strings) must be present if the JSON is to be valid. A lot of parsers are relaxed to allow them to be omitted, but that is non-standard.

Marc Lepage

2011-03-31 20:42:45
If I didn't care about order of points and rectangles, only that they were preserved within each type, I might be inclined to omit the type flags in the items by storing items separately by type. Something like this:

{

"f": "cjson",

"t": [ [0, "x", "y"], [1, "width", "height"] ],

"v0": [ { "": [ 100, 100 ] }, {} ]

"v1": [ { "": [2, 100, 100, 200, 150 ] }, {} ]

}

not working out

2011-08-07 00:56:09
{"a":{"b":{"b":{"b":{"b":{"b":{"a":{"b":{"b":{"b":{"b":{"b":1}}}}}}}}}}}}

try this.

quest

2011-12-16 08:27:36
Has php version?

dogada

2012-06-05 09:00:05
Great idea, but why to keep explicit index of data schemes. Look at RJSON that build index of schemes during decoding and so uses implicit index www.cliws.com/e/06pogA9VwXylo_GknPEeFA/
Email
steve.hanov@gmail.com

Other posts by Steve

Yes, You Absolutely Might Possibly Need an EIN to Sell Software to the US How Asana Breaks the Rules About Per-Seat Pricing 5 Ways PowToon Made Me Want to Buy Their Software How I run my business selling software to Americans 0, 1, Many, a Zillion Give your Commodore 64 new life with an SD card reader 20 lines of code that will beat A/B testing every time [comic] Appreciation of xkcd comics vs. technical ability VP trees: A data structure for finding stuff fast Why you should go to the Business of Software Conference Next Year Four ways of handling asynchronous operations in node.js Type-checked CoffeeScript with jzbuild Zero load time file formats Finding the top K items in a list efficiently An instant rhyming dictionary for any web site Succinct Data Structures: Cramming 80,000 words into a Javascript file. Throw away the keys: Easy, Minimal Perfect Hashing Why don't web browsers do this? Fun with Colour Difference Compressing dictionaries with a DAWG Fast and Easy Levenshtein distance using a Trie The Curious Complexity of Being Turned On Cross-domain communication the HTML5 way Five essential steps to prepare for your next programming interview Minimal usable Ubuntu with one command Finding awesome developers in programming interviews Compress your JSON with automatic type extraction JZBUILD - An Easy Javascript Build System Pssst! Want to stream your videos to your iPod? "This is stupid. Your program doesn't work," my wife told me The simple and obvious way to walk through a graph Asking users for steps to reproduce bugs, and other dumb ideas Creating portable binaries on Linux Bending over: How to sell your software to large companies Regular Expression Matching can be Ugly and Slow C++: A language for next generation web apps qb.js: An implementation of QBASIC in Javascript Zwibbler: A simple drawing program using Javascript and Canvas You don't need a project/solution to use the VC++ debugger Boring Date (comic) barcamp (comic) How IE <canvas> tag emulation works I didn't know you could mix and match (comic) Sign here (comic) It's a dirty job... (comic) The PenIsland Problem: Text-to-speech for domain names Pitching to VCs #2 (comic) Building a better rhyming dictionary Does Android team with eccentric geeks? (comic) Comment spam defeated at last Pitching to VCs (comic) How QBASIC almost got me killed Blame the extensions (comic) How to run a linux based home web server Microsoft's generosity knows no end for a year (comic) Using the Acer Aspire One as a web server When programmers design web sites (comic) Finding great ideas for your startup Game Theory, Salary Negotiation, and Programmers Coding tips they don't teach you in school When a reporter mangles your elevator pitch Test Driven Development without Tears Drawing Graphs with Physics Free up disk space in Ubuntu Keeping Abreast of Pornographic Research in Computer Science Exploiting perceptual colour difference for edge detection Experiment: Deleting a post from the Internet Is 2009 the year of Linux malware? Email Etiquette How a programmer reads your resume (comic) How wide should you make your web page? Usability Nightmare: Xfce Settings Manager cairo blur image surface Automatically remove wordiness from your writing Why Perforce is more scalable than Git Optimizing Ubuntu to run from a USB key or SD card UMA Questions Answered Make Windows XP look like Ubuntu, with Spinning Cube Effect See sound without drugs Standby Preventer Stock Picking using Python Spoke.com scam Stackoverflow.com Copy a cairo surface to the windows clipboard Simulating freehand drawing with Cairo Free, Raw Stock Data Installing Ubuntu on the Via Artigo Why are all my lines fuzzy in cairo? A simple command line calculator Tool for Creating UML Sequence Diagrams Exploring sound with Wavelets UMA and free long distance UMA's dirty secrets Installing the Latest Debian on an Ancient Laptop Dissecting Adsense HTML/ Javascript/ CSS Pretty Printer Web Comic Aggregator Experiments in making money online How much cash do celebrities make? Draw waveforms and hear them Cell Phones on Airplanes Detecting C++ memory leaks What does your phone number spell? A Rhyming Engine Rules for Effective C++ Cell Phone Secrets