< >

Compress your JSON with automatic type extraction

Posted 15 years ago

JSON is horribly inefficient data format for data exchange between a web server and a browser. For one, it converts everything to text. The value 3.141592653589793 takes only 8 bytes of memory, but JSON.stringify() expands it to 17. A second problem is its excessive use of quotes, which add two bytes to every string. Thirdly, it has no standard format for using a schema. When multiple objects are serialized in the same message, the key names for each property must be repeated, even though they are the same for each object.

JSON used to have an advantage because it could be directly parsed by a javascript engine, but even that advantage is gone because of security and interoperability concerns. About the only thing JSON going for it is that it is usually more compact than the alternative, XML, and it is well supported by many web programming languages.

Compression of JSON data is useful when large data structures must be transmitted from the web browser to the server. In that direction, it is not possible to use gzip compression, because it is not possible for the browser to know in advance whether the server supports gzip. The browser must be conservative, because the server may have changed abilities between requests.

Today, let's tackle the most pressing problem: the need to constantly repeat key names over and over. I will present a Javascript library for compressing JSON strings by automatically deriving a schema from multiple objects. The library can be used as a drop in replacement for the methods JSON.stringify() and JSON.parse(), except that it lacks support for a reviver function. In combination with Rison, the savings could be significant.

Download it here

Suppose you have to transmit several thousand points and rectangles. JSON might encode them like this (without the comments):

[
    { // This is a point
        "x": 100, 
        "y": 100
    },

    { // This is a rectangle
        "x": 100, 
        "y": 100,
        "width": 200,
        "height": 150
    },

    {}, // an empty object

    ... // thousands more
]

A lot of the space is taken up by repeating the key names "x", "y", "width", and "height". They only need to be stored once for each object type:

{
    "templates": [ ["x", "y"], ["x", "y", "width", "height"] ],
    "values": [ 
        { "type": 1, "values": [ 100, 100 ] }, { "type": 2, "values": [100, 100, 200, 150 ] }, {} ]
}

Each object in the original input is transformed. Instead of listing the keys, the "type" field refers to a list of keys in the schema array. (The type is 1-based, instead of zero based, and I will explain why later). But we are still repeating "x", and "y". The rectangle shared these properties with the point type, and there is no need to repeat them in the schema:

{
    "templates": [ [0, "x", "y"], [1, "width", "height"] ],
    "values": [ 
        { "type": 1, "values": [ 100, 100 ] }, { "type": 2, "values": [100, 100, 200, 150 ] }, {} ]
}

We prefix each key list in the schema with a number. This number is the one-based index of a prior schema which is prepended to it to form the combined list. Zero means the empty object, which is why we use one-based indicies.

But we can still go a little further. Instead of having a separate "type" field in each object, we stick the type as the first element of the values array.

{
    "templates": [ [0, "x", "y"], [1, "width", "height"] ],
    "values": [ 
        { "values": [ 1,  100, 100 ] }, { "values": [2, 100, 100, 200, 150 ] }, {} ]
}

Finally, since we are trying to save space, we rename our properties, and stick in a format code so we can detect that compresed json is used.

{
    "f": "cjson",
    "t": [ [0, "x", "y"], [1, "width", "height"] ],
    "v": [ { "": [1,  100, 100 ] }, { "": [2, 100, 100, 200, 150 ] }, {} ]
}

Automatic type extraction

The hard part is finding the objects which share sets of keys. It sounds a lot like the Set Cover problem, and if so, an optimal solution is NP-complete. Instead, we will approximate the solution using a tree structure. While we are building the value array, when we encounter an object, we add all of its keys to the tree in the order that we encounter them.

At the end of the process, we can traverse the nodes of the tree and create the templates. Nodes which represent the end of a key list (shown in gray) must have entry in the key list. Although not illustrated here, nodes with multiple children are also points where the the child object types inherit from a common parent, so they also get an entry.

The astute reader will realize that the final schema depends on the order that we inserted the keys into the tree. For example, if, when we encountered the rectangle, we inserted the keys "width" and "height" before "x", and "y", the algorithm would not find any common entries.

It is possible to gain more efficient packing by using a greedy algorithm. In the greedy algorithm, before we begin, an initial pass through all the objects would be made to build a list of unique object types. Then when it comes time to insert keys into the tree, they are first sorted so that the ones which occur in the most unique types are inserted first. However, this method adds a lot of extra processing and I feel the gains would not be worthwhile.

Real world savings

Here is an actual document from my web site, Zwibbler.com. Click on "Transform" to see how CJSON compresses it vs. JSON.

Download

Download the CJSON code here.

Detecting C++ memory leaks

It's fairly simple to redefine malloc() and free() to your own functions, to track the file and line number of memory leaks.

Comment spam defeated at last

For years when running this blog, I would have to log in each day and delete a dozen comments due to spam. This was a chore, and I tried many ways to stem the tide.