< >

O(n) Delta Compression With a Suffix Array

Posted eight years ago

ABSTRACT

The difference between two sequences A and B can be compactly stored using COPY/INSERT operations. The greedy algorithm for finding these operations relies on an efficient way of finding the longest matching part of A of any given position in B. This article describes how to use a suffix array to find the optimal sequence of operations in time proportional to the length of the input sequences. As a preprocessing step, we find and store the longest match in A for every position in B in two passes over the suffix array that has been enhanced with longest common prefix information (LCP).

INTRODUCTION

The days of losing work due to a power outage are over. In modern applications, users expect their work to be saved immediately. They also make mistakes, such as accidentally deleting large sections, and they expect to be able to restore their work to a prior state. To facilitate this, we need a way to retrieve or reconstruct every version of a document. We can take advantage of the similarities between the versions to minimize the space they consume.

INSERT/DELETE ALGORITHMS

Given two sequences of items A and B, we can compute a set of operations that will transform A into B. That way, only these operations need be stored. Two main classes of differencing algorithms are commonly used. Version control tools that deal with source code are often based on the longest common subsequence. For example, given the text:

A: The quick brown fox jumped over the lazy dog.

B: The lazy dog jumped over the quick brown fox.

The longest common subsequence is:

(The )(jumped over the )(.)

Although the text is written out here for clarity, only the position and length of each matching block is used. Using this information, one can directly compute the ranges where the text changed and thus derive INSERT / DELETE operations.

The example above can be encoded:

AT 5 DELETE "quick brown fox "
AT 5 INSERT "lazy dog"
AT 30 DELETE "lazy dog"
AT 30 INSERT "quick brown fox "

While INSERT/DELETE changes are easy to see visually in side-by-side comparison tools, they are suboptimal for file storage because they cannot exploit out-of-sequence similarities.

We can fix this by adding a MOVE operations. It is possible to change pairs of INSERT/DELETE commands to MOVE as a post-processing step. However, a more pressing issue is that any algorithm based on the longest common subsequence has a O(N^2) runtime in its worst case. Modern differencing tools often use Meyer's O(ND) algorithm, which is proportional to the length of the strings and the number of differences between them. Of course, when the two texts share little similarity, that will take a very long time.

THE COPY/INSERT ALTERNATIVE

Tichy describes system based only on COPY/INSERT. Starting with an empty string, it is possible to recreate a B by copying from various positions of A. Portions that do not exist in A can be created using the INSERT command. It can also be used when the encoding of the insert command would be smaller than the equivalent copy.

A: The quick brown fox jumped over the lazy dog.

COPY (The )
COPY (lazy dog)
COPY (jumped over the )
COPY (quick brown fox)
INSERT (.)

Finding COPY/INSERT operations is much simpler than LCS based algorithms.

# the position of the string.
q = 0
while q < len(B):
    find p and l such that (p.q.J) is a maximal block move 
    position, length = findLongestMatch(q)
    if length > 0:
        If insertFrom >= 0:
             outputInsertCommand(insertFrom, q - insertFrom)
             insertFrom = -1
        outputCopyCommand(position, length)
        q += length
    elif insertFrom == -1:
        insertFrom = q

if insertFrom >= 0:
     outputInsertCommand(insertFrom, q - insertFrom)

The runtime of the algorithm is dependent on the findLongestMatch function. Given a position in B, it finds the position in A with the longest matching sequence.

The brute force solution, which simply compares each possible position of A each time, performs surprisingly well when the sequences are mostly similar. It is not called very often, because the matches it finds are long. For sequences that are different, it again devolves into an O(N^2) running time.

The algorithm presented by MacDonald (2000) for use in the XDFS file system uses a preprocessing step to achieve O(N) operation. At each offset in A, the next 16 characters are placed into a lookup table and mapped to that position. FindLongestMatch is then:

code = next 16 characters in A
If code exists in the table,
    Return the length of the longest match at A[table[code]:] and B[index:]
Else
   No match at this position.

This algorithm is very fast and reasonably thorough. However, it is often not optimal. To guarantee O(N) time, the algorithm makes a tradeoff. When the table of positions is built, existing entries are "clobbered" by later ones. Only one position for each code is stored. If all of the positions were stored, then the algorithm would have to check each one, which would make the overall runtime O(N^2) with certain combinations of inputs.

THE SUFFIX ARRAY

A suffix array is an ordered list of positions in the string. Modern suffix array construction algorithms will bucket sort certain positions in the string, and then use the information to "induce" the positions of the other characters. With this clever trick, a suffix array can be created from a sequence in O(N) time where N is the length of the sequence.

It is often useful to build an enhanced suffix array. In addition to the suffix array of length N, the longest common prefix (LCP) array of length N-1 is built. IT contains the length of the longest common prefix between each entry and the next. By exploiting the commonalities in the sequence, these prefixes can also be computed in O(N) time, either during the suffix array creation, or as a separate step.

To find commonalities between two different sequences, they are appended together, separated by a character not found in either string, and a suffix array is constructed. An example is here:

A: "mississippi" + "u0001"

B: "sips and misses" + "u0000"

String   Index   Lcp  15 characters
   B       15     0 |"u0000"
A          11     0 |"u0001sips and missesu0000"
   B        4     1 |" and missesu0000"
   B        8     0 |" missesu0000"
   B        5     0 |"and missesu0000"
   B        7     0 |"d missesu0000"
   B       13     0 |"esu0000"
A          10     1 |"iu0001sips and missesu0000"
A           7     2 |"ippu0001sips and misses"
   B        1     1 |"ips and missesu0000"
   B       10     3 |"issesu0000"
A           4     4 |"issippiu0001sips and mis"
A           1     0 |"ississippiu0001sips and "
   B        9     4 |"missesu0000"
A           0     0 |"mississippiu0001sips and"
   B        6     0 |"nd missesu0000"
A           9     1 |"piu0001sips and missesu0000"
A           8     1 |"ppiu0001sips and missesu0000"
   B        2     0 |"ps and missesu0000"
   B       14     1 |"su0000"
   B        3     1 |"s and missesu0000"
   B       12     1 |"sesu0000"
A           6     3 |"sippiu0001sips and misse"
   B        0     2 |"sips and missesu0000"
A           3     1 |"sissippiu0001sips and mi"
   B       11     2 |"ssesu0000"
A           5     3 |"ssippiu0001sips and miss"
A           2     0 |"ssissippiu0001sips and m"

Note 1. The position of sequence 1 and 2 can be easily determined by its offset into the concatenated string.

Note 2. The LCP at index i efers to the commonality with position i+1 in the suffix array.

Note 3. The LCP between any two entries in the suffix array is the minimum of the LCP of all adjacent entries between them.

The common parts of the combined string AB are near each other in the the suffix array. However the common parts of A and B are not necessarily adjacent. If either string has commonalities with itself, then there will be two or more adjacent entries in the suffix array belonging to the same string.

Examining the suffix array above, we see that the location "sips and misses" in B has below it a match in string A of length 2. However, above it is a match of length 3. Our algorithm must consider both possibilities.

We do this in two passes. In the forward pass, we find the longest common prefixes between B and any part of A previous to it in the suffix array. In the reverse pass, we find the LCPs between B and any part of a that is after it in the suffix array.

def longestMatches(self):
        # returns, for every position in B, a tuple with the longest matching 
        # position in A and the length of that match.
        result = [None] * self.length2

        # forward pass
        lcp = 0
        aIndex = 0
        for i in range(len(self.sa)):
            if self.sa[i] < self.length1:
                # string in A
                lcp = self.lcp[i]
                aIndex = self.sa[i]
            else:
                # string in B.
                result[self.sa[i] - self.length1] = (aIndex, lcp)
                lcp = min(lcp, self.lcp[i])

        # reverse pass
        lcp = 0
        aIndex = 0
        for i in range(len(self.sa)-1, -1, -1):
            if self.sa[i] < self.length1:
                # string in A
                aIndex = self.sa[i]
                if i > 0:
                    lcp = self.lcp[i-1]
            else:
                # string in B.
                lcp = min(lcp, self.lcp[i])
                bIndex = self.sa[i] - self.length1
                oldAIndex, oldLcp = result[bIndex]
                if lcp > oldLcp:
                    result[bIndex] = (aIndex, lcp)

        return result

ANALYSIS

Even though building the suffix array is an O(N) algorithm, it will always be slower compared to the XDFS method because it requires several passes through the data. If optimal encoding is required, and the algorithm must be O(N) for all inputs, then the suffix array method may be considered. Preprocessing operations, such as removing the common prefix and suffix from the input, are important to reduce the problem size.

RESOURCES

Here is the Python source code that I used to test this technique. It contains a transcription of the SAIS suffix array construction algorithm from C into Python.

Steve Hanov makes a living working on Rhymebrain.com, rapt.ink, www.websequencediagrams.com, and Zwibbler.com. He lives in Waterloo, Canada.

Post comment

edit

greg

one year ago

nice blog

Yes, You Absolutely Might Possibly Need an EIN to Sell Software to the US

After many months, your software sale is complete! You've got a purchase order, sent the invoice, delivered the software. You're already handling some support issues from users at BigCorp. Then BANG! Martha from Procurement emails back, as a favour, just to let you know that BigCorp has not received your W8 form with a valid tax id, and therefore will be withholding 30% of the purchase price of your multi-thousand dollar product for taxes.

Make a web page screenshot service

I'll take you step by step into how to make a service that takes screenshots of webpages and returns them as an image.

Finding Bieber: On removing duplicates from a set of documents

Using a locality sensitive hash, you can mark duplicates in millions of items in no time.

Keeping Abreast of Pornographic Research in Computer Science

Burgeoning numbers of Ph.D's and grad students are choosing to study pornography. Techniques for the analysis of "objectionable images" are gaining increased attention (and grant money) from governments and research institutions around the world, as well as Google. But what, exactly, does computer science have to do with porn? In the name of academic persuit, let's roll up our sleeves and plunge deeply into this often hidden area that lies between the covers of top-shelf research journals.

Boring Date (comic)

Exploring sound with Wavelets

Here's a program to create scalograms of sound files.

What does your phone number spell?

Here, I explain a technique for figuring out which words are in which phone numbers. Full C source code is included.

An instant rhyming dictionary for any web site

Sometimes your API has to be simple enough for non-technical people to use it. Find out how to include a rhyming dictionary on your web page just by copying and pasting.

See sound without drugs

I have created an application that just turns on the microphone and continually plots the FFT magnitude of what it records. It allows control over the window size and sampling rate.

Coding tips they don't teach you in school

Some time-saving shortcuts for C code that will make your coworkers scream. In Awe.