< >

A Quick Measure of Sortedness

Posted ten years ago

How do you measure the "sortedness" of a list? There are several ways. In the literature this measure is called the "distance to monotonicity" or the "measure of disorder" depending on who you read. It is still an active area of research when items are presented to the algorithm one at a time. In this article, I consider the simpler case where you can look at all of the items at once.

The Kendall distance between two lists is the number of swaps it would take to turn one list into another. So, for [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] and [10, 1, 2, 3, 4, 5, 6, 7, 8, 9], it would take nine swaps.

Edit distance is another method. We could take the 10, and move it after the 9, in one operation. The edit distance is inversely related to the longest increasing subsequence. In the list [1, 2, 3, 5, 4, 6, 7, 9, 8], the longest increasing subsequence is [1, 2, 3, 5, 6, 7, 9], of length seven, and it is three away from being a sorted list. The longest increasing subsequence can be calculated in O(nlogn) time. A drawback of this method is its large granularity. For a list of ten elements, the measure can only take the distinct values 0 through 9.

Here, I propose another measure for sortedness. The procedure is to sum the difference between the position of each element in the sorted list, x, and where it ends up in the unsorted list, f(x). We divide by the square of the length of the list and multiply by two, because this gives us a nice number between 0 and 1. Subtracting from 1 makes it range from 0, for completely unsorted, to 1, for completely sorted.

A simple genetic algorithm in python for sorting a list using the above fitness function is presented below.

import random

def procreate(A):
    A = A[:]
    first = random.randint(0, len(A) - 1)
    second = random.randint(0, len(A) - 1)
    A[first], A[second] = A[second], A[first]
    return A

def score(A):
    diff = 0.
    for index, element in enumerate(A):
        diff += abs(index - element)

    return 1.0 - diff / len(A) ** 2 * 2

def genetic(root, procreateFn, scoreFn, generations = 1000, children=6):
    maxScore = 0.
    for i in range(generations):
        print("Generation {0}: {1} {2}".format(i, maxScore, root))
        maxChild = None
        for j in range(children):
            child = procreate(root)
            score = scoreFn(child)
            print("    child score {0:.2f}: {1}".format(score, child))
            if maxScore < score:
                maxChild = child
                maxScore = score
        if maxChild:
            root = maxChild
    return root

A = [a for a in range(10)]
random.shuffle(A)
genetic(A, procreate, score)

Note that under this metric, the completely reversed list does not have a score of 0.

The Spearman's coefficient, mentioned in the comments, might be what you are looking for.

Steve Hanov makes a living working on Rhymebrain.com, rapt.ink, www.websequencediagrams.com, and Zwibbler.com. He lives in Waterloo, Canada.

Post comment

edit

Terry

three years ago

Why dividing by N squared? It doesnt seem to be homogeneous

edit

Brian Pin

ten years ago

Nice article!

In quick measure of sortedness, you propoed "difference between the position of each element in the sorted list, x, and where it ends up in the unsorted list, f(x)" , and in the code you actually using the abs diff of index and value, why is that? Could you enlight me?

edit

madlep

ten years ago

Also check out Spearman's coefficient.

I wrote about it a while back at webuild.envato.com/blog/using-stats-to-not-break-search/

Quite a similar approach to what you're describing here.

barcamp (comic)

Make a web page screenshot service

I'll take you step by step into how to make a service that takes screenshots of webpages and returns them as an image.

What does your phone number spell?

Here, I explain a technique for figuring out which words are in which phone numbers. Full C source code is included.

Finding awesome developers in programming interviews

In a job interview, I once asked a very experienced embedded software developer to write a program that reverses a string and prints it on the screen. He struggled with this basic task. This man was awesome. Give him a bucket of spare parts, and he could build a robot and program it to navigate around the room. He had worked on satellites that are now in actual orbit. He could have coded circles around me. But the one thing that he had never, ever needed to do was: display something on the screen.

O(n) Delta Compression With a Suffix Array

The difference between two sequences A and B can be compactly stored using COPY/INSERT operations. The greedy algorithm for finding these operations relies on an efficient way of finding the longest matching part of A of any given position in B. This article describes how to use a suffix array to find the optimal sequence of operations in time proportional to the length of the input sequences. As a preprocessing step, we find and store the longest match in A for every position in B in two passes over the suffix array.

qb.js: An implementation of QBASIC in Javascript

Play NIBBLES.BAS in your browser. I re-implemented a small part of QBASIC as a compiler in Javascript, so it runs in a webpage.

Experiment: Deleting a post from the Internet

Once you post something on the Internet, it is hard to get rid of it. As an experiment, I deleted one of my past posts, and I tried to remove all traces of it.

Compress your JSON with automatic type extraction

JSON is horribly inefficient data format for data exchange between a web server and a browser. Here's how you can fix it.

C++: A language for next generation web apps

On Monday, I was pleased to be an uninvited speaker at Waterloo Devhouse, hosted in Postrank's magnificent office. After making some surreptitious alterations to their agile development wall, I gave a tongue-in-cheek talk on how C++ can fit in to a web application.

Why Perforce is more scalable than Git

Branching on Perforce is kind of like performing open heart surgery. But here's why git can't hope to compete with it.