Block sorting

Feeds:: Posts; Comments

Block sorting

04/11/2009 by rdn32

This is the first step in the Burrows-Wheeler (bzip) compression algorithm. It doesn’t compress anything in itself, but re-orders the characters of the input in such a way that they will then be easy to compress:

def encode(s)
    rot_nums = (0 ... s.length).collect{|i| i}
    rot_nums.sort! { |n,m| rotation(s,n) <=> rotation(s,m) }
    encoded = String.new(s)
    s.length.times { |n| encoded[n] = s[rot_nums[n]] }
    return encoded, rot_nums.index(s.length - 1)
end

# Contrive to say that the nth rotation of s is the
# one that puts the nth letter of s at the end
def rotation(s, n)
    i = (n + 1) % s.length
    return s[i ... s.length] + s[0 ... i]
end

def decode(encoded, index)
    length = encoded.length
    counts = {}
    occurrence = []
    length.times do |i|
        ch = encoded[i, 1]
        count = (counts[ch] || 0)
        occurrence.push(count)
        counts[ch] = count + 1
    end

    firsts = {}
    accum = 0
    counts.sort().each do |ch,count|
        firsts[ch] = accum
        accum += count
    end

    decoded = String.new(encoded)

    length.times do |i|
        decoded[i] = encoded[index]
        ch = decoded[i,1]
        index = firsts[ch] + occurrence[index]
    end

    decoded.reverse!

    return decoded
end

decoding = ARGV.length > 0 && ARGV.shift == "-d"
if decoding then
    i = $stdin.gets().to_i
    s = $stdin.read()
    $stdout.write(decode(s, i))
else
    s, i = encode($stdin.read())
    puts i
    $stdout.write(s)
end

I find it interesting partly because the whole concept is so strange, and partly because, even so the decoding logic is so simple, I keep having to re-convince myself that it works. (Although this might just be due to me being thick.) The reason why this peculiar re-ordering helps in compression is quite nice, but I’m not going to discuss it here.

The basic concept is that you take an input (e.g. “duck”) and find all of its rotations (“duck”, “uckd”, “ckdu”, “kduc”), sort them alphabetically (“ckdu”, “duck”, “kduc”, “uckd”), and take the last letter of each one (“ukcd”) as your encoded form together with the index into the list of sorted rotations that gives you the original input (the second, so 1).

In the code I don’t actually form a list of rotations, and instead sort some numbers representing the rotations. As I understand it there are cleverer ways of doing the sort so that it performs adequately, but Ruby’s built-in sort is sufficient for me here.

I find it relatively straightforward to convince myself that this transformation is reversible. Think of the rotations of the input as forming a matrix with each row being one rotation: what we are given is the last column of the matrix. It is possible to reconstruct the entire matrix one column at a time, and then use the index that we are also given to find the original input.

If we start with just the last column, and sort it alphabetically, then this gives us the first column which we can put into the matrix. So now we know the first and last columns. If we rotate the entire matrix, moving the last column to the front, and then re-sort, then this gives us the first two columns. So now we know the first two and the last columns. By repeatedly rotating and re-sorting we build up the whole matrix.

However, this isn’t how the algorithm works, although it’s still useful to think in terms of the matrix described above. The decoding algorithm reconstructs the original input by starting at last letter, then repeatedly finding the preceding one until the whole thing is known.

We are given the last column of the matrix, and so we can sort it to give the first column of the matrix. We are also given the row of the matrix that contains the original input, so from this we immediately know the last letter. If we can find out which row it is that has the letter in the first column corresponding to the letter in the last column, then this row will have the preceding letter in the last column (because all the rows are rotations). From this preceding letter we would want to find where it occurred in the first column, from which we could find the next preceding letter. And so on.

So everything hangs on being able to match up letter occurrences in the last column with the corresponding letter occurrences in the first. This is possible because for each letter character, the occurrences appear in the same order in both these two columns. For example, the seventh ‘T’ in the last column matches the seventh ‘T’ in the first. Of course, in the first column all the ‘T’s will appear consecutively – as the rows of the matrix are sorted alphabetically – which is why in the code we don’t actually need to construct the first column explicitly.

There are two things we need to work out in order to do the decoding. One is that for each entry in the last column we need to know which occurrence of it’s letter character it contains: if it is an ‘e’ we want to know if it is the first ‘e’, or the tenth, or whatever. The other thing we want is, for each letter character, which entry in the first column contains the first occurrence of that: this makes it trivial to find the entry that contains any specific occurrence of a letter character.

With these in place, decoding is trivial.

Posted in programming | Tagged algorithm, David Wheeler, Michael Burrows, programming, Ruby | Leave a Comment

Comments RSS

rdn32

Block sorting

Leave a comment Cancel reply

Archives

Categories

arts and crafts

favourite things

juvenilia

out and about

theory

rdn32

Block sorting

Share this:

Related

Leave a comment Cancel reply

Archives

Categories

arts and crafts

favourite things

juvenilia

out and about

theory