SDBM

02/01/2010 by rdn32

I’ve been a bit slow in my learning of Ruby. This time around I’ve re-written Ozan Yigit’s ndbm clone: sdbm.

I’ve cheated a bit. My general plan has been to pick things up through re-writing bits of existing code in Ruby: but to do them from memory (swotting up if necessary) rather than translate them directly. In this particular case it was convenient (for testing) to ensure that my implementation produced files that were bit-for-bit identical to those produced by the original (as incorporated into Perl). So, to avoid incompatibilities, I wrote down the string hash algorithm and the block sizes used for reading and writing the files. It was only a tiny cheat.

[Incidentally, putting code into a post is a bit of a pain. I might not bother again. I formatted the code using GNU enscript, but I had to then remove double spacing that wordpress introduced. Also, it isn’t clear that it is possible to have backslashes in the code—I think wordpress is treating them as the start of an escape sequence that it should process itself.]

Yigit (in the sdbm README file) says: “sdbm is not a very complicated package, at least not after you familiarize yourself with the literature on external hashing.” I am not familiar with the literature, but Yigit’s code is perfectly readable—and a re-implementation is a good way for me to test and clarify my understanding.

Basically, sdbm implements a persistent hash table (aka associative array) with keys and values being considered as strings. Each instance uses two files: a ‘page’ file and a ‘directory’ file. The page file consists of a number of fixed length pages, each of which will contain a number of keys and their associated values. The directory file is used in working out which page to look in.

The big difference between this and an in-memory hash table is in the way it is enlarged as more items are added. In an in-memory hash table you start with an array containing a certain number of slots, each slot being a (possible empty) linked list of key/value pairs. When you insert an item you take the hash of the key modulo the number of slots, and that tells you which slot the item has to go it. When the table needs to be enlarged (to avoid excessively long linked lists) a new, larger array of slots is created, and all the items in the table are re-entered.

But with the data being stored on disk there would be quite a hit in having to go through all the table’s items in order to resize it. So instead the table is grown in a more piecemeal fashion.

A single page is like a slot in an in-memory hash table, in that all the items have keys with hashes that are equivalent in some sense. Here it is that all the hashes have a certain number of low-end bits the same. (A table starts with there being just one page, with none of the hashes’ bits being considered, and hence trivially all the hashes share a common, empty suffix.) When a particular page is found to be too full it gets split into two pages, with one more bit from the hashes being taken into consideration.

Given the hash of a key, if you know how many bits need to be used then taking that many bits from the low end of the hash gives you the number of the page that would contain this key. So, everything then hinges on knowing how many bits to use. This is where the directory file comes in.

The directory is based around a structure called a radix trie, which is something I’d not previously heard of. This is a tree in which nodes correspond to strings using some particular alphabet. The number of children a node can have being equal to the size of the alphabet, with each child having as its string the parent string prefixed by some letter from the alphabet. (So that each parent node has the common suffix of all its children, and the root node has the empty string.)

[Actually, the more common form of a trie is to have each parent be the common prefix of its children, and is used for in user interface implementations for drop-down suggestion completions in text entry boxes.]

In sdbm it is a binary trie that is used, i.e. the strings are bit strings. The leaf nodes correspond to pages, and the bit string for a leaf node is the common suffix of the hashes of the corresponding page’s keys. There is an inherent restriction that a node have either two or no children; and when a page gets split in two then its corresponding node gains its child nodes. So if you have a hash for a key then you navigate down from the root node, looking at one bit of the hash at a time to determine your path; and when you reach a leaf node then you stop, as this corresponds to the page you want to look at.

The odd thing is that the structure as I’ve just described it is entirely implicit. If we number the nodes in a regular manner that we can work out left child / right child / parent relations arithmetically, and so we can store the information we care about in an array. Moreover, all we care about is whether a node has children or not, since we will just be navigating down the trie looking for a leaf node. This is one bit of information, so we just need a bit array, which is what the directory file is.

I was a bit confused about this initially. I think this is partly because I rarely do bit fiddling in my day job, and in this there are two lots of conceptually distinct bit manipulation going on at the same time: on one hand there’s the bits of the hash, and hence the bit strings notionally contained in the trie; on the other there’s the bits in the directory file that encode the structure of the trie itself.

Another thing that took me a while to appreciate (although this is just down to daftness on my part) is that the numbering of pages and the numbering of nodes in the trie are distinct, as internal nodes do not correspond to pages. For example, if page 2 (i.e. page 10b) were to be split, then in the trie the node with bit string “10” would gain two new child nodes—“010” and “110”—whereas there would only be one extra page, the items being split between pages 2 and 6 (i.e. pages 10b and 110b).

Actually, although I found the directory more interesting conceptually, the part that gave me most trouble was the page manipulation code. In principle this isn’t at all complicated, but I found the details fiddly. (I didn’t do myself any favours by taking my time tracking down the Ruby debugger.)

Anyway, here’s the code.

# Ruby re-implementation of Ozan Yigit's sdbm library
class Sdbm
    MAX_SPLIT = 20

    def initialize(name)
        dir_file = name + '.dir'
        pag_file = name + '.pag'

        if File.exists?(dir_file) and File.exists?(pag_file)
            mode = 'r+'
        else
            mode = 'w+'
        end

        @dir = DirStorage.new(name + '.dir', mode)
        @pag = PageStorage.new(name + '.pag', mode)
    end

    def close()
        @pag.close
        @dir.close
    end

    def each()
        @pag.each_page { |page| page.each_pair { |i, k, v| yield(k, v) } }
    end

    def get(key)
        h = sdbm_hash(key)

        page_index = get_page_index(h)
        page = @pag.load_page(page_index)
        page.each_pair { |i, k, v| return v if k == key }

        return nil
    end

    def add(key, val)
        del(key)
        h = sdbm_hash(key)

        MAX_SPLIT.times {
            page_index = get_page_index(h)
            page = @pag.load_page(page_index)
            if page.can_fit?(key, val)
                page.add_pair(key, val)
                @pag.save_page(page)
                return
            else
                new_bit = 1 << @dir.depth(h)
                cur_page = @pag.make_empty_page(page_index)
                new_page = @pag.make_empty_page(page_index | new_bit)
                page.each_pair {
                    |i,k,v|
                    p = (sdbm_hash(k) & new_bit == 0) ? cur_page : new_page
                    p.add_pair(k, v)
                }

                @dir.split(h)

                # Contrive to save last the page we will look at next iteration
                if (h & new_bit == 0)
                    @pag.save_page(new_page)
                    @pag.save_page(cur_page)
                else
                    @pag.save_page(cur_page)
                    @pag.save_page(new_page)
                end
            end
        }

        raise "Failed to add key after MAX_SPLIT attempts"
    end

    def del(key)
        h = sdbm_hash(key)
        page_index = get_page_index(h)
        page = @pag.load_page(page_index)
        deleted = page.del(key)
        @pag.save_page(page) if deleted
        return deleted
    end

    def sdbm_hash(s)
        n = 0
        mask = (1 << 64) - 1
        s.each_byte { |ch| (n = ch + 65599 * n) & mask }
        return n
    end

    def get_page_index(h)
        mask = (1 << @dir.depth(h)) - 1
        return h & mask
    end
end

class Storage
    def initialize(filename, mode, blocksize)
        @blocksize = blocksize
        @file = open(filename, mode)
        @end_index = @file.stat.size() / @blocksize
    end

    def close
        @file.close
    end

    def make_empty_block
        # The string is supposed to have a single null character
        # but wordpress doesn't seem to like this
        return "00" * @blocksize
    end

    def load_block(index)
        if index >= @end_index
            @current_block = make_empty_block()
            @current_index = index
        else
            @file.seek(index * @blocksize)
            @current_block = @file.read(@blocksize)
            @current_index = index
        end

        return @current_block
    end

    def save_block(index, block)
        while @end_index <= index
            @file.seek(@end_index * @blocksize)
            @file.write(make_empty_block())
            @end_index = @end_index + 1
        end

        @file.seek(index * @blocksize)
        @file.write(block)
    end
end

# The directory file is treated as a great big bit vector.
# There is a correspondence between this vector and a binary
# tree/trie. The nodes are numbered as if the tree were
# complete, but the bit corresponding to a node is only set
# if that node exists and has children.
#
# I will try to explain this better in the post.
class DirStorage < Storage
    def initialize(filename, mode)
        super(filename, mode, 4096)
        @index = -1
        @block = nil
    end

    def ensure_loaded(index)
        if @index != index
            @block = load_block(index)
            @index = index
        end
    end

    def save()
        save_block(@index, @block)
    end

    def depth(h)
        depth = 0
        n = get_node(h) + 1

        while n > 1
            depth = depth + 1
            n = n >> 1
        end

        return depth
    end

    def split(h)
        node = get_node(h)
        setbit(node)
    end

    def get_node(h)
        node = 0

        while bitset?(node)
            go_left = h & 1 == 0
            node = (node * 2) + (go_left ? 1 : 2)
            h = h >> 1
        end

        return node
    end

    def bitset?(bit_index)
        byte_index    = bit_index / 8
        bit_in_byte   = bit_index % 8
        block_index   = byte_index / @blocksize
        byte_in_block = byte_index % @blocksize

        return false if block_index >= @end_index

        ensure_loaded(block_index)

        byte = @block[byte_in_block]
        bit = (byte >> bit_in_byte) & 1

        return bit != 0
    end

    def setbit(bit_index)
        byte_index    = bit_index / 8
        bit_in_byte   = bit_index % 8
        block_index   = byte_index / @blocksize
        byte_in_block = byte_index % @blocksize

        ensure_loaded(block_index)

        byte = @block[byte_in_block]
        byte = byte | (1 << bit_in_byte)
        @block[byte_in_block] = byte

        save()
    end
end

class PageStorage < Storage
    def initialize(filename, mode)
        super(filename, mode, 1024)
        # String was supposed to contain 2 null characters but
        # wordpress doesn't seem to like this
        @page = Page.new("0000", -1)
    end

    def each_page()
        for i in (0 ... @end_index)
            yield load_page(i)
        end
    end

    def load_page(page_index)
        if @page.index != page_index
            @page = Page.new(load_block(page_index), page_index)
        end
        return @page
    end

    def save_page(page)
        save_block(page.index, page.content)
        @page = page
    end

    def make_empty_page(page_index)
        return Page.new(make_empty_block(), page_index)
    end
end

# The page format has an array of short (two byte) integers at
# the start of the page (going towards the end) and concatenated
# entry data at the end of the page (going towards the front),
# with free space in the middle. The first integer is the number
# of entries in the page, and subsequent ones index into the page
# to give the starts of the entries. (The end of an entry will with
# be the start of the previous one, or - if it is the first - the
# end of the page itself.)
#
# As described above, the content of the page is, in effect, an
# array of strings. The mapping is given by this being used to
# store a sequence of key/value pairs. (So each pair counts as
# two entries.)
class Page
    def initialize(content, index)
        @content = content
        @index = index
        @num_entries = get_short(0)

        if @num_entries == 0
            @entries_start = @content.length
        else
            @entries_start = entry_start(@num_entries - 1)
        end
    end

    def content()
        return @content
    end

    def index()
        return @index
    end

    def each_pair()
        for i in (0 ... @num_entries / 2)
            key = get_entry(i*2)
            val = get_entry(i*2 + 1)
            yield(i, key, val)
        end
    end

    def can_fit?(key, val)
        need = 4 + key.length + val.length

        if need > @content.length - 2
            # Too large even if the page were empty
            raise "Key/value pair too large to fit on page"
        end

        free_end = @entries_start
        free_start = 2 * (1 + @num_entries)

        return need <= (free_end - free_start)
    end

    def add_pair(key, val)
        add_entry(key)
        add_entry(val)
    end

    def del(key)
        which_pair = nil

        each_pair {
            |i,k,v|
            if (k == key)
                which_pair = i
                break
            end
        }

        return false if which_pair.nil?

        to_free = entry_end(which_pair*2) - entry_start(which_pair*2 + 1)

        for i in (which_pair + 1 ... @num_entries / 2)
            key_entry = i*2
            val_entry = i*2 + 1

            key_start = entry_start(key_entry)
            val_start = entry_start(val_entry)

            pair_start = val_start
            pair_end   = entry_end(key_entry)
            to_move = pair_end - pair_start

            @content[pair_start + to_free, to_move] = @content[pair_start, to_move]
            set_entry_start(key_entry - 2, key_start + to_free)
            set_entry_start(val_entry - 2, val_start + to_free)
        end

        set_num_entries(@num_entries - 2)

        return true
    end

    def add_entry(entry)
        @entries_start -= entry.length
        @content[@entries_start, entry.length] = entry
        set_entry_start(@num_entries, @entries_start)
        set_num_entries(@num_entries + 1)
    end

    def set_entry_start(i, n)
        set_short(i+1, n)
    end

    def entry_start(i)
        return get_short(i+1)
    end

    def entry_end(i)
        return i == 0 ? @content.length : get_short(i)
    end

    def get_entry(i)
        length = entry_end(i) - entry_start(i)
        return @content[entry_start(i), length]
    end

    def set_num_entries(num_entries)
        set_short(0, num_entries)
        @num_entries = num_entries
    end

    def get_short(i)
        return @content[2*i+1] << 8 | @content[2*i]
    end

    def set_short(i, n)
        @content[2*i+1] = ( n >> 8 ) & 0xff
        @content[2*i]   =   n        & 0xff
    end
end

Posted in programming | Tagged data structure, memory, programming, Ruby | 2 Comments

2 Responses

on 12/11/2012 at 4:20 am | Reply Karin

This post gives the light in which we can observe the reality. this is very awesome one and gives indepth information. thanks a lot for this awesome article! With regards, Karin.
- on 12/11/2012 at 3:14 pm | Reply rdn32
  
  That’s very kind of you to say so, Karin, although I’m not entirely sure I understand what you’re saying. Perhaps you could be more specific.

Comments RSS

rdn32

SDBM

2 Responses

Leave a comment Cancel reply

Archives

Categories

arts and crafts

favourite things

juvenilia

out and about

theory

rdn32

SDBM

Share this:

Related

2 Responses

Leave a comment Cancel reply

Archives

Categories

arts and crafts

favourite things

juvenilia

out and about

theory