I’ve been a bit slow in my learning of Ruby. This time around I’ve re-written Ozan Yigit’s ndbm clone: sdbm.
I’ve cheated a bit. My general plan has been to pick things up through re-writing bits of existing code in Ruby: but to do them from memory (swotting up if necessary) rather than translate them directly. In this particular case it was convenient (for testing) to ensure that my implementation produced files that were bit-for-bit identical to those produced by the original (as incorporated into Perl). So, to avoid incompatibilities, I wrote down the string hash algorithm and the block sizes used for reading and writing the files. It was only a tiny cheat.
[Incidentally, putting code into a post is a bit of a pain. I might not bother again. I formatted the code using GNU enscript, but I had to then remove double spacing that wordpress introduced. Also, it isn’t clear that it is possible to have backslashes in the code—I think wordpress is treating them as the start of an escape sequence that it should process itself.]
Yigit (in the sdbm README file) says: “sdbm is not a very complicated package, at least not after you familiarize yourself with the literature on external hashing.” I am not familiar with the literature, but Yigit’s code is perfectly readable—and a re-implementation is a good way for me to test and clarify my understanding.
Basically, sdbm implements a persistent hash table (aka associative array) with keys and values being considered as strings. Each instance uses two files: a ‘page’ file and a ‘directory’ file. The page file consists of a number of fixed length pages, each of which will contain a number of keys and their associated values. The directory file is used in working out which page to look in.
The big difference between this and an in-memory hash table is in the way it is enlarged as more items are added. In an in-memory hash table you start with an array containing a certain number of slots, each slot being a (possible empty) linked list of key/value pairs. When you insert an item you take the hash of the key modulo the number of slots, and that tells you which slot the item has to go it. When the table needs to be enlarged (to avoid excessively long linked lists) a new, larger array of slots is created, and all the items in the table are re-entered.
But with the data being stored on disk there would be quite a hit in having to go through all the table’s items in order to resize it. So instead the table is grown in a more piecemeal fashion.
A single page is like a slot in an in-memory hash table, in that all the items have keys with hashes that are equivalent in some sense. Here it is that all the hashes have a certain number of low-end bits the same. (A table starts with there being just one page, with none of the hashes’ bits being considered, and hence trivially all the hashes share a common, empty suffix.) When a particular page is found to be too full it gets split into two pages, with one more bit from the hashes being taken into consideration.
Given the hash of a key, if you know how many bits need to be used then taking that many bits from the low end of the hash gives you the number of the page that would contain this key. So, everything then hinges on knowing how many bits to use. This is where the directory file comes in.
The directory is based around a structure called a radix trie, which is something I’d not previously heard of. This is a tree in which nodes correspond to strings using some particular alphabet. The number of children a node can have being equal to the size of the alphabet, with each child having as its string the parent string prefixed by some letter from the alphabet. (So that each parent node has the common suffix of all its children, and the root node has the empty string.)
[Actually, the more common form of a trie is to have each parent be the common prefix of its children, and is used for in user interface implementations for drop-down suggestion completions in text entry boxes.]
In sdbm it is a binary trie that is used, i.e. the strings are bit strings. The leaf nodes correspond to pages, and the bit string for a leaf node is the common suffix of the hashes of the corresponding page’s keys. There is an inherent restriction that a node have either two or no children; and when a page gets split in two then its corresponding node gains its child nodes. So if you have a hash for a key then you navigate down from the root node, looking at one bit of the hash at a time to determine your path; and when you reach a leaf node then you stop, as this corresponds to the page you want to look at.
The odd thing is that the structure as I’ve just described it is entirely implicit. If we number the nodes in a regular manner that we can work out left child / right child / parent relations arithmetically, and so we can store the information we care about in an array. Moreover, all we care about is whether a node has children or not, since we will just be navigating down the trie looking for a leaf node. This is one bit of information, so we just need a bit array, which is what the directory file is.
I was a bit confused about this initially. I think this is partly because I rarely do bit fiddling in my day job, and in this there are two lots of conceptually distinct bit manipulation going on at the same time: on one hand there’s the bits of the hash, and hence the bit strings notionally contained in the trie; on the other there’s the bits in the directory file that encode the structure of the trie itself.
Another thing that took me a while to appreciate (although this is just down to daftness on my part) is that the numbering of pages and the numbering of nodes in the trie are distinct, as internal nodes do not correspond to pages. For example, if page 2 (i.e. page 10b) were to be split, then in the trie the node with bit string “10” would gain two new child nodes—“010” and “110”—whereas there would only be one extra page, the items being split between pages 2 and 6 (i.e. pages 10b and 110b).
Actually, although I found the directory more interesting conceptually, the part that gave me most trouble was the page manipulation code. In principle this isn’t at all complicated, but I found the details fiddly. (I didn’t do myself any favours by taking my time tracking down the Ruby debugger.)
Anyway, here’s the code.
# Ruby re-implementation of Ozan Yigit's sdbm library class Sdbm MAX_SPLIT = 20 def initialize(name) dir_file = name + '.dir' pag_file = name + '.pag' if File.exists?(dir_file) and File.exists?(pag_file) mode = 'r+' else mode = 'w+' end @dir = DirStorage.new(name + '.dir', mode) @pag = PageStorage.new(name + '.pag', mode) end def close() @pag.close @dir.close end def each() @pag.each_page { |page| page.each_pair { |i, k, v| yield(k, v) } } end def get(key) h = sdbm_hash(key) page_index = get_page_index(h) page = @pag.load_page(page_index) page.each_pair { |i, k, v| return v if k == key } return nil end def add(key, val) del(key) h = sdbm_hash(key) MAX_SPLIT.times { page_index = get_page_index(h) page = @pag.load_page(page_index) if page.can_fit?(key, val) page.add_pair(key, val) @pag.save_page(page) return else new_bit = 1 << @dir.depth(h) cur_page = @pag.make_empty_page(page_index) new_page = @pag.make_empty_page(page_index | new_bit) page.each_pair { |i,k,v| p = (sdbm_hash(k) & new_bit == 0) ? cur_page : new_page p.add_pair(k, v) } @dir.split(h) # Contrive to save last the page we will look at next iteration if (h & new_bit == 0) @pag.save_page(new_page) @pag.save_page(cur_page) else @pag.save_page(cur_page) @pag.save_page(new_page) end end } raise "Failed to add key after MAX_SPLIT attempts" end def del(key) h = sdbm_hash(key) page_index = get_page_index(h) page = @pag.load_page(page_index) deleted = page.del(key) @pag.save_page(page) if deleted return deleted end def sdbm_hash(s) n = 0 mask = (1 << 64) - 1 s.each_byte { |ch| (n = ch + 65599 * n) & mask } return n end def get_page_index(h) mask = (1 << @dir.depth(h)) - 1 return h & mask end end class Storage def initialize(filename, mode, blocksize) @blocksize = blocksize @file = open(filename, mode) @end_index = @file.stat.size() / @blocksize end def close @file.close end def make_empty_block # The string is supposed to have a single null character # but wordpress doesn't seem to like this return "00" * @blocksize end def load_block(index) if index >= @end_index @current_block = make_empty_block() @current_index = index else @file.seek(index * @blocksize) @current_block = @file.read(@blocksize) @current_index = index end return @current_block end def save_block(index, block) while @end_index <= index @file.seek(@end_index * @blocksize) @file.write(make_empty_block()) @end_index = @end_index + 1 end @file.seek(index * @blocksize) @file.write(block) end end # The directory file is treated as a great big bit vector. # There is a correspondence between this vector and a binary # tree/trie. The nodes are numbered as if the tree were # complete, but the bit corresponding to a node is only set # if that node exists and has children. # # I will try to explain this better in the post. class DirStorage < Storage def initialize(filename, mode) super(filename, mode, 4096) @index = -1 @block = nil end def ensure_loaded(index) if @index != index @block = load_block(index) @index = index end end def save() save_block(@index, @block) end def depth(h) depth = 0 n = get_node(h) + 1 while n > 1 depth = depth + 1 n = n >> 1 end return depth end def split(h) node = get_node(h) setbit(node) end def get_node(h) node = 0 while bitset?(node) go_left = h & 1 == 0 node = (node * 2) + (go_left ? 1 : 2) h = h >> 1 end return node end def bitset?(bit_index) byte_index = bit_index / 8 bit_in_byte = bit_index % 8 block_index = byte_index / @blocksize byte_in_block = byte_index % @blocksize return false if block_index >= @end_index ensure_loaded(block_index) byte = @block[byte_in_block] bit = (byte >> bit_in_byte) & 1 return bit != 0 end def setbit(bit_index) byte_index = bit_index / 8 bit_in_byte = bit_index % 8 block_index = byte_index / @blocksize byte_in_block = byte_index % @blocksize ensure_loaded(block_index) byte = @block[byte_in_block] byte = byte | (1 << bit_in_byte) @block[byte_in_block] = byte save() end end class PageStorage < Storage def initialize(filename, mode) super(filename, mode, 1024) # String was supposed to contain 2 null characters but # wordpress doesn't seem to like this @page = Page.new("0000", -1) end def each_page() for i in (0 ... @end_index) yield load_page(i) end end def load_page(page_index) if @page.index != page_index @page = Page.new(load_block(page_index), page_index) end return @page end def save_page(page) save_block(page.index, page.content) @page = page end def make_empty_page(page_index) return Page.new(make_empty_block(), page_index) end end # The page format has an array of short (two byte) integers at # the start of the page (going towards the end) and concatenated # entry data at the end of the page (going towards the front), # with free space in the middle. The first integer is the number # of entries in the page, and subsequent ones index into the page # to give the starts of the entries. (The end of an entry will with # be the start of the previous one, or - if it is the first - the # end of the page itself.) # # As described above, the content of the page is, in effect, an # array of strings. The mapping is given by this being used to # store a sequence of key/value pairs. (So each pair counts as # two entries.) class Page def initialize(content, index) @content = content @index = index @num_entries = get_short(0) if @num_entries == 0 @entries_start = @content.length else @entries_start = entry_start(@num_entries - 1) end end def content() return @content end def index() return @index end def each_pair() for i in (0 ... @num_entries / 2) key = get_entry(i*2) val = get_entry(i*2 + 1) yield(i, key, val) end end def can_fit?(key, val) need = 4 + key.length + val.length if need > @content.length - 2 # Too large even if the page were empty raise "Key/value pair too large to fit on page" end free_end = @entries_start free_start = 2 * (1 + @num_entries) return need <= (free_end - free_start) end def add_pair(key, val) add_entry(key) add_entry(val) end def del(key) which_pair = nil each_pair { |i,k,v| if (k == key) which_pair = i break end } return false if which_pair.nil? to_free = entry_end(which_pair*2) - entry_start(which_pair*2 + 1) for i in (which_pair + 1 ... @num_entries / 2) key_entry = i*2 val_entry = i*2 + 1 key_start = entry_start(key_entry) val_start = entry_start(val_entry) pair_start = val_start pair_end = entry_end(key_entry) to_move = pair_end - pair_start @content[pair_start + to_free, to_move] = @content[pair_start, to_move] set_entry_start(key_entry - 2, key_start + to_free) set_entry_start(val_entry - 2, val_start + to_free) end set_num_entries(@num_entries - 2) return true end def add_entry(entry) @entries_start -= entry.length @content[@entries_start, entry.length] = entry set_entry_start(@num_entries, @entries_start) set_num_entries(@num_entries + 1) end def set_entry_start(i, n) set_short(i+1, n) end def entry_start(i) return get_short(i+1) end def entry_end(i) return i == 0 ? @content.length : get_short(i) end def get_entry(i) length = entry_end(i) - entry_start(i) return @content[entry_start(i), length] end def set_num_entries(num_entries) set_short(0, num_entries) @num_entries = num_entries end def get_short(i) return @content[2*i+1] << 8 | @content[2*i] end def set_short(i, n) @content[2*i+1] = ( n >> 8 ) & 0xff @content[2*i] = n & 0xff end end
This post gives the light in which we can observe the reality. this is very awesome one and gives indepth information. thanks a lot for this awesome article! With regards, Karin.
That’s very kind of you to say so, Karin, although I’m not entirely sure I understand what you’re saying. Perhaps you could be more specific.