key-value database

一直以为"key-value database"是一个很新的概念，但是维基明确说明了这种概念在很早的Unix系统中就自带了一个基于key-value的数据库dbm(database manager)，更让人惊奇的是这个软件的作者依然是大名鼎鼎的"ken Thompson"。不得不说，看起来一些很新的概念，在很早就已经被提出来，只是由于当时硬件(cpu运算速度、内存大小、磁盘访问速度等)的限制，导致这些概念没有被实现，或者没有流行起来。

The Unix system provides dbm (database manager), which is a 1979 library originally written by Ken Thompson. It is also ported to Microsoft Windows, provided through programming languages such as Perl for Win32. The dbm manages associative arrays of arbitrary data by use of a single key (a primary key). Modern implementations include sdbm, GNU dbm, and Berkeley DB. Although dbm precedes the concept of a NoSQL and is rarely mentioned in modern discourse, it is used by many pieces of software.

dbm

key-value
在dbm的指令中，支持的key、value都是一个datum结构，这是一个典型的无类型结构，只有起始地址和长度，就是一段连续的内存地址空间。key和value都是这种结构。

///@file: gdbm\src\gdbm.h.in
/* The data and key structure. */
typedef struct
{
  char *dptr;
  int   dsize;
} datum;

描述
尽管存储中使用的是原始的datum结构，并不表示每个key/value只有一个filed，事实上一个key/value也可以由多个filed。和C语言一样，多个field在操作时使用C语言中用来表示组合的大括号表示。这一点同样是一个让我感到有些“惊艳”的功能：可以在key/value的基础上自定义如何解释这些内部以二进制存储的结构。define后面只能是key或者content，分别表示key、value的解释方法。
在gdbm自带的doc文档中，说明了这种自定义描述的使用方法。这种描述跟C语言非常类似(毕竟是同一个作者)，里面甚至还有offset、pad、数组等这种C语言中常见的语法概念，甚至和C语言中可以通过field名初始化特定字段。

The string and stringz are special. Both define a string of bytes, similar to ‘char x[]’ in C. The former defines an array of bytes, the latter - a null-terminated string. This makes a difference, in particular, when the string is the only part of datum. Consider the following two definitions:

define key string
define key stringz
Now, suppose we want to store the string "ab" in the key. Using the definition (1), the dptr member of GDBM datum will contain two bytes: ‘a’, and ‘b’. Consequently, the dsize member will have the value 2. Using the definition (2), the dptr member will contain three bytes: ‘a’, ‘b’, and ASCII 0. The dsize member will have the value 3.

The definition (1) is the default for both key and content.

The second form of the define statement is similar to the C struct statement and allows for defining structural data. In this form, the definition part is a comma-separated list of data types and variables enclosed in curly braces. In contrast to the rest of gdbm commands, this command is inherently multiline and is terminated with the closing curly brace. For example:

 	
define content {
        int status,
        pad 8,
        char id[3],
        string name
}
This defines a structure consisting of three members: an integer status, an array of 3 bytes id, and an array of bytes name. Notice the pad statement: it allows to introduce padding between structure members. Another useful statement is offset: it specifies that the member following it begins at the given offset in the structure. Assuming the size of int is 8 bytes, the above definition can also be written as

 	
define content {
        int status,
        offset 16,
        char id[3],
        string name
}
NOTE: The string type can reasonably be used only if it is the last or the only member of the data structure. That’s because it provides no information about the number of elements in the array, so it is interpreted to contain all bytes up to the end of the datum.

对应代码

static int
datum_scan_tag (datum *dat, struct dsegm *ds, struct kvpair *kvlist)
{
  struct xdatum xd;
  int err = 0;
  struct kvpair *kv;

  /* Check keywords for consistency */
  for (kv = kvlist; kv; kv = kv->next)
    {
      if (!kv->key)
	{
	  lerror (&kv->loc,
		  _("mixing tagged and untagged values is not allowed"));
	  return 1;
	}
      if (!dsegm_list_find (ds, kv->key))
	{
	  lerror (&kv->loc, _("%s: no such field in datum"), kv->key);
	  return 1;
	}
    }

  /* Initialize datum */
  memset (&xd, 0, sizeof (xd));

  for (; err == 0 && ds; ds = ds->next)
    {
      switch (ds->type)
	{
	case FDEF_FLD:
	  kv = kvlist_find (kvlist, ds->v.field.name);
	  if (kv)
	    err = dsconv (&xd, ds, kv);
	  else
	    {
	      size_t sz = ds->v.field.type->size * ds->v.field.dim;
	      xd_expand (&xd, xd.off + sz);
	      xd.off += sz;
	    }
	  break;

	case FDEF_OFF:
	  xd_expand (&xd, ds->v.n);
	  xd.off = ds->v.n;
	  break;
	  
	case FDEF_PAD:
	  xd_expand (&xd, xd.off + ds->v.n);
	  xd.off += ds->v.n;
	  break;
	}
    }

  if (err)
    {
      free (xd.dptr);
      return 1;
    }

  dat->dptr  = xd.dptr;
  dat->dsize = xd.dsize;
      
  return 0;
}

int
datum_scan (datum *dat, struct dsegm *ds, struct kvpair *kv)
{
  return (kv->key ? datum_scan_tag : datum_scan_notag) (dat, ds, kv);
}

Extendible Hashing

在内存存储中，dbm使用的是Extendible Hashing-A Fast Access Method for Dynamic Files描述的一种实现，从概念上看这种结构并不复杂，但是在wiki上的python代码看起来让人非常困惑，主要是感觉在拆分逻辑中少了必要的复制操作(当然也可能是我python不是很熟悉而没理解这个代码的正确性)。

一个cpp实现

在github上的一个实现中，其中包含了拆分的一个细节，就是需要将当前的所有directory复制一份，追加在当前directory的后面。

void Directory::grow(void)
{
    for(int i = 0 ; i < 1<<global_depth ; i++ )
        buckets.push_back(buckets[i]);
    global_depth++;
}

并且注意到在插入时还可能递归(Directory::insert函数调用了Directory::insert函数)。

void Directory::insert(int key,string value,bool reinserted)
{
    int bucket_no = hash(key);
    int status = buckets[bucket_no]->insert(key,value);
    if(status==1)
    {
        if(!reinserted)
            cout<<"Inserted key "<<key<<" in bucket "<<bucket_id(bucket_no)<<endl;
        else
            cout<<"Moved key "<<key<<" to bucket "<<bucket_id(bucket_no)<<endl;
    }
    else if(status==0)
    {
        split(bucket_no);
        insert(key,value,reinserted);
    }
    else
    {
        cout<<"Key "<<key<<" already exists in bucket "<<bucket_id(bucket_no)<<endl;
    }
}

gdbm的实现

在gnu的gdbm实现中，可以看到函数主体是在一个while循环中，这也意味着在一次split调用中，可能会执行多次分裂。

///@file: gdbm\src\bucket.c
/* Split the current bucket.  This includes moving all items in the bucket to
   a new bucket.  This doesn't require any disk reads because all hash values
   are stored in the buckets.  Splitting the current bucket may require
   doubling the size of the hash directory.  */
int
_gdbm_split_bucket (GDBM_FILE dbf, int next_insert)
{
  off_t        old_adr[GDBM_HASH_BITS];  /* Address of the old directories. */
  int          old_size[GDBM_HASH_BITS]; /* Size of the old directories. */
  int	       old_count;	/* Number of old directories. */

  int          index;		/* Used in array indexing. */
  int          index1;		/* Used in array indexing. */
  
  /* No directories are yet old. */
  old_count = 0;
  while (dbf->bucket->count == dbf->header->bucket_elems)
    {
      int          new_bits;	/* The number of bits for the new buckets. */
      cache_elem  *newcache[2]; /* Location in the cache for the buckets. */
      off_t        adr_0;	/* File address of the new bucket 0. */
      off_t        adr_1;	/* File address of the new bucket 1. */
      avail_elem   old_bucket;	/* Avail Struct for the old bucket. */
      
      off_t        dir_start0;	/* Used in updating the directory. */
      off_t        dir_start1;
      off_t        dir_end;

      new_bits = dbf->bucket->bucket_bits + 1;

      /*
       * Allocate two new buckets.  They will be populated with the entries
       * from the current bucket (cache_mru->bucket), so make sure that
       * cache_mru remains unchanged until both buckets are fully formed.
       * Newly allocated buckets must be linked right after cache_mru, so
       * that all changed buckets form a contiguous sequence at the beginning
       * of the cache list (this is needed by _gdbm_cache_flush).
       */
      adr_0 = _gdbm_alloc (dbf, dbf->header->bucket_size);
      switch (cache_lookup (dbf, adr_0, dbf->cache_mru, &newcache[0]))
	{
	case cache_new:
	  break;

	case cache_found:
	  /* should not happen */
	  GDBM_DEBUG (GDBM_DEBUG_ERR,
		      "%s: bucket found where it should not",
		      dbf->name);
	  GDBM_SET_ERRNO (dbf, GDBM_BUCKET_CACHE_CORRUPTED, TRUE);
	  return -1;

	case cache_failure:
	  return -1;
	}
      _gdbm_new_bucket (dbf, newcache[0]->ca_bucket, new_bits);

      adr_1 = _gdbm_alloc (dbf, dbf->header->bucket_size);
      switch (cache_lookup (dbf, adr_1, newcache[0], &newcache[1]))
	{
	case cache_new:
	  break;

	case cache_found:
	  /* should not happen */
	  GDBM_DEBUG (GDBM_DEBUG_ERR,
		      "%s: bucket found where it should not",
		      dbf->name);
	  GDBM_SET_ERRNO (dbf, GDBM_BUCKET_CACHE_CORRUPTED, TRUE);
	  return -1;

	case cache_failure:
	  return -1;
	}
      _gdbm_new_bucket (dbf, newcache[1]->ca_bucket, new_bits);

      /* Double the directory size if necessary. */
      if (dbf->header->dir_bits == dbf->bucket->bucket_bits)
	{
	  off_t       *new_dir;		/* Pointer to the new directory. */
	  int          dir_size;	/* Size of the new directory. */
	  off_t        dir_adr; 	/* Address of the new directory. */
	  
	  if (dbf->header->dir_size >= GDBM_MAX_DIR_HALF)
	    {
	      GDBM_SET_ERRNO (dbf, GDBM_DIR_OVERFLOW, TRUE);
	      _gdbm_fatal (dbf, _("directory overflow"));
	      return -1;
	    }
	  dir_size = dbf->header->dir_size * 2;
	  dir_adr  = _gdbm_alloc (dbf, dir_size);
	  if (dir_adr == 0)
	    return -1;
	  new_dir = malloc (dir_size);
	  if (new_dir == NULL)
	    {
	      GDBM_SET_ERRNO (dbf, GDBM_MALLOC_ERROR, TRUE);
	      _gdbm_fatal (dbf, _("malloc error"));
	      return -1;
	    }

	  for (index = 0; index < GDBM_DIR_COUNT (dbf); index++)
	    {
	      new_dir[2*index]   = dbf->dir[index];
	      new_dir[2*index+1] = dbf->dir[index];
	    }
	  
	  /* Update header. */
	  old_adr[old_count] = dbf->header->dir;
	  dbf->header->dir = dir_adr;
	  old_size[old_count] = dbf->header->dir_size;
	  dbf->header->dir_size = dir_size;
	  dbf->header->dir_bits = new_bits;
	  old_count++;
	  
	  /* Now update dbf.  */
	  dbf->header_changed = TRUE;
	  dbf->bucket_dir *= 2;
	  free (dbf->dir);
	  dbf->dir = new_dir;
	}

      /* Copy all elements in dbf->bucket into the new buckets. */
      for (index = 0; index < dbf->header->bucket_elems; index++)
	{
	  bucket_element *old_el = &dbf->bucket->h_table[index];
	  hash_bucket *bucket;
	  int elem_loc;
	  
	  if (old_el->hash_value < 0)
	    {
	      GDBM_SET_ERRNO (dbf, GDBM_BAD_BUCKET, TRUE);
	      return -1;
	    }

	  bucket =
	    newcache[(old_el->hash_value >> (GDBM_HASH_BITS - new_bits)) & 1]->ca_bucket;
	  elem_loc = old_el->hash_value % dbf->header->bucket_elems;
	  while (bucket->h_table[elem_loc].hash_value != -1)
	    elem_loc = (elem_loc + 1) % dbf->header->bucket_elems;
	  bucket->h_table[elem_loc] = *old_el;
	  bucket->count++;
	}
      
      /* Allocate avail space for the newcache[1]->ca_bucket. */
      newcache[1]->ca_bucket->bucket_avail[0].av_adr
	= _gdbm_alloc (dbf, dbf->header->block_size);
      if (newcache[1]->ca_bucket->bucket_avail[0].av_adr == 0)
	return -1;
      newcache[1]->ca_bucket->bucket_avail[0].av_size
	= dbf->header->block_size;
      newcache[1]->ca_bucket->av_count = 1;
      
      /* Copy the avail elements in dbf->bucket to newcache[0]->ca_bucket. */
      newcache[0]->ca_bucket->av_count = dbf->bucket->av_count;
      index = 0;
      if (newcache[0]->ca_bucket->av_count == BUCKET_AVAIL)
	{
	  /* The avail is full, move the first one to newcache[1]->ca_bucket.*/
	  _gdbm_put_av_elem (dbf->bucket->bucket_avail[0],
			     newcache[1]->ca_bucket->bucket_avail,
			     &newcache[1]->ca_bucket->av_count, 
                             dbf->coalesce_blocks);
	  index = 1;
	  newcache[0]->ca_bucket->av_count--;
	}

      index1 = 0;
      for (; index < dbf->bucket->av_count; index++)
	{
	  newcache[0]->ca_bucket->bucket_avail[index1++]
	    = dbf->bucket->bucket_avail[index];
	}
      
      /* Update the directory.  We have new file addresses for both buckets. */
      dir_start1 = (dbf->bucket_dir >> (dbf->header->dir_bits - new_bits)) | 1;
      dir_end = (dir_start1 + 1) << (dbf->header->dir_bits - new_bits);
      dir_start1 = dir_start1 << (dbf->header->dir_bits - new_bits);
      dir_start0 = dir_start1 - (dir_end - dir_start1);
      for (index = dir_start0; index < dir_start1; index++)
	dbf->dir[index] = adr_0;
      for (index = dir_start1; index < dir_end; index++)
	dbf->dir[index] = adr_1;
      
      /* Set changed flags. */
      newcache[0]->ca_changed = TRUE;
      newcache[1]->ca_changed = TRUE;
      dbf->directory_changed = TRUE;
      
      /* Update the cache! */
      dbf->bucket_dir = _gdbm_bucket_dir (dbf, next_insert);
      
      /* Invalidate old cache entry. */
      old_bucket.av_adr  = dbf->cache_mru->ca_adr;
      old_bucket.av_size = dbf->header->bucket_size;
      cache_elem_free (dbf, dbf->cache_mru);
      
      /* Set dbf->bucket to the proper bucket. */
      if (dbf->dir[dbf->bucket_dir] != adr_0)
	{
	  cache_elem *t = newcache[0];
	  newcache[0] = newcache[1];
	  newcache[1] = t;
	}

      _gdbm_put_av_elem (old_bucket,
			 newcache[1]->ca_bucket->bucket_avail,
			 &newcache[1]->ca_bucket->av_count, 
			 dbf->coalesce_blocks);

      lru_unlink_elem (dbf, newcache[0]);
      lru_link_elem (dbf, newcache[0], NULL);
    }

  /* Get rid of old directories. */
  for (index = 0; index < old_count; index++)
    if (_gdbm_free (dbf, old_adr[index], old_size[index]))
      return -1;

  return 0;
}

为什么可能递归

论文中说明了在directory使用的是hash之后的值取若干bit作为directory的索引，这样就避免了一些熟知分布不平均的问题。例如，假设取高3bits作为index，现实数据库中大部分的32bits整数的高3bits都为0。这样它们就很容易分布到同一个directory。
在directory的bits(globalbits)增加到4时，由于同样绝大部分高4bits同样都为0，所以还需要再次增加bits(因为它们还是在同一个directory并导致溢出)；依次类推多次，直到高若干bits可以被划分到不同的page。如果不经过hash，这种场景是很容易发生的。

栗子

下面是使用使用dbm描述的一个例子:

tsecer@harry: tools/gdbmtool 

Welcome to the gdbm tool.  Type ? for help.

gdbmtool> define key { int pid, int tid}
gdbmtool> define content { string name, int age}              
gdbmtool> store {1, 1} { tsecer, 3}
gdbmtool> fetch {1, 1}
name=tsecer\003\000\000\000,age=(not enough data)
gdbmtool>

可以看到这个显示并不符合预期：把整个value内容都作为name的值了。不过man手册早已看透了这一切，明确说明了string要么是唯一字段，要么是最后字段

NOTE: The string type can reasonably be used only if it is the last or the only member of the data structure. That’s because it provides no information about the number of elements in the array, so it is interpreted to contain all bytes up to the end of the datum.