Replace redb by a custom key-value store using an im-memory hashtable dictionary (!926) · Merge requests · umwelt-info / metadaten

Adam Reichold requested to merge OC000014987132/metadaten:custom-key-value-store into main Nov 16, 2024

This is another go at the "small datasets take up too much space as individual files" problem. I tried using redb to store the datasets (which was my original plan when introducing that dependency and using it for the AutoClassify cache) but it only reduced the size from 2.9 GB to 2.2 GB, so not much.

And indeed redb does have quite a large size overhead compared to other key-value stores out there due to its direct usage of CoW B-trees. Other key-value stores though have different issues like overly complex configuration, large dependency closures or significant runtime overhead or all of these.

But they are also much more capable than what we need which is why I went ahead a built a custom key-value store which assumes small values that can be stored inline. It works by just making a contiguous list of key-value pairs into memory and building a hashtable dictionary of the non-deleted ones in memory ensuring fast random access without on-disk space overhead.

This should also actually be helpful for the AutoClassify cache which does have a fully random access pattern due to using cryptographic hashes as keys which is not really amenable for redb's approach using sorting/ordering of keys.

It is not transactional at all, but that does not pose a problem because we do not need transactions for synchronization as we can just use a plain mutex instead.

So I guess the next step here is to write a migration for the datasets and see where that lands compared to the 2.2 GB using redb to find out whether this would be a viable approach. (Maybe also looking at compression which did not help with redb again because of the space overhead.)

Replace redb by a custom key-value store using an im-memory hashtable dictionary

Merge request reports