Consider using a key-value store for datasets and auxiliary databases
We currently store datasets as well as auxiliary databases (like the auto-classification cache and the duplicates) in plain files as this is simple and has workable async support via tokio::fs
.
This has the benefit of using the kernel's page cache and thereby all available system memory without additional tuning on our part as well as support a high degree of concurrency through the scalable in-kernel implementation of the ext4 file system. But it has the downside of inefficient caching (each page cache entry is at least one page, e.g. 4 kB) and a lot of system calls to interact with the individual files which carry a significant overhead on contemporary systems.
A reasonable alternative would be to use a key-value store for these databases, but it should be highly performance and ideally using memory-mapped I/O to benefit from the page cache. It should also be highly optimized for read-mostly workloads as we have little need for scalable write transactions. Candidates are LMDB-wrappers like lmdb-rkv or heed, SQLite-wrappers like rusqlite or green field implementations like redb. But all them do have various downsides of intoducing additional unnecessary complexity like SQL, a lot of tunable parameters and more opaque file formats compared to manually interacting with plain files.