Use memory-mapped secondary indexes to avoid having to keep them in memory
Our secondary indexes for bounding boxes and time ranges are currently fully in memory which is not yet a problem due to the small overall size of our index (currently less than 300 MB), but for example, already the R* trees are the largest index segments with only the actual document store being larger.
To avoid this, we can apply the same technique that Tantivy uses internally, i.e. memory maps and suitable implementations of the relevant data structures that are "flat" and hence can be mapped directly from disk. This way the operating system will handle paging in those parts of the index file which are required in a way that optimizes overall memory usage.
Luckily, someone builty suitable replacements for our R* trees, i.e. rstar
is replaced by sif-rtree
and intervaltree
is replaced by sif-itree
. This itself does not require any significant code changes, but it allows us to write the whole trees (as a slice of nodes and hence bytes) to disk and then map that file to back the queries.
The slighty increase in code complexity is proportional to the increase in functionality IMHO, but there is a major downside: This does require unsafe code, i.e. code which the compiler cannot check for memory safety and absence of undefined behavior. This is somewhat localized and our CI is still configured that individual scopes have to opt-in using #[allow(unsafe_code)]
, but it does make the code base more complex to maintain nevertheless.