Über Open CoDE Software Wiki Diskussionen GitLab

Skip to content

Use memory-mapped secondary indexes to avoid having to keep them in memory

Adam Reichold requested to merge memory-mapped-secondary-indexes into main

Our secondary indexes for bounding boxes and time ranges are currently fully in memory which is not yet a problem due to the small overall size of our index (currently less than 300 MB), but for example, already the R* trees are the largest index segments with only the actual document store being larger.

To avoid this, we can apply the same technique that Tantivy uses internally, i.e. memory maps and suitable implementations of the relevant data structures that are "flat" and hence can be mapped directly from disk. This way the operating system will handle paging in those parts of the index file which are required in a way that optimizes overall memory usage.

Luckily, someone builty suitable replacements for our R* trees, i.e. rstar is replaced by sif-rtree and intervaltree is replaced by sif-itree. This itself does not require any significant code changes, but it allows us to write the whole trees (as a slice of nodes and hence bytes) to disk and then map that file to back the queries.

The slighty increase in code complexity is proportional to the increase in functionality IMHO, but there is a major downside: This does require unsafe code, i.e. code which the compiler cannot check for memory safety and absence of undefined behavior. This is somewhat localized and our CI is still configured that individual scopes have to opt-in using #[allow(unsafe_code)], but it does make the code base more complex to maintain nevertheless.

Edited by Adam Reichold

Merge request reports