Über Open CoDE Software Wiki Diskussionen GitLab

Skip to content

Use compressed archives of datasets when random access is no longer needed to reduce storage requirements.

Adam Reichold requested to merge datasets-compressed-storage into main

When cannot write the compressed representation directly during harvesting as we need random access for deduplication and clustering, but we can at least use a compressed format for long-term storage. To simplify the implementation, this is then performed by the indexer instead of the harvester. And it is done automatically so that the indexer will also automatically migrate existing dataset directories.

The result is that together with the compression proposed for the controlled vocabularies here, the datasets directory goes from 2.9 GB to 210 MB. The same applies to datasets.old so that we end up with a reduction of more than 5.3 GB which is quite signficant as the VM currently has a local disk of 32 GB. Of course, we still need 3 GB during harvesting, but at least datasets.old will already be archived and hence even short-term peak requirements are reduced by almost 2.7 GB.

Merge request reports

Loading