Use compressed archives of datasets when random access is no longer needed to reduce storage requirements.
When cannot write the compressed representation directly during harvesting as we need random access for deduplication and clustering, but we can at least use a compressed format for long-term storage. To simplify the implementation, this is then performed by the indexer instead of the harvester. And it is done automatically so that the indexer will also automatically migrate existing dataset directories.
The result is that together with the compression proposed for the controlled vocabularies here, the datasets
directory goes from 2.9 GB to 210 MB. The same applies to datasets.old
so that we end up with a reduction of more than 5.3 GB which is quite signficant as the VM currently has a local disk of 32 GB. Of course, we still need 3 GB during harvesting, but at least datasets.old
will already be archived and hence even short-term peak requirements are reduced by almost 2.7 GB.