Über Open CoDE Software Wiki Diskussionen GitLab

Skip to content

Drop identical datasets to avoid racing overwrites

Adam Reichold requested to merge do-not-race-duplicates into main

If a task overwrites a dataset while the previous writer is still active, corruption of the dataset can result. The simplest way to avoid this seems to drop the second dataset instead of overwriting which is also more efficient.

This did not hit us yet as all harvester do not spawn additional tasks and so we are currently limited to a single task per source and identical datasets are a problem only within a given source. But we should fix this nevertheless as there is no mechanism that prevents a harvester from using multiple tasks to increase throughput.

This also changes the nomenclature form "duplicate" to "identical" to differentiate this case of duplicate identifiers and the duplicate detection based on fingerprints.

Merge request reports

Loading