Detect duplicates based on key value store of fingerprints (!130) · Merge requests · umwelt-info / metadaten

Adam Reichold requested to merge detect-duplicates into main Nov 22, 2022

This lays the ground work for duplicate detection by starting with a simple fingerprint based on title and description and an in-memory store.

~~When the number of datasets becomes too large to handle in memory, we can use simple filesystem-based solution like the SNS cache (and keep only the locking in memory).~~

Long-term we should definitely use a more flexible fingerprinting scheme that is less suceptible to small changes (e.g. whitespace changes lead to a completely different hash) and possibly takes into account more structure data like resource URL of DCAT identifiers.

Store fingerprint of each collected dataset (and a key identifying the dataset)
Identify fingerprints associated with more than one key
Merge those datasets and remove the redundant ones before indexing

Closes #23 (closed)

Edited Dec 13, 2022 by Adam Reichold

Detect duplicates based on key value store of fingerprints

Merge request reports