Detect duplicates based on key value store of fingerprints
This lays the ground work for duplicate detection by starting with a simple fingerprint based on title and description and an in-memory store.
When the number of datasets becomes too large to handle in memory, we can use simple filesystem-based solution like the SNS cache (and keep only the locking in memory).
Long-term we should definitely use a more flexible fingerprinting scheme that is less suceptible to small changes (e.g. whitespace changes lead to a completely different hash) and possibly takes into account more structure data like resource URL of DCAT identifiers.
-
Store fingerprint of each collected dataset (and a key identifying the dataset) -
Identify fingerprints associated with more than one key -
Merge those datasets and remove the redundant ones before indexing
Closes #23 (closed)
Edited by Adam Reichold