Über Open CoDE Software Wiki Diskussionen Gitlab

Skip to content

Detect duplicates based on key value store of fingerprints

Adam Reichold requested to merge detect-duplicates into main

This lays the ground work for duplicate detection by starting with a simple fingerprint based on title and description and an in-memory store.

When the number of datasets becomes too large to handle in memory, we can use simple filesystem-based solution like the SNS cache (and keep only the locking in memory).

Long-term we should definitely use a more flexible fingerprinting scheme that is less suceptible to small changes (e.g. whitespace changes lead to a completely different hash) and possibly takes into account more structure data like resource URL of DCAT identifiers.

  • Store fingerprint of each collected dataset (and a key identifying the dataset)
  • Identify fingerprints associated with more than one key
  • Merge those datasets and remove the redundant ones before indexing

Closes #23 (closed)

Edited by Adam Reichold

Merge request reports