Refine deduplication of datasets
Having created the basic technical infrastructure for duplicate detection in !130 (merged), we need to refine two main aspects of dataset deduplication:
-
We need a better fingerprint than hashing title and description. On one hand, it needs to be more robust against small changes like whitespace created by scrapers and handle snippets instead of full description. On the other hand, it should support using defined duplicate removal mechanisms like the globally unique identifiers used by the DCAT-AP.de standard.
-
We need a better merge policy that actually merges information from multiple datasets instead of discarding all but one. Most importantly, the most recent version of each property should be used where it is possible to determine this.
The above also implies that we require an extended metadata schema, for example including the DCAT-AP.de-mandated GUID for sources like the CKAN or CSW harvesters which support this information.