Identify duplicate datasets

Created by: adamreichold

It is quite likely that we will harvest datasets from multiple sources, e.g. "Zoo Leipzig Jahreszahlen" can be harvested from govdata.de and opendata.leipzig.de under different ID.

The DCAT-AP.de implementation guide describes how to identify duplicates based on dct:identifier field which in this case forwards the ID from opendata.leipzig.de into the catalogue at govdata.de via a CKAN "extra" field called identifier. (Additionally, its full URL is available via the guid field.)

Since this will only work for catalogues participating in DCAT-AP.de pipelines, it might be simpler to resolve duplicates based on the URL of the data itself, e.g. https://statistik.leipzig.de/opendata/api/values?kategorie_nr=11&rubrik_nr=4&periode=y&format=csv in this case which should identify the dataset independently of any intermediaries publishing and identifying it.

Edited Oct 06, 2022 by Adam Reichold