Identify duplicate datasets
Created by: adamreichold
It is quite likely that we will harvest datasets from multiple sources, e.g. "Zoo Leipzig Jahreszahlen" can be harvested from govdata.de
and opendata.leipzig.de
under different ID.
The DCAT-AP.de implementation guide describes how to identify duplicates based on dct:identifier
field which in this case forwards the ID from opendata.leipzig.de
into the catalogue at govdata.de
via a CKAN "extra" field called identifier
. (Additionally, its full URL is available via the guid
field.)
Since this will only work for catalogues participating in DCAT-AP.de pipelines, it might be simpler to resolve duplicates based on the URL of the data itself, e.g. https://statistik.leipzig.de/opendata/api/values?kategorie_nr=11&rubrik_nr=4&periode=y&format=csv in this case which should identify the dataset independently of any intermediaries publishing and identifying it.