Implement harvester regression tests for local and CI testing
This adds an xtask called regression-test which will run the harvest using a specifically checked-in configuration and stored responses to identify whether the resulting datasets changed and printing modifications as an easy to read diff, e.g. the change
diff --git a/src/harvester/csw.rs b/src/harvester/csw.rs
index 75777a6..bf31b83 100644
--- a/src/harvester/csw.rs
+++ b/src/harvester/csw.rs
@@ -128,7 +128,7 @@ pub async fn translate_dataset(
}
}
- let language = identification
+ let _language: crate::dataset::Language = identification
.languages
.first()
.and_then(|language| language.code)
@@ -154,7 +154,6 @@ pub async fn translate_dataset(
source_url: source.source_url().replace("{{id}}", identifier),
resources,
issued,
- language,
tags,
..Default::default()
};
will yield an output like
$ cargo xtask regression-test
Compiling metadaten v0.1.0 (/home/ubuntu/metadaten)
Finished dev-opt [optimized + debuginfo] target(s) in 13.93s
Running `target/dev-opt/harvester`
2023-01-19T16:33:55.283935Z INFO harvester: Harvesting 1 sources
2023-01-19T16:33:55.325318Z INFO harvest{source=Source { name: "uba-gdi", type: Csw, url: "https://gis.uba.de/smartfinder-csw/api", provenance: "/Bund/UBA/GDI", filter: None, source_url: Some("https://gis.uba.de/smartfinder-client/?lang=de#/datasets/iso/{{id}}"), concurrency: 1, batch_size: 100 }}: metadaten::harvester::csw: Harvesting 180 datasets
--- regression-test/datasets.old/uba-gdi/1b65bb0d-0085-41d3-97cc-57427b5f1742.json 2023-01-19 16:33:55.365529252 +0000
+++ regression-test/datasets/uba-gdi/1b65bb0d-0085-41d3-97cc-57427b5f1742.json 2023-01-19 16:33:55.365529252 +0000
@@ -54,5 +54,5 @@
"url": "https://gis.uba.de/maps/resources/apps/lu_umweltzonen"
}
],
- "language": "German"
+ "language": "Unknown"
}
--- regression-test/datasets.old/uba-gdi/229ed9cb-c817-46f9-91f0-f6337148ea19.json 2023-01-19 16:33:55.373529314 +0000
+++ regression-test/datasets/uba-gdi/229ed9cb-c817-46f9-91f0-f6337148ea19.json 2023-01-19 16:33:55.369529283 +0000
@@ -63,5 +63,5 @@
"url": "https://gis.uba.de/website/web/moos/index.html"
}
],
- "language": "German"
+ "language": "Unknown"
}
[..]
--- regression-test/datasets.old/uba-gdi/e5e66abd-f8d5-4284-8264-2f8279a3b175.json 2023-01-19 16:33:55.433529777 +0000
+++ regression-test/datasets/uba-gdi/e5e66abd-f8d5-4284-8264-2f8279a3b175.json 2023-01-19 16:33:55.429529746 +0000
@@ -57,5 +57,5 @@
"url": "https://www.umweltbundesamt.de/europaeische-mobilitaetswoche-aktionen-2022"
}
],
- "language": "German"
+ "language": "Unknown"
}
--- regression-test/datasets.old/uba-gdi/f693f3bc-b13b-44c2-8d55-e34636daf48c.json 2023-01-19 16:33:55.437529808 +0000
+++ regression-test/datasets/uba-gdi/f693f3bc-b13b-44c2-8d55-e34636daf48c.json 2023-01-19 16:33:55.437529808 +0000
@@ -60,5 +60,5 @@
"url": "https://gis.uba.de/website/luft/index.html"
}
],
- "language": "German"
+ "language": "Unknown"
}
Error: "15 datasets were modified, 0 were removed and 0 were added"
Still to be done before this can be merged:
-
The resulting datasets are currently not fully deterministic as the HashSet
we are using forDataset::tags
is randomized, so we probably need either a different data structure or a deterministic hash function. -
The stored responses and datasets are large and should be handled using Git LFS which must be enabled here and setup when baking development VM images. -
We need a more diverse set of stored responses resp. data source to achieve a reasonable coverageof our harvesters. -
While using the tests is a single command, confirming intentional changes should be done by committing the modified regression-tests/datasets folder to Git which is probably not obvious without documentation.
Closes #179 (closed)
Edited by Adam Reichold