Über Open CoDE Software Wiki Diskussionen GitLab

Skip to content

Store compressed HTTP responses on disk and allow to replay them instead of making network requests

Adam Reichold requested to merge replay-responses into main

Created by: adamreichold

This does include a bit of behind the scenes work to encapsulate usage of the HTTP client but this seems reasonable as to ensure usage of the retry logic in any case. But as a nice end results, the harvesters themselves do not have to change materially at all.

The only requirement is that they provide a unique key for each request (basically the file name of the stored body on disk). Then we can rerun the harvester on the raw response bodies and change anything about our parsing or translation logic or metadata schema. Of course, we cannot travel in time and this will not work if the actual requests we would have made are changed due to the code changes.

As for storing the responses, this can get large quickly but they usually compress well and since for now, our harvester is mostly waiting for the network anyway I added Zstd compression which reduces the responses from our default harvester configuration from 755M to 48M.

To use this, one just needs to set the REPLAY_RESPONSES environment variable, e.g. by running

> REPLAY_RESPONSES= cargo xtask harvester

Below are two examples or how this works:

the first one running the harvester against network resources
> RUST_LOG=info DATA_PATH=data time ./target/release/harvester 
2022-08-12T16:30:09.192182Z  INFO harvester: Harvesting 6 sources
2022-08-12T16:30:09.791575Z  INFO harvest{source=Source { name: "stadt-leipzig", type: Ckan, url: "https://opendata.leipzig.de/", filter: None, source_url: Some("https://opendata.leipzig.de/dataset/{{name}}"), concurrency: 3, batch_size: 100 }}: umwelt_info::harvester::ckan: Harvesting 723 datasets
2022-08-12T16:30:09.849769Z  INFO harvest{source=Source { name: "doris-bfs", type: DorisBfs, url: "https://doris.bfs.de/", filter: None, source_url: None, concurrency: 5, batch_size: 10 }}: umwelt_info::harvester::doris_bfs: Harvesting 507 datasets
2022-08-12T16:30:10.339620Z  INFO harvest{source=Source { name: "uba-gdi", type: Csw, url: "https://gis.uba.de/smartfinder-csw/api", filter: None, source_url: Some("https://gis.uba.de/smartfinder-client/?lang=de#/datasets/iso/{{id}}"), concurrency: 1, batch_size: 10 }}: umwelt_info::harvester::csw: Harvesting 180 datasets
2022-08-12T16:30:10.848271Z  INFO harvest{source=Source { name: "wasser-de", type: WasserDe, url: "https://www.wasser-de.de/", filter: None, source_url: None, concurrency: 1, batch_size: 100 }}: umwelt_info::harvester::wasser_de: Retrieved 713 documents
2022-08-12T16:30:10.854728Z ERROR harvest{source=Source { name: "wasser-de", type: WasserDe, url: "https://www.wasser-de.de/", filter: None, source_url: None, concurrency: 1, batch_size: 100 }}: umwelt_info::harvester::wasser_de: Document 4397 has no valid entry for 'NAME'
2022-08-12T16:30:10.855552Z ERROR harvest{source=Source { name: "wasser-de", type: WasserDe, url: "https://www.wasser-de.de/", filter: None, source_url: None, concurrency: 1, batch_size: 100 }}: umwelt_info::harvester::wasser_de: Document 4467 has no valid entry for 'NAME'
2022-08-12T16:30:10.857523Z ERROR harvest{source=Source { name: "wasser-de", type: WasserDe, url: "https://www.wasser-de.de/", filter: None, source_url: None, concurrency: 1, batch_size: 100 }}: umwelt_info::harvester::wasser_de: Document 4651 has no valid entry for 'NAME'
2022-08-12T16:30:10.859063Z ERROR harvest{source=Source { name: "wasser-de", type: WasserDe, url: "https://www.wasser-de.de/", filter: None, source_url: None, concurrency: 1, batch_size: 100 }}: harvester: Failed to harvest 3 out of 713 datasets (713 were transmitted)
2022-08-12T16:30:11.822389Z  WARN harvest{source=Source { name: "doris-bfs", type: DorisBfs, url: "https://doris.bfs.de/", filter: None, source_url: None, concurrency: 5, batch_size: 10 }}:fetch_datasets{rpp=10 offset=60}: umwelt_info::harvester: Overwriting duplicate dataset urn:nbn:de:0221-2021010424644
2022-08-12T16:30:12.579735Z  INFO harvest{source=Source { name: "govdata", type: Ckan, url: "https://www.govdata.de/ckan/", filter: None, source_url: Some("https://www.govdata.de/web/guest/suchen/-/details/{{name}}"), concurrency: 5, batch_size: 1000 }}: umwelt_info::harvester::ckan: Harvesting 61432 datasets
2022-08-12T16:30:13.756080Z  WARN harvest{source=Source { name: "doris-bfs", type: DorisBfs, url: "https://doris.bfs.de/", filter: None, source_url: None, concurrency: 5, batch_size: 10 }}:fetch_datasets{rpp=10 offset=220}: umwelt_info::harvester: Overwriting duplicate dataset urn:nbn:de:0221-2014111011874
2022-08-12T16:30:16.840931Z  WARN harvest{source=Source { name: "doris-bfs", type: DorisBfs, url: "https://doris.bfs.de/", filter: None, source_url: None, concurrency: 5, batch_size: 10 }}:fetch_datasets{rpp=10 offset=380}: umwelt_info::harvester: Overwriting duplicate dataset urn:nbn:de:0221-2009082154
2022-08-12T16:30:17.434641Z  WARN harvest{source=Source { name: "doris-bfs", type: DorisBfs, url: "https://doris.bfs.de/", filter: None, source_url: None, concurrency: 5, batch_size: 10 }}:fetch_datasets{rpp=10 offset=400}: umwelt_info::harvester: Overwriting duplicate dataset urn:nbn:de:0221-2009042313
2022-08-12T16:30:17.497272Z  WARN harvest{source=Source { name: "doris-bfs", type: DorisBfs, url: "https://doris.bfs.de/", filter: None, source_url: None, concurrency: 5, batch_size: 10 }}:fetch_datasets{rpp=10 offset=410}: umwelt_info::harvester: Overwriting duplicate dataset urn:nbn:de:0221-2009011228
2022-08-12T16:30:17.962662Z  INFO harvest{source=Source { name: "geodatenkatalog", type: GeoNetworkQ, url: "http://gdk.gdi-de.org/gdi-de/srv/ger/q", filter: Some("environment"), source_url: Some("http://gdk.gdi-de.org/gdi-de/srv/ger/catalog.search#/metadata/{{id}}"), concurrency: 5, batch_size: 100 }}: umwelt_info::harvester::geo_network_q: Harvesting 5315 datasets
2022-08-12T16:30:18.224830Z  WARN harvest{source=Source { name: "doris-bfs", type: DorisBfs, url: "https://doris.bfs.de/", filter: None, source_url: None, concurrency: 5, batch_size: 10 }}:fetch_datasets{rpp=10 offset=470}: umwelt_info::harvester: Overwriting duplicate dataset urn:nbn:de:0221-201004061230
2022-08-12T16:30:18.833466Z  WARN harvest{source=Source { name: "doris-bfs", type: DorisBfs, url: "https://doris.bfs.de/", filter: None, source_url: None, concurrency: 5, batch_size: 10 }}:fetch_datasets{rpp=10 offset=490}: umwelt_info::harvester: Overwriting duplicate dataset urn:nbn:de:0221-201006222423
2022-08-12T16:32:50.535827Z  WARN harvest{source=Source { name: "geodatenkatalog", type: GeoNetworkQ, url: "http://gdk.gdi-de.org/gdi-de/srv/ger/q", filter: Some("environment"), source_url: Some("http://gdk.gdi-de.org/gdi-de/srv/ger/catalog.search#/metadata/{{id}}"), concurrency: 5, batch_size: 100 }}:fetch_datasets{summary=false from=4801 to=4900}: umwelt_info::harvester: Overwriting duplicate dataset 094bb2e5-c6fb-451a-bcfd-b52629a7e2ff
12.60user 6.09system 2:50.81elapsed 10%CPU (0avgtext+0avgdata 297584maxresident)k
0inputs+649344outputs (0major+40856minor)pagefaults 0swaps
and the second one using the stored responses from disk
> RUST_LOG=info DATA_PATH=data REPLAY_RESPONSES= time ./target/release/harvester 
2022-08-12T16:33:30.702428Z  INFO harvester: Harvesting 6 sources
2022-08-12T16:33:30.705569Z  INFO harvest{source=Source { name: "uba-gdi", type: Csw, url: "https://gis.uba.de/smartfinder-csw/api", filter: None, source_url: Some("https://gis.uba.de/smartfinder-client/?lang=de#/datasets/iso/{{id}}"), concurrency: 1, batch_size: 10 }}: umwelt_info::harvester::csw: Harvesting 180 datasets
2022-08-12T16:33:30.706461Z  INFO harvest{source=Source { name: "stadt-leipzig", type: Ckan, url: "https://opendata.leipzig.de/", filter: None, source_url: Some("https://opendata.leipzig.de/dataset/{{name}}"), concurrency: 3, batch_size: 100 }}: umwelt_info::harvester::ckan: Harvesting 723 datasets
2022-08-12T16:33:30.709723Z  INFO harvest{source=Source { name: "wasser-de", type: WasserDe, url: "https://www.wasser-de.de/", filter: None, source_url: None, concurrency: 1, batch_size: 100 }}: umwelt_info::harvester::wasser_de: Retrieved 713 documents
2022-08-12T16:33:30.713670Z  INFO harvest{source=Source { name: "doris-bfs", type: DorisBfs, url: "https://doris.bfs.de/", filter: None, source_url: None, concurrency: 5, batch_size: 10 }}: umwelt_info::harvester::doris_bfs: Harvesting 507 datasets
2022-08-12T16:33:30.716274Z ERROR harvest{source=Source { name: "wasser-de", type: WasserDe, url: "https://www.wasser-de.de/", filter: None, source_url: None, concurrency: 1, batch_size: 100 }}: umwelt_info::harvester::wasser_de: Document 4397 has no valid entry for 'NAME'
2022-08-12T16:33:30.716831Z ERROR harvest{source=Source { name: "wasser-de", type: WasserDe, url: "https://www.wasser-de.de/", filter: None, source_url: None, concurrency: 1, batch_size: 100 }}: umwelt_info::harvester::wasser_de: Document 4467 has no valid entry for 'NAME'
2022-08-12T16:33:30.718004Z ERROR harvest{source=Source { name: "wasser-de", type: WasserDe, url: "https://www.wasser-de.de/", filter: None, source_url: None, concurrency: 1, batch_size: 100 }}: umwelt_info::harvester::wasser_de: Document 4651 has no valid entry for 'NAME'
2022-08-12T16:33:30.718910Z ERROR harvest{source=Source { name: "wasser-de", type: WasserDe, url: "https://www.wasser-de.de/", filter: None, source_url: None, concurrency: 1, batch_size: 100 }}: harvester: Failed to harvest 3 out of 713 datasets (713 were transmitted)
2022-08-12T16:33:30.733913Z  INFO harvest{source=Source { name: "geodatenkatalog", type: GeoNetworkQ, url: "http://gdk.gdi-de.org/gdi-de/srv/ger/q", filter: Some("environment"), source_url: Some("http://gdk.gdi-de.org/gdi-de/srv/ger/catalog.search#/metadata/{{id}}"), concurrency: 5, batch_size: 100 }}: umwelt_info::harvester::geo_network_q: Harvesting 5315 datasets
2022-08-12T16:33:30.743831Z  INFO harvest{source=Source { name: "govdata", type: Ckan, url: "https://www.govdata.de/ckan/", filter: None, source_url: Some("https://www.govdata.de/web/guest/suchen/-/details/{{name}}"), concurrency: 5, batch_size: 1000 }}: umwelt_info::harvester::ckan: Harvesting 61432 datasets
2022-08-12T16:33:30.771718Z  WARN harvest{source=Source { name: "doris-bfs", type: DorisBfs, url: "https://doris.bfs.de/", filter: None, source_url: None, concurrency: 5, batch_size: 10 }}:fetch_datasets{rpp=10 offset=60}: umwelt_info::harvester: Overwriting duplicate dataset urn:nbn:de:0221-2021010424644
2022-08-12T16:33:30.845551Z  WARN harvest{source=Source { name: "doris-bfs", type: DorisBfs, url: "https://doris.bfs.de/", filter: None, source_url: None, concurrency: 5, batch_size: 10 }}:fetch_datasets{rpp=10 offset=220}: umwelt_info::harvester: Overwriting duplicate dataset urn:nbn:de:0221-2014111011874
2022-08-12T16:33:30.920961Z  WARN harvest{source=Source { name: "doris-bfs", type: DorisBfs, url: "https://doris.bfs.de/", filter: None, source_url: None, concurrency: 5, batch_size: 10 }}:fetch_datasets{rpp=10 offset=380}: umwelt_info::harvester: Overwriting duplicate dataset urn:nbn:de:0221-2009082154
2022-08-12T16:33:30.925359Z  WARN harvest{source=Source { name: "doris-bfs", type: DorisBfs, url: "https://doris.bfs.de/", filter: None, source_url: None, concurrency: 5, batch_size: 10 }}:fetch_datasets{rpp=10 offset=410}: umwelt_info::harvester: Overwriting duplicate dataset urn:nbn:de:0221-2009042313
2022-08-12T16:33:30.942785Z  WARN harvest{source=Source { name: "doris-bfs", type: DorisBfs, url: "https://doris.bfs.de/", filter: None, source_url: None, concurrency: 5, batch_size: 10 }}:fetch_datasets{rpp=10 offset=410}: umwelt_info::harvester: Overwriting duplicate dataset urn:nbn:de:0221-2009011228
2022-08-12T16:33:30.964311Z  WARN harvest{source=Source { name: "doris-bfs", type: DorisBfs, url: "https://doris.bfs.de/", filter: None, source_url: None, concurrency: 5, batch_size: 10 }}:fetch_datasets{rpp=10 offset=460}: umwelt_info::harvester: Overwriting duplicate dataset urn:nbn:de:0221-201004061230
2022-08-12T16:33:30.965476Z  WARN harvest{source=Source { name: "doris-bfs", type: DorisBfs, url: "https://doris.bfs.de/", filter: None, source_url: None, concurrency: 5, batch_size: 10 }}:fetch_datasets{rpp=10 offset=490}: umwelt_info::harvester: Overwriting duplicate dataset urn:nbn:de:0221-201006222423
2022-08-12T16:33:31.921309Z  WARN harvest{source=Source { name: "geodatenkatalog", type: GeoNetworkQ, url: "http://gdk.gdi-de.org/gdi-de/srv/ger/q", filter: Some("environment"), source_url: Some("http://gdk.gdi-de.org/gdi-de/srv/ger/catalog.search#/metadata/{{id}}"), concurrency: 5, batch_size: 100 }}:fetch_datasets{summary=false from=4801 to=4900}: umwelt_info::harvester: Overwriting duplicate dataset 094bb2e5-c6fb-451a-bcfd-b52629a7e2ff
2.96user 1.68system 0:02.42elapsed 192%CPU (0avgtext+0avgdata 213888maxresident)k
0inputs+553088outputs (0major+50395minor)pagefaults 0swaps

Notice how not just the wall time but also the CPU utilization is much larger in the second case as the harvester does not have to wait for the network to respond. Having a edit-compile-harvest loop of a few seconds should also be very helpful when developing new harvesters or improving our metadata schema and mapping logic.

Closes #56 (closed)

Merge request reports

Loading