Z39.50 harvester for DNL-online

The literature database "DNL-online" contains a large number of conservation-related publication, but its session-based web site is not amenable to classic scraping techniques. However, we can also access the catalogue via the library-specific ASN.1-based Z39.50 protocol. Hence, we need a minimal implementation that is able to use a subset of the Z39.50 functions to retrieve the publications listed in the catalogue and produce datasets based on their metadata including links back into the web-based OPAC system. Additionally, it is sufficient if our implementation handles only UTF-8-encoded MARC21 metadata instead of the full generality supported by Z39.50.

As the library in question is a comparatively small aDIS installation, it is very important that we do not create undue load on the system during development of the harvester. For example, it would be advisable to test the harvester using other publically accessible Z39.50 servers to iron out initial issues before starting to test against the actual target. Additionally, it necessary that we cache response content in a way similar to what we already do for HTTP exchanges.

A lot of resources including its official specification and links to existing libraries supporting Z39.50 are available at the Library of Congress which maintains the protocol. Apparently, open source Python libraries do exist, but we have to make sure we can integrate them in a manner that is sufficiently simple to deploy and compatible with the above performance requirements.

Acceptance criteria

Z39.50-based Harvester for DNL-online merged into main branch and deployed to md.umwelt.info.
Each publication available as a separate dataset with a source URL pointing into the web-based OPAC.
Protocol integration that supports our age-based response replay mechanism.
A measurement of the number of requests made and their duration for a single harvest of the catalogue.

Edited Mar 24, 2023 by Adam Reichold