Filter data sources using classification based on language models
When harvesting from trans-sectoral data sources like GovData or Geodatenkatalog, we currently filter for environmental and conservation information using the built-in filters provided by the data sources themselves. Those do not necessarily match watch we consider (ir)relevant datasets. A promising alternative would be to apply a ML-based classifier on the receiving end to determine whether to include a dataset or not, i.e. whether a dataset concerns environmental or conversation information.
This would most likely be based on title and description and being processed using a language model, for example one of the available pre-trained BERT-based ones. Relevant technical questions is how to integrate this into our harvester, e.g. inline using ONNX or as a separate service with an API, how to cache the classification. Functional questions that need solutions are how to actually define/train (e.g. fine-tuning a pre-trained model) the classification and how to monitor and evaluate the operation.