Optimize website_uba harvester
The initial implementation has multiple things that can be improved upon:
- Never call
.unwrap()
based on remotely controlled content, i.e.rsplit
ting an href attribute should not be able to crash the whole harvester. - Instead of many intermediate collections which are repeatedly sorted and deduplicated, collect all links into a single set and filter before inserting them instead of after.
- Instead of repeatedly matching on the path segment, use a helper struct to collect the relevant parameters for a sub page and thereby make inconsistent parameters unrepresentable.
-
fetch_many
can already be applied when crawling for links and it allows to continue crawling even if some intermediate steps fail as well as increasing concurrency (even if the harvester is configured to not use concurrency on network level). - ".isol" should be part of the key passed to
fetch_text
, but not of the dataset key visible to users of the API. - Deduplication will only work reliably when the collection is already sorted.
- Remove selectors which are never used.
A follow-up to !449 (merged)
Edited by Adam Reichold