Skip to content

Draft: Run all harvesters on a single local set to allow keeping HTML documents alive over await points.

@OC000021106231 @OC000008373193 So this is what would be required to avoid "error: future cannot be sent between threads safely" at the cost of running all harvesters on a single thread/CPU. So this would solve the issues you were running into with the UIP and HLNUG harvesters at the price of limiting the scalability of our harvester process.

Admittedly, the limit of a single CPU is also not as bad as it might sound:

  • We currently do not actually use more than 10% of the two CPU allotted to the harvester VM, so this would just push us up to 20% for now.
  • Everything involving blocking I/O like loading cached responses from disk and decompressing them already runs a in separate thread pool would still be able to use the second CPU.

One large downside I see is that this encourages a coding pattern that uses more memory than necessary, i.e. it is a good thing to drop response text and parsed HTML as soon as possible before moving on to the next item. However, we could still nudge people during code reviews and this issue would not block development of harvesters any more.

So what do you (or anybody else) think?

To be resolved before this is more than a quick draft:

  • Revert the harvester changes here as they are only illustrative but actually increase memory usage without a good reason.
  • Update the HOWTO document to remove the whole section on non-send futures.
  • Give this a try for an overnight harvester run to see if there actually is a performance degradation.
Edited by Falk Heße

Merge request reports

Loading