etcd defrag + backup: Avoid too many leader changes
Created by: garloff
As k8s cluster user, I want the k8s control plane to always be responsive, stable and safe.
We have a nightly job to defragment etcd and back it up on all control plane nodes, randomized a bit, so the defragmentation does not happen all at the same time.
This has been in existence for many months already, but due to a missing --now
in systemctl enable
, it has not really been active before.
As @matofeder points out, the defragmentation may block access to etcd for a while (seconds on typically sized etcd DBs), causing etcd leader changes (on multi-node etcd clusters) or temporary kube-api failures (on single-node etcd clusters).
Things to consider:
- Possibly stronger protection from concurrent defragmentation on multiple etcd nodes by scheduling all etcd nodes from the leader instead of relying on the configured randomness in the timer start time.
- Scheduling the leader etcd defragmentation last, as it will likely cause a leader change and we want to minimize these. (Starting with the leader would cause several leader changes ...)
- Skipping the leader's defragmentation (for up to a week or infinitely?) to cause less leader changes?
- Skipping defragmentation on single-node etcd installations?
- Leaving this disabled for R4 and do more real-world tests before R5. (This is not without risk either, we have seen heavily fragmented etcds causing trouble in real-life already.)