Recently I had a task that looked simple on paper: resize the root partitions on several Kubernetes nodes running a Rook-Ceph cluster.
The resize itself was not the most interesting part. The interesting part started after new OSDs appeared in the cluster and Ceph began rebalancing data. At that point the cluster looked healthy, but the rebalancing process almost stopped making progress. For more than 8 hours the number of misplaced objects stayed practically on the same figure, while most of the remapped PGs were sitting in backfill_wait.