<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Ceph on laslopaul</title><link>https://laslopaul.dev/tags/ceph/</link><description>Recent content in Ceph on laslopaul</description><generator>Hugo -- gohugo.io</generator><language>en-us</language><lastBuildDate>Sun, 17 May 2026 00:00:00 +0000</lastBuildDate><atom:link href="https://laslopaul.dev/tags/ceph/index.xml" rel="self" type="application/rss+xml"/><item><title>When Ceph Backfill Was Not Actually Stuck</title><link>https://laslopaul.dev/when-ceph-backfill-was-not-actually-stuck/</link><pubDate>Sun, 17 May 2026 00:00:00 +0000</pubDate><guid>https://laslopaul.dev/when-ceph-backfill-was-not-actually-stuck/</guid><description>&lt;p&gt;Recently I had a task that looked simple on paper: resize the root partitions on several Kubernetes nodes running a Rook-Ceph cluster.&lt;/p&gt;
&lt;p&gt;The resize itself was not the most interesting part. The interesting part started after new OSDs appeared in the cluster and Ceph began rebalancing data. At that point the cluster looked healthy, but the rebalancing process almost stopped making progress. For more than 8 hours the number of misplaced objects stayed practically on the same figure, while most of the remapped PGs were sitting in &lt;code&gt;backfill_wait&lt;/code&gt;.&lt;/p&gt;</description><content>&lt;p&gt;Recently I had a task that looked simple on paper: resize the root partitions on several Kubernetes nodes running a Rook-Ceph cluster.&lt;/p&gt;
&lt;p&gt;The resize itself was not the most interesting part. The interesting part started after new OSDs appeared in the cluster and Ceph began rebalancing data. At that point the cluster looked healthy, but the rebalancing process almost stopped making progress. For more than 8 hours the number of misplaced objects stayed practically on the same figure, while most of the remapped PGs were sitting in &lt;code&gt;backfill_wait&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;That was the part that made the task interesting. It was not an obvious failure, and it was not a classic “Ceph is down” situation. The cluster was alive, the data was safe, but the background recovery work was moving so slowly that it looked stuck.&lt;/p&gt;
&lt;p&gt;At first glance everything looked fine:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-bash" data-lang="bash"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;ceph -s
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;The cluster was reporting &lt;code&gt;HEALTH_OK&lt;/code&gt;. But the PG state told a slightly different story:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-text" data-lang="text"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;24 remapped pgs
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;705 active+clean
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;23 active+remapped+backfill_wait
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;1 active+remapped+backfilling
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Out of 24 remapped PGs, only one was actually backfilling. The rest were sitting in &lt;code&gt;backfill_wait&lt;/code&gt;. Nothing was broken, but the cluster was moving at a painfully slow pace. The number of misplaced objects stayed almost unchanged for more than 8 hours, so it stopped looking like normal slow housekeeping and started looking like a real throttling issue.&lt;/p&gt;
&lt;h2 id="understanding-what-backfill_wait-really-meant"&gt;Understanding what &lt;code&gt;backfill_wait&lt;/code&gt; really meant&lt;/h2&gt;
&lt;p&gt;At the beginning I treated &lt;code&gt;backfill_wait&lt;/code&gt; as something vague: Ceph wants to backfill, but it is waiting for something.&lt;/p&gt;
&lt;p&gt;In practice, it was more specific than that. A PG in &lt;code&gt;backfill_wait&lt;/code&gt; is queued for backfill, but it does not currently have the required reservation from the involved OSDs. In other words, the PG is ready to move, but some OSD is saying: not now, I do not have a free slot for this work.&lt;/p&gt;
&lt;p&gt;The question became: why was Ceph allowing only one backfill at a time?&lt;/p&gt;
&lt;h2 id="the-misleading-osd-benchmark"&gt;The misleading OSD benchmark&lt;/h2&gt;
&lt;p&gt;The first thing that stood out was one newly added OSD: &lt;code&gt;osd.8&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;Ceph has a mechanism where an OSD measures the capacity of the underlying device, including its IOPS capacity, and stores that value in the config database. This value is later used by the mClock scheduler to decide how much background work the OSD can handle.&lt;/p&gt;
&lt;p&gt;On this cluster, &lt;code&gt;osd.8&lt;/code&gt; had a very strange value:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-bash" data-lang="bash"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;ceph config dump &lt;span class="p"&gt;|&lt;/span&gt; grep &lt;span class="s2"&gt;&amp;#34;osd.8&amp;#34;&lt;/span&gt; &lt;span class="p"&gt;|&lt;/span&gt; grep osd_mclock
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;osd.8 osd_mclock_max_capacity_iops_ssd 349.717058
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;The other SSD-backed OSDs had values in a completely different range. Some were around 40,000–70,000 IOPS.&lt;/p&gt;
&lt;p&gt;So Ceph believed that this new OSD was dramatically slower than its neighbors. It was the same type of disk, but from Ceph&amp;rsquo;s point of view it looked like a weak device that should not be given much background work.&lt;/p&gt;
&lt;p&gt;The fix was to overwrite the bogus value with a more realistic one:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-bash" data-lang="bash"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;ceph config &lt;span class="nb"&gt;set&lt;/span&gt; osd.8 osd_mclock_max_capacity_iops_ssd &lt;span class="m"&gt;40000&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;h2 id="mclock-ignoring-the-usual-knobs"&gt;mClock ignoring the usual knobs&lt;/h2&gt;
&lt;p&gt;After that, I expected the usual recovery settings to help:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-bash" data-lang="bash"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="c1"&gt;# Allow each OSD to run up to 8 backfill operations at the same time.&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="c1"&gt;# This controls how many PGs can be backfilled concurrently per OSD.&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;ceph config &lt;span class="nb"&gt;set&lt;/span&gt; osd osd_max_backfills &lt;span class="m"&gt;8&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="c1"&gt;# Allow each OSD to run up to 4 active recovery operations at the same time.&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="c1"&gt;# This affects recovery concurrency, for example when replicas need to be rebuilt.&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;ceph config &lt;span class="nb"&gt;set&lt;/span&gt; osd osd_recovery_max_active &lt;span class="m"&gt;4&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;The settings were visible in the config dump. So they were definitely set.&lt;/p&gt;
&lt;p&gt;But the cluster still behaved almost the same way: many PGs in &lt;code&gt;backfill_wait&lt;/code&gt;, only one actively backfilling.&lt;/p&gt;
&lt;p&gt;This is where another Ceph lesson arrived.&lt;/p&gt;
&lt;p&gt;With the mClock scheduler enabled, the traditional recovery and backfill knobs are not necessarily honored. Settings like &lt;code&gt;osd_max_backfills&lt;/code&gt; and &lt;code&gt;osd_recovery_max_active&lt;/code&gt; may be present in the config database, but mClock can still control the effective limits using its own scheduling logic.&lt;/p&gt;
&lt;p&gt;To make mClock honor those settings, this flag has to be enabled:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-bash" data-lang="bash"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;ceph config &lt;span class="nb"&gt;set&lt;/span&gt; osd osd_mclock_override_recovery_settings &lt;span class="nb"&gt;true&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Without that, the settings looked correct but had no real effect.&lt;/p&gt;
&lt;h2 id="persistent-config-is-not-always-immediate-config"&gt;Persistent config is not always immediate config&lt;/h2&gt;
&lt;p&gt;Even after enabling the override flag and setting higher values, the running OSDs did not immediately behave differently.&lt;/p&gt;
&lt;p&gt;The persistent configuration was changed, but the live daemons still needed a push. In this case, the useful command was:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-bash" data-lang="bash"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;ceph tell &lt;span class="s1"&gt;&amp;#39;osd.*&amp;#39;&lt;/span&gt; injectargs --osd-max-backfills&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="m"&gt;8&lt;/span&gt; --osd-recovery-max-active&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="m"&gt;4&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;This applies the arguments to the running OSD daemons without restarting them.&lt;/p&gt;
&lt;p&gt;So the full picture was actually two-layered:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-bash" data-lang="bash"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="c1"&gt;# Persistent configuration&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;ceph config &lt;span class="nb"&gt;set&lt;/span&gt; osd osd_mclock_override_recovery_settings &lt;span class="nb"&gt;true&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;ceph config &lt;span class="nb"&gt;set&lt;/span&gt; osd osd_max_backfills &lt;span class="m"&gt;8&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;ceph config &lt;span class="nb"&gt;set&lt;/span&gt; osd osd_recovery_max_active &lt;span class="m"&gt;4&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="c1"&gt;# Immediate runtime effect&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;ceph tell &lt;span class="s1"&gt;&amp;#39;osd.*&amp;#39;&lt;/span&gt; injectargs --osd-max-backfills&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="m"&gt;8&lt;/span&gt; --osd-recovery-max-active&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="m"&gt;4&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;The persistent config survives restarts. The injected arguments make the current daemons pick up the values immediately.&lt;/p&gt;
&lt;p&gt;After correcting the mClock behavior and injecting the runtime arguments, the cluster PG stats changed to something much more reasonable:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-text" data-lang="text"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;12 backfilling
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;12 backfill_wait
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;At that point, misplaced objects started dropping visibly and the cluster finally looked like it was making real progress.&lt;/p&gt;
&lt;h2 id="what-i-took-away-from-this"&gt;What I took away from this&lt;/h2&gt;
&lt;p&gt;This was a good Ceph troubleshooting experience because the cluster was not actually broken. It was healthy, but the backfill process had effectively been stuck for more than 8 hours, and the reason was not obvious from the high-level status.&lt;/p&gt;
&lt;p&gt;A few things I learned:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;code&gt;HEALTH_OK&lt;/code&gt; means the data is safe, not that all background work is finished. A Ceph cluster can be healthy and still spend a lot of time rebalancing or backfilling.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;code&gt;backfill_wait&lt;/code&gt; usually means there is a reservation or throttling bottleneck. If many PGs are waiting and only one is backfilling, it is worth checking the effective backfill limits instead of assuming the cluster is stuck.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;mClock changes the troubleshooting model. The old knobs are still there, but they may not do what you expect unless &lt;code&gt;osd_mclock_override_recovery_settings&lt;/code&gt; is enabled.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;The automatic OSD capacity benchmark matters. If a newly added OSD gets a bad benchmark result, Ceph may treat it as a much slower disk and schedule recovery work very conservatively.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;There is a difference between changing the config database and changing the behavior of running daemons. Sometimes &lt;code&gt;ceph config set&lt;/code&gt; is not enough for the current situation, and &lt;code&gt;ceph tell ... injectargs&lt;/code&gt; is needed to apply the change immediately.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="commands-i-would-check-next-time"&gt;Commands I would check next time&lt;/h2&gt;
&lt;p&gt;These are the commands I would keep close if I had to troubleshoot a similar situation again:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-bash" data-lang="bash"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;ceph -s
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;ceph health detail
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;ceph pg dump_stuck
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;ceph osd df tree
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;ceph config dump &lt;span class="p"&gt;|&lt;/span&gt; grep osd_mclock
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;ceph tell osd.N dump_recovery_reservations
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;ceph config get osd osd_max_backfills
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;ceph config get osd osd_recovery_max_active
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;ceph config get osd osd_mclock_override_recovery_settings
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;And if I need the runtime values to take effect immediately:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-bash" data-lang="bash"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;ceph tell &lt;span class="s1"&gt;&amp;#39;osd.*&amp;#39;&lt;/span&gt; injectargs --osd-max-backfills&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="m"&gt;8&lt;/span&gt; --osd-recovery-max-active&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="m"&gt;4&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;</content></item></channel></rss>