37 private links
killed tuple
This incident was triggered by the following combination of factors:
A bulk data change (updates and deletes) produced many dead tuples in the target table.
Crucially, these dead tuples were concentrated on one end of the range of values for an indexed column (job_id).
This means any index scan that needs to traverse that range of values incurs a lot more overhead than usual due to the dense accumulation of dead tuples.
Running VACUUM on the table cleans up these dead tuples, but dead tuples accumulated far faster than the pacing of vacuum maintenance.
In between vacuums, the primary db has a mechanism that lets it avoid most of the overhead caused by those dead tuples, but the replica dbs cannot use that mechanism. This is why the regression only affected the replicas, not the primary.
On the replica dbs, each index scan has to independently repeat the costly overhead of searching past those dead tuples to find live ones.
The primary db gets to avoid repeating that overhead, because the first index scan can leave a hint for future index scans about any dead tuples it visits, but the standby dbs cannot use this mechanism.
A frequently run index scan does need to traverse that polluted range of values for that particular column.
Any index scan that hits that cluster of roughly 500K dead tuples is going to be slowed by the overhead of having to serially do row-visibility checks.
Having many concurrent db sessions trying to perform that same task ends up amplifying the overhead, because they contend for per-buffer locks, spending a lot of time in spinlock contention on this serialization point.
The planner itself frequently runs an index scan that ended up hitting many of those dead tuples.
This is not obvious. Without the full set of debug symbols, we would not have been able to uncover this key fact.
The planner runs this index scan via a special-case code path (get_actual_variable_endpoint) where the planner decides to augment its statistics about the current distribution of values in a column, so it can give a better estimate of the selectivity of one of the filter expressions in the query it is planning.
Under normal conditions, this code path gets run at a rate of roughly 10K calls/second on each replica db, and normally it is very cheap (0.01 - 0.06 ms).
However, due to the overhead of checking and discarding those dead tuples, it became far more expensive.
This particular index scan is always seeking the min or max value of the column, so because our dead tuples were accumulated at one of those extremes, this index scan ended up having to traverse those numerous dead tuples before finding a live tuple.
The regression escalated quickly in severity because all of the dead tuples were accumulating on the lowest-valued job_ids. The overall rate of accumulating dead tuples was only moderately higher than normal, but because they were concentrated in the same index range, the overhead of scanning through them was dramatically higher than usual.
From that last point, we may be able to avoid the regression by adjusting the order of our data changes:
Originally the sidekiq job performing these updates was implicitly choosing which rows to update next by walking the very same index that featured in this regression: index_ci_job_artifacts_on_job_id_and_file_type
That guaranteed that the dead tuples would all accumulate at one end of the index, and since the data change was updating virtually every row it visited, all of the updates it had done were contributing to the overhead of the index scans running by other clients on the replica dbs.
The sidekiq job does not necessarily need to perform its data changes in that order, and by changing the order to not be correlated with the job_id values, we can prevent having all of the dead tuples accumulate in a clump at one end of the index.
Spreading the dead tuples across a wider range of values for the job_id column means its index pollution will be much less localized.
The frequent index scans on the replica dbs only visit the lowest or highest valued live tuples, so they should incur far less overhead.
Periodic vacuums still must clean up those dead tuples, but spreading those dead tuples more uniformly across the index means the cadence of vacuuming is much more likely to be able to run often enough to avoid accumulating a dense clump of dead tuples at any particular location in the index -- including at the start and end of the index, where they hurt our known case of frequent index scans.
This is essentially the idea outlined in #5952 (comment 749917689)
Voir aussi https://gitlab.com/gitlab-org/gitlab/-/issues/347369