This article was written in collaboration with Bohan Zhang and originally appeared on the OtterTune website.
本文由 Bohan Zhang 合作撰写,最初发表于 OtterTune 网站。
There are a lot of choices in databases (897 as of April 2023). With so many systems, it’s hard to know what to pick! But there is an interesting phenomenon where the Internet collectively decides on the default choice for new applications. In the 2000s, the conventional wisdom selected MySQL because rising tech stars like Google and Facebook were using it. Then in the 2010s, it was MongoDB because non-durable writes made it “webscale“. In the last five years, PostgreSQL has become the Internet’s darling DBMS. And for good reasons! It’s dependable, feature-rich, extensible, and well-suited for most operational workloads.
数据库有很多选择(897 截至 2023 年 4 月)。面对如此多的系统,很难知道该如何选择!但有一个有趣的现象,即互联网集体决定了新应用程序的默认选择。2000 年代,传统智慧选择了 MySQL,因为谷歌和 Facebook 等科技新星都在使用它。2010 年代,MongoDB 成为首选,因为非持久性写入使其成为"webscale"。在过去的五年中,PostgreSQL 已成为互联网的宠儿 DBMS。这是有原因的!它可靠、功能丰富、可扩展,非常适合大多数操作工作负载。
But as much as we love PostgreSQL at OtterTune, certain aspects of it are not great. So instead of writing yet another blog article like everyone else touting the awesomeness of everyone’s favorite elephant-themed DBMS, we want to discuss the one major thing that sucks: how PostgreSQL implements multi-version concurrency control (MVCC). Our research at Carnegie Mellon University and experience optimizing PostgreSQL database instances on Amazon RDS have shown that its MVCC implementation is the worst among the other widely used relational DBMSs, including MySQL, Oracle, and Microsoft SQL Server. And yes, Amazon’s PostgreSQL Aurora still has these problems.
但是,尽管我们热爱OtterTune的PostgreSQL,但它的某些方面并不出色。因此,我们不想像其他人一样再写一篇博客文章来吹捧大家最喜欢的以大象为主题的 DBMS 的神奇之处,而是想讨论一下最糟糕的一点:PostgreSQL 是如何实现 多版本并发控制 (MVCC) 的。我们在卡内基梅隆大学进行的研究以及在 Amazon RDS 上优化 PostgreSQL 数据库实例的经验表明,在其他广泛使用的关系型 DBMS(包括 MySQL、Oracle 和 Microsoft SQL Server)中,PostgreSQL 的 MVCC 实现是最差的。没错,Amazon 的 PostgreSQL Aurora 仍然存在这些问题。
In this article, we’ll dive into MVCC: what it is, how PostgreSQL does it, and why it is terrible. Our goal at OtterTune is to give you fewer things to worry about with your databases, so we’ve thought a lot about dealing with this problem. We’ll cover OtterTune’s solution for managing PostgreSQL’s MVCC issues automatically for RDS and Aurora databases in a follow-up article next week.
在本文中,我们将深入探讨 MVCC:它是什么,PostgreSQL 是如何做到的,以及为什么它很糟糕。我们在OtterTune的目标是为你的数据库提供更少需要担心的事情,因此我们在处理这个问题时考虑了很多。我们将在下周的后续文章中介绍 OtterTune 为 RDS 和 Aurora 数据库自动管理 PostgreSQL MVCC 问题的解决方案。
What is Multi-Version Concurrency Control?
什么是多版本并发控制?
The goal of MVCC in a DBMS is to allow multiple queries to read and write to the database simultaneously without interfering with each other when possible. The basic idea of MVCC is that the DBMS never overwrites existing rows. Instead, for each (logical) row, the DBMS maintains multiple (physical) versions. When the application executes a query, the DBMS determines which version to retrieve to satisfy the request according to some version ordering (e.g., creation timestamp). The benefit of this approach is that multiple queries can read older versions of rows without getting blocked by another query updating it. Queries observe a snapshot of the database as it existed when the DBMS started that query’s transaction (snapshot isolation). This approach eliminates the need for explicit record locks that block readers from accessing data while writers modify the same item.
DBMS 中 MVCC 的目标是允许多个查询同时读取和写入数据库,尽可能互不干扰。MVCC 的基本思想是 DBMS 绝不覆盖现有行。相反,对于每条(逻辑)记录,数据库管理系统都会维护多个(物理)版本。当应用程序执行查询时,数据库管理系统会根据版本排序(如创建时间戳)来决定检索哪个版本以满足请求。这种方法的好处是,多个查询可以读取旧版本的行,而不会被另一个正在更新的查询阻塞。查询会观察到 DBMS 启动查询事务时数据库的快照(快照隔离)。这种方法无需使用显式记录锁,因为这种锁会在写入者修改同一项目时阻止读者访问数据。
David Reed’s 1978 MIT Ph.D. dissertation, “Concurrency Control in Distributed Database Systems,” was, we believe, the first publication to describe MVCC. The first commercial DBMS implementation of MVCC was InterBase in the 1980s. Since then, nearly every new DBMS created in the last two decades that supports transactions implements MVCC.
我们认为,David Reed 1978 年发表的麻省理工学院博士论文《分布式数据库系统中的并发控制》是第一篇描述 MVCC 的出版物。20 世纪 80 年代,InterBase是第一个实现 MVCC 的商业 DBMS。从那时起,在过去二十年中创建的几乎所有支持事务的新 DBMS 都实现了 MVCC。
A systems engineer has to make several design decisions when building a DBMS that supports MVCC. At a high level, it comes down to the following:
在构建支持 MVCC 的 DBMS 时,系统工程师必须做出多项设计决策。概括起来,主要有以下几点:
- How to store updates to existing rows.
如何存储对现有行的更新。 - How to find the correct version of a row for a query at runtime.
如何在运行时为查询找到一行的正确版本。 - How to remove expired versions that are no longer visible.
如何删除不再可见的过期版本。
These decisions are not mutually exclusive. In the case of PostgreSQL, it’s how they decided to handle the first question in the 1980s that caused problems with the other two that we still have to deal with today.
这些决定并不相互排斥。就 PostgreSQL 而言,正是他们在 20 世纪 80 年代决定如何处理第一个问题,才导致了我们今天仍需处理的另外两个问题。
For our discussion, we will use the following example of a table containing movie information. Each row in the table includes the movie name, release year, director, and a unique ID serving as the primary key, with secondary indexes on the movie name and director. Here is the DDL command to create this table:
在讨论中,我们将使用以下包含电影信息的表作为示例。表中的每一行都包括电影名称、上映年份、导演和作为主键的唯一 ID,以及电影名称和导演的二级索引。下面是创建该表的 DDL 命令:
CREATE TABLE movies (
id INTEGER PRIMARY KEY GENERATED ALWAYS AS IDENTITY,
name VARCHAR(256) NOT NULL,
year SMALLINT NOT NULL,
director VARCHAR(128)
);
CREATE INDEX idx_name ON movies (name);
CREATE INDEX idx_director ON movies (director);
The table contains a primary index (movies_pkey
) and two secondary B+Tree indexes (idx_name
, idx_director
).
该表包含一个主索引(movies_pkey
)和两个二级 B+Tree 索引(idx_name
, idx_director
)。
PostgreSQL’s Multi-Version Concurrency Control
PostgreSQL 的多版本并发控制
As discussed in Stonebraker’s system design document from 1987, PostgreSQL was designed from the beginning to support multi-versioning. The core idea of PostgreSQL’s MVCC scheme is seemingly straightforward: when a query updates an existing row in a table, the DBMS makes a copy of that row and applies the changes to this new version instead of overwriting the original row. We refer to this approach as the append-only version storage scheme. But as we now describe, this approach has several non-trivial implications in the rest of the system.
正如Stonebraker在1987年的系统设计文档中所讨论的那样,PostgreSQL从一开始就是为了支持多版本而设计的。PostgreSQL 的 MVCC 方案的核心思想看似简单明了:当查询更新表中的现有记录时,DBMS 会复制该记录,并将更改应用到这个新版本,而不是覆盖原始记录。我们将这种方法称为 仅应用版本存储方案。但正如我们现在所描述的,这种方法对系统的其他部分有一些非同小可的影响。
Multi-Versioned Storage 多版本存储
PostgreSQL stores all row versions in a table in the same storage space. To update an existing tuple, the DBMS first acquires an empty slot from the table for the new row version. It then copies the row content of the current version to the new version, and applies the modifications to the row in the newly allocated version slot. You can see this process in the example below when an application executes an update query on the movies database to change the release year of “Shaolin and Wu Tang” from 1985 to 1983:
PostgreSQL 将表中的所有行版本存储在同一存储空间中。要更新现有的元组,DBMS 首先会从表中为新的行版本获取一个空槽。然后,它会将当前版本的行内容复制到新版本,并将修改应用到新分配的版本槽中的行。在下面的示例中,当应用程序在电影数据库中执行更新查询,将"少林与武当"的发行年份从 1985 年改为 1983 年时,您可以看到这一过程:
When an UPDATE query changes a tuple in the table, PostgreSQL copies the original version of the tuple and then applies the change to the new version. In this example, there is no more space in Table Page #1, so PostgreSQL creates the new version in Table Page #2.
当 UPDATE 查询更改表中的元组时,PostgreSQL 会复制元组的原始版本,然后将更改应用到新版本。在本例中,表页 #1 中没有更多空间,因此 PostgreSQL 在表页 #2 中创建了新版本。
Now with two physical versions representing the same logical row, the DBMS needs to record the lineage of these versions so that it knows how to find them in the future. MVCC DBMSs achieve this by creating a version chain via a singly linked-list. The version chain only goes in one direction to reduce storage and maintenance overhead. This means that the DBMS has to decide what order to use: newest-to-oldest (N2O) order or oldest-to-newest (O2N). For the N2O order, each tuple version points to its previous version and the version chain’s head is always the latest version. For the O2N order, each tuple version points to its new version, and the head is the oldest tuple version. The O2N approach avoids the need for the DBMS to update indexes to point to a newer version of the tuple each time it’s modified. However, it may take longer for the DBMS to find the latest version during query processing, potentially traversing a long version chain. Most DBMSs, including Oracle and MySQL, implement N2O. But PostgreSQL stands alone in using O2N (except for Microsoft’s In-Memory OLTP engine for SQL Server).
现在,两个物理版本代表同一逻辑行,DBMS 需要记录这些版本的沿革,以便知道将来如何找到它们。MVCC DBMS 通过单链列表创建 版本链来实现这一目的。版本链只单向运行,以减少存储和维护开销。这意味着 DBMS 必须决定使用哪种顺序:newest-to-oldest (N2O) 顺序或 oldest-to-newest (O2N)。对于 N2O 顺序,每个元组的版本都指向其上一个版本,版本链的首部总是最新版本。而在 O2N 排序中,每个元组版本都指向其新版本,而头部则是最旧的元组版本。O2N 方法避免了数据库管理系统每次修改元组时都要更新索引以指向元组的新版本。不过,在查询处理过程中,数据库管理系统可能需要更长的时间才能找到最新版本,从而可能需要遍历很长的版本链。包括 Oracle 和 MySQL 在内的大多数 DBMS 都实现了 N2O。但只有 PostgreSQL 使用 O2N(微软的 SQL Server 内存 OLTP 引擎除外)。
The next issue is how PostgreSQL determines what to record for these version pointers. The header for each row in PostgreSQL contains a tuple id field ( t_tcid) of the next version (or its own tuple id if it is the latest version). Thus, as shown in this next example, when a query requests the latest version of a row, the DBMS traverses the index, lands on the oldest version, and then follows the pointer until it finds a version that it needs.
下一个问题是 PostgreSQL 如何确定这些版本指针的记录内容。PostgreSQL 中每一行的标头都包含下一个版本的图元 id 字段(t_tcid )(如果是最新版本,则包含其自身的图元 id)。因此,正如下一个示例所示,当查询请求记录的最新版本时,DBMS 会遍历索引,找到最旧的版本,然后跟随指针直到找到所需的版本。
The SELECT query traverses the index to find tuple with requested movie name. The index entry points to the oldest version of the tuple, which means PostgreSQL follows the version chain embedded in the original version to find the new version.
SELECT 查询会遍历索引,找到包含所请求的电影名称的元组。索引入口指向元组的最旧版本,这意味着 PostgreSQL 会按照嵌入在原始版本中的版本链查找新版本。
PostgreSQL developers realized early on that there are two problems with its MVCC scheme. First, making a new copy of an entire tuple every time it is updated is expensive. And second, traversing the entire version chain just to find the latest version (which is what most queries want) is wasteful. Of course there is also the problem of cleaning up old versions, but we’ll cover that below.
PostgreSQL 开发人员很早就意识到 MVCC 方案存在两个问题。首先,每次更新元组时都要为整个元组创建一个新副本,成本很高。其次,为了找到最新版本(这正是大多数查询所需要的)而遍历整个版本链会造成浪费。当然,还有一个清理旧版本的问题,我们将在下文讨论。
To avoid traversing the entire version chain, PostgreSQL adds an entry to a table’s indexes for each physical version of a row. That means if there are five physical versions of a logical row, there will be (at most) five entries for that tuple in the index! In the example below, we see that the idx_name
index contains entries for each of the “Shaolin and Wu Tang” rows that are on separate pages. This enables direct access to the latest version of the tuple, without the need to traverse the long version chain.
为了避免遍历整个版本链,PostgreSQL 会为每条记录的每个物理版本在表的索引中添加一个条目。这意味着,如果一条逻辑行有 5 个物理版本,那么在索引中该元组将(最多)有 5 个条目!在下面的示例中,我们可以看到 idx_name
索引包含了 "少林与武当 "行的每个条目,它们分别位于不同的页面上。这样就可以直接访问元组的最新版本,而无需遍历长长的版本链。
In this example, the index contains multiple entries for the “Shaolin and Wu Tang” tuple (one for each version). Now PostgreSQL uses the index to find the latest version and then immediately retrieves it from Table Page #2 without having to traverse the version chain starting at Table Page #1.
在本例中,索引包含 "少林与武当 "元组的多个条目(每个版本一个)。现在,PostgreSQL 使用索引查找最新版本,然后立即从表页 #2 中检索,而不必从表页 #1 开始遍历版本链。
PostgreSQL tries to avoid having to install multiple index entries and storing related versions over multiple pages by creating a new copy in the same disk page (block) as the old version to reduce disk I/O. This optimization is known as heap-only tuple (HOT) updates. The DBMS uses the HOT approach if an update does not modify any columns referenced by a table’s indexes and the new version is stored on the same data page as the old version (if there is space in that page). Now in our example, after the update the index still points to the old version and queries retrieve the latest version by traversing the version chain. During normal operation, PostgreSQL further optimizes this process by removing old versions to prune the version chain.
PostgreSQL 试图通过在与旧版本相同的磁盘页(块)中创建新副本来减少磁盘 I/O,从而避免安装多个索引项和在多个页面上存储相关版本。这种优化被称为只堆元组(HOT)更新。如果更新不修改表索引引用的任何列,并且新版本与旧版本存储在相同的数据页上(如果该页有空间),那么 DBMS 就会使用 HOT 方法。现在,在我们的示例中,更新后索引仍指向旧版本,查询通过遍历版本链检索最新版本。在正常运行过程中,PostgreSQL 会通过删除旧版本来剪切版本链,从而进一步优化这一过程。
Version Vacuum 真空版本
We’ve established that PostgreSQL makes a copy of rows whenever an application updates them. The next question is how the system removes older versions (called “dead tuples”). The original version of PostgreSQL from the 1980s did not remove dead tuples. The idea was that keeping all the older versions allowed applications to execute “time-travel” queries to examine the database at a particular point in time (e.g., run a SELECT
query on the state of the database as it existed at the end of last week). But never removing dead tuples means tables never shrink in size if the application deletes tuples. It also means long version chains for frequently updated tuples, which would slow down queries, except that PostgreSQL adds index entries that allow queries to quickly jump to the correct version instead of traversing the chain. But now, this means the indexes are larger, making them slower and adding additional memory pressure. Hopefully, you can understand now why all these issues are interconnected.
我们已经确定,每当应用程序更新数据行时,PostgreSQL 都会复制一份数据行。下一个问题是系统如何删除旧版本(称为 “死图元”)。20 世纪 80 年代的 PostgreSQL 最初版本并不删除死图元。当时的想法是,保留所有旧版本允许应用程序执行 "时间旅行 "查询,以检查特定时间点的数据库(例如,对上周末的数据库状态运行 SELECT
查询)。但是,永远不删除死图元意味着如果应用程序删除图元,表的大小永远不会缩小。这还意味着频繁更新的数据元组的版本链很长,这会降低查询速度,除非 PostgreSQL 添加了索引项,允许查询快速跳转到正确的版本,而不是遍历版本链。但现在,这意味着索引会变大,使其速度变慢,并增加额外的内存压力。希望你现在能理解为什么所有这些问题都是相互关联的。
To overcome these problems, PostgreSQL uses a vacuum procedure to clean up dead tuples from tables. The vacuum performs a sequential scan on table pages modified since its last run and find expired versions. The DBMS considers a version “expired” if it is not visible to any active transaction. This means no current transaction is accessing that version, and future transactions will use the latest “live” version instead. Thus, removing the expired version and reclaiming the space for reuse is safe.
为了克服这些问题,PostgreSQL 使用真空过程来清理表中的死图元。真空程序会对上次运行后修改过的表页执行顺序扫描,找出过期的版本。如果一个版本在任何活动事务中都不可见,数据库管理系统就会认为该版本"已过期"。这意味着当前事务不会访问该版本,未来事务将使用最新的"live"版本。因此,删除过期版本并回收空间以供重用是安全的。
PostgreSQL automatically executes this vacuum procedure (autovacuum) at regular intervals based on its configuration settings. In addition to the global settings that affect the vacuum frequency for all tables, PostgreSQL provides the flexibility to configure autovacuum at the table level to fine-tune the process for specific tables. Users can also trigger the vacuum manually to optimize database performance via the VACUUM
SQL command.
PostgreSQL 会根据配置设置定期自动执行真空程序(autovacuum)。除了影响所有表真空频率的全局设置外,PostgreSQL 还提供了在表级别配置自动真空的灵活性,以便对特定表的进程进行微调。用户还可以通过 VACUUM
SQL 命令手动触发真空,以优化数据库性能。
Why PostgreSQL’s MVCC is the Worst
为什么 PostgreSQL 的 MVCC 最糟糕?
We will be blunt: if someone is going to build a new MVCC DBMS today, they should not do it the way PostgreSQL does (e.g., append-only storage with autovacuum). In our 2018 VLDB paper (aka “ the best paper ever on MVCC“), we did not find another DBMS doing MVCC the way PostgreSQL does it. Its design is a relic of the 1980s and before the proliferation of log-structured system patterns from the 1990s.
我们将直言不讳:如果有人要在今天构建新的 MVCC DBMS,他们应该 而不是按照 PostgreSQL 的方式来做(例如,采用 autovacuum 的仅附加存储)。在我们的 2018 VLDB 论文(又名"有史以来关于 MVCC 的最佳论文")中,我们没有发现其他 DBMS 以 PostgreSQL 的方式实现 MVCC。它的设计是 20 世纪 80 年代的产物,早于 20 世纪 90 年代的 log 结构系统模式。
Let’s talk about four problems that arise with PostgreSQL’s MVCC. We will also talk about why other MVCC DBMSs like Oracle and MySQL avoid these problems.
让我们来谈谈 PostgreSQL 的 MVCC 会出现的四个问题。我们还将讨论为什么 Oracle 和 MySQL 等其他 MVCC DBMS 可以避免这些问题。
Problem #1: Version Copying
问题 1:版本复制
With the append-only storage scheme in MVCC, if a query updates a tuple, the DBMS copies all its columns into the new version. This copying occurs no matter if the query updates a single or all of its columns. As you can imagine, append-only MVCC results in massive data duplication and increased storage requirements. This approach means that PostgreSQL requires more memory and disk storage to store a database than other DBMS, which means slower queries and higher cloud costs. Instead of copying an entire tuple for a new version, MySQL and Oracle store a compact delta between the new and current versions (think of it like a git diff). Using deltas means that if a query only updates a single column in a tuple for a table with 1000 columns, then the DBMS only stores a delta record with the change to that one column. On the other hand, PostgreSQL creates a new version with the one column that the query changed and the 999 other untouched columns. We will ignore TOAST attributes because PostgreSQL handles them differently.
利用 MVCC 中的仅附加存储方案,如果查询更新了一个元组,数据库管理系统就会将其所有列复制到新版本中。无论查询更新的是单列还是所有列,都会发生这种复制。可以想象,仅附加 MVCC 会导致大量数据重复和存储需求增加。与其他数据库管理系统相比,这种方法意味着 PostgreSQL 需要更多内存和磁盘存储来存储数据库,这意味着查询速度更慢,云计算成本更高。MySQL 和 Oracle 不会为新版本复制整个元组,而是在新版本和当前版本之间存储一个紧凑的 delta(可以把它想象成 git diff)。使用 deltas 意味着,如果查询只更新了有 1000 列的表的元组中的一列,那么 DBMS 只会存储一条包含这一列变化的 delta 记录。另一方面,PostgreSQL 会创建一个新版本,其中包含查询更改的那一列和其他 999 列未更改的列。我们将忽略 TOAST 属性,因为 PostgreSQL 会以不同的方式处理它们。
There was an attempt to modernize PostgreSQL’s version storage implementation. EnterpriseDB started the zheap project in 2013 to replace the append-only storage engine to use delta versions. Unfortunately the last official update was in 2021, and to the best of our knowledge the effort has fizzled out.
有人曾尝试将 PostgreSQL 的版本存储实现现代化。EnterpriseDB 在 2013 年启动了zheap项目,以使用 delta 版本来替换仅用于追加的存储引擎。遗憾的是,最后一次官方更新是在 2021 年,而且据我们所知,这项工作已经宣告失败。
Problem #2: Table Bloat 问题 2:表格膨胀
Expired versions in PostgreSQL (i.e., dead tuples) also occupy more space than delta versions. Although PostgreSQL’s autovacuum will eventually remove these dead tuples, write-heavy workloads can cause them to accumulate faster than the vacuum can catch up, resulting in continuous database growth. The DBMS has to load dead tuples into memory during query execution since the system intermingles dead tuples with live tuples in pages. Unfettered bloat slows query performance by causing the DBMS to incur more IOPS and consume more memory than necessary during table scans. Additionally, inaccurate optimizer statistics caused by dead tuples can lead to poor query plans.
PostgreSQL 中的过期版本(即死图元)也比 delta 版本占用更多空间。虽然 PostgreSQL 的自动真空最终会删除这些死图元,但写入量大的工作负载会导致死图元的累积速度超过真空的追赶速度,从而导致数据库持续增长。由于系统会将死图元与页面中的活图元混合在一起,因此 DBMS 必须在执行查询时将死图元加载到内存中。不受约束的膨胀会导致 DBMS 在表扫描过程中产生更多 IOPS 并消耗比必要更多的内存,从而降低查询性能。此外,死图元造成的优化器统计不准确也会导致查询计划不良。
Suppose our movies table has 10 million live and 40 million dead tuples, making 80% of the table obsolete data. Assume also that the table also has many more columns than what we are showing and that the average size of each tuple is 1KB. With this scenario, the live tuples occupy 10GB of storage space while the dead tuples occupy ~40GB of storage; the total size of the table is 50GB. When a query performs a full table scan on this table, PostgreSQL has to retrieve all 50GB from the disk and store it in memory, even if most of it is obsolete. Although Postgres has a protection mechanism to avoid polluting its buffer pool cache from sequential scans, it does not help prevent IO costs.
假设我们的电影表中有 1 千万个有效元组和 4 千万个无效元组,因此表中 80% 的数据都已过时。还假设表中的列数比我们展示的列数多得多,每个元组的平均大小为 1KB。在这种情况下,活元组占用 10GB 的存储空间,而死元组占用 ~40GB 的存储空间;表的总大小为 50GB。当查询对该表执行全表扫描时,PostgreSQL 必须从磁盘检索所有 50GB 的数据并将其存储在内存中,即使其中大部分数据已经过时。尽管 PostgreSQL 有一个保护机制来避免顺序扫描污染其缓冲池缓存,但这无助于避免 IO 成本。
Even if you make sure that PostgreSQL’s autovacuum is running at regular intervals and able to keep up with your workload (which is not always easy to do, see below), the autovacuum cannot reclaim storage space. The autovacuum only removes dead tuples and relocates live tuples within each page, but it does not reclaim empty pages from the disk.
即使你确保 PostgreSQL 的 autovacuum 定期运行,并能跟上你的工作负荷(这并不容易做到,见下文),autovacuum 也无法回收存储空间。自动真空吸尘器只会删除死数据元组,并重新定位每个页面中的活数据元组,但不会从磁盘中回收空页面。
When the DBMS truncates the last page due to the absence of any tuple, other pages remain on disk. In our example above, even if PostgreSQL removed the 40GB of dead tuples from the movies table, it still retains the 50GB of allocated storage space from the operating system (or, in the case of RDS, from Amazon). To reclaim and return such unused space, one must use VACUUM FULL
or the pg_repack extension to rewrite the entire table to a new space with no wasted storage. Running either of these operations is not an easy endeavor that one should take without considering the performance implications for production databases; they are resource-intensive and time-consuming operations that will crush query performance. The following figure shows how VACUUM
and VACUUM FULL
work.
当 DBMS 因没有任何元组而截断最后一页时,其他页面仍会保留在磁盘上。在我们上面的示例中,即使 PostgreSQL 从电影表中删除了 40GB 的死图元,它仍然保留了从操作系统(或在 RDS 中,从亚马逊)分配的 50GB 存储空间。要回收并返回这些未使用的空间,必须使用 VACUUM FULL
或 pg_repack 扩展将整个表重写到一个没有浪费存储空间的新空间。运行这些操作中的任何一个都不是一件容易的事情,如果不考虑对生产数据库性能的影响,就不应该进行这些操作;它们都是资源密集型的耗时操作,会影响查询性能。下图显示了 VACUUM
和 VACUUM FULL
的工作原理。
With PostgreSQL’s regular VACUUM operation, the DBMS only removes dead tuples from each table page and reorganizes it to put all the live tuples at the end of the page. With VACUUM FULL, PostgreSQL removes the dead tuples from each page, coalesces and compacts the remaining live tuples to a new page (Table Page #3), and then deletes the unneeded pages (Table Pages #1 / #2).
使用 PostgreSQL 的常规 VACUUM 操作,DBMS 只删除每个表页中的死图元,并重新组织表页,将所有活图元放在表页末尾。使用 VACUUM FULL 时,PostgreSQL 会删除每个页面上的死数据元组,将剩余的活数据元组聚合并压缩到一个新页面(表页 #3),然后删除不需要的页面(表页 #1 / #2)。
Problem #3: Secondary Index Maintenance
问题 3:二级索引维护
A single update to a tuple requires PostgreSQL to update all the indexes for that table. Updating all the indexes is necessary because PostgreSQL uses the exact physical locations of a version in both primary and secondary indexes. Unless the DBMS stores the new version in the same page as the previous version (HOT update), the system does this for every update.
对元组的一次更新需要 PostgreSQL 更新该表的所有索引。之所以需要更新所有索引,是因为 PostgreSQL 在主索引和二级索引中都使用了版本的确切物理位置。除非 DBMS 将新版本存储在与上一版本相同的页面中(HOT 更新),否则系统每次更新都会这样做。
Returning to our UPDATE
query example, PostgreSQL creates a new version by copying the original version into a new page just like before. But it also inserts entries pointing to the new version in table’s primary key index ( movies_pkey
) and the two secondary indexes ( idx_director
, idx_name
).
回到我们的 UPDATE
查询示例,PostgreSQL 会像以前一样,通过将原始版本复制到新页面来创建新版本。但它也会在表的主键索引(movies_pkey
)和两个辅助索引(idx_director
,idx_name
)中插入指向新版本的条目。
Example of PostgreSQL index maintenance operations with a non-HOT update. The DBMS creates the new version of the tuple in Table Page #2, and then inserts new entries that point to that version in all the table’s indexes.
PostgreSQL 非 HOT 更新的索引维护操作示例。数据库管理系统在表页 #2 中创建元组的新版本,然后在表的所有索引中插入指向该版本的新条目。
The need for PostgreSQL to modify all of a table’s indexes for each update has several performance implications. Obviously, this makes update queries slower because the system has to do more work. The DBMS incurs additional I/O to traverse each index and insert the new entries. Accessing an index introduces lock/latch contention in both the index and the DBMS’s internal data structures (e.g., buffer pool’s page table). Again, PostgreSQL does this maintenance work for all a table’s indexes, even if queries are never going to use them (by the way, OtterTune automatically finds unused indexes in your database). These extra reads and writes are problematic in DBMSs that charge users based on IOPS, like Amazon Aurora.
每次更新时,PostgreSQL 都需要修改表的所有索引,这对性能有一些影响。很明显,这会使更新查询变慢,因为系统需要做更多的工作。DBMS 会产生额外的 I/O,以遍历每个索引并插入新条目。访问索引会在索引和 DBMS 的内部数据结构(如缓冲池的页表)中引入锁/抓取争用。同样,PostgreSQL 会对表的所有索引进行这种维护工作,即使查询永远不会使用它们(顺便说一下,OtterTune 会自动查找数据库中未使用的索引)。在根据 IOPS 向用户收费的 DBMS(如 Amazon Aurora)中,这些额外的读写会造成问题。
As described above, PostgreSQL avoids updating indexes each time if it can perform a HOT write where the new version is on the same page as the current version. Our analysis of OtterTune customers’ PostgreSQL databases shows that roughly 46% of updates use the HOT optimization on average. Although that’s an impressive number, it still means more than 50% of the updates are paying this penalty.
如上所述,如果PostgreSQL可以执行HOT写入,而新版本与当前版本位于同一页面,那么它就会避免每次更新索引。我们对OtterTune客户的PostgreSQL数据库进行的分析表明,平均约有46%的更新使用了HOT优化。虽然这是一个令人印象深刻的数字,但它仍然意味着超过50%的更新要支付这种惩罚。
There are many examples of users struggling with this aspect of PostgreSQL’s MVCC implementation. The most famous testament of this is Uber’s 2016 blog article about why they switched from Postgres to MySQL. Their write-heavy workload was experiencing significant performance problems on tables with many secondary indexes.
用户在 PostgreSQL 的 MVCC 实施方面遇到困难的例子有很多。最著名的例子是 Uber 在 2016 年发表的一篇博客文章,其中介绍了他们从 Postgres 转向 MySQL 的原因。他们的重写工作负载在具有许多二级索引的表上遇到了严重的性能问题。
Oracle and MySQL do not have this problem in their MVCC implementation because their secondary indexes do not store the physical addresses of new versions. Instead, they store a logical identifier (e.g., tuple id, primary key) that the DBMS then uses to look up the current version’s physical address. Now this may make secondary index reads slower since the DBMS has to resolve a logical identifier, but these DBMS have other advantages in their MVCC implementation to reduce overhead.
Oracle 和 MySQL 的 MVCC 实现不存在这个问题,因为它们的二级索引不存储新版本的物理地址。相反,它们会存储一个逻辑标识符(如元组 id、主键),然后 DBMS 会使用该标识符查找当前版本的物理地址。现在,这可能会使二级索引的读取速度变慢,因为 DBMS 必须解析逻辑标识符,但这些 DBMS 在 MVCC 实现方面有其他优势,可以减少开销。
Side Comment: There is an error in Uber’s blog post regarding PostgreSQL’s version storage. Specifically, each tuple in PostgreSQL stores a pointer to the new version, not the previous one, as stated in the blog. This results in an O2N version chain ordering rather than the N2O version chain erroneously claimed by Uber.
*侧面评论:*Uber 的博文中存在一个关于 PostgreSQL 版本存储的错误。具体地说,PostgreSQL 中的每个元组存储的是指向新版本的指针,而不是博客中所说的指向旧版本的指针。这导致了 O2N 版本链排序,而不是 Uber 错误声称的 N2O 版本链。
Problem #4: Vacuum Management
问题 4:真空管理
PostgreSQL’s performance relies heavily on the effectiveness of the autovacuum to remove obsolete data and reclaim space (this is why OtterTune immediately checks the health status of the autovacuum when you first connect your database). It does not matter if you are running RDS, Aurora, or Aurora Serverless; all variants of PostgreSQL have the same autovacuum issues. But making sure that PostgreSQL’s autovacuum is running as best as possible is difficult due to its complexity. PostgreSQL’s default settings for tuning the autovacuum are not ideal for all tables, particularly for large ones. For example, the default setting for the configuration knob that controls what percentage of a table PostgreSQL has to update before the autovacuum kicks in (autovacuum_vacuum_scale_factor) is 20%. This threshold means that if a table has 100 million tuples, the DBMS does not trigger the autovacuum until queries update at least 20 million tuples. As such, PostgreSQL may unnecessarily keep around a lot of dead tuples in a table (thereby incurring IO and memory costs) for a long time.
PostgreSQL的性能在很大程度上依赖于自动真空(autovacuum)删除过时数据和回收空间的有效性(这就是为什么OtterTune会在你首次连接数据库时立即检查自动真空的健康状态)。不管你运行的是RDS、Aurora还是Aurora Serverless,PostgreSQL的所有变种都有相同的自动真空问题。但由于其复杂性,确保 PostgreSQL 的自动吸尘功能以最佳状态运行非常困难。PostgreSQL 用于调整自动真空的默认设置并不适合所有表,尤其是大表。例如,控制 PostgreSQL 在自动真空启动前必须更新的表的百分比的配置旋钮(autovacuum_vacuum_scale_factor)的默认设置是 20%。这个阈值意味着,如果一个表有 1 亿个图元,那么 DBMS 在查询更新至少 2 千万个图元之前不会触发自动真空。因此,PostgreSQL 可能会不必要地在表中长时间保留大量死图元(从而产生 IO 和内存成本)。
Another problem with the autovacuum in PostgreSQL is that it may get blocked by long-running transactions, which can result in the accumulation of more dead tuples and stale statistics. Failing to clean expired versions in a timely manner leads to numerous performance problems, causing more long-running transactions that block the autovacuum process. It becomes a vicious cycle, requiring humans to intervene manually by killing long-running transactions. Consider the graph below that shows the number of dead tuples in an OtterTune customer’s database over two weeks:
PostgreSQL 中自动真空的另一个问题是,它可能会被长期运行的事务阻塞,从而导致累积更多的死图元和陈旧的统计数据。如果不能及时清理过期版本,就会导致许多性能问题,造成更多长时间运行的事务阻塞自动真空进程。这就成了一个恶性循环,需要人工干预,杀死长期运行的事务。请看下图,图中显示了 OtterTune 客户数据库中两周内死亡图元的数量:
The number of dead tuples over time in a PostgreSQL Amazon RDS database.
PostgreSQL Amazon RDS 数据库中随着时间推移出现的死图元数量。
The sawtooth pattern in the chart shows that the autovacuum performs a major clean-up about once every day. For example, on February 14th, the DBMS cleaned up 3.2 million dead tuples. This graph is actually an example of an unhealthy PostgreSQL database. The chart clearly shows an upward trend in the number of dead tuples because the autovacuum cannot keep up.
图表中的锯齿模式显示,自动真空大约每天都会执行一次大清理。例如,2 月 14 日,数据库管理系统清理了 320 万个死图元。这张图实际上是 PostgreSQL 数据库不健康的一个例子。图表清楚地显示了死图元数量的上升趋势,因为自动吸尘器跟不上。
At OtterTune, we see this problem often in our customers’ databases. One PostgreSQL RDS instance had a long-running query caused by stale statistics after bulk insertions. This query blocked the autovacuum from updating the statistics, resulting in more long-running queries. OtterTune’s automated health checks identified the problem, but the administrator still had to kill the query manually and run ANALYZE after bulk insertions. The good news is that the long query’s execution time went from 52 minutes to just 34 seconds.
在 OtterTune,我们经常在客户的数据库中看到这个问题。一个 PostgreSQL RDS 实例在批量插入后,由于统计数据过期而导致查询长时间运行。这个查询阻止了自动真空更新统计数据,导致更多的长期查询。OtterTune 的自动健康检查发现了这个问题,但管理员仍不得不手动杀死查询并在批量插入后运行 ANALYZE 。好消息是,长查询的执行时间从 52 分钟缩短到了 34 秒。
Concluding Remarks 结束语
There are always hard design decisions one has to make when building a DBMS. And these decisions will cause any DBMS to perform differently on varying workloads. For Uber’s specific write-intensive workload, PostgreSQL’s index write amplification due to MVCC is why they switched to MySQL. But please don’t misunderstand our diatribe to mean that we don’t think you should ever use PostgreSQL. Although its MVCC implementation is the wrong way to do it, PostgreSQL is still our favorite DBMS. To love something is to be willing to work with its flaws (see Dan Savage’s “The Price of Admission”).
在构建 DBMS 时,总是要做出一些艰难的设计决定。而这些决定会导致任何 DBMS 在不同的工作负载上表现出不同的性能。就 Uber 的特定写密集型工作负载而言,PostgreSQL 因 MVCC 而导致的索引写放大正是他们转向 MySQL 的原因。但请不要误解我们的抨击,认为我们不应该使用 PostgreSQL。虽然 PostgreSQL 的 MVCC 实现是错误的,但它仍然是我们最喜欢的 DBMS。爱一个东西,就要愿意接受它的缺陷(参见丹-萨维奇(Dan Savage)的"入场的代价")。
So how does one work around PostgreSQL’s quirks? Well, you can spend an enormous amount of time and effort tuning it yourself. Good luck with that.
那么,如何克服 PostgreSQL 的怪癖呢?您可以花费大量的时间和精力自行调整。祝你好运。
We’ll cover more about what we can do in our next article.
我们将在下一篇文章中详细介绍我们能做些什么。