将 GitHub.com 升级到 MySQL 8.0

GitHub uses MySQL to store vast amounts of relational data. This is the story of how we seamlessly upgraded our production fleet to MySQL 8.0.
GitHub 使用 MySQL 存储大量关系型数据。这就是我们如何将生产机群无缝升级到 MySQL 8.0 的故事。

Over 15 years ago, GitHub started as a Ruby on Rails application with a single MySQL database. Since then, GitHub has evolved its MySQL architecture to meet the scaling and resiliency needs of the platform—including building for high availability, implementing testing automation, and partitioning the data. Today, MySQL remains a core part of GitHub’s infrastructure and our relational database of choice.
15 年前,GitHub 从一个带有单一 MySQL 数据库的 Ruby on Rails 应用程序起步。从那时起,GitHub 不断发展 MySQL 架构,以满足平台的扩展和弹性需求,包括构建高可用性、实施自动化测试和数据分区。如今,MySQL 仍是 GitHub 基础架构的核心部分,也是我们首选的关系型数据库。

This is the story of how we upgraded our fleet of 1200+ MySQL hosts to 8.0. Upgrading the fleet with no impact to our Service Level Objectives (SLO) was no small feat–planning, testing and the upgrade itself took over a year and collaboration across multiple teams within GitHub.
这是一个关于我们如何将 1200 多台 MySQL 主机升级到 8.0 的故事。在不影响我们的服务级别目标(SLO)的情况下升级主机群并非易事–规划、测试和升级本身就花费了一年多的时间,并且需要 GitHub 内多个团队的协作。

Motivation for upgrading

升级的动机

Why upgrade to MySQL 8.0? With MySQL 5.7 nearing end of life, we upgraded our fleet to the next major version, MySQL 8.0. We also wanted to be on a version of MySQL that gets the latest security patches, bug fixes, and performance enhancements. There are also new features in 8.0 that we want to test and benefit from, including Instant DDLs, invisible indexes, and compressed bin logs, among others.
为什么要升级到 MySQL 8.0?随着 MySQL 5.7 的生命周期即将结束,我们将我们的系统升级到了下一个主要版本,即 MySQL 8.0。我们还想使用能获得最新安全补丁、错误修复和性能增强的 MySQL 版本。此外,我们还希望测试 8.0 中的新功能并从中受益,包括即时 DDL、隐形索引和压缩的 bin 日志等。

GitHub’s MySQL infrastructure

GitHub 的 MySQL 基础设施

Before we dive into how we did the upgrade, let’s take a 10,000-foot view of our MySQL infrastructure:
在深入了解我们如何进行升级之前,让我们先从 10,000 英尺的高度来看看我们的 MySQL 基础架构:

  • Our fleet consists of 1200+ hosts. It’s a combination of Azure Virtual Machines and bare metal hosts in our data center.
    我们的机队由 1200 多台主机组成。它由 Azure 虚拟机和我们数据中心的裸机主机组合而成。
  • We store 300+ TB of data and serve 5.5 million queries per second across 50+ database clusters.
    我们存储了 300 多 TB 的数据,每秒通过 50 多个数据库集群提供 550 万次查询。
  • Each cluster is configured for high availability with a primary plus replicas cluster setup.
    每个群集都采用主群集加副本群集的高可用性配置。
  • Our data is partitioned. We leverage both horizontal and vertical sharding to scale our MySQL clusters. We have MySQL clusters that store data for specific product-domain areas. We also have horizontally sharded Vitess clusters for large-domain areas that outgrew the single-primary MySQL cluster.
    我们的数据是分区的。我们利用水平和垂直分片来扩展我们的 MySQL 集群。我们有存储特定产品领域数据的 MySQL 集群。我们还有水平分片的 Vitess 集群,用于存储超出单主 MySQL 集群的大型领域数据。
  • We have a large ecosystem of tools consisting of Percona Toolkit, gh-ost, orchestrator, freno, and in-house automation used to operate the fleet.
    我们拥有一个庞大的工具生态系统,包括 Percona Toolkit、gh-ost、orchestrator、freno 以及用于运行机群的内部自动化工具。

All this sums up to a diverse and complex deployment that needs to be upgraded while maintaining our SLOs.
综上所述,我们需要在保持我们的 SLO 的同时,对多样化和复杂的部署进行升级。

Preparing the journey 准备旅程

As the primary data store for GitHub, we hold ourselves to a high standard for availability. Due to the size of our fleet and the criticality of MySQL infrastructure, we had a few requirements for the upgrade process:
作为 GitHub 的主要数据存储,我们对可用性有着很高的要求。由于我们团队的规模和 MySQL 基础设施的重要性,我们对升级过程有一些要求:

  • We must be able to upgrade each MySQL database while adhering to our Service Level Objectives (SLOs) and Service Level Agreements (SLAs).
    我们必须能够升级每个 MySQL 数据库,同时遵守我们的服务级别目标 (SLO) 和服务级别协议 (SLA)。
  • We are unable to account for all failure modes in our testing and validation stages. So, in order to remain within SLO, we needed to be able to roll back to the prior version of MySQL 5.7 without a disruption of service.
    我们无法在测试和验证阶段考虑到所有故障模式。因此,为了保持在 SLO 的范围内,我们需要能够在不中断服务的情况下回滚到之前的 MySQL 5.7 版本。
  • We have a very diverse workload across our MySQL fleet. To reduce risk, we needed to upgrade each database cluster atomically and schedule around other major changes. This meant the upgrade process would be a long one. Therefore, we knew from the start we needed to be able to sustain operating a mixed-version environment.
    我们的 MySQL 集群工作负载非常多样化。为了降低风险,我们需要对每个数据库集群进行原子升级,并围绕其他重大变更安排升级时间。这意味着升级过程将是一个漫长的过程。因此,我们从一开始就知道,我们需要能够持续运行混合版本环境。

Preparation for the upgrade started in July 2022 and we had several milestones to reach even before upgrading a single production database.
升级的准备工作于 2022 年 7 月开始,在升级一个生产数据库之前,我们就已经达到了几个里程碑。

Prepare infrastructure for upgrade

为基础设施升级做好准备

We needed to determine appropriate default values for MySQL 8.0 and perform some baseline performance benchmarking. Since we needed to operate two versions of MySQL, our tooling and automation needed to be able to handle mixed versions and be aware of new, different, or deprecated syntax between 5.7 and 8.0.
我们需要为 MySQL 8.0 确定适当的默认值,并执行一些基线性能基准测试。由于我们需要运行两个版本的 MySQL,因此我们的工具和自动化需要能够处理混合版本,并了解 5.7 和 8.0 之间的新语法、不同语法或废弃语法。

Ensure application compatibility

确保应用程序的兼容性

We added MySQL 8.0 to Continuous Integration (CI) for all applications using MySQL. We ran MySQL 5.7 and 8.0 side-by-side in CI to ensure that there wouldn’t be regressions during the prolonged upgrade process. We detected a variety of bugs and incompatibilities in CI, helping us remove any unsupported configurations or features and escape any new reserved keywords.
我们为所有使用 MySQL 的应用程序添加了 MySQL 8.0 的持续集成 (CI)。我们在 CI 中并行运行了 MySQL 5.7 和 8.0,以确保在漫长的升级过程中不会出现倒退。我们在 CI 中检测到了各种错误和不兼容性,帮助我们删除了任何不支持的配置或功能,并转义了任何新的保留关键字。

To help application developers transition towards MySQL 8.0, we also enabled an option to select a MySQL 8.0 prebuilt container in GitHub Codespaces for debugging and provided MySQL 8.0 development clusters for additional pre-prod testing.
为帮助应用程序开发人员过渡到 MySQL 8.0,我们还启用了一个选项,可在 GitHub Codespaces 中选择一个 MySQL 8.0 预构建容器进行调试,并提供了 MySQL 8.0 开发集群以进行额外的预开发测试。

Communication and transparency

沟通和透明度

We used GitHub Projects to create a rolling calendar to communicate and track our upgrade schedule internally. We created issue templates that tracked the checklist for both application teams and the database team to coordinate an upgrade.
我们使用 GitHub 项目创建了一个滚动日历,用于内部沟通和跟踪升级计划。我们创建了问题模板,跟踪应用程序团队和数据库团队协调升级的清单。

Project Board for tracking the MySQL 8.0 upgrade schedule
用于跟踪 MySQL 8.0 升级计划的项目板

Upgrade plan 升级计划

To meet our availability standards, we had a gradual upgrade strategy that allowed for checkpoints and rollbacks throughout the process.
为了达到可用性标准,我们采取了渐进式升级策略,在整个过程中允许检查点和回滚。

Step 1: Rolling replica upgrades

步骤 1:滚动复制升级

We started with upgrading a single replica and monitoring while it was still offline to ensure basic functionality was stable. Then, we enabled production traffic and continued to monitor for query latency, system metrics, and application metrics. We gradually brought 8.0 replicas online until we upgraded an entire data center and then iterated through other data centers. We left enough 5.7 replicas online in order to rollback, but we disabled production traffic to start serving all read traffic through 8.0 servers.
我们首先升级了单个副本,并在其仍处于离线状态时对其进行监控,以确保基本功能稳定。然后,我们启用了生产流量,并继续监控查询延迟、系统指标和应用程序指标。我们逐步将 8.0 复制上线,直到升级了整个数据中心,然后再迭代其他数据中心。我们保留了足够的 5.7 在线副本,以便进行回滚,但我们禁用了生产流量,开始通过 8.0 服务器提供所有读取流量。

The replica upgrade strategy involved gradual rollouts in each data center (DC).
复制升级策略包括在每个数据中心(DC)逐步推出。

Step 2: Update replication topology

步骤 2:更新复制拓扑

Once all the read-only traffic was being served via 8.0 replicas, we adjusted the replication topology as follows:
当所有只读流量都通过 8.0 复制提供后,我们对复制拓扑进行了如下调整:

  • An 8.0 primary candidate was configured to replicate directly under the current 5.7 primary.
    配置了一个 8.0 主候选系统,直接复制到当前的 5.7 主系统下。
  • Two replication chains were created downstream of that 8.0 replica:
    在 8.0 复制的下游创建了两个复制链:
  • A set of only 5.7 replicas (not serving traffic, but ready in case of rollback).
    一组仅有 5.7 个副本(不提供流量,但可随时回滚)。
  • A set of only 8.0 replicas (serving traffic).
    一组只有 8.0 个副本(服务流量)。
  • The topology was only in this state for a short period of time (hours at most) until we moved to the next step.
    在我们进入下一步之前,拓扑结构只在短时间内(最多几个小时)处于这种状态。

To facilitate the upgrade, the topology was updated to have two replication chains.
为便于升级,拓扑结构已更新为两个复制链。

Step 3: Promote MySQL 8.0 host to primary

步骤 3:将 MySQL 8.0 主机升级为主主机

We opted not to do direct upgrades on the primary database host. Instead, we would promote a MySQL 8.0 replica to primary through a graceful failover performed with Orchestrator. At that point, the replication topology consisted of an 8.0 primary with two replication chains attached to it: an offline set of 5.7 replicas in case of rollback and a serving set of 8.0 replicas.
我们选择不在主数据库主机上进行直接升级。相反,我们将通过使用 Orchestrator 执行优雅故障切换,将 MySQL 8.0 复制升级为主数据库。此时,复制拓扑包括一个 8.0 主数据库和连接到它的两个复制链:一个离线的 5.7 复制集(以防回滚)和一个服务的 8.0 复制集。

Orchestrator was also configured to blacklist 5.7 hosts as potential failover candidates to prevent an accidental rollback in case of an unplanned failover.
Orchestrator 还被配置为将 5.7 主机列入潜在故障切换候选黑名单,以防止意外故障切换时出现意外回滚。

Primary failover and additional steps to finalize MySQL 8.0 upgrade for a database
完成数据库 MySQL 8.0 升级的主故障切换和其他步骤

Step 4: Internal facing instance types upgraded

步骤 4:升级面向内部的实例类型

We also have ancillary servers for backups or non-production workloads. Those were subsequently upgraded for consistency.
我们还有用于备份或非生产工作负载的辅助服务器。为了保持一致性,我们随后对这些服务器进行了升级。

Step 5: Cleanup 步骤 5:清理

Once we confirmed that the cluster didn’t need to rollback and was successfully upgraded to 8.0, we removed the 5.7 servers. Validation consisted of at least one complete 24 hour traffic cycle to ensure there were no issues during peak traffic.
在确认群集无需回滚并成功升级到 8.0 后,我们移除了 5.7 服务器。验证包括至少一个完整的 24 小时流量周期,以确保在流量高峰期不会出现问题。

Ability to Rollback 回滚能力

A core part of keeping our upgrade strategy safe was maintaining the ability to rollback to the prior version of MySQL 5.7. For read-replicas, we ensured enough 5.7 replicas remained online to serve production traffic load, and rollback was initiated by disabling the 8.0 replicas if they weren’t performing well. For the primary, in order to roll back without data loss or service disruption, we needed to be able to maintain backwards data replication between 8.0 and 5.7.
保证升级策略安全的一个核心部分是保持回滚到先前版本 MySQL 5.7 的能力。对于读取副本,我们确保有足够的 5.7 版本副本保持在线,以满足生产流量负载的需要;如果 8.0 版本副本性能不佳,则通过禁用它们来启动回滚。对于主系统,为了在不丢失数据或中断服务的情况下进行回滚,我们需要在 8.0 和 5.7 之间保持向后数据复制。

MySQL supports replication from one release to the next higher release but does not explicitly support the reverse (MySQL Replication compatibility). When we tested promoting an 8.0 host to primary on our staging cluster, we saw replication break on all 5.7 replicas. There were a couple of problems we needed to overcome:
MySQL 支持从一个版本复制到下一个更高的版本,但不明确支持反向复制(MySQL 复制兼容性)。当我们测试在暂存集群上将 8.0 主机升级为主主机时,发现所有 5.7 复制都出现了复制中断。我们需要克服几个问题:

  1. In MySQL 8.0, utf8mb4 is the default character set and uses a more modern utf8mb4_0900_ai_ci collation as the default. The prior version of MySQL 5.7 supported the utf8mb4_unicode_520_ci collation but not the latest version of Unicode utf8mb4_0900_ai_ci.
    在 MySQL 8.0 中, utf8mb4 是默认字符集,默认使用更现代的 utf8mb4_0900_ai_ci 整理方式。MySQL 5.7 之前的版本支持 utf8mb4_unicode_520_ci 整理方式,但不支持最新版本的 Unicode utf8mb4_0900_ai_ci
  2. MySQL 8.0 introduces roles for managing privileges but this feature did not exist in MySQL 5.7. When an 8.0 instance was promoted to be a primary in a cluster, we encountered problems. Our configuration management was expanding certain permission sets to include role statements and executing them, which broke downstream replication in 5.7 replicas. We solved this problem by temporarily adjusting defined permissions for affected users during the upgrade window.
    MySQL 8.0 引入了管理权限的角色,但在 MySQL 5.7 中并不存在这一功能。当一个 8.0 实例晋升为群集中的主实例时,我们遇到了问题。我们的配置管理正在扩展某些权限集,以包含角色语句并执行它们,这破坏了 5.7 复制中的下游复制。我们在升级窗口期间临时调整了受影响用户的已定义权限,从而解决了这个问题。

To address the character collation incompatibility, we had to set the default character encoding to utf8 and collation to utf8_unicode_ci.
为了解决字符校对不兼容问题,我们必须将默认字符编码设置为 utf8 ,将校对设置为 utf8_unicode_ci

For the GitHub.com monolith, our Rails configuration ensured that character collation was consistent and made it easier to standardize client configurations to the database. As a result, we had high confidence that we could maintain backward replication for our most critical applications.
对于 GitHub.com monolith 而言,我们的 Rails 配置确保了字符校对的一致性,并使数据库的客户端配置更容易标准化。因此,我们非常有信心能够为最关键的应用程序保持向后复制。

Challenges 挑战

Throughout our testing, preparation and upgrades, we encountered some technical challenges.
在整个测试、准备和升级过程中,我们遇到了一些技术挑战。

What about Vitess? 维特斯怎么样?

We use Vitess for horizontally sharding relational data. For the most part, upgrading our Vitess clusters was not too different from upgrading the MySQL clusters. We were already running Vitess in CI, so we were able to validate query compatibility. In our upgrade strategy for sharded clusters, we upgraded one shard at a time. VTgate, the Vitess proxy layer, advertises the version of MySQL and some client behavior depends on this version information. For example, one application used a Java client that disabled the query cache for 5.7 servers—since the query cache was removed in 8.0, it generated blocking errors for them. So, once a single MySQL host was upgraded for a given keyspace, we had to make sure we also updated the VTgate setting to advertise 8.0.
我们使用 Vitess 对关系数据进行横向分片。在大多数情况下,升级 Vitess 集群与升级 MySQL 集群并无太大区别。我们已经在 CI 中运行 Vitess,因此能够验证查询的兼容性。在分片集群的升级策略中,我们一次升级一个分片。Vitess 代理层 VTgate 会公布 MySQL 的版本,某些客户端行为依赖于该版本信息。例如,一个应用程序使用的 Java 客户端禁用了 5.7 服务器的查询缓存–因为查询缓存在 8.0 中被移除,所以会产生阻塞错误。因此,一旦给定键空间的单台 MySQL 主机升级,我们就必须确保同时更新 VTgate 设置以宣传 8.0。

Replication delay 复制延迟

We use read-replicas to scale our read availability. GitHub.com requires low replication delay in order to serve up-to-date data.
我们使用读取复制来扩展我们的读取可用性。GitHub.com 要求低复制延迟,以便提供最新数据。

Earlier on in our testing, we encountered a replication bug in MySQL that was patched on 8.0.28:
在早些时候的测试中,我们遇到了 MySQL 中的一个复制错误,该错误已在 8.0.28 中得到修补:

Replication: If a replica server with the system variable replica_preserve_commit_order = 1 set was used under intensive load for a long period, the instance could run out of commit order sequence tickets. Incorrect behavior after the maximum value was exceeded caused the applier to hang and the applier worker threads to wait indefinitely on the commit order queue. The commit order sequence ticket generator now wraps around correctly. Thanks to Zhai Weixiang for the contribution. (Bug #32891221, Bug #103636)
复制:如果在高负载条件下长期使用设置了系统变量 replica_preserve_commit_order = 1 的复制服务器,实例可能会耗尽提交顺序票。超过最大值后的不正确行为会导致应用程序运行挂起和应用程序运行工作线程在提交顺序队列上无限期等待。现在,提交顺序票生成器可以正确绕行。感谢 Zhai Weixiang 的贡献。(错误编号 32891221,错误编号 103636)。

We happen to meet all the criteria for hitting this bug.
我们碰巧满足了撞击这个错误的所有标准。

  • We use replica_preserve_commit_order because we use GTID based replication.
    我们使用 replica_preserve_commit_order 是因为我们使用基于 GTID 的复制。
  • We have intensive load for long periods of time on many of our clusters and certainly for all of our most critical ones. Most of our clusters are very write-heavy.
    我们的许多集群,当然也包括所有最关键的集群,都长期处于高强度负载状态。我们大多数集群的写入量都非常大。

Since this bug was already patched upstream, we just needed to ensure we are deploying a version of MySQL higher than 8.0.28.
由于该漏洞已在上游得到修补,我们只需确保部署的 MySQL 版本高于 8.0.28。

We also observed that the heavy writes that drove replication delay were exacerbated in MySQL 8.0. This made it even more important that we avoid heavy bursts in writes. At GitHub, we use freno to throttle write workloads based on replication lag.
我们还观察到,导致复制延迟的大量写入在 MySQL 8.0 中更加严重。因此,避免大量写入变得更加重要。在 GitHub,我们使用 freno 根据复制延迟来控制写入工作量。

Queries would pass CI but fail on production

查询会通过 CI,但在生产中会失败

We knew we would inevitably see problems for the first time in production environments—hence our gradual rollout strategy with upgrading replicas. We encountered queries that passed CI but would fail on production when encountering real-world workloads. Most notably, we encountered a problem where queries with large WHERE IN clauses would crash MySQL. We had large WHERE IN queries containing over tens of thousands of values. In those cases, we needed to rewrite the queries prior to continuing the upgrade process. Query sampling helped to track and detect these problems. At GitHub, we use Solarwinds DPM (VividCortex), a SaaS database performance monitor, for query observability.
我们知道在生产环境中难免会首次出现问题,因此我们采取了升级副本的渐进式推广策略。我们遇到过通过 CI 的查询,但在生产环境中遇到实际工作负载时就会失败。最值得注意的是,我们遇到了一个问题,即带有大型 WHERE IN 子句的查询会导致 MySQL 崩溃。我们曾遇到过包含数万个值的大型 WHERE IN 查询。在这种情况下,我们需要在继续升级之前重写查询。查询采样有助于跟踪和检测这些问题。在 GitHub,我们使用 SaaS 数据库性能监控器 Solarwinds DPM (VividCortex) 进行查询观察。

Learnings and takeaways 经验教训

Between testing, performance tuning, and resolving identified issues, the overall upgrade process took over a year and involved engineers from multiple teams at GitHub. We upgraded our entire fleet to MySQL 8.0 – including staging clusters, production clusters in support of GitHub.com, and instances in support of internal tools. This upgrade highlighted the importance of our observability platform, testing plan, and rollback capabilities. The testing and gradual rollout strategy allowed us to identify problems early and reduce the likelihood for encountering new failure modes for the primary upgrade.
在测试、性能调整和解决发现的问题之间,整个升级过程耗时一年多,GitHub 多个团队的工程师都参与其中。我们将整个系统升级到 MySQL 8.0,包括暂存集群、支持 GitHub.com 的生产集群以及支持内部工具的实例。这次升级凸显了我们的可观察性平台、测试计划和回滚能力的重要性。测试和逐步推出策略使我们能够及早发现问题,并降低在主要升级中遇到新故障模式的可能性。

While there was a gradual rollout strategy, we still needed the ability to rollback at every step and we needed the observability to identify signals to indicate when a rollback was needed. The most challenging aspect of enabling rollbacks was holding onto the backward replication from the new 8.0 primary to 5.7 replicas. We learned that consistency in the Trilogy client library gave us more predictability in connection behavior and allowed us to have confidence that connections from the main Rails monolith would not break backward replication.
虽然采用的是渐进式推广策略,但我们仍然需要在每一步都能够回滚,而且我们需要可观察性来识别信号,以指示何时需要回滚。实现回滚的最大挑战在于保持从新的 8.0 主副本到 5.7 副副本的后向复制。我们了解到,Trilogy 客户端库的一致性为我们提供了更多连接行为的可预测性,并让我们确信来自主 Rails 单体的连接不会破坏向后复制。

However, for some of our MySQL clusters with connections from multiple different clients in different frameworks/languages, we saw backwards replication break in a matter of hours which shortened the window of opportunity for rollback. Luckily, those cases were few and we didn’t have an instance where the replication broke before we needed to rollback. But for us this was a lesson that there are benefits to having known and well-understood client-side connection configurations. It emphasized the value of developing guidelines and frameworks to ensure consistency in such configurations.
但是,对于我们的一些 MySQL 集群,如果连接来自不同框架/语言的多个不同客户端,我们会发现向后复制在几个小时内就会中断,这就缩短了回滚的机会窗口。幸运的是,这种情况很少,我们没有在需要回滚之前发生复制中断的情况。但对我们来说,这是一次教训,让我们认识到,拥有已知且易于理解的客户端连接配置是有好处的。它强调了制定指南和框架以确保此类配置一致性的价值。

Prior efforts to partition our data paid off—it allowed us to have more targeted upgrades for the different data domains. This was important as one failing query would block the upgrade for an entire cluster and having different workloads partitioned allowed us to upgrade piecemeal and reduce the blast radius of unknown risks encountered during the process. The tradeoff here is that this also means that our MySQL fleet has grown.
之前的数据分区工作取得了成效–它使我们能够针对不同的数据域进行更有针对性的升级。这一点非常重要,因为一个失败的查询会阻碍整个集群的升级,而对不同的工作负载进行分区,可以让我们进行零散升级,减少升级过程中遇到的未知风险的爆炸半径。代价是,这也意味着我们的 MySQL 集群扩大了。

The last time GitHub upgraded MySQL versions, we had five database clusters and now we have 50+ clusters. In order to successfully upgrade, we had to invest in observability, tooling, and processes for managing the fleet.
上次 GitHub 升级 MySQL 版本时,我们有五个数据库集群,而现在我们有 50 多个集群。为了成功升级,我们必须投资于可观察性、工具和管理程序。

Conclusion 结论

A MySQL upgrade is just one type of routine maintenance that we have to perform – it’s critical for us to have an upgrade path for any software we run on our fleet. As part of the upgrade project, we developed new processes and operational capabilities to successfully complete the MySQL version upgrade. Yet, we still had too many steps in the upgrade process that required manual intervention and we want to reduce the effort and time it takes to complete future MySQL upgrades.
MySQL 升级只是我们必须进行的例行维护的一种–对我们来说,为我们机队上运行的任何软件提供升级路径至关重要。作为升级项目的一部分,我们开发了新的流程和操作能力,以成功完成 MySQL 版本升级。然而,我们在升级过程中仍然有太多需要人工干预的步骤,我们希望减少完成未来 MySQL 升级所需的工作量和时间。

We anticipate that our fleet will continue to grow as GitHub.com grows and we have goals to partition our data further which will increase our number of MySQL clusters over time. Building in automation for operational tasks and self-healing capabilities can help us scale MySQL operations in the future. We believe that investing in reliable fleet management and automation will allow us to scale github and keep up with required maintenance, providing a more predictable and resilient system.
我们预计,随着 GitHub.com 的发展,我们的团队将继续壮大,我们的目标是进一步划分数据,从而随着时间的推移增加 MySQL 集群的数量。建立自动化操作任务和自愈能力可以帮助我们在未来扩展MySQL业务。我们相信,投资于可靠的机群管理和自动化将使我们能够扩展 github 并跟上所需的维护工作,从而提供一个更可预测、更有弹性的系统。

The lessons from this project provided the foundations for our MySQL automation and will pave the way for future upgrades to be done more efficiently, but still with the same level of care and safety.
从这个项目中汲取的经验教训为我们的 MySQL 自动化奠定了基础,并将为今后更高效地完成升级铺平道路,但仍要保持同样的谨慎和安全水平。

翻译并转载: