Author: Duodian (Dmall), Tang Wanmin
Introduction
After two years, with the assistance of Feng Guangpu, organizer of the Chengdu chapter of the TiDB Community, TiDB Community offline regional events have returned to Chengdu. In his themed talk Going Global with Multi-Cloud Architecture: Dmall's TiDB Operation and Maintenance Practice, Tang Wanmin, Head of Domestic Database at Dmall, introduced Dmall's experience deploying and using TiDB for overseas business scenarios. This article is compiled based on Tang Wanmin's speech, from which you can learn about Dmall's TiDB journey from scratch, the business scenarios where Dmall uses TiDB, practical experience with multi-cloud architecture, and solutions to problems encountered during version upgrades.
Dmall's TiDB Journey
Currently, Dmall uses TiDB for both domestic and overseas business, with a total of 46 TiDB clusters, over 300 nodes, and more than 400TB of data in the online production environment. These clusters support a wide range of business scenarios, including integrated business and finance, TMS, settlement, procurement and sales, logistics, inventory vouchers, order fulfillment, inventory accounting, and more. For underlying cloud resources, Dmall has selected multiple public clouds including Tencent Cloud, Huawei Cloud, Microsoft Cloud, and VolcEngine based on business requirements in different regions.
Dmall has more than 20 online production environments. The image above shows part of the TiDB cluster in one of these environments. As you can see, the database for integrated business and finance has very high traffic: incoming and outgoing network traffic is around 500 MB/s. A QPS of 17,000 may not seem high, but all of these are large queries and operations.
Dmall currently runs many different versions of TiDB. As you can see from the image above, we have everything from 5.1.1, 5.1.2, all the way up to the recently upgraded 6.5.8. The most widely used version online is 6.1.5. Why does Dmall run so many versions of TiDB instead of upgrading everything to 6.5.8? In fact, as DBAs, we generally prefer not to touch a running database: any change carries risk, and the vast majority of database problems and risks are caused by changes.
But why do we upgrade at all? First, business requirements change: the current version may no longer meet business needs, so we have to upgrade. Second, newer versions are really attractive, and we want to use the new features so much that this need outweighs the risk of upgrading. Once we decide to upgrade, we conduct research on new TiDB versions to find the best one to upgrade to. At this point, I have to say that the TiDB Community has an incredibly good atmosphere: community members use all versions of TiDB, many people will answer our questions enthusiastically, and there are lots of shared practical experiences with upgrades and deployments, so we have plenty of experience to refer to when choosing a new version.
Dmall and TiDB: Moving Forward Together
As an early user of TiDB, Dmall has a long history with TiDB: we started using it back in 2018, when it was still version 2.0.3. At that time, we wanted to offload some complex queries from MySQL to TiDB, but testing showed that TiDB 2.0.3 really didn't have great performance — it couldn't solve the problems that MySQL couldn't handle either.
It wasn't until TiDB 4.0 was released in 2020, which introduced TiFlash, that we decided to give TiDB another try. At that time, Dmall's business and finance operations were very complex and had an extremely large data volume, which MySQL could no longer handle. After research, we finally decided to migrate this data to TiDB. The timeline was as follows: June 2020, TiDB went live in the test environment; September 2020, production environment was officially upgraded to TiDB 4.0 GA; October 2020, production was upgraded to 4.0.6; April 2021, upgraded to 4.0.9; October 2021, upgraded to 5.1.2; 2022, upgraded to 5.1.4; 2023, upgraded to 6.1.5; and most recently, we upgraded to TiDB 6.5.8. In fact, Dmall upgrades to a new version of TiDB every year, and many of these upgrades were driven by issues we encountered online, which I will cover in detail later.
Choosing a Database Type
What exactly does Dmall use TiDB for, and why did we choose TiDB? I will share the four main scenarios where Dmall uses TiDB.
The first scenario is continuously growing data. TiDB is extremely well-suited for continuously growing data scenarios: it supports seamless writes and unlimited scaling, unlike MySQL. When MySQL outgrows its storage capacity, you have to perform migrations, implement database and table sharding, ensure cluster high availability, all of which has very high migration costs, requires all kinds of coordination, and carries plenty of migration risks. TiDB supports smooth migration, and reduces storage costs by 70%. In addition, TiDB stores data differently from MySQL: MySQL stores uncompressed data, while TiDB compresses data with compression algorithms. According to our tests, a single TiDB replica can take up to almost 10 times less storage space than a single MySQL replica. Even for log data, where TiDB uses 3 replicas and MySQL uses two replicas for primary-replica replication, TiDB still has a much lower storage cost.
The second scenario is hot-cold data separation and historical data archiving. When we first started using TiDB, TiDB Data Migration (DM) was not as mature as the current DM Cluster, so we developed our own DRC-TiDB synchronization tool. We use this tool to synchronize data from MySQL to TiDB, moving cold data and historical archived data into TiDB. This allows MySQL to maintain high-performance read and write operations, while TiDB stores the full dataset.
The third scenario is consolidated sharded databases and tables, OLAP aggregate queries. Data in MySQL is distributed across different databases and different tables, which makes querying very painful for developers. Anyone who has worked with database and table sharding knows how painful it is to run queries and aggregations on sharded data. Developers want to aggregate data from many sharded databases and tables together, and TiDB supports this very well. In addition, TiFlash uses columnar storage and can store full data, so we can use TiFlash to accelerate statistical operations.
The fourth scenario is replacing Elasticsearch (ES). Inside Dmall, we use Elasticsearch (ES) very heavily, but ES has very high costs. For large-volume data storage, ES requires a lot of machines, so we have used TiDB to replace some ES deployments. To be honest, for certain query scenarios such as fuzzy queries, ES is actually better than TiDB. So when we replace ES, we first run tests: we only replace it if query performance on TiDB is not worse than on ES. After testing, we found that about 60% of our ES deployments can be replaced with TiDB, and overall costs have been greatly reduced.
Dmall's Data Technology Stack Architecture
The image above shows the overall architecture of Dmall's data technology stack. Data from MySQL, warehouse management, sales, and payment databases flows to the TiDB cluster via DRC-TiDB; the finance engine can also directly clean and transform data and read/write it directly in TiDB; other businesses also read and write data directly in TiDB. Downstream of this process, we run analysis directly in TiDB, for example for finance-related APIs, financial accounting, end-to-end tracking systems, business analysis, and more. In addition, for big data requirements, we use TiCDC to sync data to Kafka, then to Spark, and finally to the big data offline data warehouse. For example, offline report requirements are processed in Hive, making this a relatively long process.
Architectural Choices for TiDB Deployment in Overseas Business
Originally, Dmall's overseas business was only deployed on Microsoft Cloud, but we gradually encountered several problems:
- First, Dmall's RTA OS (Dmall's Retail Technology Platform) is deployed in the Singapore region of Microsoft Cloud, but we frequently encountered infrastructure instability issues, such as unexpected restarts of cloud hosts and network anomalies. These issues caused our machines to restart, services to become unavailable, and disconnected communication between services;
- Second, IO performance did not meet our expectations. For example, some disks supported ADE disk encryption but had poor IO performance, while disks that did not support encryption could not meet our overseas security requirements;
- Third, the cost of Microsoft Cloud is relatively high.
For these reasons, we changed from a single-cloud deployment on Microsoft Cloud to a dual-data-center deployment model of "Microsoft Cloud + Huawei Cloud". The goal is that if any data center becomes unavailable, TiDB can automatically recover and restore data. We used Microsoft Cloud and Huawei Cloud to build an active-active metro architecture for RTA in Singapore: applications, middleware, and databases are deployed across two public clouds, forming three metro data centers. If any single data center becomes unavailable, business can be recovered quickly, improving the availability of RTA OS.
The image above shows the architecture of the dual-data-center deployment solution. Microsoft Cloud has two availability zones, and Huawei Cloud has one availability zone. The TiDB PD cluster is deployed across all three zones. TiCDC has one node deployed on each of Microsoft Cloud and Huawei Cloud, and TiDB also has a set of nodes deployed on each cloud. To implement this, we need to add a dc label to TiKV nodes, so that regions are distributed across the three data centers. TiDB has at least 2 nodes deployed across two data centers, and TiCDC is also deployed across two data centers. We also optimized the DRC-TiDB synchronization link recovery by implementing an active-standby structure: if the MySQL to TiDB link goes down on one DRC-TiDB node, the other can take over as standby.
This is an example of the dc labels we added to TiKV nodes. The zone, dc, rack, and host fields can all be configured by users, and are only used to mark which machines and zones regions should be distributed across. However, Placement Rules cannot be used together with this method, as this may cause unexpected issues.
For the implementation process: we migrated one PD node from Microsoft Cloud to Huawei Cloud, added dc labels to TiKV nodes, migrated some TiKV nodes from Microsoft Cloud to Huawei Cloud, and waited for automatic rebalancing to complete, then migrated one TiDB node from Microsoft Cloud to Huawei Cloud. The entire process was actually very smooth. If we were doing this with MySQL, migrating from a single cloud to multiple clouds would be extremely troublesome, and even migrating MySQL from one cloud to another is very difficult. We have done a lot of MySQL migrations: for example, we recently migrated all MySQL clusters from Tencent Cloud to VolcEngine, and some environments required an enormous amount of work and carried huge risks. But TiDB handles this kind of migration very smoothly.
Issues Encountered During the Multi-Cloud TiDB Cluster Practice
We did encounter some issues when using TiDB — TiDB is not perfect. But the TiDB Community is very open and active. No matter what problem you encounter, if you search for it on AskTUG, many people have already encountered the same problem, and you can get help directly from their shared experiences.
For example, with version 4.0.9, we encountered an OOM (Out of Memory) issue with TiDB Server. There was a bug with the hashagg to streamagg transformation for expensive queries: building the hash in memory caused TiDB Server to consume a very large amount of memory. We changed the execution plan to use stream aggregation, and memory consumption dropped immediately. These issues have all been fixed in newer versions of TiDB. In fact, in TiDB 4.0.9, memory control for TiDB Server, TiKV, and TiFlash was not as good as in current versions. Why are newer versions so much better? This is because a large number of community users have reported issues they encountered when using TiDB, the official team has optimized these issues, and the problems are naturally fixed in new versions.
Version 4.0.9 also had an issue with TiCDC. TiCDC is mainly used for big data requirements or requirements that need data to be synced to downstream systems, so you probably won't encounter many problems if you have a small data volume. At that time, after we restarted TiCDC, the checkpoint stopped progressing, and it couldn't be restarted after failing. This was mainly because the default sort-engine at the time was memory-based: if the machine did not have enough memory, sorting after restart would run out of memory. Newer versions have unified sort that uses disk spillover: when memory is insufficient, data is first spilled to disk before sorting and checking.
In addition, we encountered a TiDB OOM issue when querying slowlogs sorted by non-time columns on the dashboard. This issue was caused by large INSERT statements during plan decode; the issue does not occur if INSERT statements are not large. We also encountered issues like execution plan deviation, and TiFlash failing to restart after being shut down. All of these issues have been fixed in later versions. In fact, any database works fine when the data volume is small, but once you scale up the data volume, more problems appear. So when large data volume users like Dmall use TiDB, we help iron out these issues, so other users can use TiDB with more confidence when they encounter similar scenarios.
Upgrades are also a common concern for everyone: what if TiDB won't start after an upgrade? We encountered this problem when upgrading to version 5.1.2. The correct order for upgrading TiDB is: TiFlash, PD, TiKV, Pump, TiDB, Drainer, TiCDC, Prometheus, Grafana, Alertmanager. Once, after upgrading TiFlash, we immediately upgraded the TiCDC component and skipped PD, TiKV, and Pump, which resulted in an upgrade failure. At that time, TiUP version 1.5.1 may have had a bug with the component upgrade, and upgrading to a newer version of TiUP fixed the issue.
We also encountered an issue when upgrading to version 6.1.5. After the upgrade, TiDB would not start. After careful inspection, we found that a large DDL was running before we started the upgrade. During the upgrade process, this DDL blocked the data dictionary upgrade operation, so the data dictionary upgrade never completed, which caused TiDB to fail to start.
As you can see from the image above, the alter table mysql.stats_meta add column operation could not complete, which was the exception we encountered during the upgrade. Therefore, I recommend that you must check for running large DDL operations before starting an upgrade. In fact, this issue occurred because we have a very large data volume, so DDL execution takes a long time, and we restarted before the DDL completed. If you have a small data volume, you are unlikely to encounter this issue.
In summary, all the issues we have encountered in the past have been fixed in new versions of TiDB. My recommendation to everyone is: use the new version if you can, don't use old versions. Many issues have already been encountered by early users like us, we have reported these issues to the community, and they have been fixed one after another in new versions. I believe that as more people use TiDB, and the community remains as active as it is now, TiDB will only get better and better!
This discussion topic was split from the original thread at https://juejin.cn/post/7368469208647401491












