Theme: Nico
Author: @Li Zepeng
Douyin relies on its own recommendation system to push video content that users may be interested in. Interest circles are a core capability of recommendation: by understanding the preference characteristics of core users, judging the similarity of their preferences, we can build interest circles for similar users to achieve precise recommendations.
In the past, interest circles often relied on a single dimension or label, such as content type, duration, geographic characteristics, etc., making it difficult to reveal the underlying logic of user interests. For example, food vlog videos featuring beautiful young ladies from Chongqing, and ancient-style dance videos from the two-dimensional (二次元) fandom. On the surface, their tag types may be completely different, but in-depth analysis shows that the people who like both videos belong to the same group, and they should be categorized into the same interest circle.
Building such an interest circle platform not only requires algorithm strategies, but also poses a major challenge to the underlying data storage architecture. Douyin has a huge amount of daily new data and a wide variety of business tags, and it also needs to meet the real-time requirements of business personnel for complex queries. Previously, the technical team used MySQL as the storage architecture. As a row-based storage database, MySQL has low processing efficiency for large amounts of data. If you need to query hundreds of millions of pieces of data in MySQL, you may need higher-configured hardware, and even adopt strategies such as sharding and read-write separation to improve performance, which will significantly increase hardware costs.
Therefore, the technical team gradually migrated the interest platform to be rebuilt based on ByteHouse. ByteHouse is an OLAP engine that features high query efficiency, relatively low hardware requirements, and good horizontal scalability. If the data volume grows further, processing capacity can be improved by increasing the number of servers. This article will break down how to build an interest circle platform based on an OLAP engine from the perspectives of difficulties in interest circle construction and construction plans.
Introduction to Interest Circle Platform
An interest circle refers to a group of people with the same hobbies. Interest circles can help gain a deeper understanding of short-video creators and content from the user's perspective, uncover the common interests and typical preference characteristics of the core user groups of the creators in the circle, use this as an important tag for classifying creators, and apply it to scenarios such as content distribution, vertical operation, data analysis, and strategic planning to deliver value. Interest circles exist in the form of clusters, which are formed through machine model clustering. Each cluster includes one seed creator and multiple associated creators.
Circle production process: The daily Hive tables in the data warehouse use scheduled tasks to import the data in the Hive tables into the RDS (MySQL) database by partition. At the same time, the precomputation script will regularly write the data in RDS to cache (such as general queries like circle information) or write back to RDS (such as core data like the parent node information of the circle) as needed every day. A successful production process will be marked in the cache to indicate that today's data is valid, otherwise an alarm will be sent to notify the relevant person in charge.
Circle query process: When a user performs a query, the front end sends a request for query scenario data. After the server receives the request, it reads the corresponding cache, database tables and partitions, assembles the data, and finally returns it to the user.
Main Problems
Data Inflation
The daily update version has led to data volume inflation. The circle basic information table adds tens of thousands of pieces of data daily, the circle creator information table adds millions of pieces of data daily, and the circle user information table adds about tens of millions of pieces of data daily, which has reached the performance bottleneck of MySQL's tens of millions of queries per second.
Query efficiency can no longer meet demand. Even with cache acceleration to reduce join queries, single-table query efficiency takes more than 10 seconds. Among them, the time to load the circle understanding page (circle user information table) exceeds 15 seconds, which affects the business user experience to a certain extent. Previously, many optimizations were done, including index optimization, query optimization, cache optimization, and table structure optimization, but the time to update a column/add or modify an index for a single table has exceeded 2 days, and the optimization cost has gradually increased.
Outdated Architecture Is Too Thin to Support Complex Circle Selection Capabilities
Looking at the current situation, the current circle architecture is simple, and to distinguish query scenarios, it directly interacts with the database and only supports simple synchronous queries. When the business requires more complex generalized circle selection conditions, users need to wait more than 15 seconds on the platform.
Looking at future planning, the current synchronous query architecture using RDS as storage can no longer support business scenarios that require complex conditional queries involving multiple tables and features.
Business Feature Inflation
Tag feature inflation: There are more and more tag descriptions for current circles. Since different business parties understand circles from different perspectives, such as vertical tags, circle keyword descriptions, circle quality classification, circle style, etc., the current circle information entity features have reached dozens of types, and it is expected that circle attribute tags will continue to expand.
There are increasing demands for one-stop generalized circle selection of target creators. Currently, creators only include basic information, and business parties hope to use circles and other basic creator characteristics, such as follower count, creator quality, activity level, etc., to meet needs such as creator traffic targeting strategies, and to realize multi-dimensional filtering and sorting functions under complex conditions.
Reconstructing the Interest Circle Platform Based on ByteHouse
RDS, as a row-based database, is more suitable for single-point transaction analysis work, which obviously does not meet the current platform requirements. We selected the storage solution from the perspectives of query scenarios, query performance, storage cost, and migration cost respectively.
Query Scenarios
- Circle information is produced by the model and imported in batches by time partitions, there is no temporary import, which is an append-only scenario.
- There are many circle features, and business parties filter features related to their own business according to their needs. Columnar storage is more suitable than row-based storage.
- Circles are mainly used for analysis and statistics, do not have strong transaction processing requirements, and are oriented towards OLAP business.
Query Performance
- When MySQL performs complex conditional queries on multiple columns, it is difficult to optimize query performance, and it needs to strongly rely on Redis cache for acceleration, otherwise the platform functions will be unavailable.
- Circle scenarios are usually limited to aggregated analysis of local data, such as calculating keyword frequency statistics for circle IDs within a set. If the set range is too large, the index will fail and degrade to a full table scan.
Detailed Scenario Testing
Storage Comparison Before and After Reconstruction
| MySQL | ByteHouse |
|---|---|
| Relational database, supports transactions | Distributed columnar database, supports eventual transactions |
| Row storage mode, suitable for reading only the required row data as little as possible | Columnar storage mode with high data compression ratio, which has a natural advantage for reading large batches of data |
| Single-process multi-threaded service, a single business request query cannot effectively utilize multiple CPU resources | Multi-core parallel processing |
| Oriented towards OLTP business | Oriented towards OLAP business |
Specific Scenario Comparison
Data management information query scenario:
Application tool analysis scenario:
Summary
In summary, after reconstructing Douyin's interest circle platform by replacing MySQL with ByteHouse, the query efficiency of several typical scenarios has been improved by an average of about 100 times, greatly improving the user experience. Due to ByteHouse's excellent query performance and good data compression ratio, servers with medium resources can well meet the requirements, which also reduces the comprehensive hardware cost. In addition, ByteHouse has good horizontal scalability, and if the data volume grows further, the analysis capability can be easily improved by increasing the number of servers.
This is a discussion topic separated from the original topic at https://juejin.cn/post/7348649105708744756

