theme: channing-cyan
This article is an exclusive signed article first published on the Juejin Technology Community. Reproduction is prohibited within 30 days, and reproduction without authorization is prohibited after 30 days. All infringements will be prosecuted!
In the previous article [ClickHouse] Updating and Deleting Data via Happy Match: Is This Really the Solution?, we looked at partitioning (Partition) in MergeTree
Partitioning
Let's use a courier pickup station as an analogy
Courier packages are stacked on shelves according to specific classification and sorting rules
There is a Cainiao Station right downstairs from my home
There are several shelves inside, each marked with a number indicating which shelf it is
Each shelf has multiple layers, and each layer also has its own number
Goods are stacked on each layer following the rule of placing smaller items on top and larger items below
Partitioning in MergeTree works similarly: it stores data in different files according to certain rules
We can specify the partition key with partition by when creating the table
create table express_delivery ( id String comment 'ID', type Int8 comment 'Size: 1=small, 2=medium, 3=large', consignor String comment 'Consignor', consignee String comment 'Consignee', phone String comment 'Phone number', storage_time DateTime comment 'Warehousing time'
) engine = MergeTree()
order by storage_time
partition by (toInt64(phone) % 3 + 1, type)
comment ‘Courier station with only 3 shelves’;
We create a table to store courier information (other fields are omitted)
Assume our courier station only has 3 shelves
We first use the remainder of the phone number modulo 3 to determine the shelf number
Then we determine the layer number based on whether the package is small, medium, or large
Finally, we sort by the warehousing time
(This is just an assumption, not an actual production algorithm)
Therefore, our partition key can be defined as partition by (toInt64(phone) % 3 + 1, type)
Now a new package has arrived
/* Mr. A has a small package */
insert into express_delivery(id, type, consignor, consignee, phone, storage_time)
values ('0001', 1, 'A先生', 'F先生', '18312340001', now());
Let's check the partition information
SELECT partition, name, rows, active FROM system.parts WHERE table = 'express_delivery';
We can get the following result:
| partition | name | rows | active |
|---|---|---|---|
| (3,1) | 3-1_1_1_0 | 1 | 1 |
Among these, partition=(3,1) is the partition where the first record resides
Now two more packages have arrived
/* Mr. B has a small package */ insert into express_delivery(id, type, consignor, consignee, phone, storage_time) values ('0002', 1, 'B先生', 'F先生', '18312340002', now());
/* Mr. A has a large package */
insert into express_delivery(id, type, consignor, consignee, phone, storage_time)
values (‘0003’, 3, ‘A先生’, ‘F先生’, ‘18312340001’, now());
Let's look at the current partitions
| partition | name | rows | active |
|---|---|---|---|
| (1,1) | 1-1_2_2_0 | 1 | 1 |
| (3,1) | 3-1_1_1_0 | 1 | 1 |
| (3,3) | 3-3_3_3_0 | 1 | 1 |
The (1,1) partition holds Mr. B's small package
The (3,3) partition holds Mr. A's large package
Ding! Mr. A received a text message (just kidding):
【CK Station】 Pick up your packages at express_delivery with pickup codes 3-1-0001 and 3-3-0003.
When we query data, if the query meets certain conditions, we can directly locate and read the corresponding data file
Just like when you pick up a package, you can directly lock onto the correct shelf and layer using your pickup code
Merging
Recently, Mr. A has been shopping online frequently, so two more packages have arrived
/* Mr. A has a medium package */ insert into express_delivery(id, type, consignor, consignee, phone, storage_time) values ('0004', 2, 'A先生', 'F先生', '18312340001', now());
/* Mr. A has a medium package */
insert into express_delivery(id, type, consignor, consignee, phone, storage_time)
values (‘0005’, 2, ‘A先生’, ‘F先生’, ‘18312340001’, now());
Let's check the current partitions (older partitions are not shown here)
| partition | name | rows | active |
|---|---|---|---|
| (3,2) | 3-2_4_4_0 | 1 | 1 |
| (3,2) | 3-2_5_5_0 | 1 | 1 |
??? Why are there two entries for the same partition?
After all, based on our current understanding of partitioning
We should only have one partition entry for (3,2) with two records inside that partition
Do you remember what our table engine is called?
MergeTree – the merging tree
Since it's called a merging tree, it naturally needs to merge something
That's right, the thing that gets merged is partitions
Why merge? (Conjecture)
This design is most likely chosen for insertion efficiency, which is why two separate insertions are kept separate
Suppose we do not perform any additional merge operations
What problem would occur if we just append the second batch of data directly after the first batch?
First, there is the concurrency problem
When inserting the second batch, if the first batch hasn't finished inserting yet
Does the second insertion have to wait for the first one to complete before it can start?
In our example, the first batch only has one record
But in real-world scenarios, there could be tens of thousands or hundreds of thousands of records
Waiting for that long is not acceptable
Suppose we can use some technology to allow two insertions to run at the same time without errors
But we have specified a sorting key
This means we need to re-sort all data after both insertions complete
If the entire insertion operation is synchronous
The insertion efficiency would be very low
Therefore, it is better to write each insertion to a new file
Then merge these multiple files into one when the system is idle
This approach both guarantees insertion performance and makes use of idle system resources to merge data
How merging works
Generally speaking, ClickHouse will automatically merge partitions 10 to 20 minutes after data is inserted
Let's check the partitions again after some time has passed
| partition | name | rows | active |
|---|---|---|---|
| (3,2) | 3-2_4_5_1 | 2 | 1 |
The two partitions have been merged into one, and the total number of records is now two
Have you noticed
After merging, the name field changed from the original 3-2_4_4_0 and 3-2_5_5_0 to 3-2_4_5_1
So what do these numbers mean?
Let's recall all the partition data we've seen so far, ordered by insertion time
| partition | name | rows | active |
|---|---|---|---|
| (3,1) | 3-1_1_1_0 | 1 | 1 |
| (1,1) | 1-1_2_2_0 | 1 | 1 |
| (3,3) | 3-3_3_3_0 | 1 | 1 |
| (3,2) | 3-2_4_4_0 | 1 | 1 |
| (3,2) | 3-2_5_5_0 | 1 | 1 |
You can see that after the prefix that matches the partition, the next two numbers increment with each new insertion
Which means the next insertion will be named partition_6_6_0
insert into express_delivery(id, type, consignor, consignee, phone, storage_time)
values ('0006', 2, 'A先生', 'F先生', '18312340001', now());
Let's check the partition data (older partitions are ignored)
| partition | name | rows | active |
|---|---|---|---|
| (3,2) | 3-2_6_6_0 | 1 | 1 |
As expected, the number increments with each insertion
So what are the rules for these numbers when two partitions are merged?
The first number takes the smaller of the two values
The second number takes the larger of the two values
The third number takes the larger of the two values and adds 1; this can be seen as the number of merges that have occurred
So when merging 3-2_4_4_0 and 3-2_5_5_0 earlier
The first number takes the smaller value between 4 and 5, which is 4
The second number takes the larger value between 4 and 5, which is 5
The third number takes the larger value between 0 and 0, which is 0, then adds 1 to get 1
So the resulting name is 3-2_4_5_1
So now we can predict what the numbers will be after merging 3-2_4_5_1 and 3-2_6_6_0
The first number takes the smaller value between 4 and 6, which is 4
The second number takes the larger value between 5 and 6, which is 6
The third number takes the larger value between 1 and 0, which is 1, then adds 1 to get 2
So after merging, the name should be 3-2_4_6_2
Let's manually trigger the partition merge
optimize table express_delivery partition (3,2);
Let's check the partition data again
| partition | name | rows | active |
|---|---|---|---|
| (3,2) | 3-2_4_6_2 | 3 | 1 |
It matches our prediction exactly, and the total number of records after merging is now 3
Summary
For the MergeTree table engine in ClickHouse
Each insertion writes a new partition file
Then merges data from the same partition when the system is idle
Another important thing to note is that you should avoid generating too many partitions when specifying your partition key, as this will negatively impact performance
For example, when partitioning by time, you can partition by year or month, instead of by day or hour
This is a separate discussion topic split from the original thread at https://juejin.cn/post/7368836713965387815