【ClickHouse】The Merging and Splitting of Data


theme: channing-cyan

This article is an exclusive signed article first published on the Juejin Technology Community. Reproduction is prohibited within 30 days, and reproduction without authorization is prohibited after 30 days. All infringements will be prosecuted!

In the previous article [ClickHouse] Updating and Deleting Data via Happy Match: Is This Really the Solution?, we looked at partitioning (Partition) in MergeTree

Partitioning

Let's use a courier pickup station as an analogy

Courier packages are stacked on shelves according to specific classification and sorting rules

There is a Cainiao Station right downstairs from my home

There are several shelves inside, each marked with a number indicating which shelf it is

Each shelf has multiple layers, and each layer also has its own number

Goods are stacked on each layer following the rule of placing smaller items on top and larger items below

Partitioning in MergeTree works similarly: it stores data in different files according to certain rules

We can specify the partition key with partition by when creating the table

create table express_delivery
(
    id           String comment 'ID',
    type         Int8 comment 'Size: 1=small, 2=medium, 3=large',
    consignor    String comment 'Consignor',
    consignee    String comment 'Consignee',
    phone        String comment 'Phone number',
    storage_time DateTime comment 'Warehousing time'

) engine = MergeTree()
order by storage_time
partition by (toInt64(phone) % 3 + 1, type)
comment ‘Courier station with only 3 shelves’;

We create a table to store courier information (other fields are omitted)

Assume our courier station only has 3 shelves

We first use the remainder of the phone number modulo 3 to determine the shelf number

Then we determine the layer number based on whether the package is small, medium, or large

Finally, we sort by the warehousing time

(This is just an assumption, not an actual production algorithm)

Therefore, our partition key can be defined as partition by (toInt64(phone) % 3 + 1, type)

Now a new package has arrived

/* Mr. A has a small package */
insert into express_delivery(id, type, consignor, consignee, phone, storage_time)
values ('0001', 1, 'A先生', 'F先生', '18312340001', now());

Let's check the partition information

SELECT partition, name, rows, active FROM system.parts WHERE table = 'express_delivery';

We can get the following result:

partition name rows active
(3,1) 3-1_1_1_0 1 1

Among these, partition=(3,1) is the partition where the first record resides

Now two more packages have arrived

/* Mr. B has a small package */
insert into express_delivery(id, type, consignor, consignee, phone, storage_time)
values ('0002', 1, 'B先生', 'F先生', '18312340002', now());

/* Mr. A has a large package */
insert into express_delivery(id, type, consignor, consignee, phone, storage_time)
values (‘0003’, 3, ‘A先生’, ‘F先生’, ‘18312340001’, now());

Let's look at the current partitions

partition name rows active
(1,1) 1-1_2_2_0 1 1
(3,1) 3-1_1_1_0 1 1
(3,3) 3-3_3_3_0 1 1

The (1,1) partition holds Mr. B's small package

The (3,3) partition holds Mr. A's large package

Ding! Mr. A received a text message (just kidding):

【CK Station】 Pick up your packages at express_delivery with pickup codes 3-1-0001 and 3-3-0003.

When we query data, if the query meets certain conditions, we can directly locate and read the corresponding data file

Just like when you pick up a package, you can directly lock onto the correct shelf and layer using your pickup code

Merging

Recently, Mr. A has been shopping online frequently, so two more packages have arrived

/* Mr. A has a medium package */
insert into express_delivery(id, type, consignor, consignee, phone, storage_time)
values ('0004', 2, 'A先生', 'F先生', '18312340001', now());

/* Mr. A has a medium package */
insert into express_delivery(id, type, consignor, consignee, phone, storage_time)
values (‘0005’, 2, ‘A先生’, ‘F先生’, ‘18312340001’, now());

Let's check the current partitions (older partitions are not shown here)

partition name rows active
(3,2) 3-2_4_4_0 1 1
(3,2) 3-2_5_5_0 1 1

??? Why are there two entries for the same partition?

After all, based on our current understanding of partitioning

We should only have one partition entry for (3,2) with two records inside that partition

Do you remember what our table engine is called?

MergeTree – the merging tree

Since it's called a merging tree, it naturally needs to merge something

That's right, the thing that gets merged is partitions

Why merge? (Conjecture)

This design is most likely chosen for insertion efficiency, which is why two separate insertions are kept separate

Suppose we do not perform any additional merge operations

What problem would occur if we just append the second batch of data directly after the first batch?

First, there is the concurrency problem

When inserting the second batch, if the first batch hasn't finished inserting yet

Does the second insertion have to wait for the first one to complete before it can start?

In our example, the first batch only has one record

But in real-world scenarios, there could be tens of thousands or hundreds of thousands of records

Waiting for that long is not acceptable

Suppose we can use some technology to allow two insertions to run at the same time without errors

But we have specified a sorting key

This means we need to re-sort all data after both insertions complete

If the entire insertion operation is synchronous

The insertion efficiency would be very low

Therefore, it is better to write each insertion to a new file

Then merge these multiple files into one when the system is idle

This approach both guarantees insertion performance and makes use of idle system resources to merge data

How merging works

Generally speaking, ClickHouse will automatically merge partitions 10 to 20 minutes after data is inserted

Let's check the partitions again after some time has passed

partition name rows active
(3,2) 3-2_4_5_1 2 1

The two partitions have been merged into one, and the total number of records is now two

Have you noticed

After merging, the name field changed from the original 3-2_4_4_0 and 3-2_5_5_0 to 3-2_4_5_1

So what do these numbers mean?

Let's recall all the partition data we've seen so far, ordered by insertion time

partition name rows active
(3,1) 3-1_1_1_0 1 1
(1,1) 1-1_2_2_0 1 1
(3,3) 3-3_3_3_0 1 1
(3,2) 3-2_4_4_0 1 1
(3,2) 3-2_5_5_0 1 1

You can see that after the prefix that matches the partition, the next two numbers increment with each new insertion

Which means the next insertion will be named partition_6_6_0

insert into express_delivery(id, type, consignor, consignee, phone, storage_time)
values ('0006', 2, 'A先生', 'F先生', '18312340001', now());

Let's check the partition data (older partitions are ignored)

partition name rows active
(3,2) 3-2_6_6_0 1 1

As expected, the number increments with each insertion

So what are the rules for these numbers when two partitions are merged?

The first number takes the smaller of the two values

The second number takes the larger of the two values

The third number takes the larger of the two values and adds 1; this can be seen as the number of merges that have occurred

So when merging 3-2_4_4_0 and 3-2_5_5_0 earlier

The first number takes the smaller value between 4 and 5, which is 4

The second number takes the larger value between 4 and 5, which is 5

The third number takes the larger value between 0 and 0, which is 0, then adds 1 to get 1

So the resulting name is 3-2_4_5_1

So now we can predict what the numbers will be after merging 3-2_4_5_1 and 3-2_6_6_0

The first number takes the smaller value between 4 and 6, which is 4

The second number takes the larger value between 5 and 6, which is 6

The third number takes the larger value between 1 and 0, which is 1, then adds 1 to get 2

So after merging, the name should be 3-2_4_6_2

Let's manually trigger the partition merge

optimize table express_delivery partition (3,2);

Let's check the partition data again

partition name rows active
(3,2) 3-2_4_6_2 3 1

It matches our prediction exactly, and the total number of records after merging is now 3

Summary

For the MergeTree table engine in ClickHouse

Each insertion writes a new partition file

Then merges data from the same partition when the system is idle

Another important thing to note is that you should avoid generating too many partitions when specifying your partition key, as this will negatively impact performance

For example, when partitioning by time, you can partition by year or month, instead of by day or hour


This is a separate discussion topic split from the original thread at https://juejin.cn/post/7368836713965387815