PostgerSQL Usage Q&A 03 - Upsert

This article is part of the PostgreSQL Technical Q&A series. For the origin of this series, you can refer to the opening article:

PostgerSQL Technical Q&A - 00 Why Postgres

Article numbering is only an identifier and has no specific logical order or meaning within the series. Readers do not need to pay much attention to this detail.

This article focuses on issues related to the Upsert operation pattern and its implementation.

What is Upsert

In web application development, there is a very common data processing requirement pattern: you need to update data, but the data may not exist yet, so you need to create the record first. This is the so-called Update Or (If Not Exist) Insert pattern. Traditional standard SQL only provides separate Update and Insert operations. To implement this business requirement, you must first check for existence, then decide whether to use Update or Insert.

In program implementation, this requires first running a Select query. If a matching record exists, execute an update; otherwise, execute an insert. This is not an ideal pattern, because it requires multiple operations and repeated request-response and data transmission between the client and the database server. In addition, it cannot handle bulk operations for multiple records; each record must be processed individually.

To address this, enhanced SQL statements and database systems have introduced support for performing Upsert operations with a single SQL statement.

How does Postgres implement Upsert

In fact, Postgres does not have a standalone SQL statement named Upsert. It is implemented by extending the INSERT statement, specifically through the `insert ... on conflict (IOC)` pattern.

Let's first look at a practical SQL example, then analyze the key points:

// Example data table:

CREATE TABLE udata (
id SERIAL PRIMARY KEY,
ivalue int default 0
);

// Insert data
insert into udata (id,ivalue) values(10,20);

// Insert the same data again, will throw an error
insert into udata (id,ivalue) values(10,20);
ERROR: duplicate key value violates unique constraint “udata_pkey”

// Insert data again with the modified statement
insert into udata (id,ivalue) values(10,20)
on conflict(id) do update set ivalue = excluded.ivalue ;
INSERT 0 1

// Improved INSERT statement that returns the record
insert into udata (id,ivalue) values(10,20) on conflict(id) do update set ivalue = udata.ivalue + excluded.ivalue returning * ;

id | ivalue
----±-------
10 | 40
(1 row)

Checking the PostgreSQL technical documentation, the standard form of the optional `on conflict` clause for INSERT statements is:

[ ON CONFLICT [ conflict_target ] conflict_action ]

Combined with the example above, the key points for applying this operation are:

  • ON CONFLICT must be used in conjunction with an INSERT statement, and is an optional clause for it
  • ON CONFLICT also has two optional components: the conflict target and the conflict action
  • The conflict target is the detection condition that can trigger a conflict. In this example, `id` as the primary key is a conflict target, which needs to be specified in the clause parameters
  • Conflict targets can use primary keys (as shown in the example), as well as unique indexes and composite unique indexes
  • The conflict action is the operation performed when a conflict is detected. In this example, we perform an UPDATE operation, formatted as `do update set ....`, which is similar to a standard UPDATE statement
  • There is a particularly important difference from standard UPDATE to note here: if you want to reference the logical record set of inserted data, you use `excluded`. This can be understood as a logical table named `excluded`, whose field names match the target field names used during insertion
  • The result of INSERT in this case is: `insert 0 1`. Just like a standard INSERT, this output contains two pieces of information: the first is the oid, which is always 0 for INSERT statements; the second is the number of records affected by the insert operation, which is 1 here
  • The conflict action can also choose to do nothing, which may seem redundant, but the difference from an INSERT without the ON CONFLICT clause is that it will not throw a SQL execution error when a conflict occurs, and handles the conflict internally

At this point, I find this way of thinking quite strange and interesting. Following a more intuitive, normal logic, shouldn't the implementation be "Update on noexist insert"? But in fact, we really should think in the order of "data exists first, then update", and additionally, insertion contains complete information while update may only update some fields, so this approach is quite reasonable.

What is EXCLUDED

According to PostgreSQL technical documentation, EXCLUDED is a special temporary table used during the execution of insert on conflict operations. You can reference the values to be inserted from this table, and its field names match the target fields to be inserted. Therefore, using this table requires access permissions for the target insertion table and its fields. It is referenced as EXCLUDED.filedName (case-insensitive in PostgreSQL).

I have not found any other scenarios in PostgreSQL where the special logical table EXCLUDED is used. As a side note, in Trigger usage, OLD and NEW are used as the names of logical record sets. Therefore, EXCLUDED is a keyword in PG SQL, and should be used with caution in related development.

Does PostgreSQL support Merge statements like Oracle?

Simply put, PG did not support this feature before, but it has been supported as of version 15.

The basic logic and requirements of Upsert are simple and clear, but because there was no standard SQL implementation in the early days, various database systems created their own implementations. For example, PostgreSQL uses the Insert On Conflict approach, while Oracle's native implementation uses its own designed Merge Into statement. In fact, there are still some subtle logical differences between Merge Into and IOC in actual operation, so after version 15, PostgreSQL has also added and implemented the Merge Into statement, with usage and logic similar to Oracle's.

Let's analyze this starting from an official example:

MERGE INTO customer_account ca
USING recent_transactions t
ON t.customer_id = ca.customer_id
WHEN MATCHED THEN
  UPDATE SET balance = balance + transaction_value
WHEN NOT MATCHED THEN
  INSERT (customer_id, balance)
  VALUES (t.customer_id, t.transaction_value);

In this example, we can see that the key points of the Merge Into implementation include:

  • Basic syntax is `Merge Into (target table) Using (joined table) On (join condition) (matched operation statement) (unmatched operation statement)`
  • You need to specify a reference dataset to join with `Using`, and set the relevant join condition
  • Due to the nature of the join, the joined dataset requires unique matching (similar to Oracle's implementation)
  • Each joined record is processed after matching
  • If matched, you can choose to update or delete the record
  • If not matched, you can choose to insert the record
  • Of course, both matched and unmatched processing are optional

Therefore, the logic of Merge is conceived from the perspective of an existing record set. Although it can achieve similar goals to insert on conflict, its starting point is slightly different. So there is a logical limitation when using Merge Into: the related data set referenced by `using` needs to have a unique identifier.

Are there other solutions?

Of course. The traditional solution is for the application client to perform query - check existence - update or insert. Obviously, this requires an external application to access the database, retrieve data for operation, and requires multiple executions.

This process can also be implemented as a stored procedure and deployed on the database server. In this case, the client only needs to call the stored procedure, and performance is relatively good. However, this solution is more cumbersome, less flexible, and more difficult to adjust.

I have thought that in PG, using CTE may also help implement this requirement: you first construct a CTE with the input data, then join it with the original table for query, and filter out two cases: data that needs to be updated is processed with an update CTE; data that needs to be inserted is inserted directly into the database. Although this approach is not recommended in actual development, as a thought experiment, I still wrote an example to show that this operation is indeed feasible:

// udata table schema is id(int), ivalue(int)
// Insert two records: (1, 1) and (2, 2)
with 
D (id1, ivalue1) as (values
(1, 1),
(2, 2)
),
U as ( update udata set ivalue = ivalue + ivalue1 from D where id1 = id returning id ),
I as ( insert into udata(id,ivalue) select id1, ivalue1 from D left join udata O on id1 = id where id is null returning id  )
select id, 'I' from I
union all 
select id, 'U' from U;

// Check the result
select * from udata;

Summary

This article discusses how to implement the common business requirement of "update data, insert if it does not exist" in PostgreSQL. It covers several feasible technical solutions, their specific implementation and operation, as well as the advantages, disadvantages and differences between the solutions.


This is a discussion topic separated from the original topic at https://juejin.cn/post/7368761530382876691