Node.js Chapter 37: MySQL Subqueries and Table Joins


Theme: smartblue

Nodejs Chapter 37: MySQL Subqueries and Table Joins

  • In all previous examples, we performed create, read, update, delete operations on a single table, but in normal development, we will never define all fields in a single table.
    • What difference is there between putting all fields in one table versus putting all your code in one file?
    • We learned the concept of modularity back in Chapter 9, and this idea can also be applied to database design. We therefore categorize tables by the different properties and functions of their fields, resulting in a clearer structure
  • However, in frontend code, we can connect individual files into a complete project through the two import/export specifications of ESM or CommonJS.
    • So how do we connect multiple MySQL tables together?
    • And that is exactly the topic we will cover today: Subqueries and Table Joins

Subqueries

What is a Subquery

In MySQL, a subquery (Subquery) is a query nested inside another SQL query. Subqueries can be used in a variety of SQL statements, including SELECT, INSERT, UPDATE, and DELETE statements, as well as in FROM, WHERE, and HAVING clauses. Subqueries allow us to execute one query inside another, giving SQL queries greater flexibility and the ability to handle more complex logic.

  • Simply put, as we mentioned earlier, subqueries establish one-way relationships between tables.
    • This is just like import/export in frontend modularity: a child table nested in the main table cannot reverse query the content of the main table, because our goal is to have the child table serve the main table (PS: the concepts of main table and child table are metaphors I created for easier explanation, MySQL itself does not have these concepts), and the subquery is used by the main query
    • A subquery is like an imported module: it provides data or functionality for the main query (the module that uses it), but the main query cannot directly manipulate the internals of the subquery, and the data processing inside the subquery is invisible to the main query – it is only responsible for providing data.
  • And as for why use subqueries, we explained this in the introduction at the beginning: it implements the idea of modularity, and also brings benefits such as better readability, easier understanding, and more convenient maintenance.

Subquery Example

  • Let's first clarify our approach (two steps):
    1. First, we need two tables, corresponding to the main query (main table) and the subquery (child table)
    2. Establish a subquery association between these two tables

Creating the Tables

  • First, let's start with step one: we will create a user table as the main table, and a child table that records users' login activities.
    • The users table is the main table, which stores the user's basic information; the login_activities table is the child table, which is associated with the id of the users table via a foreign key, and records the user's login activities

Main Query Table

-- Create main table `users`
CREATE TABLE `users` (
    `id` int NOT NULL AUTO_INCREMENT,
    `name` varchar(255) NOT NULL,
    `email` varchar(255),
    PRIMARY KEY (`id`)
);

– Insert data into users
INSERT INTO users (name, email)
VALUES
(‘迷你余’, ‘mini@gmail.com’),
(‘小余’, ‘small@gmail.com’),
(‘中余’, ‘medium@gmail.com’),
(‘大余’, ‘large@gmail.com’),
(‘超大余’, ‘xlarge@gmail.com’);

Subquery Table

  • When initializing the data here, we query the primary key id from the user table (main query table), and fill it into the login_activities table as the foreign key id.
    • The purpose of this is to establish the association between the two tables for the next steps
    • This bulk operation approach also improves insertion efficiency
-- Create child table `login_activities`
CREATE TABLE `login_activities` (
    `activity_id` int NOT NULL AUTO_INCREMENT,
    `user_id` int NOT NULL,
    `login_time` timestamp DEFAULT CURRENT_TIMESTAMP,
    PRIMARY KEY (`activity_id`),
    FOREIGN KEY (`user_id`) REFERENCES `users`(`id`)
);

– Insert data into login_activities
– Assume that every user has logged in at least once
INSERT INTO login_activities (user_id)
SELECT id FROM users;

Establishing the Query Association

  • Here we need to master one concept:
    • Parentheses in Subqueries: In SQL, subqueries usually need to be enclosed in parentheses to distinguish the boundary between the subquery and the main query
  • Suppose we want to query the names of all users who have login activity (this is a good approach for filtering active users or removing inactive followers).
    • In the syntax below, we use the IN keyword, IN can replace multiple OR conditions. This means that only one condition inside the parentheses of IN(xxx,xxx,...) needs to be satisfied. It is usually used together with WHERE for filtering.
SELECT `name` FROM `users`
WHERE `id` IN (
    SELECT `user_id` FROM `login_activities`
);
  • We need to look at this in two parts:

    1. The first part is the main query, which retrieves the name data
    SELECT `name` FROM `users`
    
    1. The second part is the core functionality of this example: we want to query users who have logged in at least once after May 1st this year

      • At this point, the first part gets us all usernames
      • Through the subquery (child table), we get the list of usernames that logged in on or after this year's May Day holiday
      SELECT `name` FROM `users`
      WHERE `id` IN (
          SELECT `user_id` FROM `login_activities`
          WHERE `login_time` >= '2024-05-01'
      );
      
  • From the syntax above, I think everyone should understand: a subquery is not actually a specific syntactic rule, but rather an expression of an idea.

    • We split the subquery into two steps, but when viewed individually, each step is a normal, ordinary operation (querying names, querying login times)
    • If we had put everything in a single table in the first place, we would not need this concept at all. It is the many benefits brought by the clear modular structure that promote the functional classification of tables. The concept of subqueries was developed based on this ideological foundation, which is why we say it is an expression of an idea.
    • In terms of operation, it is indeed a very ordinary approach; what is valuable is the idea behind it – when learning, you must understand the design philosophy behind it.
  • We cannot say: "I use subqueries because that is the only way to get the functionality I want." This understanding is incorrect, because the fact that you need this approach is already based on having separated the tables, which means you have already accepted the benefits of structured thinking (improving the flexibility and efficiency of database queries), and subqueries are just a natural solution and extension of this way of thinking.

Join Queries

  • There are many application scenarios in daily development. If you want to combine content from two tables, simply using subqueries alone is not enough.
    • Let's take the previous subquery example: we got the names of active users after May Day through a subquery, but we cannot see their specific most recent login time, right?
    • If we want to see both pieces of information (username, login time) in a single table, that is, merge the content of two tables, subqueries alone cannot do this, because the username and login time data are not in the same table – they come from the main query table and the child table respectively.
  • This is when join queries come into play, they can solve this problem. Join queries are divided into inner joins, outer joins, and cross joins. Let's take a look.

Inner Joins

What is an Inner Join

An inner join returns matching rows from both tables. Only rows that have matching values in the joined columns will appear in the query result.

How to Use Inner Joins

  • The simple introduction above may still be a bit abstract, so let's see how it works.
SELECT * FROM `users`,`login_activities`
  • By initially querying both tables at the same time, we can see that all the content from both tables is combined.
    • But something seems wrong: there are suddenly 25 total rows of data, but we should only have 5 rows.
    • This is because we did not add a WHERE restriction, so the data from the two tables are not matched one-to-one, instead they are combined in a 5×5 Cartesian product, generating 25 different rows of data.

  • An inner join is based on this foundation, and uses primary and foreign keys in the WHERE clause to achieve a one-to-one matching of data.
    • In terms of operation, it is still very simple: you just compare the primary and foreign keys. But as we said before, the idea is what matters.
# Match the primary key id of users to the foreign key id of login_activities
SELECT * FROM `users`,`login_activities` WHERE `users`.`id` = `login_activities`.`user_id`

  • Everything looks good so far. Even if there is a lot of content, we just need to replace the * wildcard in the command with the specific content we want.
SELECT `name`,`login_time` FROM `users`,`login_activities` WHERE `users`.`id` = `login_activities`.`user_id`

  • But inner joins actually have a problem: our data cannot always be perfectly matched. For example, the user table has 12 rows of data, but the login table may only have 8 rows of data.
    • If you do an inner join in this case, only the 8 matching rows will be displayed, and the remaining 4 rows will not be shown. This means the result may not be comprehensive.
    • If I want to achieve: show all usernames, show the login time for users who have logged in recently, and show NULL for those who have not. Inner joins cannot do this – inner joins are like an intersection.

Outer Joins

What is an Outer Join

An outer join returns all rows from at least one of the tables, even if there is no match in the other table. The behavior differs slightly depending on whether it is a left outer join, right outer join, or full outer join.

  • Outer joins are divided into left and right types, and each has its own specific syntax.

Left Outer Join

Returns all rows from the left table (table1), plus matching rows from the right table (table2). If there is no match in the right table, those rows will be filled with null values.

  • Syntax: LEFT JOIN [table name] ON [join condition]
  • The order of tables referenced after FROM also has different effects.
    • The first table is called the driving table, and the join is based on this table.
    • Why is the first table the driving table? Because it is on the left side of LEFT JOIN – a left join is dominated by the left side. For a right outer join, the right side is the driving table.
  • For easier demonstration, I will delete one record from the login table.
DELETE FROM `login_activities` WHERE `user_id` = 5

  • Now the login table has 4 records, and the user table has 5. We will use the user table as the driving table to see if the problem of missing unmatched data from the other table that occurred with inner joins still exists.

  • Previously we used the WHERE keyword for filter conditions, but for left and right outer joins, we use the ON keyword. What is the difference between the two?

The ON Keyword

  • Purpose: The ON keyword is mainly used in join queries to specify the join condition, that is, to define how to match rows from two tables.
  • Context: ON is used in JOIN statements, used to specify how to join rows by comparing columns from two tables.

The WHERE Keyword

  • Purpose: The WHERE keyword is used to specify filter conditions, that is, to determine which rows should be included in the query result.
  • Context: WHERE can be used in any SELECT query, not just join queries. It filters after the data is joined, so you can use WHERE after JOIN to further restrict the result.
# When joining tables, especially for outer joins, it is very important to use ON to define the join condition, because this directly affects which rows are included in the final join result. The ON condition determines how to match data from the two tables, which is the foundation for building the joined result set.
SELECT `name`,`login_time` FROM `users` LEFT JOIN `login_activities` ON `users`.`id` = `login_activities`.`user_id`

  • You can clearly see that the fifth row is also displayed. Data that does not exist in the login table is represented as NULL.

Right Outer Join

Returns all rows from the right table (table2), plus matching rows from the left table (table1). If there is no match in the left table, those rows will be filled with null values.

  • As mentioned earlier, for a right outer join, the right side is the driving table.
    • If we simply change LEFT to RIGHT in the left outer join example above, leaving everything else unchanged, what result do we get?
SELECT `name`,`login_time` FROM `users` RIGHT JOIN `login_activities` ON `users`.`id` = `login_activities`.`user_id`

  • Now we only get 4 rows of data, not 5. This shows that the join is indeed dominated by the table on the right side of RIGHT JOIN.

Summary

  • These two outer join methods give us two options, depending on which table we want to prioritize.
    • Theoretically, these two methods cover the vast majority of usage scenarios, but as we know, there are actually three possible options when joining two tables.

  • As shown in the image above, a left outer join gets A+B, and a right outer join gets B+C. From the diagram, we can easily see that there is actually another option: A+B+C.
    • There is a method for this, called Full Outer Join, which returns all rows from both tables. If there is no match on one side, that side's rows will be filled with null values.
    • However, not all database systems support full outer joins, so we do not cover it here.

This is an independent discussion topic split from the original thread at https://juejin.cn/post/7368367426210578471