theme: z-blue
Preface
Over my years of working, I have encountered many requirements for exporting large files from the frontend, but I have never actually had a requirement to import large files from the frontend into the backend database. After all, with full permissions to the MySQL server, I can just use the `source` command to import everything. A mere 100,000 lines is finished off in seconds, and there is no need to consider issues of network latency or program efficiency.
The speed of file upload and reception when the backend receives files from the frontend depends on the server's network bandwidth, which is a network-level issue. Today we will focus on improving efficiency from the application processing level. To improve efficiency in backend applications, the most common approach is multithreading, but which step should we apply multithreading to, and how to integrate it? That is what we need to discuss.
Frontend
The last time I implemented a file upload feature, I used a native HTML form. For this example, we are working in a Vue project, and we will use the el-upload component from Element Plus to complete the frontend development for file upload.
The implementation code is as follows:
Drop file here or click to uploadjpg/png files with a size less than 500kb
Backend Service
We use Spring Boot to implement the backend processing logic for file upload, including receiving files, reading files, inserting data into the database, etc. So if we want to speed up database insertion, we need to consider how to apply multithreaded processing logic in these steps.
First, let's implement the overall code framework for the file upload controller.
@PostMapping("/upload") @CrossOrigin(origins = "*", maxAge = 3600) public @ResponseBody Map upload(MultipartFile file) throws IOException { System.out.println(file.getName()); BufferedReader br = new BufferedReader(new InputStreamReader(file.getInputStream())); String line = null; long start = System.currentTimeMillis(); while ((line = br.readLine()) != null) { // implement insertion logic } long end = System.currentTimeMillis(); System.out.println("插入数据库共用时间:" + (end -start) / 1000 + "s");Map<string, string> res = new HashMap(); res.put("1", "success"); return res;
}
</string,>
We use CrossOrigin to allow cross-origin access, get the uploaded file via MultipartFile, and obtain the file's input stream. We loop with readLine and implement different database insertion methods inside the while loop to test efficiency.
I used a 1-core VPS for MySQL and a mobile hotspot for testing (which was very slow). So the following test results are for reference only and are not production-grade data; after all, no one uses a 1-core server for production anyway.
Based on my limited development experience, I have listed three solutions. Solution 1 uses 10,000 test records (to save time), and Solution 3 uses 100,000 test records.
Solution 1
Solution 1 is the conventional process: insert a row into the database every time a line of data is read.
@Insert("insert into test values(#{a}, #{b}, #{c}, #{d})")
void insertFile(Test test);
Total time taken to insert into database: 1833s
Solution 2
As you can see, inserting row-by-row requires communicating with the database every time, which is very time-consuming. Based on Solution 1, we can consider batch insertion to improve efficiency, using a script to implement batch data insertion.
@Insert("")
void insertBatch(List list);
As you can see, the parameter for the batch insertion here is now a List, and we implement batch insertion via foreach iteration. We just encapsulate the data in a List in the controller.
List list = new ArrayList();
while ((line = br.readLine()) != null) {
String[] lines = line.split(",");
Test test = new Test(lines[0], lines[1], lines[2], lines[3]);
list.add(test);
count ++;
if (count % 100 == 0) {
fileUploadService.insertBatch(list);
list.clear();
}
}
We defined a count counter, and use modulo to insert 100 records per batch. Final result: Total time taken to insert into database: 82s
Solution 3
Solution 2 implements batch processing at the database insertion step. Based on Solution 2, we now use multithreading in the controller to insert data into the database in batches.
When working with multithreading, the first things we need to consider are locking and concurrency issues. Adding locks will greatly reduce concurrency, so we use the producer/consumer pattern here: reading the file is production, and multithreading handles consumption.
I have two technical options here: one is the concurrent queue ConcurrentLinkedQueue, the other is Disruptor. Both of these queues are internally implemented based on CAS + voilatile. Disruptor is slightly superior to ConcurrentLinkedQueue in terms of performance, but Disruptor requires more code, so we will use ConcurrentLinkedQueue as an example here.
ConcurrentLinkedQueue queue = new ConcurrentLinkedQueue(); boolean[] isComplete = {true}; int[] count = {0}; CountDownLatch countDownLatch = new CountDownLatch(6); AtomicInteger atomicSize = new AtomicInteger(0); while ((line = br.readLine()) != null) { queue.add(line); count[0] = ++count[0]; if (count[0] == 500) { for (int i = 0; i < 6; i++) { new Thread(new Runnable() { @Override public void run() { int num = 0; List list = new ArrayList(); while (isComplete[0] == true && count[0] != atomicSize.get()) { String line = queue.poll(); if (line != null) { String[] lines = line.split(","); Test test = new Test(lines[0], lines[1], lines[2], lines[3]); atomicSize.incrementAndGet(); list.add(test); num++; } if (num % 100 == 0) { fileUploadService.insertBatch(list); list.clear(); } } countDownLatch.countDown(); } }).start();} }
}
isComplete[0] = true;
countDownLatch.await();
Since this is just a test, I used the native Thread construction method in the controller. For production, this can be refactored to use a thread pool, and you can modify some default configurations.
This solution requires considering many factors:
1. How do we confirm when the file has been fully read, and when should the thread exit the while loop?
We use two variables to control this process here: one is the AtomicInteger atomic variable, used to count how many records have been consumed across multiple threads, so its value will match the total record count `count` from the file. In addition, we define a Boolean type isComplete variable, used to confirm that the `count` being compared is the final count after the entire file has been read, rather than an intermediate count that equals the AtomicInteger value just because consumption speed is greater than production speed.
Also, to solve the problem of anonymous inner classes accessing external variables, I defined all of these variables as array types here.
2. How to correctly measure the elapsed time
After the main thread finishes reading the file and adding all lines to the queue, it starts the child threads to consume data, then continues executing subsequent code without waiting for the child threads to finish consumption. This means the total time would be calculated before the child threads finish inserting all data into the database.
To avoid this situation, we use CountDownLatch to make the main thread wait for all child threads to finish before executing subsequent code.
Test result with 3 threads: Total time taken to insert into database: 72s. After changing to 6 threads, test result: Total time taken to insert into database: 60s
Then I inserted 100,000 records, total time taken for database insertion: 41s. Maybe my mobile hotspot was less laggy this time, or maybe the performance scaled better with the larger workload.
Problem Summary
I don't often implement file upload with Spring Boot, so when I tested the upload, I got the following exception: "the request was rejected because its size (18889091) exceeds the configured maximum (10485760)".
The exception indicates that the file upload request was rejected because it exceeded the size limit. You can fix this by modifying the default limit in application.properties.
spring.servlet.multipart.max-file-size is used to set the maximum file size limit for a single file upload, and spring.servlet.multipart.max-request-size is used to set the maximum size limit for the entire request (including all files and form data).
Conclusion
This article compares the row-by-row insertion approach with the method that uses batch insertion and multithreading, and the results show that this approach improves efficiency. Since I work in the big data industry myself, I handle large volumes of data on a daily basis. For importing and exporting large volumes of data, I still recommend using the backend command line for processing.
For example, when importing more than 100,000 records into MySQL, you can use the LOAD DATA INFILE command to directly import the file content into the table, which is very efficient. If you want to implement this solution in Java, you can combine it with shell in the controller. This approach is slightly more complex, but those who are interested can give it a try.
This is a separate discussion topic split from the original post at https://juejin.cn/post/7369111231834030118



