A Record of a Lindorm Query Timeout Issue


theme: smartblue

One morning, I discovered that an IoT platform I am responsible for had severe MQ backlogging (the backlog grew by around one million messages per hour). We had encountered this problem before, and it was almost always caused by slowdown of the dependent OSS: sometimes other businesses saturated OSS bandwidth, and sometimes OSS itself implements flow limiting under high concurrency.

Troubleshooting

After arriving at the company, I first checked the relevant logs (there were already numerous ERROR alerts due to severe timeouts with Lindorm). Based on the alert logs, I found that the connection pool of the database (Lindorm) was completely exhausted (we use Druid), and the error stack trace is roughly as follows.

org.mybatis.spring.MyBatisSystemException: nested exception is org.apache.ibatis.exceptions.PersistenceException: 
### Error updating database.  Cause: org.springframework.jdbc.CannotGetJdbcConnectionException: Failed to obtain JDBC Connection; nested exception is com.alibaba.druid.pool.GetConnectionTimeoutException: wait millis 30000, active 20, maxActive 20, runningSqlCount 2 : select  ts
                ,metric
                ,val
        from    monitor_metric
        where   app_bind_rel_id = ?
        and     metric in
         (  
            ?
         ) 
        and     ts >= ? and ts <= ?
        order by app_bind_rel_id, metric, ts
runningSqlCount 1 : SELECT  app_bind_rel_id,oss_key,lng,angle,gmt_create,speed,lat,ts  FROM data_track_point 

WHERE (app_bind_rel_id = ? AND ts > ? AND ts < ?)
runningSqlCount 6 : SELECT ts AS id, concat(?, CAST(lng AS varchar), ?, CAST(lat AS varchar), ?) AS coord, ts
FROM data_track_point
WHERE app_bind…

The problem is easily identifiable from the error: the database connection pool is completely exhausted, and no JDBC connection can be obtained. In the vast majority of cases where a connection pool is exhausted, the cause is slow SQL! So I checked the execution time of Lindorm-related SQL using pre-configured logging, and it matched my guess exactly: queries took 40,000 to 50,000 ms, and writes took 20,000 to 30,000 ms, while normally both queries and writes are completed within 10 ms. It is no wonder the connection pool became exhausted with such slow responses.

My first assumption was that there was a problem with Lindorm (I really should have slapped myself here: after all, this is a mature service from Alibaba Cloud, the probability of it having an issue is much lower than that of our own service having a problem, and I wasted a lot of time here). So I contacted the Lindorm team to ask if there was any issue with the Lindorm cluster. Because even a temporary table I created for testing that only contained one row of data took a very long time to query.

After the Lindorm team got involved, they identified the slow SQL and asked our business team to conduct a self-check. But we had not released any new versions recently, so we really could not find the source of the problem at first. The Lindorm team found through the monitoring platform that the read volume of the `data_track_point` table had increased sharply, and asked us to check if there had been any changes to the query method or any major changes to input parameters.

After troubleshooting for a long time, we still had no clue. We checked all read operations related to this table and found no issues (the main service indeed had not been modified).

So we asked the Lindorm team if they have SQL flow limiting capabilities like dbservice, and got no answer (they likely do not support this for now). They then suggested we try isolating the problematic table so it does not affect the entire cluster's service. So we performed the isolation operation on the table first, but found that it had little to no effect, and responses ended up being randomly fast and slow. After further discussion, they explained that because we are using the light client (SQL) access method, the isolation policy did not take effect. With this usage method, Lindorm routes requests to randomly selected nodes: requests routed to the node hosting the problematic table are very slow, while requests routed to other nodes are fast.

Recovery

So the business team decided to implement flow limiting first. Handling incoming HTTP requests was very simple: since we had already integrated Sentinel, just configuring the flow limiting rule in Sentinel took effect immediately. I have to praise Sentinel here: I strongly recommend that any production online business integrate it, it is a true lifesaver. Without Sentinel, we would have had to modify code and release a new version, and even that might not have solved the problem, because Lindorm's response was extremely slow, and obtaining a connection during application startup would be particularly slow.

But besides HTTP requests, there are also triggers dispatched via Spring Event (these requests were actually not the source of the problem), but luckily we were able to disable them with a workaround configuration.

After all excess traffic was blocked, the Lindorm cluster began to recover.

Root Cause Identification

After the business recovered, we started working to find the real root cause. At first we kept suspecting that a Lindorm node was faulty, causing slow queries that eventually dragged down the entire cluster. Because after the initial slow SQL, all subsequent SQL queries became slow, it was really hard to tell which side the problem was on. We also considered whether it was a query avalanche caused by increased data volume, but this type of issue usually causes linear or gradual slowdown with clear warning signs, while our problem was a sudden complete outage, so this was unlikely to be the cause.

Finally, when checking recent events, we found that there was a full OTA upgrade for all devices overnight. We then suspected that this might be the cause, and the upgrade time point perfectly matched the sharp increase in read volume to Lindorm. We carefully checked the input parameters of the rate-limited interface, and found that some requests passed 0 as the start time. This interface uses the start time to perform a range query on `data_track_point`, so we had basically located the problem at this point.

We asked the client-side team why 0 was passed as the start time, and they told us that the device's system time is reset to 0 after an OTA. But we had done many OTA upgrades before, so why was there a problem only this time? The main reason is that previously this data was stored in MySQL for only 1 month; after switching to Lindorm, we store 3 months of data (and we used the cold storage feature). Query performance is not Lindorm's strength compared to MySQL, and querying such a large range of data is a nightmare for any database.

Lessons Learned

  • All input from client/device endpoints must be validated
  • Range queries on large tables must be handled with extreme, extreme caution
  • Build flow limiting and degradation capabilities for online production business in advance. When encountering this type of problem, implement flow limiting and degradation first to ensure the majority of the business remains unaffected
  • Isolate different modules as much as possible. This makes troubleshooting much easier, and also simplifies implementing flow limiting when needed

This discussion topic was separated from the original thread at https://juejin.cn/post/7368820207592898569