1. Background
Recently, some users have reported that content in one of our company's apps has been failing to load. We conducted troubleshooting, documented our process, and are sharing this information with everyone.
Before starting the main content, let's first review some basic knowledge:
MTU is the maximum transmission unit in layer 2 protocols, referring to the maximum size of frame content, excluding the frame header and FCS. The standard Ethernet MTU is 1500 bytes
MSS refers to the maximum TCP payload size, which is usually advertised in SYN packets. Intermediate devices can modify the MSS size, and the technical term for this is MSS Clamping
In practice, when transmitting large amounts of data, TCP tends to send fully loaded data packets
The frame length displayed in Wireshark does not include the 4-byte FCS
2. Troubleshooting
Let's start with a packet capture screenshot, see if you can spot the anomaly:
From this capture, you can immediately see that in the last few packets, the Client keeps retransmitting, but the Server does not send an ACK.
Tracing back up from the retransmitted packets, you will find that the packet being retransmitted is packet number 16 initiated by the Client. During this time, packets sent by the Server were ACKed by the Client, which confirms that the Server and Client can still communicate, but the Server never received packet 16.
So the question becomes: why didn't the Server receive packet 16?
Let's look at the characteristics of packet 16: its Length is 1514, so it is a fully loaded data frame.
Looking further back up, we can see that among the four consecutive packets 6, 7, 8, 9 sent by the Server, the first three all have a Length of 1506. This raises suspicion: why is the Length not 1514? This indicates that it is very likely an intermediate device modified the MSS in the SYN packet sent from Client to Server, changing it to 1452. We have good reason to suspect that the MTU along the path has changed.
Going back even further to packet 2, we can see that the SYN packet sent from Server to Client has an MSS of 1460, which is the standard normal size, unchanged. This gives us reason to suspect that the intermediate device has an anomaly in its MSS handling, and only modified the MSS for one direction. As a result, the Client does not detect the change in MTU along the path, which causes the fully loaded 1514-byte data frame to be dropped.
The above only provides partial evidence. How can we confirm this further?
Probe by using ping to send a fully loaded packet with the Don't Fragment flag set:
If you are lucky, you will get a clear prompt like the one below. We actually got exactly this :-) , which confirms our earlier guess.
If you do not get a clear prompt, you can still probe for the minimum MTU along the path by adjusting the -s parameter to gradually reduce the payload size.
If you want to find out exactly which hop has the abnormal MTU, you can probe by specifying the TTL:
The analysis is not finished yet, let's look deeper from the business perspective. We know that the request in question is an ordinary HTTP API GET request, which does not involve data upload. Pause for a moment: can you spot what is anomalous here?
If you didn't spot it, let's look at another TLS-decrypted HTTP API GET request. See if you can spot the clue:
Now I will reveal the answer:
Generally speaking, the raw size of API requests sent by the Client is far smaller than 1460 bytes. With the addition of HTTP/2 header compression, the data size is even smaller, so fully loaded packets very rarely occur.
So we confirmed with our business team: it turns out that due to new business requirements, some very large parameters were added to requests in the new app version. User reports of this issue also increased after the new version launched, and rolling back to the old version avoids triggering the problem. The business team is also implementing corresponding optimizations based on our feedback.
Old version request size:
New version request size:
Tips: If you want to observe TLS-decrypted packets, you can configure the SSLKEYLOGFILE environment variable to log the TLS session key, then decrypt the capture with Wireshark. Chrome, cURL, Firefox and other tools all support this method.
3. Conclusion
This concludes our analysis. Packet capture and analysis is a very efficient debugging method, and it has helped the author solve many problems.
However, when I handled this case, the process was far less smooth than it is described in this article. I was not proficient in applying many small pieces of knowledge, and did not connect them together in the early stages of investigation. Fortunately, a problem can be observed from multiple angles; by keeping the question in mind and gradually gathering evidence and knowledge, I finally solved the case. I wrote this record in hope that it will be helpful to you.
Recommended Reading:
Everything About MTU and MSS: https://www.kawabangga.com/posts/4983 (and other articles by this blogger)
Wireshark Network Analysis Is That Simple
The Art of Wireshark Network Analysis
-End-
Author | House
This is a discussion topic separated from the original topic at https://www.bilibili.com/read/cv36265123/





