The Mystery of Path MTU MSS Clamping
I had to get my production server to clamp MSS (size of TCP data “segment” in bytes) to path MTU.
I have CenturyLink fiber at my house, which means PPP-over-Ethernet. I have a Dell R530 rackmount server in my basement that I have set up to route from inside my house to CenturyLink’s network, and vice versa. I wanted to avoid the lowest-bidder-grade WiFi router that CenturyLink makes you buy or rent when you get a fiber connection.
I’m not going to lie: I also did (and still do) extensive DNS ad blocking, and I didn’t want to give that up.
Mystifying symptoms
When I got CenturyLink fiber, I already had a radio modem from Verso Networks, with which I replaced a DSL connection. I originally chose Verso to escape the incredibly annoying Comcast/CenturyLink oligopoly that mostly provides internet to Denver. Verso Networks (the company) was great to work with, responsive customer service. Alas, their radio modem was slower than a fiber optic connection.
I got the Dell R530 at about the same time I got the fiber optic connection, so I had the luxury of getting fiber to work with my new computer, while the old server worked with the Verso radio modem.
Getting PPP-over-Ethernet working was tricky, but not impossible. You, too can do this, don’t be afraid to say “no” to a vendor-provided crappy WiFi router. The troubles started after I experimentally cut over my entire house to routing by the R530.
TCP connections would always work, but sometimes data didn’t flow back. No other error appeared. It was very site-specific. Google would work. Yahoo would not. Did I note that this was during the shelter-in-place part of the pandemic? My kids could get Zoom to work to get into their classes, but an educational activity site named “Kahoot” wouldn’t work properly. DNS worked properly, contradicting the old system administration adage.
MTU
MTU means “maximum transmission unit”, basically the size in bytes of the biggest single packet the operating system is set up to receive.
You can find MTU with commands like ip link show dev eth0
.
I just looked up MTU on various machines and interfaces
I found 1500 on every machine and interface,
except for device ppp0
on my Dell R530, where it is 1492.
PPP-the-protocol must use a byte of every packet.
The loopback interfaces (lo
) have an MTU of 65536,
but well, they’re loopbacks.
I had never thought to look into this until now.
MSS
MSS stands for “maximum segment size”. “Segment” is TCP jargon, it’s the size of the block of bytes of user data in a TCP packet. MSS is less than MTU because the different protocol layers, ethernet (or WiFi), IP and TCP all use up some bytes of the MTU for their important information.
The wikipedia article is interesting. It says that the minimum MSS for IPv4 is 536 bytes, but for IPv6 it’s 1220 bytes. I wonder if this is why CenturyLink does not give out IPv6 addresses or prefixes for its fiber optic customers.
I’ll quote a bit of the article:
Where a host wishes to set the maximum segment size to a value other than the default, the maximum segment size is specified as a TCP option, initially in the TCP SYN packet during the TCP handshake. The value cannot be changed after the connection is established.
I did not know that a SYN handshake packet carried this info.
The iptables
rule (see below) causes the MSS on TCP connections made
over the PPP connection to have a value of 1492, instead of 1500.
If the MSS value can’t be changed after the connection is established, what happens when Path MTU discovery for a TCP connection gives you an MTU value smaller than the MSS set in the SYN packet? It appears the Linux kernel takes care of this. See the MSS/MTU data below for examples.
MSS is per TCP connection, so it’s a little harder to find. Here’s my script:
#!/bin/bash
set -eou pipefail
ss -it | grep '^ ' |
sed -e 's/^..* mss/mss/' -e 's/ cwnd:..*$//'
The regular expression given to grep
is “lines beginning with a tab character”.
Here’s some things the above script prints out on various machines:
mss:1448 pmtu:1500 rcvmss:856 advmss:1448
mss:1440 pmtu:1500 rcvmss:1440 advmss:1448
mss:1428 pmtu:1500 rcvmss:536 advmss:1428
mss:1448 pmtu:1500 rcvmss:536 advmss:1448
mss:1448 pmtu:1500 rcvmss:1448 advmss:1448
mss:1440 pmtu:1492 rcvmss:1440 advmss:1440
mss:1400 pmtu:1500 rcvmss:798 advmss:1448
mss:1440 pmtu:1500 rcvmss:555 advmss:1448
mss:32741 pmtu:65535 rcvmss:536 advmss:65483
mss:1448 pmtu:1500 rcvmss:1448 advmss:1448
mss:1400 pmtu:1500 rcvmss:1400 advmss:1448
mss:32768 pmtu:65535 rcvmss:576 advmss:65483
The big MSS values like 32741 are sockets from 127.0.0.1 to 127.0.0.1. No socket going over WiFi or cables or PPP has such a large MSS.
One TCP/IP connection from both ends
I did ssh my.vps
on my Dell R530,
which should open and maintain a TCP connection from my R530
to the Debian VPS hosting my blog.
I ran ssh -it
on both ends to see what the connection looked like.
which end | MSS | PMTU | RCVMSS | ADVMSS |
---|---|---|---|---|
R530 | 1440 | 1492 | 1080 | 1440 |
VPS | 1440 | 1500 | 976 | 1448 |
I ran ss -it
several times to see what was up.
The value of RCVMSS for the VPS would not stay constant.
- MSS - maximum segment size
- ADVMSS - advertised MSS, what the end machine sends as its own maximum TCP segment size
- RCVMSS - I have not been able to find a coherent definition or explanation
- PMTU - Path Maximum Transmission Unit
One definition of RCVMSS I found doesn’t entirely make sense. If RCVMSS is the maximum segment size you let peers know you will accept, why is PMTU 1500, but RCVMSS 976?
On the VPS end of the socket, PMTU of 1500 is bigger than both MSS and ADVMSS. What’s up with that?
I guess I still don’t understand why an MSS bigger than Path MTU causes problems. Mechanisms exist, like ICMP Fragmentation Needed messages, or Path MTU discovery. It seems like this is a well-known issue. Why do beloved sites like Duck Duck Go seem to be plagued by this?
Try it at home!
I changed a few things on my server, and then rebooted.
I disabled pmtu-clamping.service
I commented out a few lines in /etc/iptables/iptables.rules
:
#-A FORWARD -o ppp0 -p tcp -m tcp --tcp-flags SYN,RST SYN -j TCPMSS --clamp-mss-to-pmtu
#-A FORWARD -p tcp -m tcp --tcp-flags SYN,RST SYN -j TCPMSS --clamp-mss-to-pmtu
After rebooting, interface ppp0
still has the same MTU.
1031 % ip link show dev ppp0
9: ppp0: <POINTOPOINT,MULTICAST,NOARP,UP,LOWER_UP> mtu 1492 qdisc fq_codel state UNKNOWN mode DEFAULT group default qlen 3
link/ppp
The lynx command line web browser
running on my server directly
was able to conduct a search via duckduckgo.com
,
but Firefox on my laptop, connected via WiFi, was not.
After trying a few things to see what worked and what didn’t, I restarted PMTU clamping.
1033 % sudo systemctl enable pmtu-clamping.service
[sudo] password for bediger:
Created symlink '/etc/systemd/system/multi-user.target.wants/pmtu-clamping.service' -> '/etc/systemd/system/pmtu-clamping.service'.
1034 % sudo systemctl start pmtu-clamping.service
Now, all is well.
One puzzle, executing iptables-save
before turning MSS clamping back on
doesn’t show one of the 2 iptables
rules I commented out.
iptables-save
only shows
-A FORWARD -p tcp -m tcp --tcp-flags SYN,RST SYN -j TCPMSS --clamp-mss-to-pmtu
The similar rule pertaining to ppp0
does not appear.
I wonder if the rule specific to ppp0
does anything.
Ultimately, I don’t have any suggestions about when to diagnose
that you need to clamp MSS to PMTU.
There’s no real symptoms except confusing irregularities in networking,
where TCP connections (but not UDP!) get set up,
but sometimes data doesn’t flow back and forth.
I didn’t find this solution by collecting facts,
eliminating possiblities by logical deduction or experimentation.
I got frustrated by the circumstances, and posted on the Arch Linux forums,
where someone suggested trying the iptables
rule above.
I found out a few things while writing this blog entry,
but I still have at least three mysteries I wasn’t able to find an answer to.
Systemd Service File
Below, the contents of my Path MTU clamping Systemd service file,
/etc/systemd/system/pmtu-clamping.service
[Unit]
Description=PMTU clamping for pppoe
Requires=iptables.service
After=iptables.service
[Service]
Type=oneshot
ExecStart=/usr/bin/iptables -I FORWARD -p tcp --tcp-flags SYN,RST SYN -j TCPMSS --clamp-mss-to-pmtu
[Install]
WantedBy=multi-user.target
This prescribes a “oneshot” service, it runs, it works, systemd forgets about it.
It’s always weird to get a service to run at a particular point in a Linux
system’s initial boot, but the Requires=
and After=
is one such attempt.
I want this iptables
command to run after
NAT gets set up.