TCP client hangs intermitantly in SYN_SENT for 10 seconds
Disabling tcp_tw_reuse and tcp_tw_recycle solve the problem
We have some high volume media servers that have benefitted from some selective TCP parameter tuning. These Debian Linux machines have stageful firewalling on the box (netconn_track), a moderate number of actively communicating connections (hundreds to thousands of clients) and high bandwidth (hundreds of megabits per second to multiple gigabits per second.)
Some years ago when ran into a problem with the number of TCP connections consuming too much memory, both for the stack and netconn_track table. Some of the suggested solutions we implemented included...
- Enabling
tcp_tw_recycle: "It enables fast recycling ofTIME_WAITsockets. The default value is 0 (disabled). The sysctl documentation incorrectly states the default as enabled. It can be changed to 1 (enabled) in many cases. Known to cause some issues with hoststated (load balancing and fail over) if enabled, should be used with caution." * - Enabling
tcp_tw_reuse: "This allows reusing sockets inTIME_WAITstate for new connections when it is safe from protocol viewpoint. Default value is 0 (disabled). It is generally a safer alternative totcp_tw_recycle" *
While we were loath to tune the TCP kernel parameters (modern kernels tune themselves very well, non-TCP-experts tend to screw it up), there was a notable improvement in the systems. Later on, when we had problems with the netconn_track table filling up, we tweaked two more parameters:
- Setting
ip_conntrack_maxto 64k to 128k - Setting
ip_conntrack_tcp_timeout_establishedto 5 minutes, from the default of 5 days
Things have been working very well for some time. Until this week, that is. On Monday I started getting reports of sluggishness in our application. As the app is under heavy development our first debugging was occurring there. Once I realized the app was fine I had a look at the TCP connections with a simple script checking the connection states on both client and server.
while [ 1 ]
do
date
netstat -n | grep 195.49.105.149
echo
sleep 1
done
Below is the output from the client. Immediately the problem jumped out: the client TCP connection was hanging in the SYN_SENT state for ~10s!
Note that this problem was occurring not only with HTTPS (as in the above output) but also HTTP and SSH/SCP. The problem was definitely in the TCP layer.
Thu May 3 15:58:14 CEST 2012
Thu May 3 15:58:15 CEST 2012
tcp4 0 0 192.168.1.69.63583 195.49.105.158.443 SYN_SENT
tcp4 0 0 192.168.1.69.63582 195.49.105.158.443 SYN_SENT
Thu May 3 15:58:16 CEST 2012
tcp4 0 0 192.168.1.69.63583 195.49.105.158.443 SYN_SENT
tcp4 0 0 192.168.1.69.63582 195.49.105.158.443 SYN_SENT
-- snip --
Thu May 3 15:58:25 CEST 2012
tcp4 0 0 192.168.1.69.63583 195.49.105.158.443 SYN_SENT
tcp4 0 0 192.168.1.69.63582 195.49.105.158.443 SYN_SENT
Thu May 3 15:58:26 CEST 2012
tcp4 0 0 192.168.1.69.63583 195.49.105.158.443 ESTABLISHED
tcp4 0 0 192.168.1.69.63582 195.49.105.158.443 ESTABLISHED
-- snip --
Thu May 3 15:58:36 CEST 2012
tcp4 0 0 192.168.1.69.63583 195.49.105.158.443 ESTABLISHED
tcp4 0 0 192.168.1.69.63582 195.49.105.158.443 ESTABLISHED
Thu May 3 15:53:53 CEST 2012
Some research around the Interwebs left me nowhere. The best suggestion was to reboot and hope. That's not the kind of engineered remedy that I was in search of. Instead we walked back the first to TCP stack tuning parameters, and turn off tcp_tw_recycle and tcp_tw_reuse.
That did the trick! I'm fairly certain that things will be OK now, and that the later tuning parameters will protect us from the problems our tcp_tw_recycle and tcp_tw_reuse tweaks were intended to help with. If problems do creep up again with the volume of unused sockets we'll try tweaking only tcp_tw_reuse, as that is deemed safer.
How to write a custom Tomcat logging valve
Or, the joy and pain of subclassing org.apache.catalina.valves.AccessLogValve
In an attempt to aid a regular log analysis task of mine, I decided to write my own customer Tomcat valve. The idea was to pre-filter what I was after and log that to a file.
The result of my custom valve was an access log that rolled over monthly that contained:
- Only PUT requests
- A segment of the URI
- The User-Agent of the client
Unfortunately, the task of coding a custom valve is not as simple as implementing a Filter or extending ServletRequestWrapper. Fortunately, the scope of what I was trying to accomplish was limited. Below is my code for extending AccessLogValve and a couple of tips for anyone venturing into the world of custom logging valves.
Notes on extending AccessLogValve:
- AccessLogValve.invoke(request, response) always needs to be called. If you override it and do not call super.invoke() then the request stops there.
- AccessLogValve is not scoped to the request. It is not thread safe. Do not attempt to keep state unless you have a specific reason to have one request impact another requests' valve behavior.
- Make any the log/no-log decision in .log(Request request, Response response, long time).
- To trim the URI, and filter out the method and protocol from the log,I overrode org.apache.catalina.connector.Request. This was a little trickier than expected as, like AccessLogValve, there is no corresponding wrapper. In the below extension of Request, the code overrides only the methods that are called by the logging patter. If the logging pattern contains anything that might calla property that is not set, then an NPE very well be thrown.
The code: LogAbreviatedPutValve.java
package ch.geekomatic.catalina;
import java.util.Enumeration;
import org.apache.catalina.connector.Request;
import org.apache.catalina.connector.Response;
import org.apache.catalina.valves.AccessLogValve;
/**
* This log valve filters out non-PUT requests and trims the logged URI. This is only useful for providing an easy to parse access log
* to generate statistics on the clients that upload via PUT. The
* correct way to use this value is to set the logger pattern to:
*
* %r %{User-Agent}i
*
* Other patterns either will not work or may generate an exception.
*
* @author Stu Thompson
*
*/
public class LogAbreviatedPutValve extends AccessLogValve {
@Override
public void log(Request request, Response response, long time) {
if (request.getMethod() != null && request.getMethod().equals("PUT"))
super.log(new AbreviatedRequest(request), response, time);
super.invoke(arg0, arg1)
}
}
/**
* This class provides a trimmed down request URI which is only useful for providing an easy to parse access log
* to generate statistics on the clients that upload via PUT
*
* @author Stu Thompson
* @since v2.12.5
*/
class AbreviatedRequest extends Request {
private final Request _request;
AbreviatedRequest(Request request) {
_request = request;
}
/**
* Returns a trimemd RequestURI that does not contain any characters from the position of the last /, and hence the file name.
*
* @see org.apache.catalina.connector.Request#getProtocol()
*/
@Override
public String getRequestURI() {
String uri = _request.getRequestURI();
if (uri != null && uri.contains("/")) {
int pos;
// remove file
pos = uri.lastIndexOf("/");
uri = uri.substring(0, pos);
return uri;
}
}
return _request.getRequestURI();
}
/**
* The logged URI includes the method. Since we are already filtering only for PUT, return an empty string.
*
* @see org.apache.catalina.connector.Request#getMethod()
* @returns empty string
*/
@Override
public String getMethod() {
return "";
}
/**
* The logged URI includes the HTTP protocol. We have no interest in this for our special log. Return an empty string.
*
* @see org.apache.catalina.connector.Request#getProtocol()
* @returns empty string
*/
@Override
public String getProtocol() {
return "";
}
/**
* The logged URI includes the query string. We have no interest in this for our special log. Return null no matter what.
*
* @see org.apache.catalina.connector.Request#getQueryString()
* @returns null
*/
@Override
public String getQueryString() {
return null;
}
@Override
public String getHeader(String name) {
return _request.getHeader(name);
}
@Override
public Enumeration
The code: server.xml snippet
<Valve className="ch.geekomatic.catalina.LogAbreviatedPutValve"
directory="logs/access"
prefix="put_log." suffix=".txt" pattern="%r %{User-Agent}i"
fileDateFormat="yyyy-MM" resolveHosts="false"/>
Duplicate Session Errors on BlazeDS
Solution #2: Get off of Tomcat 7.0.0
A while back we discovered a reason behind "duplicate session errors" in BlazeDS, and I posted what fixed it for us. But the pesky little problem remained on a single server. Until recently, the cause eluded us. What, prey tell, was the culprit?
The problem server had only one significant difference from the happily running servers: Apache Tomcat 7.0.0, from two years ago. Once we rolled the Tomcat version back to the most recent Tomcat 6 version, 6.0.35, everything worked perfectly.
That's not the first bizarre problem I've had with Tomcat 7 that was solved by staying with Tomcat 6. It might be a while longer before trying Tomcat 7 again.
Duplicate Session Errors on BlazeDS
Solution: Use only one channel!
For the past week I've been coding with the BlazeDS 4 framework, in conjunction with my Flex-coding counterpart. Things were moving along quickly...until we brought in another Flex coder, Alex, who works on Windows. (My other colleague, Andy, and I work on OSX.) Alex was unable to get the app to work because he was receiving a strange error message:
Detected duplicate HTTP-based FlexSessions, generally due to the remote host disabling session cookies. Session cookies must be enabled to manage the client connection correctly.Very quickly on Teh Google we found other people experiencing the error two. (E.g.: "Duplicate session errors in LCDS/BlazeDS") But those solutions were either not helpful ("ensure a session is established from the start by a single call") or outright destructive hackiness, like "patch BlazeDS and remove the duplicate session detection." That last suggestion breaks any session-based functionality, like authorization.
In order to debug what was happening with the sessions a small tweak to the Apache Tomcat log valve was in order.
<Valve className="org.apache.catalina.valves.AccessLogValve" directory="logs/access"
prefix="access_m15_berea_log." suffix=".txt" pattern="%h %S %t '%r' %s %B %D %{Cookie}i %{Set-Cookie}o"
fileDateFormat="yyyy-MM" resolveHosts="false" />
This specialized logs records:
- %h: client IP address
- %S: Tomcat's session ID for the request
- %t: time
- %r: first like of the HTTP request; verb, URI and version
- %s: HTTP status code
- %B: number of non-header bytes
- %D: time taken to process request, in milliseconds
- %{Cookie}i: The value of the "Cookie" header in the request
- %{Set-Cookie}o: The value of the "Set-Cooking" header in the request
After restarting Tomcat so that the log changes could take effect, we had our client hit the server. The client, on start, does a couple of things. It subscribes to three topics on the messaging service and makes two "shared object" RPC calls. We can see this activity in the logs files below. (Edited for readability.)
"Five different sessions, for ten requests, over 2 seconds? That's crazy!"
I thought. And then I noticed that the session IDs a) seemed to have an affinity for the channel (streamingamf or amf), and b) that the session seemed to change after a switch from one channel to the other. Andy and I were stumped. How on earth could we work around this?
3.2.5.1 C7E0E [12:19:24] 'POST /ws/amf3/amf HTTP/1.1' 200 245 89 - C7E0E
3.2.5.1 C7E0E [12:19:24] 'POST /ws/amf3/amf;jsessionid=C7E0E HTTP/1.1' 200 138 205 C7E0E -
3.2.5.1 A6B02 [12:19:24] 'POST /ws/amf3/streamingamf HTTP/1.1' 200 969 40 - A6B02
3.2.5.1 A6B02 [12:19:24] 'POST /ws/amf3/streamingamf;jsessionid=A6B02 HTTP/1.1' 200 539 6 A6B02
3.2.5.1 A6B02 [12:19:25] 'POST /ws/amf3/streamingamf;jsessionid=A6B02 HTTP/1.1' 200 468 99 A6B02
3.2.5.1 A6B02 [12:19:25] 'POST /ws/amf3/streamingamf;jsessionid=A6B02 HTTP/1.1' 200 468 98 A6B02
3.2.5.1 A6B02 [12:19:25] 'POST /ws/amf3/streamingamf;jsessionid=A6B02 HTTP/1.1' 200 468 10 A6B02
3.2.5.1 D2E25 [12:19:25] 'POST /ws/amf3/amf;jsessionid=C7E0E HTTP/1.1' 200 1331 39 - D2E25
3.2.5.1 - [12:19:24] 'POST /ws/amf3/streamingamf;jsessionid=A6B02?command=open&version=1 HTTP/1.1'
200 916 444 - -
3.2.5.1 EC695 [12:19:25] 'POST /ws/amf3/streamingamf;jsessionid=A6B02?command=close&streamId=6877DFC8-030A&version=1 HTTP/1.1'
200 0 3 - EC695
And that's when I had a Eureka! moment. "What if we used the streaming AMF channel to make the shared object calls? And hence use only one channel?" It worked.
We, like I suspect a lot of BlazeDS users made some assumptions based upon the standard BlazeDS examples, like:
- "shared object calls are only made to the my-amf end point"
- "messaging must be done separately via the my-streaming-amf end point"
- "clients then maintain session affinity by using only one NetConnection object, even if they have multiple endpoints"
To that end, here is my simple sample BlazeDS configuration file. It replaces the four configuration files that come with BlazeDS. I've also added important notes on key properties so one doesn't need to run off to Adobe's documentation site.
If you've gotten this far you are probably thinking either "OMG, what a fucking idiot. Of course it works this way." But I bet quite a few of you are thinking "OMG, thanks for saving my ass." Either way I'd love to hear about it in the comments.
Raspberry Pi Image Download Page
Swiss-based Images with 4 Gbps Connectivity
I've got access to mad bandwidth, thanks to my employer xtendx AG. We'd like to be using Raspberry PI for some projects, so hosting this site is a great way to share the love.
Raspberry Pi Image Download Page