Connections in the FIN_WAIT_2 state and Apache

What is the FIN_WAIT_2 state?

netstat

The FIN_WAIT_2 state is somewhat unusual in that there is no timeout defined in the standard for it. This means that on many operating systems, a connection in the FIN_WAIT_2 state will stay around until the system is rebooted. If the system does not have a timeout and too many FIN_WAIT_2 connections build up, it can fill up the space allocated for storing information about the connections and crash the kernel. The connections in FIN_WAIT_2 do not tie up an httpd process.

But why does it happen?

Buggy clients and persistent connections

persistent connections

KeepAliveTimeout

The client opens a new connection to the same or a different site, which causes it to fully close the older connection on that socket.
The user exits the client, which on some (most?) clients causes the OS to fully shutdown the connection.
The FIN_WAIT_2 times out, on servers that have a timeout for this state.

If you are lucky, this means that the buggy client will fully close the connection and release the resources on your server. However, there are some cases where the socket is never fully closed, such as a dialup client disconnecting from their provider before closing the client. In addition, a client might sit idle for days without making another connection, and thus may hold its end of the socket open for days even though it has no further use for it. This is a bug in the browser or in its operating system's TCP implementation.

The clients on which this problem has been verified to exist:

Mozilla/3.01 (X11; I; FreeBSD 2.1.5-RELEASE i386)
Mozilla/2.02 (X11; I; FreeBSD 2.1.5-RELEASE i386)
Mozilla/3.01Gold (X11; I; SunOS 5.5 sun4m)
MSIE 3.01 on the Macintosh
MSIE 3.01 on Windows 95

This does not appear to be a problem on:

Mozilla/3.01 (Win95; I)

It is expected that many other clients have the same problem. What a client should do is periodically check its open socket(s) to see if they have been closed by the server, and close their side of the connection if the server has closed. This check need only occur once every few seconds, and may even be detected by a OS signal on some systems (e.g., Win95 and NT clients have this capability, but they seem to be ignoring it).

Apache cannot avoid these FIN_WAIT_2 states unless it disables persistent connections for the buggy clients, just like we recommend doing for Navigator 2.x clients due to other bugs. However, non-persistent connections increase the total number of connections needed per client and slow retrieval of an image-laden web page. Since non-persistent connections have their own resource consumptions and a short waiting period after each closure, a busy server may need persistence in order to best serve its clients.

As far as we know, the client-caused FIN_WAIT_2 problem is present for all servers that support persistent connections, including Apache 1.1.x and 1.2.

Something in Apache may be broken

One possible (and most likely) source for additional FIN_WAIT_2 states is a function called lingering_close() which was added between 1.1 and 1.2. This function is necessary for the proper handling of persistent connections and any request which includes content in the message body (e.g., PUTs and POSTs). What it does is read any data sent by the client for a certain time after the server closes the connection. The exact reasons for doing this are somewhat complicated, but involve what happens if the client is making a request at the same time the server sends a response and closes the connection. Without lingering, the client might be forced to reset its TCP input buffer before it has a chance to read the server's response, and thus understand why the connection has closed. See the appendix for more details.

We have not yet tracked down the exact reason why lingering_close() causes problems. Its code has been thoroughly reviewed and extensively updated in 1.2b6. It is possible that there is some problem in the BSD TCP stack which is causing the observed problems. It is also possible that we fixed it in 1.2b6. Unfortunately, we have not been able to replicate the problem on our test servers.

What can I do about it?

Add a timeout for FIN_WAIT_2

FreeBSD versions starting at 2.0 or possibly earlier.
NetBSD version 1.2(?)
OpenBSD all versions(?)
BSD/OS 2.1, with the K210-027 patch installed.
Solaris as of around version 2.2. The timeout can be tuned by using ndd to modify tcp_fin_wait_2_flush_interval, but the default should be appropriate for most servers and improper tuning can have negative impacts.
SCO TCP/IP Release 1.2.1 can be modified to have a timeout by following SCO's instructions.
Linux 2.0.x and earlier(?)
HP-UX 10.x defaults to terminating connections in the FIN_WAIT_2 state after the normal keepalive timeouts. This does not refer to the persistent connection or HTTP keepalive timeouts, but the SO_LINGER socket option which is enabled by Apache. This parameter can be adjusted by using nettune to modify parameters such as tcp_keepstart and tcp_keepstop. In later revisions, there is an explicit timer for connections in FIN_WAIT_2 that can be modified; contact HP support for details.
SGI IRIX 5.3, 6.2 and 6.3 (and a patch for 6.4 after release) will add a timeout for connections in FIN_WAIT_2 in a forthcoming (as of 97/01) rollup patch. Contact SGI for details.
NCR's MP RAS Unix 2.xx and 3.xx both have FIN_WAIT_2 timeouts. In 2.xx it is non-tunable at 600 seconds, while in 3.xx it defaults to 600 seconds and is calculated based on the tunable "max keep alive probes" (default of 8) multiplied by the "keep alive interval" (default 75 seconds).
Squent's ptx/TCP/IP for DYNIX/ptx has had a FIN_WAIT_2 timeout since around release 4.1 in mid-1994.

The following systems are known to not have a timeout:

SunOS 4.x does not and almost certainly never will have one because it as at the very end of its development cycle for Sun. If you have kernel source should be easy to patch.

There is a patch available for adding a timeout to the FIN_WAIT_2 state; it was originally intended for BSD/OS, but should be adaptable to most systems using BSD networking code. You need kernel source code to be able to use it. If you do adapt it to work for any other systems, please drop me a note at marc@apache.org.

Compile without using `lingering_close()`

lingering_close()

To compile without the lingering_close() function, add -DNO_LINGCLOSE to the end of the EXTRA_CFLAGS line in your Configuration file, rerun Configure and rebuild the server.

Use `SO_LINGER` as an alternative to `lingering_close()`

SO_LINGER

setsockopt(2)

lingering_close()

lingering_close

To try it, add -DUSE_SO_LINGER -DNO_LINGCLOSE to the end of the EXTRA_CFLAGS line in your Configuration file, rerun Configure and rebuild the server.

NOTE: Attempting to use SO_LINGER and lingering_close() at the same time is very likely to do very bad things, so don't.

Increase the amount of memory used for storing connection state

BSD based networking code:

BSD stores network data, such as connection states, in something called an mbuf. When you get so many connections that the kernel does not have enough mbufs to put them all in, your kernel will likely crash. You can reduce the effects of the problem by increasing the number of mbufs that are available; this will not prevent the problem, it will just make the server go longer before crashing.

The exact way to increase them may depend on your OS; look for some reference to the number of "mbufs" or "mbuf clusters". On many systems, this can be done by adding the line NMBCLUSTERS="n", where n is the number of mbuf clusters you want to your kernel config file and rebuilding your kernel.

Feedback

marc@apache.org

Appendix

Below is a message from Roy Fielding, one of the authors of HTTP/1.1.

Why the lingering close functionality is necessary with HTTP

The need for a server to linger on a socket after a close is noted a couple times in the HTTP specs, but not explained. This explanation is based on discussions between myself, Henrik Frystyk, Robert S. Thau, Dave Raggett, and John C. Mallery in the hallways of MIT while I was at W3C.

If a server closes the input side of the connection while the client is sending data (or is planning to send data), then the server's TCP stack will signal an RST (reset) back to the client. Upon receipt of the RST, the client will flush its own incoming TCP buffer back to the un-ACKed packet indicated by the RST packet argument. If the server has sent a message, usually an error response, to the client just before the close, and the client receives the RST packet before its application code has read the error message from its incoming TCP buffer and before the server has received the ACK sent by the client upon receipt of that buffer, then the RST will flush the error message before the client application has a chance to see it. The result is that the client is left thinking that the connection failed for no apparent reason.

There are two conditions under which this is likely to occur:

sending POST or PUT data without proper authorization
sending multiple requests before each response (pipelining) and one of the middle requests resulting in an error or other break-the-connection result.

The solution in all cases is to send the response, close only the write half of the connection (what shutdown is supposed to do), and continue reading on the socket until it is either closed by the client (signifying it has finally read the response) or a timeout occurs. That is what the kernel is supposed to do if SO_LINGER is set. Unfortunately, SO_LINGER has no effect on some systems; on some other systems, it does not have its own timeout and thus the TCP memory segments just pile-up until the next reboot (planned or not).

Please note that simply removing the linger code will not solve the problem -- it only moves it to a different and much harder one to detect.