Handle idle disconnects more cleanly to try to reduce number of dropped logs #39

jwreford99 · 2023-04-25T15:21:33Z

Reported issue in #38 where there was lots of re-connections caused by idle periods for the plugin followed by activity. This is specifically problematic as when logs are put down a closed connection then in some cases they can be lost rather than just retried. This is the nature of TCP sockets, as they are not instant feedback, the plugin tries to put logs down the socket and immediately returns once it has written to the file. This means if the connection is not possible but it has not been marked as such, then logs are put on the file successfully but never successfully make it to the remote.

The Papertrail documentation says idle connection timeouts should be handled by the client, ie this plugin (emphasis mine):

Papertrail automatically closes idle TCP connections after 15 minutes of inactivity. Log senders should already handle this as part of a full implementation, since connections are routinely affected by Internet reachability changes. Senders can elect to immediately reconnect, or leave the connection closed until new log messages are available.

Additionally, when speaking to Papertrail support I received the following message when I asked about why connections might be terminated:

There may be some network equipment dropping idle TCP connections after some minutes. Adding the TCP keepalives probably solved that problem.

However, Papertrail itself will also drop idle syslog connections after 59 minutes. TCP keepalives don’t help there, because the timer only resets for actual application data. To solve this problem, you can log a periodic “MARK” message. The message can literally be anything, and should probably be on a 30 minute timer. You can even add a filter for it in Papertrail, so it doesn’t show up in the logs.

There may also be some network equipment dropping TCP RST packets. This isn’t entirely uncommon, but it really messes up error recovery as you’ve observed. Instead of immediately reconnecting, the sender will instead send down a dead connection until it times out.

This PR seeks to address points 1 & 2 from Papertrail support so that this plugin can handle periods of idleness better. Specifically, because this is a Papertrail specific plugin I have enabled some sensible defaults which will allow the plugin to work better out of the box.

The logic behind the two changes is as follows:

The keep alives will better deal with network equipment between the sender and Papertrail to keep connections alive if there has been short periods of idleness. This is an out the box TCP feature which is useful in this case. In addition to the keep alives, I have configured the TCP_USER_TIMEOUT option, as a similar issue was seen here in a different syslog plugin when connection was lost to the host without RST. This does not mean no logs are dropped if a connection is dropped, but it puts a limiter on how long the connection is used before it is closed. Specifically, the standard UNIX setting seems to be 15 minutes, which would lead to a lot of lost logs if there is a period of inactivity and then there is heavy logging (a scenario we experienced with new deployments of apps), however, by shortening this down to 10 seconds this reduces the window of lost logs when connection is reset by PT. Reading for this - https://tech.instacart.com/the-vanishing-thread-and-postgresql-tcp-connection-parameters-93afc0e1208c, https://blog.cloudflare.com/when-tcp-sockets-refuse-to-die/
If Papertrail closes any idle Syslog connections which have not sent application data for 59 minutes then the plugin should know that if a socket has not been used for 59 minutes then it is now dead and so should be re-created. As suggested in the email from technical support though it is probably sensible to have some leaway on this, so I have set the default to 30 minutes of idle application time before the plugin re-creates the socket before even trying to send the message. I felt this was a neater solution than sending a MARK message or similar on a timer

I have made these things configurable as despite them being sensible defaults there is a possibility that someone would want to tune these values to get different numbers. I have also added these details to the readme so that they are easily accessible for anyone using the plugin

…alive messages. This also includes some sensible configuration based on experimenting with Papertrail itself

This timeout means that if a socket has not had any logs through it for that number of seconds then we should create it afresh before trying to send anymore logs through it

jwreford99 added 3 commits April 25, 2023 15:39

Add configuration parameters that can enable TCP sockets to use keep …

887b16a

…alive messages. This also includes some sensible configuration based on experimenting with Papertrail itself

Add a socket re-creation timeout.

5082db4

This timeout means that if a socket has not had any logs through it for that number of seconds then we should create it afresh before trying to send anymore logs through it

Add details of new configuration options to readme

ff00fdc

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle idle disconnects more cleanly to try to reduce number of dropped logs #39

Handle idle disconnects more cleanly to try to reduce number of dropped logs #39

jwreford99 commented Apr 25, 2023

Handle idle disconnects more cleanly to try to reduce number of dropped logs #39

Are you sure you want to change the base?

Handle idle disconnects more cleanly to try to reduce number of dropped logs #39

Conversation

jwreford99 commented Apr 25, 2023