Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle idle disconnects more cleanly to try to reduce number of dropped logs #39

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

jwreford99
Copy link

Reported issue in #38 where there was lots of re-connections caused by idle periods for the plugin followed by activity. This is specifically problematic as when logs are put down a closed connection then in some cases they can be lost rather than just retried. This is the nature of TCP sockets, as they are not instant feedback, the plugin tries to put logs down the socket and immediately returns once it has written to the file. This means if the connection is not possible but it has not been marked as such, then logs are put on the file successfully but never successfully make it to the remote.

The Papertrail documentation says idle connection timeouts should be handled by the client, ie this plugin (emphasis mine):

Papertrail automatically closes idle TCP connections after 15 minutes of inactivity. Log senders should already handle this as part of a full implementation, since connections are routinely affected by Internet reachability changes. Senders can elect to immediately reconnect, or leave the connection closed until new log messages are available.

Additionally, when speaking to Papertrail support I received the following message when I asked about why connections might be terminated:

  1. There may be some network equipment dropping idle TCP connections after some minutes. Adding the TCP keepalives probably solved that problem.

  2. However, Papertrail itself will also drop idle syslog connections after 59 minutes. TCP keepalives don’t help there, because the timer only resets for actual application data. To solve this problem, you can log a periodic “MARK” message. The message can literally be anything, and should probably be on a 30 minute timer. You can even add a filter for it in Papertrail, so it doesn’t show up in the logs.

  3. There may also be some network equipment dropping TCP RST packets. This isn’t entirely uncommon, but it really messes up error recovery as you’ve observed. Instead of immediately reconnecting, the sender will instead send down a dead connection until it times out.

This PR seeks to address points 1 & 2 from Papertrail support so that this plugin can handle periods of idleness better. Specifically, because this is a Papertrail specific plugin I have enabled some sensible defaults which will allow the plugin to work better out of the box.

The logic behind the two changes is as follows:

  • The keep alives will better deal with network equipment between the sender and Papertrail to keep connections alive if there has been short periods of idleness. This is an out the box TCP feature which is useful in this case. In addition to the keep alives, I have configured the TCP_USER_TIMEOUT option, as a similar issue was seen here in a different syslog plugin when connection was lost to the host without RST. This does not mean no logs are dropped if a connection is dropped, but it puts a limiter on how long the connection is used before it is closed. Specifically, the standard UNIX setting seems to be 15 minutes, which would lead to a lot of lost logs if there is a period of inactivity and then there is heavy logging (a scenario we experienced with new deployments of apps), however, by shortening this down to 10 seconds this reduces the window of lost logs when connection is reset by PT. Reading for this - https://tech.instacart.com/the-vanishing-thread-and-postgresql-tcp-connection-parameters-93afc0e1208c, https://blog.cloudflare.com/when-tcp-sockets-refuse-to-die/
  • If Papertrail closes any idle Syslog connections which have not sent application data for 59 minutes then the plugin should know that if a socket has not been used for 59 minutes then it is now dead and so should be re-created. As suggested in the email from technical support though it is probably sensible to have some leaway on this, so I have set the default to 30 minutes of idle application time before the plugin re-creates the socket before even trying to send the message. I felt this was a neater solution than sending a MARK message or similar on a timer

I have made these things configurable as despite them being sensible defaults there is a possibility that someone would want to tune these values to get different numbers. I have also added these details to the readme so that they are easily accessible for anyone using the plugin

…alive messages. This also includes some sensible configuration based on experimenting with Papertrail itself
This timeout means that if a socket has not had any logs through it for that number of seconds then we should create it afresh before trying to send anymore logs through it
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

1 participant