Handle idle disconnects more cleanly to try to reduce number of dropped logs #39
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Reported issue in #38 where there was lots of re-connections caused by idle periods for the plugin followed by activity. This is specifically problematic as when logs are put down a closed connection then in some cases they can be lost rather than just retried. This is the nature of TCP sockets, as they are not instant feedback, the plugin tries to put logs down the socket and immediately returns once it has written to the file. This means if the connection is not possible but it has not been marked as such, then logs are put on the file successfully but never successfully make it to the remote.
The Papertrail documentation says idle connection timeouts should be handled by the client, ie this plugin (emphasis mine):
Additionally, when speaking to Papertrail support I received the following message when I asked about why connections might be terminated:
This PR seeks to address points 1 & 2 from Papertrail support so that this plugin can handle periods of idleness better. Specifically, because this is a Papertrail specific plugin I have enabled some sensible defaults which will allow the plugin to work better out of the box.
The logic behind the two changes is as follows:
TCP_USER_TIMEOUT
option, as a similar issue was seen here in a different syslog plugin when connection was lost to the host without RST. This does not mean no logs are dropped if a connection is dropped, but it puts a limiter on how long the connection is used before it is closed. Specifically, the standard UNIX setting seems to be 15 minutes, which would lead to a lot of lost logs if there is a period of inactivity and then there is heavy logging (a scenario we experienced with new deployments of apps), however, by shortening this down to 10 seconds this reduces the window of lost logs when connection is reset by PT. Reading for this - https://tech.instacart.com/the-vanishing-thread-and-postgresql-tcp-connection-parameters-93afc0e1208c, https://blog.cloudflare.com/when-tcp-sockets-refuse-to-die/MARK
message or similar on a timerI have made these things configurable as despite them being sensible defaults there is a possibility that someone would want to tune these values to get different numbers. I have also added these details to the readme so that they are easily accessible for anyone using the plugin