Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix (known) false positives in connection warning #14095

Merged
merged 6 commits into from
Jan 30, 2025

Conversation

danxuliu
Copy link
Member

@danxuliu danxuliu commented Jan 9, 2025

Fixes (non-UI part of) #14019

There are some known false positives that cause the connection warning to be shown when the connection is fine:

  • Stat timestamp may not be updated when the stats stall: sometimes the same number of packets is reported two times in a row. The code currently keeps the quality calculation on hold if the same number of packets is reported twice in a row until a third value is received, and then updates the quality based on that third value. In the past it was observed that when the stats stalled the timestamps were updated. However, it seems that it can also happen that the timestamp is also repeated, which messes with the calculations and causes the connection to be reported as very bad even if the same number of packets is reported twice in a row (which was expected to be correctly handled already, but due to this new finding turns out that it is not).
    • Fixed by evenly distributing the timestamps when they stall
    • This seems to be exclusive of Firefox, as Chromium does not provide a packetsReceived value in remote-inbound-rtp.
    • It is unclear whether the stats can stall more than two times in a row. It was assumed that the connection warning was caused by that, but it is most likely that the stalled timestamp was the actual problem and that it was mistakenly identified as the stats stalling three times in a row. For now it will be assumed that if the same stats are repeated three times in a row there is a connection problem, and it will be adjusted in a follow up if it is found out that it is a false positive.
  • Videos with mostly static images and low information (like a blurred background with the video quality reduced due to being in a call with several participants), which may send even less packets per second than the current threshold (10) used to mark the connection as very bad (and therefore show the connection warning)
    • Workarounded by no longer showing a connection warning to the user when the packets per second are low; further investigation would be needed, but in any case it seemed to cause more false positives to be reported than actual positives
  • The reported packet count may "go back in time" to a previous value (it seems to happen when enabling the video, but it is not reproducible at will), which messes with the quality calculation.
    • After endless tests it was not reproducible once... It was just found in provided logs, so a "synthetic" test was added that reproduces what was seen in those logs and the issue was fixed by resetting the stats when the received packet count regresses.
    • Although the issue from the logs could not be reproduced a similar (but not fully matching) issue was also found: due to a bug in Firefox and/or Janus (the stats overflowing is a bug in Firefox and even the spec, but in theory it should not happen unless it received invalid values from Janus) the received packet count may overflow. This typically happens in the stats of the lower quality stream of a simulcast video.

Follow ups

  • Fix distribution of staged stats (as right now they can be distributed when the values are staged three times due to a wrong if (packets > 0) { condition)
  • Maybe log stats returned by getStats() and not only the processed values in case of strange scenarios (like the packet count regressing) to try to better understand it

@danxuliu danxuliu added this to the 🖤 Next Major (31) milestone Jan 9, 2025
@danxuliu danxuliu force-pushed the fix-known-false-positives-in-connection-warning branch 2 times, most recently from 51ed92b to 0f449de Compare January 23, 2025 10:45
@nickvergessen
Copy link
Member

Reviewable soon?

@danxuliu danxuliu force-pushed the fix-known-false-positives-in-connection-warning branch from 0f449de to 8c34ada Compare January 28, 2025 10:57
Chromium does not provide "packetsReceived" in "remote-inbound-rtp"

Signed-off-by: Daniel Calviño Sánchez <[email protected]>
When the initial stats reports are added the first value of cumulative
stats is used as a base when converting the rest of values to relative.
Therefore, that initial value is not meaningful and should not be used
in calculations, as otherwise the reported quality would be off. Due to
that now no quality is reported until the values of the first report
added were shifted.

Signed-off-by: Daniel Calviño Sánchez <[email protected]>
In the past it was observed that the timestamp and round trip time were
updated when the stats stalled (so only the received packets actually
stalled). However, nowadays it seems that the timestamp and round trip
time can also stall. This caused the stalled timestamp to be computed as
0, and that in turn messed with the rest of calculations (for example,
generating a NaN for the number of packets per second) and causing a
"very bad quality" to be wrongly reported. To solve that now the
timestamps are evenly distributed when they unstall.

Signed-off-by: Daniel Calviño Sánchez <[email protected]>
In the past it was observed that, even for small videos, 10 packets per
second was a reasonable threshold to detect connection issues even if
there were no lost packets. However, nowadays it seems that it can
sometimes trigger a false positive (typically when the background blur
is enabled and the video quality is reduced due to being in a call with
several participants), so for now the connection problem is no longer
reported to the user but just logged.

Signed-off-by: Daniel Calviño Sánchez <[email protected]>
Due to a bug in Firefox and/or Janus, when using simulcast video the
remote report for one of the streams (typically the lowest quality one)
may start with garbage values. This causes the received packet count to
be reported the maximum integer value minus the packet lost count (which
is usually around a few thousands). When newer stats arrive the received
packets start to increase from that extremely high value, and eventually
it overflows, causing the received packet count to go back from
~4294967295 to ~0.

In other cases it was seen that the received packet count can regress a
few values, although it is not clear when or why (this was much rarer
and not reproducible, unlike the scenario described above).

To prevent the regressed value from distorting the analysis due to the
packet count being < 0, and as in both cases once the packet count
regressed the received packet count in all following stat reports
increase from the regressed value, now the stats are reset when a lower
packet count is found.

Signed-off-by: Daniel Calviño Sánchez <[email protected]>
@danxuliu danxuliu force-pushed the fix-known-false-positives-in-connection-warning branch from 8c34ada to 048c505 Compare January 29, 2025 14:47
@danxuliu
Copy link
Member Author

/backport to stable31

@danxuliu
Copy link
Member Author

/backport to stable30

@danxuliu
Copy link
Member Author

/backport to stable29

@danxuliu danxuliu marked this pull request as ready for review January 29, 2025 15:16
Copy link
Contributor

@Antreesy Antreesy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for detailed comments!
Smoke tested in local call with blurred bg and screenshares, haven't received any warnings, as was noticed before for these cases

@danxuliu danxuliu merged commit 0171ecd into main Jan 30, 2025
56 checks passed
@danxuliu danxuliu deleted the fix-known-false-positives-in-connection-warning branch January 30, 2025 01:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants