-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Calculate and report quartiles in performance results #60950
Conversation
Size Change: +54 B (0%) Total Size: 1.75 MB
ℹ️ View Unchanged
|
( sum( array ) / array.length ) ** 2 | ||
); | ||
} | ||
|
||
export function median( array ) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In some cases we also seems to be calculating the median, for others the average 🤔
@@ -72,26 +92,46 @@ export function curateResults( | |||
results: WPRawPerformanceResults | |||
): WPPerformanceResults { | |||
const output = { | |||
timeToFirstByte: median( results.timeToFirstByte ), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will change the reporting from median
to average
, is that intentional?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In some cases we also seems to be calculating the median, for others the average
It seems to me like a pointless inconsistency. The metrics measured don't look fundamentally different from each other. That why I'm trying to unify the stats to use only one methods.
Are you worried that changing it would invalidate the stats history? I think in our case the medians and averages should be approx the same, any difference is just a random noise. I'll check if one of them is systematically higher than the other one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm just saying this would be reverting #54157, but maybe that's ok
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we remove that, maybe we should also remove the median
function etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this would be reverting #54157
That's true, we'll have to deal with that. The reason why median is needed to "stabilize the metric" is that there are outliers and that median is less sensitive to them than average.
This is a real-world example of an inserterHover
measurement with 20 data points:
[2.60, 3.09, 2.57, 2.62, 2.21, 2.98, 2.34, 3.34, 2.37, 3.00, 2.14, 2.16, 2.07, 2.22, 2.12, 2.27, 2.66, 2.20, 6.89, 2.99]
Almost all measurements are nicely grouped around 2.5, in a ±0.5 range, but there is one outlier at 6.8:

The presence of this outlier moves the median from 2.37 to 2.47, by 0.1. But the average is more sensitive, it's moved from 2.52 to 2.74, by 0.22.
When there are more outliers than one, then the stats are even more distorted.
We can solve this by detecting and removing the outliers. By arguing that when a certain result is too far from all others, it means that the measurement instrument broke down and produced a garbage result. Without outliers, neither the median and the average are distorted, and are closer to each other.
My ultimate goal here is to do some significance test for each metric and produce a result like:
on trunk the inserterHover metric is 2.5±0.5ms
on this branch the inserterHover metric is 1.5±0.5ms
with 95% confidence this is a real improvement, not a random noise
This enables automated reporting in PR where the tool can confidently say "out of 100 metrics I checked, your PR improves this one but another two are significantly worse"
All the common significance form accept averages and variances as inputs. Not medians and quartiles. That's why I'm trying to move to averages.
c024213
to
e4aa308
Compare
The following accounts have interacted with this PR and/or linked issues. I will continue to update these lists as activity occurs. You can also manually ask me to refresh this list by adding the If you're merging code through a pull request on GitHub, copy and paste the following into the bottom of the merge commit message.
To understand the WordPress project's expectations around crediting contributors, please review the Contributor Attribution page in the Core Handbook. |
So, at the end this PR does two things:
The result table now looks like this:
It reports the median (50% quartile), how many percent are the 75% and 25% quartiles larger or smaller than the median, and the number of measurements (16 in the example). The main value of this is that we can compare the "% change" against another branch, and the variability of the measurements on one branch. So, if the reported change is -1.29% while the measurements themselves vary by 5-9%, we can see that the reported change is probably not real, it's noise. A nice next step would be if we could report the quartiles also to codevitals (in |
9eccd43
to
2510c52
Compare
Is the I wanted to check the summary but it seems the performance job is failing in the last run. |
2510c52
to
cd450ae
Compare
I probably broke it when resolving the conflicts with #61450. Both PR touch the same code, printing the results table. I'm working on fixing it.
It's temporarily helpful, as I wanted to confirm I'm using all the samples from all test rounds. And it's useful to record the number of samples permanently in the codevitals database because statistical significance depends on number of samples: the precision of the measurement grows proportionally to |
Thanks |
@youknowriad it's all green now, you can check how the summary looks like. When logging the data to codevitals' |
@jsnajdr Looks good. For codevitals, I haven't thought about it much but I have nothing against it. Feel free to open an issue or actually work on it https://github.com/youknowriad/codevitals I'm just thinking that we should make it optional as not all projects might have it... |
What is |
@jeryj The We're only additionally reporting how precisely we know the median values in respective branches. The median values are calculated from multiple measurements, each of them giving a different number. These % values report how different are these measurements from each other. |
Inspired by #60509, I'm adding a calculation of 25% and 75% quartiles of the measurements. Each measurement is a set of 10+ samples, so let's extract more data than a single number, i.e., the average or the median.
Knowing the quartiles gives us some idea about the variance, i.e., the error of the measurements. Knowing that is important for determining if a change in a metric in a given PR is significant or not.
Also using medians consistently everywhere instead of averages. Medians are more stable in presence of outliers, i.e., measurements where the process fails in a big way, i.e., the error is larger than a gaussian noise.
Let's see how it works.