-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat/re-allow multiple workers #36134
base: main
Are you sure you want to change the base?
Conversation
I'll be OOO for the next few weeks, so I won't be able to review until then. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR @bmiguel-teixeira! The contributions sound solid, but I'm a bit concerned that we're doing two separate things in one single PR here. Generally that's not a good practice because in case we have to revert one thing, we end up reverting more than what's needed, not to mention that it makes the PR biggers and a bit harder to review
Could you choose only one new functionality for this PR and open another one for the other?
Regarding the changes, could you add tests for the telemetry added? Send PRW requests to a mock server and assert that the metric you've added increments as you expected.
For allowing multiple workers, it would be nice if we add extra documentation making it clear that Out-of-order needs to be enabled in Prometheus for it to work :)
Sure. I will open a dedicated PR with the additional telemetry and keep de queue changes in this one which already has context. |
422bb46
to
2498ea3
Compare
hi @ArthurSens Removed the additional telemetry to be added in a secondary PR. Also added a bit of docs to explain the toggle and use case. Please take a look Cheers |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, the code is simple so it does LGTM, but I'm struggling to test this.
Do you have any examples of apps/configs that will generate out of order datapoints? All attempts I've tried provide things in order so I can't be sure if this is working as expected 😅
(there's a linting failure) |
Hi @ArthurSens Just submitted your recomendation to fix the spelling issues. In regards to testing and simulating locally the out of order issues. Prometheus Config
Otel Config
Scenario 1: (Vanilla)
Outcome: Metrics are reingested, in order with single worker, ALL GOOD. Scenario 2 (PrometheusRemoteWrite 5 Consumers + Vanilla Prometheus)
Outcome: OutOfOrder Errors in Prometheus after boot up, since samples will be re-tried with no specific order. Scenario 3 (PrometheusRemoteWrite 5 Consumers + Prometheus with OutOfOrder Window)
Outcome: No OutOfOrder issues because even though we have multiple workers, Prometheus can accept mixed/old samples and update tsdb until 10 minutes "ago". Let me know if you need any more info. Cheers. |
2978d8c
to
012fe5d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perfect; thank you for the instructions! I just tested it, and it works perfectly. There's just one CI failure that we need to fix before approving here.
Could you run make generate
and commit the changes?
2a50228
to
7c25326
Compare
@ArthurSens All done! |
3a2ac49
to
38d7bcc
Compare
Hey @bmiguel-teixeira, once #36601 is merged, I think we should be safe to proceed with multiple workers again :) Could you rebase your PR on top of main once that happens? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While reading the changes in README, I'm confused about the difference between num_consumers
and max_batch_request_parallelism
.
Both mention that they configure the number of workers... so they fight each other to decide the number of workers? Do they complement each other?
By the way, thanks for the patience here, a lot was done to make this possible :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The README is still a bit confusing to me, but I'm fine merging if the other codeowners are ok with it
Hi @ArthurSens No worries on the delays. If Then we have feature gate for controlling the parallelism. If the gate is absent, we keep the default behaviour with single worker thread and with Hopefully this helps |
@@ -66,6 +66,8 @@ The following settings can be optionally configured: | |||
- `max_batch_size_bytes` (default = `3000000` -> `~2.861 mb`): Maximum size of a batch of | |||
samples to be sent to the remote write endpoint. If the batch size is larger | |||
than this value, it will be split into multiple batches. | |||
- `max_batch_request_parallelism` (default = `5`): Maximum parallelism allowed for a single request bigger than `max_batch_size_bytes`. | |||
This configuration is used only when feature gate `exporter.prometheusremotewritexporter.EnableMultipleWorkers` is enabled. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the implementation differs from this documentation. It seems like this config option will be used if set regardless of the state of the feature gate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Aneurysm9 is correct. Please remove this sentence
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct. I didnt review README after all the changes. Please take another look. If it LGTM do let me know to squash and rebase
Description
This MR does the following:
Out of Order is no longer an issue now that it is fully supported in Prometheus. Nonetheless, I am setting the default worker as 1 to avoid OoO in Vanilla Prometheus Settings.
With a single worker, and for a collector with a large load, this becomes "blocking". Example: Imagine a scenario in which a collector is collecting lots of targets, and with a slow prometheus/unstable network, a single worker can easily bottleneck the off-shipping if retries are enabled.
Link to tracking issue
N/A
Testing
Documentation
docs auto-updated. Readme.md is now correct in its explanation of the `num_consumers since its no longer hard-coded at 1. Additional docs added.