feat/re-allow multiple workers #36134

bmiguel-teixeira · 2024-11-01T16:42:25Z

Description

This MR does the following:

Re adds the ability to allow multiple workers in this exporter due to:

Out of Order is no longer an issue now that it is fully supported in Prometheus. Nonetheless, I am setting the default worker as 1 to avoid OoO in Vanilla Prometheus Settings.
With a single worker, and for a collector with a large load, this becomes "blocking". Example: Imagine a scenario in which a collector is collecting lots of targets, and with a slow prometheus/unstable network, a single worker can easily bottleneck the off-shipping if retries are enabled.

Link to tracking issue

N/A

Testing

Documentation

docs auto-updated. Readme.md is now correct in its explanation of the `num_consumers since its no longer hard-coded at 1. Additional docs added.

dashpole · 2024-11-01T16:45:22Z

cc @jmichalek132 @ArthurSens

dashpole · 2024-11-01T16:47:05Z

I'll be OOO for the next few weeks, so I won't be able to review until then.

ArthurSens

Thanks for the PR @bmiguel-teixeira! The contributions sound solid, but I'm a bit concerned that we're doing two separate things in one single PR here. Generally that's not a good practice because in case we have to revert one thing, we end up reverting more than what's needed, not to mention that it makes the PR biggers and a bit harder to review

Could you choose only one new functionality for this PR and open another one for the other?

Regarding the changes, could you add tests for the telemetry added? Send PRW requests to a mock server and assert that the metric you've added increments as you expected.

For allowing multiple workers, it would be nice if we add extra documentation making it clear that Out-of-order needs to be enabled in Prometheus for it to work :)

bmiguel-teixeira · 2024-11-12T20:31:18Z

Sure. I will open a dedicated PR with the additional telemetry and keep de queue changes in this one which already has context.

bmiguel-teixeira · 2024-11-13T18:11:25Z

hi @ArthurSens

Removed the additional telemetry to be added in a secondary PR. Also added a bit of docs to explain the toggle and use case. Please take a look

Cheers

ArthurSens

Well, the code is simple so it does LGTM, but I'm struggling to test this.

Do you have any examples of apps/configs that will generate out of order datapoints? All attempts I've tried provide things in order so I can't be sure if this is working as expected 😅

exporter/prometheusremotewriteexporter/README.md

ArthurSens · 2024-11-19T22:59:14Z

(there's a linting failure)

bmiguel-teixeira · 2024-11-21T20:42:52Z

Hi @ArthurSens

Just submitted your recomendation to fix the spelling issues.

In regards to testing and simulating locally the out of order issues.
Here is my setup.

Prometheus Config

global:
  scrape_interval: 1s 
  evaluation_interval: 1s

#storage:
#  tsdb:
#    out_of_order_time_window: 10m

Otel Config

receivers:
  prometheus:
    config:
      scrape_configs:
      - job_name: 'node-1'
        scrape_interval: 1s
        static_configs:
          - targets: ['127.0.0.1:8081']

exporters:
  prometheusremotewrite:
    endpoint: http://localhost:9090/api/v1/write
    remote_write_queue:
      enabled: true
      num_consumers: XXX
      queue_size: 1000
    retry_on_failure:
      enabled: true
      initial_interval: 1s
      max_interval: 5s
      max_elapsed_time: 30s

service:
  telemetry:
    logs:
      level: "DEBUG"

  pipelines:
    metrics:
      receivers: [prometheus]
      processors: []
      exporters: [prometheusremotewrite]

Scenario 1: (Vanilla)

Prometheus with NO out of order window.
PrometheusRemoteWrite with 1 consumer (or queue disabled)

Setup environment (let it run few couple minutes)
Stop Prometheus
Wait couple seconds for timeout errors
Restart Prometheus (ensure you still have old data)

Outcome: Metrics are reingested, in order with single worker, ALL GOOD.

Scenario 2 (PrometheusRemoteWrite 5 Consumers + Vanilla Prometheus)

Prometheus with NO out of order window.
PrometheusRemoteWrite with 5 consumers

Setup environment (let it run few couple minutes)
Stop Prometheus
Wait couple seconds for timeout errors (and for items to build up in the queue)
Restart Prometheus (ensure you still have old data)

Outcome: OutOfOrder Errors in Prometheus after boot up, since samples will be re-tried with no specific order.

Scenario 3 (PrometheusRemoteWrite 5 Consumers + Prometheus with OutOfOrder Window)

Prometheus with WITH out of order window set at 10m
PrometheusRemoteWrite with 5 consumers

Setup environment (let it run few couple minutes)
Stop Prometheus
Wait couple seconds for timeout errors (and for items to build up in the queue)
Restart Prometheus (ensure you still have old data)

Outcome: No OutOfOrder issues because even though we have multiple workers, Prometheus can accept mixed/old samples and update tsdb until 10 minutes "ago".

Let me know if you need any more info.

Cheers.

ArthurSens

Perfect; thank you for the instructions! I just tested it, and it works perfectly. There's just one CI failure that we need to fix before approving here.

Could you run make generate and commit the changes?

bmiguel-teixeira · 2024-11-22T08:17:13Z

@ArthurSens All done!

exporter/prometheusremotewriteexporter/factory.go

exporter/prometheusremotewriteexporter/config.go

exporter/prometheusremotewriteexporter/exporter.go

ArthurSens · 2024-11-25T18:55:55Z

@edma2 found a bug when batching time series with multiple workers. Here is a PR trying to fix it: #36524

Should we solve the bug before re-allowing multiple workers?

exporter/prometheusremotewriteexporter/README.md

exporter/prometheusremotewriteexporter/config.go

exporter/prometheusremotewriteexporter/exporter.go

exporter/prometheusremotewriteexporter/factory.go

ArthurSens · 2024-12-16T21:34:54Z

Hey @bmiguel-teixeira, once #36601 is merged, I think we should be safe to proceed with multiple workers again :)

Could you rebase your PR on top of main once that happens?

ArthurSens

While reading the changes in README, I'm confused about the difference between num_consumers and max_batch_request_parallelism.

Both mention that they configure the number of workers... so they fight each other to decide the number of workers? Do they complement each other?

By the way, thanks for the patience here, a lot was done to make this possible :)

exporter/prometheusremotewriteexporter/README.md

ArthurSens

The README is still a bit confusing to me, but I'm fine merging if the other codeowners are ok with it

bmiguel-teixeira · 2025-01-06T17:04:37Z

Hi @ArthurSens

No worries on the delays.
The plan, as discussed with @dashpole, is to start the process of transitioning out of the hard-coded values and into the ability to configure both workers and single request parallelism (for large payloads of single-requests) individually.

If max_batch_request_parallelism is defined, this is the value used for single-requests parallelism regardless of anything else. Users can start moving to this value starting from this PR to keep same behaviour.

Then we have feature gate for controlling the parallelism.
If MultipleWorkersFeatureGate is enabled, then numConsumers enables multiple workers/consumers from the metrics queue, and single request parallelism can be set with max_batch_request_parallelism.

If the gate is absent, we keep the default behaviour with single worker thread and with numConsumer controlling single request parallelism.

Hopefully this helps

Aneurysm9 · 2025-01-06T18:12:52Z

exporter/prometheusremotewriteexporter/README.md

@@ -66,6 +66,8 @@ The following settings can be optionally configured:
 - `max_batch_size_bytes` (default = `3000000` -> `~2.861 mb`): Maximum size of a batch of
  samples to be sent to the remote write endpoint. If the batch size is larger
  than this value, it will be split into multiple batches.
+- `max_batch_request_parallelism` (default = `5`): Maximum parallelism allowed for a single request bigger than `max_batch_size_bytes`.
+  This configuration is used only when feature gate `exporter.prometheusremotewritexporter.EnableMultipleWorkers` is enabled.


I think the implementation differs from this documentation. It seems like this config option will be used if set regardless of the state of the feature gate.

@Aneurysm9 is correct. Please remove this sentence

Correct. I didnt review README after all the changes. Please take another look. If it LGTM do let me know to squash and rebase

bmiguel-teixeira requested review from dashpole and a team as code owners November 1, 2024 16:42

github-actions bot assigned songy23 Nov 1, 2024

github-actions bot added the exporter/prometheusremotewrite label Nov 1, 2024

github-actions bot requested review from Aneurysm9 and rapphil November 1, 2024 16:42

dashpole added the enhancement New feature or request label Nov 1, 2024

ArthurSens reviewed Nov 12, 2024

View reviewed changes

bmiguel-teixeira force-pushed the main branch from 422bb46 to 2498ea3 Compare November 13, 2024 18:08

bmiguel-teixeira changed the title ~~feat/add outbound metrics and re-allow multiple workers~~ feat/re-allow multiple workers Nov 13, 2024

ArthurSens reviewed Nov 19, 2024

View reviewed changes

exporter/prometheusremotewriteexporter/README.md Outdated Show resolved Hide resolved

bmiguel-teixeira force-pushed the main branch from 2978d8c to 012fe5d Compare November 21, 2024 20:44

ArthurSens reviewed Nov 21, 2024

View reviewed changes

bmiguel-teixeira force-pushed the main branch from 2a50228 to 7c25326 Compare November 22, 2024 07:48

ArthurSens approved these changes Nov 22, 2024

View reviewed changes

dashpole reviewed Nov 22, 2024

View reviewed changes

exporter/prometheusremotewriteexporter/factory.go Show resolved Hide resolved

dashpole reviewed Nov 25, 2024

View reviewed changes

exporter/prometheusremotewriteexporter/config.go Outdated Show resolved Hide resolved

exporter/prometheusremotewriteexporter/exporter.go Show resolved Hide resolved

Aneurysm9 reviewed Nov 25, 2024

View reviewed changes

exporter/prometheusremotewriteexporter/README.md Show resolved Hide resolved

Aneurysm9 mentioned this pull request Nov 25, 2024

[prometheusremotewriteexporter] reduce allocations in createAttributes #35184

Closed

bmiguel-teixeira force-pushed the main branch from 3a2ac49 to 38d7bcc Compare November 26, 2024 10:11

dashpole reviewed Nov 26, 2024

View reviewed changes

exporter/prometheusremotewriteexporter/config.go Outdated Show resolved Hide resolved

exporter/prometheusremotewriteexporter/exporter.go Outdated Show resolved Hide resolved

exporter/prometheusremotewriteexporter/factory.go Outdated Show resolved Hide resolved

bmiguel-teixeira requested review from TylerHelmuth, ChrsMark, yurishkuro, mwear, codeboten, mx-psi, bogdandrutu, songy23, tigrannajaryan and jsuereth as code owners January 3, 2025 10:16

github-actions bot added cmd/githubgen cmd/opampsupervisor cmd/otelcontribcol otelcontribcol command cmd/oteltestbedcol cmd/telemetrygen telemetrygen command confmap/provider/aesprovider confmap/provider/s3provider confmap/provider/secretsmanagerprovider connector/count labels Jan 3, 2025

bmiguel-teixeira added 2 commits January 3, 2025 10:18

Merge branch 'open-telemetry:main' into main

b151216

feat/re-add multiple workers

b627349

ArthurSens reviewed Jan 3, 2025

View reviewed changes

exporter/prometheusremotewriteexporter/README.md Outdated Show resolved Hide resolved

github-actions bot removed the Stale label Jan 4, 2025

feat/ cleanup more info docs

a47801b

ArthurSens approved these changes Jan 6, 2025

View reviewed changes

Aneurysm9 reviewed Jan 6, 2025

View reviewed changes

dashpole approved these changes Jan 7, 2025

View reviewed changes

bmiguel-teixeira and others added 2 commits January 8, 2025 11:43

fix/readme

169ad39

Merge branch 'main' into main

4c99cd4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat/re-allow multiple workers #36134

feat/re-allow multiple workers #36134

bmiguel-teixeira commented Nov 1, 2024 •

edited

Loading

dashpole commented Nov 1, 2024

dashpole commented Nov 1, 2024

ArthurSens left a comment •

edited

Loading

bmiguel-teixeira commented Nov 12, 2024

bmiguel-teixeira commented Nov 13, 2024

ArthurSens left a comment

ArthurSens commented Nov 19, 2024

bmiguel-teixeira commented Nov 21, 2024

ArthurSens left a comment

bmiguel-teixeira commented Nov 22, 2024

ArthurSens commented Nov 25, 2024

ArthurSens commented Dec 16, 2024

ArthurSens left a comment

ArthurSens left a comment

bmiguel-teixeira commented Jan 6, 2025

Aneurysm9 Jan 6, 2025

dashpole Jan 7, 2025

bmiguel-teixeira Jan 8, 2025

feat/re-allow multiple workers #36134

Are you sure you want to change the base?

feat/re-allow multiple workers #36134

Conversation

bmiguel-teixeira commented Nov 1, 2024 • edited Loading

Description

Link to tracking issue

Testing

Documentation

dashpole commented Nov 1, 2024

dashpole commented Nov 1, 2024

ArthurSens left a comment • edited Loading

Choose a reason for hiding this comment

bmiguel-teixeira commented Nov 12, 2024

bmiguel-teixeira commented Nov 13, 2024

ArthurSens left a comment

Choose a reason for hiding this comment

ArthurSens commented Nov 19, 2024

bmiguel-teixeira commented Nov 21, 2024

ArthurSens left a comment

Choose a reason for hiding this comment

bmiguel-teixeira commented Nov 22, 2024

ArthurSens commented Nov 25, 2024

ArthurSens commented Dec 16, 2024

ArthurSens left a comment

Choose a reason for hiding this comment

ArthurSens left a comment

Choose a reason for hiding this comment

bmiguel-teixeira commented Jan 6, 2025

Aneurysm9 Jan 6, 2025

Choose a reason for hiding this comment

dashpole Jan 7, 2025

Choose a reason for hiding this comment

bmiguel-teixeira Jan 8, 2025

Choose a reason for hiding this comment

bmiguel-teixeira commented Nov 1, 2024 •

edited

Loading

ArthurSens left a comment •

edited

Loading