Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat/re-allow multiple workers #36134

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

bmiguel-teixeira
Copy link
Contributor

@bmiguel-teixeira bmiguel-teixeira commented Nov 1, 2024

Description

This MR does the following:

  • Re adds the ability to allow multiple workers in this exporter due to:
  1. Out of Order is no longer an issue now that it is fully supported in Prometheus. Nonetheless, I am setting the default worker as 1 to avoid OoO in Vanilla Prometheus Settings.

  2. With a single worker, and for a collector with a large load, this becomes "blocking". Example: Imagine a scenario in which a collector is collecting lots of targets, and with a slow prometheus/unstable network, a single worker can easily bottleneck the off-shipping if retries are enabled.

Link to tracking issue

N/A

Testing

Documentation

docs auto-updated. Readme.md is now correct in its explanation of the `num_consumers since its no longer hard-coded at 1. Additional docs added.

@dashpole
Copy link
Contributor

dashpole commented Nov 1, 2024

cc @jmichalek132 @ArthurSens

@dashpole dashpole added the enhancement New feature or request label Nov 1, 2024
@dashpole
Copy link
Contributor

dashpole commented Nov 1, 2024

I'll be OOO for the next few weeks, so I won't be able to review until then.

Copy link
Member

@ArthurSens ArthurSens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @bmiguel-teixeira! The contributions sound solid, but I'm a bit concerned that we're doing two separate things in one single PR here. Generally that's not a good practice because in case we have to revert one thing, we end up reverting more than what's needed, not to mention that it makes the PR biggers and a bit harder to review

Could you choose only one new functionality for this PR and open another one for the other?


Regarding the changes, could you add tests for the telemetry added? Send PRW requests to a mock server and assert that the metric you've added increments as you expected.

For allowing multiple workers, it would be nice if we add extra documentation making it clear that Out-of-order needs to be enabled in Prometheus for it to work :)

@bmiguel-teixeira
Copy link
Contributor Author

Sure. I will open a dedicated PR with the additional telemetry and keep de queue changes in this one which already has context.

@bmiguel-teixeira
Copy link
Contributor Author

hi @ArthurSens

Removed the additional telemetry to be added in a secondary PR. Also added a bit of docs to explain the toggle and use case. Please take a look

Cheers

@bmiguel-teixeira bmiguel-teixeira changed the title feat/add outbound metrics and re-allow multiple workers feat/re-allow multiple workers Nov 13, 2024
Copy link
Member

@ArthurSens ArthurSens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, the code is simple so it does LGTM, but I'm struggling to test this.

Do you have any examples of apps/configs that will generate out of order datapoints? All attempts I've tried provide things in order so I can't be sure if this is working as expected 😅

exporter/prometheusremotewriteexporter/README.md Outdated Show resolved Hide resolved
@ArthurSens
Copy link
Member

(there's a linting failure)

@bmiguel-teixeira
Copy link
Contributor Author

Hi @ArthurSens

Just submitted your recomendation to fix the spelling issues.

In regards to testing and simulating locally the out of order issues.
Here is my setup.

Prometheus Config

global:
  scrape_interval: 1s 
  evaluation_interval: 1s

#storage:
#  tsdb:
#    out_of_order_time_window: 10m

Otel Config

receivers:
  prometheus:
    config:
      scrape_configs:
      - job_name: 'node-1'
        scrape_interval: 1s
        static_configs:
          - targets: ['127.0.0.1:8081']

exporters:
  prometheusremotewrite:
    endpoint: http://localhost:9090/api/v1/write
    remote_write_queue:
      enabled: true
      num_consumers: XXX
      queue_size: 1000
    retry_on_failure:
      enabled: true
      initial_interval: 1s
      max_interval: 5s
      max_elapsed_time: 30s

service:
  telemetry:
    logs:
      level: "DEBUG"

  pipelines:
    metrics:
      receivers: [prometheus]
      processors: []
      exporters: [prometheusremotewrite]

Scenario 1: (Vanilla)

  • Prometheus with NO out of order window.
  • PrometheusRemoteWrite with 1 consumer (or queue disabled)
  1. Setup environment (let it run few couple minutes)
  2. Stop Prometheus
  3. Wait couple seconds for timeout errors
  4. Restart Prometheus (ensure you still have old data)

Outcome: Metrics are reingested, in order with single worker, ALL GOOD.

Scenario 2 (PrometheusRemoteWrite 5 Consumers + Vanilla Prometheus)

  • Prometheus with NO out of order window.
  • PrometheusRemoteWrite with 5 consumers
  1. Setup environment (let it run few couple minutes)
  2. Stop Prometheus
  3. Wait couple seconds for timeout errors (and for items to build up in the queue)
  4. Restart Prometheus (ensure you still have old data)

Outcome: OutOfOrder Errors in Prometheus after boot up, since samples will be re-tried with no specific order.

Scenario 3 (PrometheusRemoteWrite 5 Consumers + Prometheus with OutOfOrder Window)

  • Prometheus with WITH out of order window set at 10m
  • PrometheusRemoteWrite with 5 consumers
  1. Setup environment (let it run few couple minutes)
  2. Stop Prometheus
  3. Wait couple seconds for timeout errors (and for items to build up in the queue)
  4. Restart Prometheus (ensure you still have old data)

Outcome: No OutOfOrder issues because even though we have multiple workers, Prometheus can accept mixed/old samples and update tsdb until 10 minutes "ago".

Let me know if you need any more info.

Cheers.

Copy link
Member

@ArthurSens ArthurSens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perfect; thank you for the instructions! I just tested it, and it works perfectly. There's just one CI failure that we need to fix before approving here.

Could you run make generate and commit the changes?

@bmiguel-teixeira
Copy link
Contributor Author

@ArthurSens All done!

@ArthurSens
Copy link
Member

@edma2 found a bug when batching time series with multiple workers. Here is a PR trying to fix it: #36524

Should we solve the bug before re-allowing multiple workers?

@ArthurSens
Copy link
Member

Hey @bmiguel-teixeira, once #36601 is merged, I think we should be safe to proceed with multiple workers again :)

Could you rebase your PR on top of main once that happens?

Copy link
Member

@ArthurSens ArthurSens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While reading the changes in README, I'm confused about the difference between num_consumers and max_batch_request_parallelism.

Both mention that they configure the number of workers... so they fight each other to decide the number of workers? Do they complement each other?


By the way, thanks for the patience here, a lot was done to make this possible :)

exporter/prometheusremotewriteexporter/README.md Outdated Show resolved Hide resolved
@github-actions github-actions bot removed the Stale label Jan 4, 2025
Copy link
Member

@ArthurSens ArthurSens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The README is still a bit confusing to me, but I'm fine merging if the other codeowners are ok with it

@bmiguel-teixeira
Copy link
Contributor Author

Hi @ArthurSens

No worries on the delays.
The plan, as discussed with @dashpole, is to start the process of transitioning out of the hard-coded values and into the ability to configure both workers and single request parallelism (for large payloads of single-requests) individually.

If max_batch_request_parallelism is defined, this is the value used for single-requests parallelism regardless of anything else. Users can start moving to this value starting from this PR to keep same behaviour.

Then we have feature gate for controlling the parallelism.
If MultipleWorkersFeatureGate is enabled, then numConsumers enables multiple workers/consumers from the metrics queue, and single request parallelism can be set with max_batch_request_parallelism.

If the gate is absent, we keep the default behaviour with single worker thread and with numConsumer controlling single request parallelism.

Hopefully this helps

@@ -66,6 +66,8 @@ The following settings can be optionally configured:
- `max_batch_size_bytes` (default = `3000000` -> `~2.861 mb`): Maximum size of a batch of
samples to be sent to the remote write endpoint. If the batch size is larger
than this value, it will be split into multiple batches.
- `max_batch_request_parallelism` (default = `5`): Maximum parallelism allowed for a single request bigger than `max_batch_size_bytes`.
This configuration is used only when feature gate `exporter.prometheusremotewritexporter.EnableMultipleWorkers` is enabled.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the implementation differs from this documentation. It seems like this config option will be used if set regardless of the state of the feature gate.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Aneurysm9 is correct. Please remove this sentence

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct. I didnt review README after all the changes. Please take another look. If it LGTM do let me know to squash and rebase

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants