Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Figure out how to performance test the aws-s3 input in SQS mode #76

Open
zmoog opened this issue Feb 3, 2024 · 15 comments
Open

Figure out how to performance test the aws-s3 input in SQS mode #76

zmoog opened this issue Feb 3, 2024 · 15 comments
Assignees
Labels

Comments

@zmoog
Copy link
Owner

zmoog commented Feb 3, 2024

I want to run performance tests on the aws-s3 input using the Elastic Agent 8.11.4 and 8.12.0 to ingest Cloudtrail logs.

The test will run on the following EC2 instance.

Instance type Architecture vCPU Memory (GiB) Network Bandwidth (Gbps)
c7g.2xlarge arm64 8 16 Up to 15

The goals are:

  • Measure performance using the OOTB settings (on both input and output).
  • Tweak the input and output settings to improve performance.
  • Measure the impact of the cache credentials bug when using the role_arn option.

Authentication

We will use an EC2 instance profile only, so there will be no authentication related options in the integration settings.

Dataset

As a test dataset, I will use an S3 bucket containing 1.2M objects of Cloudtrail logs. Each object is a file with 1-n Cloudtrail events compressed with gzip.

I will use a script to:

  • Download the list of S3 objects in a file CSV once.
  • Read the list of S3 objects from the CSV file, create a S3 notification, and send the notification to an SQS queue.

I will download the S3 objects once, and load it multiple times as needed.

Test process

  • Delete the data stream in Elasticsearch.
  • Load 1.2M notifications in the SQS queue
  • Start the Agent
@zmoog zmoog self-assigned this Feb 3, 2024
@zmoog zmoog added the research label Feb 3, 2024
@zmoog zmoog added this to Notes Feb 3, 2024
@zmoog zmoog moved this to In Progress in Notes Feb 3, 2024
@zmoog
Copy link
Owner Author

zmoog commented Feb 4, 2024

OOTB experience

The OOTB is the performance we get from the Agent without any customizations on the input or the output.

8.11.4

For the Agent 8.11.4, we will set:

  • Input
    • queue_url
  • Output
    • No custom settings

Note: the 8.11 default settings come from elastic/elastic-agent#3797

Version Maximum Concurrent SQS Messages Batch size Worker Queue size Flush timeout CPU Usage Output Events Total (events/s)
8.11.4 5 50 1 4096 1s 5% ~371
bulk_max_size: 50
worker: 1
queue.mem.events: 4096
queue.mem.flush.min_events: 2048
queue.mem.flush.timeout: 1s
compression: 0
idle_timeout: 60
$ cat logs/elastic-agent-* | jq -r '[.["@timestamp"],.component.id,.monitoring.metrics.filebeat.events.active,.monitoring.metrics.libbeat.pipeline.events.active,.monitoring.metrics.libbeat.output.events.total,.monitoring.metrics.libbeat.output.events.acked,.monitoring.metrics.libbeat.output.events.failed//0] | @tsv' | sort | grep s3

2024-02-04T23:00:22.949Z        aws-s3-b3107b00-d538-11ed-bb66-095ca05d09b4     852     852     9800    9750    0
2024-02-04T23:00:52.949Z        aws-s3-b3107b00-d538-11ed-bb66-095ca05d09b4     910     910     10850   10850   0
2024-02-04T23:01:22.949Z        aws-s3-b3107b00-d538-11ed-bb66-095ca05d09b4     1007    1007    11300   11300   0
2024-02-04T23:01:52.949Z        aws-s3-b3107b00-d538-11ed-bb66-095ca05d09b4     861     861     11250   11250   0
2024-02-04T23:02:22.949Z        aws-s3-b3107b00-d538-11ed-bb66-095ca05d09b4     1127    1127    11400   11400   0
2024-02-04T23:02:52.949Z        aws-s3-b3107b00-d538-11ed-bb66-095ca05d09b4     996     996     11350   11350   0
2024-02-04T23:03:22.949Z        aws-s3-b3107b00-d538-11ed-bb66-095ca05d09b4     1013    1013    11150   11150   0
2024-02-04T23:03:52.949Z        aws-s3-b3107b00-d538-11ed-bb66-095ca05d09b4     785     785     11250   11250   0
2024-02-04T23:04:22.949Z        aws-s3-b3107b00-d538-11ed-bb66-095ca05d09b4     1069    1069    10850   10850   0
2024-02-04T23:04:52.949Z        aws-s3-b3107b00-d538-11ed-bb66-095ca05d09b4     956     956     11350   11350   0

8.12.0

For the Agent 8.12.0, we will set:

  • Input
    • queue_url
  • Output
    • Default settings
Version Maximum Concurrent SQS Messages Batch size Worker Queue size Flush timeout CPU Usage Preset Output Events Total (events/s)
8.12.0 5 1600 1 3200 10s 2% balanced ~90
bulk_max_size: 1600
worker: 1
queue.mem.events: 3200
queue.mem.flush.min_events: 1600
queue.mem.flush.timeout: 10s
compression: 1
idle_timeout: 3
2024-02-04T23:20:01.321Z        aws-s3-b3107b00-d538-11ed-bb66-095ca05d09b4     1254    1254    2225    2225    0
2024-02-04T23:20:31.322Z        aws-s3-b3107b00-d538-11ed-bb66-095ca05d09b4     1053    1053    3465    3465    0
2024-02-04T23:21:01.322Z        aws-s3-b3107b00-d538-11ed-bb66-095ca05d09b4     836     836     2014    2014    0
2024-02-04T23:21:31.321Z        aws-s3-b3107b00-d538-11ed-bb66-095ca05d09b4     1130    1130    3232    3232    0
2024-02-04T23:22:01.321Z        aws-s3-b3107b00-d538-11ed-bb66-095ca05d09b4     1122    1122    2418    2418    0
2024-02-04T23:22:31.322Z        aws-s3-b3107b00-d538-11ed-bb66-095ca05d09b4     1060    1060    3190    3190    0
2024-02-04T23:23:01.322Z        aws-s3-b3107b00-d538-11ed-bb66-095ca05d09b4     1305    1305    2004    2004    0
2024-02-04T23:23:31.322Z        aws-s3-b3107b00-d538-11ed-bb66-095ca05d09b4     1270    1270    3604    3604    0

@zmoog
Copy link
Owner Author

zmoog commented Feb 4, 2024

The impact of the cache credentials bug

In progress

@zmoog
Copy link
Owner Author

zmoog commented Feb 4, 2024

Tweak the input and output settings to improve performance

Version Maximum Concurrent SQS Messages CPU Usage Output Events Total (events/s)
8.12.0 74 60% ~4100
8.12.0 100 68% ~2000-5000
8.12.0 200

Output settings

bulk_max_size: 2048
worker: 8
queue.mem.events: 32768
queue.mem.flush.min_events: 2048
queue.mem.flush.timeout: 1s
idle_connection_timeout: 60s
compression_level: 1

Variants:

  • Reducing the workers 8 > 4 decreases the throughput from ~4000-5000 to ~2200 EPS.

Observations:

  • Over time, the EPS fluctuates a lot. It can vary ~2000-5000 EPS without any change to input/output settings.

CleanShot 2024-02-05 at 09 29 11@2x

@zmoog
Copy link
Owner Author

zmoog commented Feb 7, 2024

The hypothesis is that s3 objects with only one event significantly impact the input performance. Let's test and compare the performance of two batches of s3 objects only made of big or small objects.

I'm extracting two datasets from the Cloudwatch dataset:

  1. "bigfiles" batch: the biggest 150K s3 objects (with event count in each s3 object ranging from 10102 to 101).
  2. "smallfiles" batch: the smallest 150K s3 objects (with only one event in each s3 object).

And I'm loading into ES using two SQS queues and data streams.

Query the entire dataset we previously ingested:

GET logs-aws.cloudtrail-allfiles/_search
{
  "size": 0,
  "aggs": {
    "group_by_object_key": {
      "terms": {
        "field": "aws.s3.object.key",
        "size": 150000,
        "order": {
          "_count": "desc"
        }
      }
    }
  }
}

I saved the two datasets in two CSV files:

  • s3_objects_list.smallfiles.csv
  • s3_objects_list.bigfiles.csv

I loaded the content of these two files in two SQS queues using the script:

  • mbranca-cloudtrail-logs-smallfiles
  • mbranca-cloudtrail-logs-bigfiles

@zmoog
Copy link
Owner Author

zmoog commented Feb 7, 2024

The agent is still processing the "bigfiles" batch, but the initial data seems to confirm that performance with big files is higher and stable.

The agent started pulling SQS messages going straight from zero to 7K EPS, and continuing at this rate with small fluctuations:

image

@zmoog
Copy link
Owner Author

zmoog commented Feb 7, 2024

At ~60% of the "bigfiles" batch, the performance starts declining:

CleanShot 2024-02-07 at 09 23 03@2x

The declining trend may be due to ordering the s3 objects in the batch.

The query in ES returned a list of s3 objects ordered by doc count. The batch contains 150k object keys, with a doc count from 10102 to 101 ordered by doc count DESC.

@zmoog
Copy link
Owner Author

zmoog commented Feb 7, 2024

Here is the final metrics for the "bigfiles" batch.

image

@zmoog
Copy link
Owner Author

zmoog commented Feb 7, 2024

I started the "smallfiles" batch.

The input started at ~60 EPS and for the first few minutes is continuing at this rate.

image

@zmoog
Copy link
Owner Author

zmoog commented Feb 7, 2024

The trend continues stable at 60 EPS.

image

@zmoog
Copy link
Owner Author

zmoog commented Feb 7, 2024

I am trying to adjust the output settings to get better performance from the input.

By reducing the queue.mem.flush.min_events value from 2048 to 100, 10, 1 I get better result, 2x the EPS from 60 EPS to 120 EPS (with queue.mem.flush.min_events: 1 ).

bulk_max_size: 2048
worker: 8
queue.mem.events: 32768
queue.mem.flush.min_events: 1
queue.mem.flush.timeout: 1s
idle_connection_timeout: 60s
compression_level: 1

CleanShot 2024-02-07 at 11 40 37@2x

@zmoog
Copy link
Owner Author

zmoog commented Feb 7, 2024

EPS with related queue.mem.flush.min_events values:

CleanShot 2024-02-07 at 12 53 45@2x

@andrewkroh
Copy link

By reducing the queue.mem.flush.min_events value from 2048 to 100, 10, 1 I get better result, 2x the EPS from 60 EPS to 120 EPS (with queue.mem.flush.min_events: 1 ).

Do you observe the mean sqs_message_processing_time metric going lower than your 1s queue.mem.flush.timeout when you do this?

@zmoog
Copy link
Owner Author

zmoog commented Feb 7, 2024

Let me try to check the Filebeat metrics:

CleanShot 2024-02-07 at 14 54 28@2x

@andrewkroh
Copy link

With that smaller min_events it looks like it averages at about 411ms per SQS message. That would be the total time from ReceiveMessage, GetObject, publish to ES, and DeleteMessage. 250ms of that time is the GetObject part.

Screenshot 2024-02-07 at 09 30 56

Some super rough estimation -- with basic extrapolation that means 2.4 events per second if all this work were done in a single thread. So if you wanted to reach 10k EPS then you would need 4166 concurrent SQS messages. But I would imagine we would hit other bottlenecks if you tried to go that large on max_number_of_messages.

@cmacknz
Copy link

cmacknz commented Feb 7, 2024

There are lots of interesting notes in https://docs.aws.amazon.com/AmazonS3/latest/userguide/optimizing-performance.html

Your applications can easily achieve thousands of transactions per second in request performance when uploading and retrieving storage from Amazon S3. Amazon S3 automatically scales to high request rates. For example, your application can achieve at least 3,500 PUT/COPY/POST/DELETE or 5,500 GET/HEAD requests per second per partitioned Amazon S3 prefix. There are no limits to the number of prefixes in a bucket. You can increase your read or write performance by using parallelization. For example, if you create 10 prefixes in an Amazon S3 bucket to parallelize reads, you could scale your read performance to 55,000 read requests per second. Similarly, you can scale write operations by writing to multiple prefixes. The scaling, in the case of both read and write operations, happens gradually and is not instantaneous. While Amazon S3 is scaling to your new higher request rate, you may see some 503 (Slow Down) errors. These errors will dissipate when the scaling is complete. For more information about creating and using prefixes, see Organizing objects using prefixes.

Other applications are sensitive to latency, such as social media messaging applications. These applications can achieve consistent small object latencies (and first-byte-out latencies for larger objects) of roughly 100–200 milliseconds.

Other AWS services can also help accelerate performance for different application architectures. For example, if you want higher transfer rates over a single HTTP connection or single-digit millisecond latencies, use Amazon CloudFront or Amazon ElastiCache for caching with Amazon S3.

This suggests to me that ~200 ms is possibly as good as we will observe when getting a small object. It also suggests that the way around this is to greatly increase concurrency to trigger the auto-scaler.

If we take the maximum for a single GET as 200 ms, that is 5 request/sec for a single client. At the suggested maximum of 5500 requests/sec within a single prefix that would require us to use 5500 / 5 = 1100 concurrent requests.

So perhaps making 1000s of concurrent requests is exactly what we should be doing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: In Progress
Development

No branches or pull requests

3 participants