Figure out how to performance test the aws-s3 input in SQS mode #76

zmoog · 2024-02-03T12:51:08Z

I want to run performance tests on the aws-s3 input using the Elastic Agent 8.11.4 and 8.12.0 to ingest Cloudtrail logs.

The test will run on the following EC2 instance.

Instance type	Architecture	vCPU	Memory (GiB)	Network Bandwidth (Gbps)
`c7g.2xlarge`	arm64	8	16	Up to 15

The goals are:

Measure performance using the OOTB settings (on both input and output).
Tweak the input and output settings to improve performance.
Measure the impact of the cache credentials bug when using the role_arn option.

Authentication

We will use an EC2 instance profile only, so there will be no authentication related options in the integration settings.

Dataset

As a test dataset, I will use an S3 bucket containing 1.2M objects of Cloudtrail logs. Each object is a file with 1-n Cloudtrail events compressed with gzip.

I will use a script to:

Download the list of S3 objects in a file CSV once.
Read the list of S3 objects from the CSV file, create a S3 notification, and send the notification to an SQS queue.

I will download the S3 objects once, and load it multiple times as needed.

Test process

Delete the data stream in Elasticsearch.
Load 1.2M notifications in the SQS queue
Start the Agent

The text was updated successfully, but these errors were encountered:

zmoog · 2024-02-04T23:26:12Z

OOTB experience

The OOTB is the performance we get from the Agent without any customizations on the input or the output.

8.11.4

For the Agent 8.11.4, we will set:

Input
- queue_url
Output
- No custom settings

Note: the 8.11 default settings come from elastic/elastic-agent#3797

Version	Maximum Concurrent SQS Messages	Batch size	Worker	Queue size	Flush timeout	CPU Usage	Output Events Total (events/s)
8.11.4	5	50	1	4096	1s	5%	~371

bulk_max_size: 50
worker: 1
queue.mem.events: 4096
queue.mem.flush.min_events: 2048
queue.mem.flush.timeout: 1s
compression: 0
idle_timeout: 60

$ cat logs/elastic-agent-* | jq -r '[.["@timestamp"],.component.id,.monitoring.metrics.filebeat.events.active,.monitoring.metrics.libbeat.pipeline.events.active,.monitoring.metrics.libbeat.output.events.total,.monitoring.metrics.libbeat.output.events.acked,.monitoring.metrics.libbeat.output.events.failed//0] | @tsv' | sort | grep s3

2024-02-04T23:00:22.949Z        aws-s3-b3107b00-d538-11ed-bb66-095ca05d09b4     852     852     9800    9750    0
2024-02-04T23:00:52.949Z        aws-s3-b3107b00-d538-11ed-bb66-095ca05d09b4     910     910     10850   10850   0
2024-02-04T23:01:22.949Z        aws-s3-b3107b00-d538-11ed-bb66-095ca05d09b4     1007    1007    11300   11300   0
2024-02-04T23:01:52.949Z        aws-s3-b3107b00-d538-11ed-bb66-095ca05d09b4     861     861     11250   11250   0
2024-02-04T23:02:22.949Z        aws-s3-b3107b00-d538-11ed-bb66-095ca05d09b4     1127    1127    11400   11400   0
2024-02-04T23:02:52.949Z        aws-s3-b3107b00-d538-11ed-bb66-095ca05d09b4     996     996     11350   11350   0
2024-02-04T23:03:22.949Z        aws-s3-b3107b00-d538-11ed-bb66-095ca05d09b4     1013    1013    11150   11150   0
2024-02-04T23:03:52.949Z        aws-s3-b3107b00-d538-11ed-bb66-095ca05d09b4     785     785     11250   11250   0
2024-02-04T23:04:22.949Z        aws-s3-b3107b00-d538-11ed-bb66-095ca05d09b4     1069    1069    10850   10850   0
2024-02-04T23:04:52.949Z        aws-s3-b3107b00-d538-11ed-bb66-095ca05d09b4     956     956     11350   11350   0

8.12.0

For the Agent 8.12.0, we will set:

Input
- queue_url
Output
- Default settings

Version	Maximum Concurrent SQS Messages	Batch size	Worker	Queue size	Flush timeout	CPU Usage	Preset	Output Events Total (events/s)
8.12.0	5	1600	1	3200	10s	2%	balanced	~90

bulk_max_size: 1600
worker: 1
queue.mem.events: 3200
queue.mem.flush.min_events: 1600
queue.mem.flush.timeout: 10s
compression: 1
idle_timeout: 3

2024-02-04T23:20:01.321Z        aws-s3-b3107b00-d538-11ed-bb66-095ca05d09b4     1254    1254    2225    2225    0
2024-02-04T23:20:31.322Z        aws-s3-b3107b00-d538-11ed-bb66-095ca05d09b4     1053    1053    3465    3465    0
2024-02-04T23:21:01.322Z        aws-s3-b3107b00-d538-11ed-bb66-095ca05d09b4     836     836     2014    2014    0
2024-02-04T23:21:31.321Z        aws-s3-b3107b00-d538-11ed-bb66-095ca05d09b4     1130    1130    3232    3232    0
2024-02-04T23:22:01.321Z        aws-s3-b3107b00-d538-11ed-bb66-095ca05d09b4     1122    1122    2418    2418    0
2024-02-04T23:22:31.322Z        aws-s3-b3107b00-d538-11ed-bb66-095ca05d09b4     1060    1060    3190    3190    0
2024-02-04T23:23:01.322Z        aws-s3-b3107b00-d538-11ed-bb66-095ca05d09b4     1305    1305    2004    2004    0
2024-02-04T23:23:31.322Z        aws-s3-b3107b00-d538-11ed-bb66-095ca05d09b4     1270    1270    3604    3604    0

zmoog · 2024-02-04T23:36:55Z

The impact of the cache credentials bug

In progress

zmoog · 2024-02-04T23:37:12Z

Tweak the input and output settings to improve performance

Version	Maximum Concurrent SQS Messages	CPU Usage	Output Events Total (events/s)
8.12.0	74	60%	~4100
8.12.0	100	68%	~2000-5000
8.12.0	200

Output settings

bulk_max_size: 2048
worker: 8
queue.mem.events: 32768
queue.mem.flush.min_events: 2048
queue.mem.flush.timeout: 1s
idle_connection_timeout: 60s
compression_level: 1

Variants:

Reducing the workers 8 > 4 decreases the throughput from ~4000-5000 to ~2200 EPS.

Observations:

Over time, the EPS fluctuates a lot. It can vary ~2000-5000 EPS without any change to input/output settings.

zmoog · 2024-02-07T07:42:02Z

The hypothesis is that s3 objects with only one event significantly impact the input performance. Let's test and compare the performance of two batches of s3 objects only made of big or small objects.

I'm extracting two datasets from the Cloudwatch dataset:

"bigfiles" batch: the biggest 150K s3 objects (with event count in each s3 object ranging from 10102 to 101).
"smallfiles" batch: the smallest 150K s3 objects (with only one event in each s3 object).

And I'm loading into ES using two SQS queues and data streams.

Query the entire dataset we previously ingested:

GET logs-aws.cloudtrail-allfiles/_search
{
  "size": 0,
  "aggs": {
    "group_by_object_key": {
      "terms": {
        "field": "aws.s3.object.key",
        "size": 150000,
        "order": {
          "_count": "desc"
        }
      }
    }
  }
}

I saved the two datasets in two CSV files:

s3_objects_list.smallfiles.csv
s3_objects_list.bigfiles.csv

I loaded the content of these two files in two SQS queues using the script:

mbranca-cloudtrail-logs-smallfiles
mbranca-cloudtrail-logs-bigfiles

zmoog · 2024-02-07T07:46:27Z

The agent is still processing the "bigfiles" batch, but the initial data seems to confirm that performance with big files is higher and stable.

The agent started pulling SQS messages going straight from zero to 7K EPS, and continuing at this rate with small fluctuations:

zmoog · 2024-02-07T08:27:13Z

At ~60% of the "bigfiles" batch, the performance starts declining:

The declining trend may be due to ordering the s3 objects in the batch.

The query in ES returned a list of s3 objects ordered by doc count. The batch contains 150k object keys, with a doc count from 10102 to 101 ordered by doc count DESC.

zmoog · 2024-02-07T09:04:56Z

Here is the final metrics for the "bigfiles" batch.

zmoog · 2024-02-07T09:19:02Z

I started the "smallfiles" batch.

The input started at ~60 EPS and for the first few minutes is continuing at this rate.

zmoog · 2024-02-07T09:44:41Z

The trend continues stable at 60 EPS.

zmoog · 2024-02-07T10:40:53Z

I am trying to adjust the output settings to get better performance from the input.

By reducing the queue.mem.flush.min_events value from 2048 to 100, 10, 1 I get better result, 2x the EPS from 60 EPS to 120 EPS (with queue.mem.flush.min_events: 1 ).

bulk_max_size: 2048
worker: 8
queue.mem.events: 32768
queue.mem.flush.min_events: 1
queue.mem.flush.timeout: 1s
idle_connection_timeout: 60s
compression_level: 1

zmoog · 2024-02-07T11:56:01Z

EPS with related queue.mem.flush.min_events values:

andrewkroh · 2024-02-07T13:29:16Z

By reducing the queue.mem.flush.min_events value from 2048 to 100, 10, 1 I get better result, 2x the EPS from 60 EPS to 120 EPS (with queue.mem.flush.min_events: 1 ).

Do you observe the mean sqs_message_processing_time metric going lower than your 1s queue.mem.flush.timeout when you do this?

zmoog · 2024-02-07T13:55:59Z

Let me try to check the Filebeat metrics:

andrewkroh · 2024-02-07T14:46:48Z

With that smaller min_events it looks like it averages at about 411ms per SQS message. That would be the total time from ReceiveMessage, GetObject, publish to ES, and DeleteMessage. 250ms of that time is the GetObject part.

Some super rough estimation -- with basic extrapolation that means 2.4 events per second if all this work were done in a single thread. So if you wanted to reach 10k EPS then you would need 4166 concurrent SQS messages. But I would imagine we would hit other bottlenecks if you tried to go that large on max_number_of_messages.

cmacknz · 2024-02-07T19:50:44Z

There are lots of interesting notes in https://docs.aws.amazon.com/AmazonS3/latest/userguide/optimizing-performance.html

Your applications can easily achieve thousands of transactions per second in request performance when uploading and retrieving storage from Amazon S3. Amazon S3 automatically scales to high request rates. For example, your application can achieve at least 3,500 PUT/COPY/POST/DELETE or 5,500 GET/HEAD requests per second per partitioned Amazon S3 prefix. There are no limits to the number of prefixes in a bucket. You can increase your read or write performance by using parallelization. For example, if you create 10 prefixes in an Amazon S3 bucket to parallelize reads, you could scale your read performance to 55,000 read requests per second. Similarly, you can scale write operations by writing to multiple prefixes. The scaling, in the case of both read and write operations, happens gradually and is not instantaneous. While Amazon S3 is scaling to your new higher request rate, you may see some 503 (Slow Down) errors. These errors will dissipate when the scaling is complete. For more information about creating and using prefixes, see Organizing objects using prefixes.

Other applications are sensitive to latency, such as social media messaging applications. These applications can achieve consistent small object latencies (and first-byte-out latencies for larger objects) of roughly 100–200 milliseconds.

Other AWS services can also help accelerate performance for different application architectures. For example, if you want higher transfer rates over a single HTTP connection or single-digit millisecond latencies, use Amazon CloudFront or Amazon ElastiCache for caching with Amazon S3.

This suggests to me that ~200 ms is possibly as good as we will observe when getting a small object. It also suggests that the way around this is to greatly increase concurrency to trigger the auto-scaler.

If we take the maximum for a single GET as 200 ms, that is 5 request/sec for a single client. At the suggested maximum of 5500 requests/sec within a single prefix that would require us to use 5500 / 5 = 1100 concurrent requests.

So perhaps making 1000s of concurrent requests is exactly what we should be doing.

zmoog self-assigned this Feb 3, 2024

zmoog added the research label Feb 3, 2024

zmoog added this to Notes Feb 3, 2024

zmoog moved this to In Progress in Notes Feb 3, 2024

aspacca mentioned this issue Mar 11, 2024

Refactor sqsReader.Receive: move queue ack waiting and message deleti… elastic/beats#38146

Closed

4 tasks

zmoog mentioned this issue Oct 10, 2024

Add asynchronous ACK handling to S3 and SQS inputs elastic/beats#40699

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Figure out how to performance test the aws-s3 input in SQS mode #76

Figure out how to performance test the aws-s3 input in SQS mode #76

zmoog commented Feb 3, 2024 •

edited

Loading

zmoog commented Feb 4, 2024

zmoog commented Feb 4, 2024

zmoog commented Feb 4, 2024 •

edited

Loading

zmoog commented Feb 7, 2024 •

edited

Loading

zmoog commented Feb 7, 2024 •

edited

Loading

zmoog commented Feb 7, 2024

zmoog commented Feb 7, 2024

zmoog commented Feb 7, 2024

zmoog commented Feb 7, 2024

zmoog commented Feb 7, 2024

zmoog commented Feb 7, 2024

andrewkroh commented Feb 7, 2024

zmoog commented Feb 7, 2024

andrewkroh commented Feb 7, 2024

cmacknz commented Feb 7, 2024

Figure out how to performance test the aws-s3 input in SQS mode #76

Figure out how to performance test the aws-s3 input in SQS mode #76

Comments

zmoog commented Feb 3, 2024 • edited Loading

Authentication

Dataset

Test process

zmoog commented Feb 4, 2024

OOTB experience

8.11.4

8.12.0

zmoog commented Feb 4, 2024

The impact of the cache credentials bug

zmoog commented Feb 4, 2024 • edited Loading

Tweak the input and output settings to improve performance

zmoog commented Feb 7, 2024 • edited Loading

zmoog commented Feb 7, 2024 • edited Loading

zmoog commented Feb 7, 2024

zmoog commented Feb 7, 2024

zmoog commented Feb 7, 2024

zmoog commented Feb 7, 2024

zmoog commented Feb 7, 2024

zmoog commented Feb 7, 2024

andrewkroh commented Feb 7, 2024

zmoog commented Feb 7, 2024

andrewkroh commented Feb 7, 2024

cmacknz commented Feb 7, 2024

zmoog commented Feb 3, 2024 •

edited

Loading

zmoog commented Feb 4, 2024 •

edited

Loading

zmoog commented Feb 7, 2024 •

edited

Loading

zmoog commented Feb 7, 2024 •

edited

Loading