Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Producer: Aggregate by Predicted Shard #57

Open
etspaceman opened this issue Feb 22, 2023 · 0 comments
Open

Producer: Aggregate by Predicted Shard #57

etspaceman opened this issue Feb 22, 2023 · 0 comments

Comments

@etspaceman
Copy link
Owner

etspaceman commented Feb 22, 2023

The KPL will attempt to leverage a shard-map and aggregate records that share a predicted shard. The logic for this is as follows:

  • If the shard map is not available for whatever reason (e.g. an error occurred), no aggregation is performed
  • If the shard map is stale, the KPL will produce a batch and detect if the predicted shard differed from the resulting shard. If it detects a difference, the shard map is refreshed and the records are resent.

The KCL will silently discard data in aggregated records if they do not have a calculated hash-key within the received shard's range. So re-sending the records, aggregated against the proper shard-id, is vital to receiving all records. This could result in duplicates though, as some records of the original request may have been successfully produced.

An example: consider a scale up operation, where new shards are created, and hash ranges for the new shards are smaller than the previous shard map. It is possible that some of the records of the original try could have successfully been consumed by the KCL, as the calculated hash key could have fallen into that smaller range. Others may not have. So this would mean a retry of the request would result in duplicates being consumed by the KCL.

This also seems like quite a bit of overhead to save a few requests. So the default today in kinesis4cats's Producer differs from the KPL, in that it only aggregates records that share a partition key. This means that the shard map can be stale or empty, and aggregation can still occur safely.

kinesis4cats should also offer the capabilities that the KPL offers, as an opt-in feature. This issue is for that feature implementation.

References:

https://github.com/awslabs/amazon-kinesis-producer/pull/277/files#diff-d1cbd3793299fd3c06ee3b0232d378f6341e9b5eb64eaa9b7b2094a9c911431fR211

https://github.com/awslabs/amazon-kinesis-producer/blob/master/aws/kinesis/core/aggregator.h

https://github.com/awslabs/amazon-kinesis-client/blob/master/amazon-kinesis-client/src/main/java/software/amazon/kinesis/retrieval/AggregatorUtil.java#L145-L150

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant