You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The KPL will attempt to leverage a shard-map and aggregate records that share a predicted shard. The logic for this is as follows:
If the shard map is not available for whatever reason (e.g. an error occurred), no aggregation is performed
If the shard map is stale, the KPL will produce a batch and detect if the predicted shard differed from the resulting shard. If it detects a difference, the shard map is refreshed and the records are resent.
The KCL will silently discard data in aggregated records if they do not have a calculated hash-key within the received shard's range. So re-sending the records, aggregated against the proper shard-id, is vital to receiving all records. This could result in duplicates though, as some records of the original request may have been successfully produced.
An example: consider a scale up operation, where new shards are created, and hash ranges for the new shards are smaller than the previous shard map. It is possible that some of the records of the original try could have successfully been consumed by the KCL, as the calculated hash key could have fallen into that smaller range. Others may not have. So this would mean a retry of the request would result in duplicates being consumed by the KCL.
This also seems like quite a bit of overhead to save a few requests. So the default today in kinesis4cats's Producer differs from the KPL, in that it only aggregates records that share a partition key. This means that the shard map can be stale or empty, and aggregation can still occur safely.
kinesis4cats should also offer the capabilities that the KPL offers, as an opt-in feature. This issue is for that feature implementation.
The KPL will attempt to leverage a shard-map and aggregate records that share a predicted shard. The logic for this is as follows:
The KCL will silently discard data in aggregated records if they do not have a calculated hash-key within the received shard's range. So re-sending the records, aggregated against the proper shard-id, is vital to receiving all records. This could result in duplicates though, as some records of the original request may have been successfully produced.
An example: consider a scale up operation, where new shards are created, and hash ranges for the new shards are smaller than the previous shard map. It is possible that some of the records of the original try could have successfully been consumed by the KCL, as the calculated hash key could have fallen into that smaller range. Others may not have. So this would mean a retry of the request would result in duplicates being consumed by the KCL.
This also seems like quite a bit of overhead to save a few requests. So the default today in kinesis4cats's Producer differs from the KPL, in that it only aggregates records that share a partition key. This means that the shard map can be stale or empty, and aggregation can still occur safely.
kinesis4cats should also offer the capabilities that the KPL offers, as an opt-in feature. This issue is for that feature implementation.
References:
https://github.com/awslabs/amazon-kinesis-producer/pull/277/files#diff-d1cbd3793299fd3c06ee3b0232d378f6341e9b5eb64eaa9b7b2094a9c911431fR211
https://github.com/awslabs/amazon-kinesis-producer/blob/master/aws/kinesis/core/aggregator.h
https://github.com/awslabs/amazon-kinesis-client/blob/master/amazon-kinesis-client/src/main/java/software/amazon/kinesis/retrieval/AggregatorUtil.java#L145-L150
The text was updated successfully, but these errors were encountered: