Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SUPPORT] Hoodie Insert operation failing while appending to record index log file #12320

Open
dataproblems opened this issue Nov 23, 2024 · 4 comments
Labels
index metadata metadata table priority:critical production down; pipelines stalled; Need help asap.

Comments

@dataproblems
Copy link

Describe the problem you faced

I'm creating a table using INSERT mode with record level index. I see that the data and the partitions are written to s3 but then while appending records to the record index log my job fails.

To Reproduce

Steps to reproduce the behavior:

  1. spark.write.format("hudi").options(...).save("...")

Expected behavior

I should be able to create the record level index

Environment Description

  • Hudi version : 0.15.0

  • Spark version : 3.4

  • Hive version : N/A

  • Hadoop version :

  • Storage (HDFS/S3/GCS..) : S3

  • Running on Docker? (yes/no) :

Additional context

Hoodie options

DataSourceWriteOptions.TABLE_TYPE.key() -> DataSourceWriteOptions.COW_TABLE_TYPE_OPT_VAL, 
    HoodieStorageConfig.PARQUET_COMPRESSION_CODEC_NAME.key() -> "snappy", 
    HoodieStorageConfig.PARQUET_MAX_FILE_SIZE
      .key() -> "2147483648",
    "hoodie.parquet.small.file.limit" -> "1073741824",
    HoodieMetadataConfig.ENABLE_METADATA_INDEX_COLUMN_STATS.key() -> "true", 
    HoodieIndexConfig.INDEX_TYPE.key() -> "RECORD_INDEX", 
    "hoodie.metadata.enable" -> "true", 
    "hoodie.datasource.write.hive_style_partitioning" -> "true", 
    "hoodie.metadata.record.index.enable" -> "true", 
    HoodieTableConfig.POPULATE_META_FIELDS.key() -> "true", 
    HoodieWriteConfig.MARKERS_TYPE.key() -> "DIRECT",
    DataSourceWriteOptions.OPERATION.key() -> DataSourceWriteOptions.INSERT_OPERATION_OPT_VAL, /
    "hoodie.metadata.record.index.max.filegroup.count" -> "100000",
   "hoodie.metadata.record.index.min.filegroup.count" -> "7500" // I have 10ish TB of data and trying to keep the record index log files to be around 400 MB each. 
  ) 

Stacktrace

Caused by: org.apache.hudi.exception.HoodieAppendException: Failed while appending records to s3://SomeS3Path/.hoodie/metadata/record_index/.record-index-0195-0_00000000000000012.log.2_912-39-236765
	at org.apache.hudi.io.HoodieAppendHandle.appendDataAndDeleteBlocks(HoodieAppendHandle.java:466)
	at org.apache.hudi.io.HoodieAppendHandle.flushToDiskIfRequired(HoodieAppendHandle.java:599)
	at org.apache.hudi.io.HoodieAppendHandle.doAppend(HoodieAppendHandle.java:428)
	at org.apache.hudi.table.action.deltacommit.BaseSparkDeltaCommitActionExecutor.handleUpdate(BaseSparkDeltaCommitActionExecutor.java:90)
	at org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.handleUpsertPartition(BaseSparkCommitActionExecutor.java:337)
	... 29 more
Caused by: org.apache.hudi.exception.HoodieIOException: IOException serializing records
	at org.apache.hudi.common.util.HFileUtils.lambda$serializeRecordsToLogBlock$0(HFileUtils.java:219)
	at java.util.TreeMap.forEach(TreeMap.java:1005)
	at org.apache.hudi.common.util.HFileUtils.serializeRecordsToLogBlock(HFileUtils.java:213)
	at org.apache.hudi.common.table.log.block.HoodieHFileDataBlock.serializeRecords(HoodieHFileDataBlock.java:108)
	at org.apache.hudi.common.table.log.block.HoodieDataBlock.getContentBytes(HoodieDataBlock.java:117)
	at org.apache.hudi.common.table.log.HoodieLogFormatWriter.appendBlocks(HoodieLogFormatWriter.java:163)
	at org.apache.hudi.io.HoodieAppendHandle.appendDataAndDeleteBlocks(HoodieAppendHandle.java:458)
	... 33 more
Caused by: java.io.IOException: Added a key not lexically larger than previous.
@ad1happy2go
Copy link
Collaborator

@dataproblems Can you try the upsert mode? With RLI it anyway will not incur must cost for index lookup phase so insert/upsert will perform similar only.
We will look into this more why with insert it is failing

@ad1happy2go ad1happy2go added priority:critical production down; pipelines stalled; Need help asap. index metadata metadata table labels Nov 26, 2024
@github-project-automation github-project-automation bot moved this to ⏳ Awaiting Triage in Hudi Issue Support Nov 26, 2024
@dataproblems
Copy link
Author

@ad1happy2go - This seemed to be related to the text present within the record key field. If I removed that particular entry from my dataset, the operation went through.

Is there a place which captures the restrictions on the characters present in a String for the record key field?

@ad1happy2go
Copy link
Collaborator

Thanks @dataproblems for raising this. Can you provide us more details what type of values were there which caused the issue?

@dataproblems
Copy link
Author

@ad1happy2go - here are two examples » and è.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
index metadata metadata table priority:critical production down; pipelines stalled; Need help asap.
Projects
Status: Awaiting Triage
Development

No branches or pull requests

2 participants