Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[HUDI-8518] Fix RLI and Secondary index with custom payload or merge mode #12525

Merged
merged 4 commits into from
Jan 9, 2025

Conversation

codope
Copy link
Member

@codope codope commented Dec 19, 2024

Change Logs

  • Ensure deletes via custom payload or merger gets synced to RLI (HUDI-8741)
  • Fix secondary index updates for event_time and custom merge modes (HUDI-8518)
  • Ensure secondary index updates are done in distributed manner without collected any record keys or maps in driver (HUDI-8740)

Impact

Index correctness with custom mergers, and optimization.

Risk level (write none, low medium or high below)

medium

Documentation Update

Describe any necessary documentation update if there is any new feature, config, or user-facing change. If not, put "none".

  • The config description must be updated if new configs are added or the default value of the configs are changed
  • Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the
    ticket number here and follow the instruction to make
    changes to the website.

Contributor's checklist

  • Read through contributor's guide
  • Change Logs and Impact were stated clearly
  • Adequate tests were added if applicable
  • CI passed

@github-actions github-actions bot added the size:L PR with lines of changes in (300, 1000] label Dec 19, 2024
@codope codope changed the title [HUDI-8518][HUDI-8741] Fix RLI and Secondary index with custom payload or merge mode [HUDI-8518] [HUDI-8741] Fix RLI and Secondary index with custom payload or merge mode Dec 23, 2024
@codope codope changed the title [HUDI-8518] [HUDI-8741] Fix RLI and Secondary index with custom payload or merge mode [HUDI-8518] Fix RLI and Secondary index with custom payload or merge mode Dec 23, 2024
Copy link
Contributor

@nsivabalan nsivabalan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed source code. will review tests once feedback is addressed.

@apache apache deleted a comment from hudi-bot Dec 25, 2024
@github-actions github-actions bot added size:XL PR with lines of changes > 1000 and removed size:L PR with lines of changes in (300, 1000] labels Jan 3, 2025
String fileId = writeStatsByFileIdEntry.getKey();
List<HoodieWriteStat> writeStats = writeStatsByFileIdEntry.getValue();
String partition = writeStats.get(0).getPartitionPath();
FileSlice previousFileSliceForFileId = fsView.getLatestFileSlice(partition, fileId).orElse(null);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are still using filesystem view for now, instead of populating file slice in the writestats (through append handle). There could be some issue due to NBCC with CDC enabled. Trying to clarify it in #12582
Btw, here fsView call is destributed by partition, fileId. So, not making repeated calls.

Copy link
Contributor

@nsivabalan nsivabalan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to review tests. done w/ source code

Copy link
Contributor

@nsivabalan nsivabalan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to see more tests for the core method finding SI records.

Copy link
Contributor

@nsivabalan nsivabalan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still don't see the tests that I asked for.

esply around preparing SI records for MDT given a file slice.

Cases I would like to see:

  1. adding 1 new log file w/ data block. Existing file slice has just base file.
    a. data block has updates. some of them updates the SI value, while some do not.
    b. data block contains some deletes (custom payload way)
    c. log file is a delete block.

  2. Same as 1, just that the existing file slice contains base file and 2 log files.

  3. Same as 1. just that existing file slice contains base file, 1 log file w/ data block. and a delete block.

  4. Same as 1. existing file slice contains base file and 1 committed log block. one another log file which is partially failed. SI record generation should ignore the partially committed log file.

  5. update a SI record in logs(lf1). and later delete it(lf2) and then re-add (lf3) across diff commits. SI record generation should hold water.

  6. Simulate, async compaction.
    last but one file slice contains base file and 3 log files. latest file slice contains 1 log file. and new commit is looking to add a new log file. SI record generation should work.

May be, you have already covered w/ testSecondaryIndexRecordGenerationForMOR.
can you please confirm.

codope added 2 commits January 8, 2025 17:07
…erge modes

[HUDI-8741] Ensure deletes via custom payload or merger gets synced to RLI

[HUDI-8740] Fix SI update logic to avoid collection on the  driver
@codope
Copy link
Member Author

codope commented Jan 8, 2025

@nsivabalan Yes, most of those tests are already covered in TestSecondaryIndex and TestSecondaryIndexPruning. I have also added more tests, for cases like partialy failed commit, revived keys etc in TestSecondaryIndexPruning and TestMetadataUtilRLIandSIRecordGeneration.testSecondaryIndexRecordGenerationForMOR.

@hudi-bot
Copy link

hudi-bot commented Jan 8, 2025

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@codope codope merged commit ef9ab24 into apache:master Jan 9, 2025
43 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
size:XL PR with lines of changes > 1000
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants