Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[HUDI-8663] Look up in col stats based on indexed cols #12575

Merged

Conversation

nsivabalan
Copy link
Contributor

@nsivabalan nsivabalan commented Jan 6, 2025

Change Logs

  • As of now, we lookup in col stats for all filter expressions. This patch checks for any non indexed cols and avoids col stats based pruning. The files from pruned files are used as candidate files if any non indexed cols are detected. If not, pruning happens based on col stats index in MDT and candidate files are deduced. We have also optimized the AND clause. If we have a leftExpr AND rightExpr, where leftExpr has non indexed cols, while rightExpr has all indexed cols, we can still afford to prune files based on col stats index. The expression to look up in col stats will omit leftExpr accordingly. Only if both leftExpr and rightExpr has non indexex cols, we can avoid looking up in col stats index. Incase of 'Or' clause, we can't do much, but have to resort to bypass col stats lookup completely.

lets walk through an example to understand this in detail

c1 = 619sdc or c2 = 100 and c3 = 200

Say this is the filter expression in the query issued.

case 1:
when all 3 cols are indexed w/ col stats:
expected col stats lookup expression is

(((('c1_minValue <= 619sdc) AND ('c1_maxValue >= 619sdc)) OR (('c2_minValue <= 100) AND ('c2_maxValue >= 100))) AND (('c3_minValue <= 200) AND ('c3_maxValue >= 200)))

i.e. all 3 cols will be looked up.

case 2:
only c1 and c2 are indexed for filter "c1 = 619sdc or c2 = 100 and c3 = 200":
since left of AND exp is indexed, we can ignore c3 which is not indexed (rigth expr of AND).
we translate to

((('c1_minValue <= 619sdc) AND ('c1_maxValue >= 619sdc)) OR (('c2_minValue <= 100) AND ('c2_maxValue >= 100)))

and lookup in col stats translated df.

case 3:
only c1 and c3 are indexed for filter "c1 = 619sdc or c2 = 100 and c3 = 200":
Among c1 and c2, c2 is not indexed. So, left Expr of (c1 = 619sdc or c2 = 100) is deemed as not indexed. But the right expr of 'c3 = 200' is indexed. and so, we translate this to

(('c3_minValue <= 200) AND ('c3_maxValue >= 200))

and lookup in col stats translated df.

case 4:
only c2 and c3 are indexed for filter "c1 = 619sdc or c2 = 100 and c3 = 200":
Same as case 3
So, we translate it to

(('c3_minValue <= 200) AND ('c3_maxValue >= 200))

and lookup in col stats translated df.

case 5:
only c2 is indexed for filter "c1 = 619sdc or c2 = 100 and c3 = 200":
Since either of Or clause is not indexed, we have to bypass looking up in col stats.

case 6:
only c3 is indexed for filter "c1 = 619sdc or c2 = 100 and c3 = 200":
Since either of Or clause is not indexed, we have to bypass looking up in col stats.

case 7:
only c1 is indexed for filter "c1 = 619sdc or c2 = 100 and c3 = 200":
Since either of Or clause is not indexed, we have to bypass looking up in col stats.

case 8:
only c2 is indexed for filter "c1 = 619sdc or c2 = 100 and c3 = 200":
Since either of Or clause is not indexed, we have to bypass looking up in col stats.

Whenever we deem as col stats should not be looked up, we return the files w/o pruning from the col stats lookup step.

Impact

Optimized lookup in col stats.

Risk level (write none, low medium or high below)

medium

Documentation Update

Describe any necessary documentation update if there is any new feature, config, or user-facing change. If not, put "none".

  • The config description must be updated if new configs are added or the default value of the configs are changed
  • Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the
    ticket number here and follow the instruction to make
    changes to the website.

Contributor's checklist

  • Read through contributor's guide
  • Change Logs and Impact were stated clearly
  • Adequate tests were added if applicable
  • CI passed

@github-actions github-actions bot added the size:L PR with lines of changes in (300, 1000] label Jan 6, 2025
}

@Test
def testTranslateIntoColumnStatsIndexFilterExpr(): Unit = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pretty good coverage! I have some more test cases:

  1. Concurrent translation to exercise AtomicBoolean
  2. Nested And/Or example
val dataFilter = And(
  Or(
    EqualTo(attribute("c1"), literal("619sdc")),
    And(EqualTo(attribute("c2"), literal("100")), EqualTo(attribute("c3"), literal("200")))
  ),
  EqualTo(attribute("c4"), literal("300"))
)
translateIntoColStatsExprAndValidate(
  dataFilter,
  Seq("c1", "c2", "c3", "c4"),
  expectedTransformedExpression, // Define the correct transformed expression
  false
)
  1. Index updates
val dataFilter = EqualTo(attribute("c1"), literal("619sdc"))
translateIntoColStatsExprAndValidate(dataFilter, Seq("c1"), expectedExpr, false)

// Simulate removing c1 from indexed columns
translateIntoColStatsExprAndValidate(dataFilter, Seq.empty, TrueLiteral, true)
  1. Unsupported filter types
val unsupportedFilter = SubqueryExpression(...) // Simulate a subquery-based filter
translateIntoColStatsExprAndValidate(unsupportedFilter, Seq("c1"), TrueLiteral, true)
  1. Too many filters
val largeFilter = (1 to 100).map(i => EqualTo(attribute(s"c$i"), literal("value"))).reduce(And)
val indexedColumns = (1 to 50).map(i => s"c$i")
translateIntoColStatsExprAndValidate(largeFilter, indexedColumns, expectedExpr, false)
  1. Nested schema
val schema = StructType(Seq(
  StructField("c1", StringType, true),
  StructField("nested", StructType(Seq(
    StructField("inner_c2", StringType, true),
    StructField("inner_c3", IntegerType, true)
  )), true)
))

val dataFilter = And(
  EqualTo(attribute("c1"), literal("val1")),
  EqualTo(attribute("nested.inner_c2"), literal("val2"))
)
translateIntoColStatsExprAndValidate(
  dataFilter,
  Seq("c1", "nested.inner_c2"),
  expectedExpr,
  false
)

Some of the above might not be handled even from before (e.g. I think case 3 was never handled). Feel free to track separately if it takes longer to fix. I just thought these would be good addition to the suite for future.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good list of tests. I have added them all except 1. not sure why we need mutli-writer tests. col stats lookup is designed to be single threaded. So, we don't need to test for multi writers. I had to use AtomicBoolean since I need something to be used as communication b/w caller and callee. we can't use list, and hence used AtomicBoolean.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh ok. If that was the case, why couldn't we simply return a tuple of <Expression, Boolean> from translateIntoColumnStatsIndexFilterExpr? Wouldn't AtomicBoolean compare and swap be slightly worse?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, we could do that as well. I will give that a try.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually we don't even need the boolean. A return of "TrueLiteral" means expr is not parseable or has non indexed cols. So, have simplified based on that.

Copy link
Member

@codope codope left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, I am good with the patch. If we can avoid AtomicBoolean, given that colstats lookup is single threaded only, that would be great.

@nsivabalan nsivabalan force-pushed the colStatsLookupOnlyForIndexedCols branch from 428bc73 to 14581e6 Compare January 8, 2025 07:26
@hudi-bot
Copy link

hudi-bot commented Jan 8, 2025

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@nsivabalan nsivabalan merged commit 3040aa5 into apache:master Jan 8, 2025
43 checks passed
Davis-Zhang-Onehouse pushed a commit to Davis-Zhang-Onehouse/hudi-oss that referenced this pull request Jan 10, 2025
- Look up in col stats index based on indexed cols. Non indexed cols are ignored while looking up if rest of the filter expr is eligible to be looked up.
Davis-Zhang-Onehouse pushed a commit to Davis-Zhang-Onehouse/hudi-oss that referenced this pull request Jan 10, 2025
- Look up in col stats index based on indexed cols. Non indexed cols are ignored while looking up if rest of the filter expr is eligible to be looked up.
Davis-Zhang-Onehouse pushed a commit to Davis-Zhang-Onehouse/hudi-oss that referenced this pull request Jan 10, 2025
- Look up in col stats index based on indexed cols. Non indexed cols are ignored while looking up if rest of the filter expr is eligible to be looked up.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
size:L PR with lines of changes in (300, 1000]
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants