[HUDI-8663] Look up in col stats based on indexed cols #12575

nsivabalan · 2025-01-06T01:09:58Z

Change Logs

As of now, we lookup in col stats for all filter expressions. This patch checks for any non indexed cols and avoids col stats based pruning. The files from pruned files are used as candidate files if any non indexed cols are detected. If not, pruning happens based on col stats index in MDT and candidate files are deduced. We have also optimized the AND clause. If we have a leftExpr AND rightExpr, where leftExpr has non indexed cols, while rightExpr has all indexed cols, we can still afford to prune files based on col stats index. The expression to look up in col stats will omit leftExpr accordingly. Only if both leftExpr and rightExpr has non indexex cols, we can avoid looking up in col stats index. Incase of 'Or' clause, we can't do much, but have to resort to bypass col stats lookup completely.

lets walk through an example to understand this in detail

c1 = 619sdc or c2 = 100 and c3 = 200

Say this is the filter expression in the query issued.

case 1:
when all 3 cols are indexed w/ col stats:
expected col stats lookup expression is

(((('c1_minValue <= 619sdc) AND ('c1_maxValue >= 619sdc)) OR (('c2_minValue <= 100) AND ('c2_maxValue >= 100))) AND (('c3_minValue <= 200) AND ('c3_maxValue >= 200)))

i.e. all 3 cols will be looked up.

case 2:
only c1 and c2 are indexed for filter "c1 = 619sdc or c2 = 100 and c3 = 200":
since left of AND exp is indexed, we can ignore c3 which is not indexed (rigth expr of AND).
we translate to

((('c1_minValue <= 619sdc) AND ('c1_maxValue >= 619sdc)) OR (('c2_minValue <= 100) AND ('c2_maxValue >= 100)))

and lookup in col stats translated df.

case 3:
only c1 and c3 are indexed for filter "c1 = 619sdc or c2 = 100 and c3 = 200":
Among c1 and c2, c2 is not indexed. So, left Expr of (c1 = 619sdc or c2 = 100) is deemed as not indexed. But the right expr of 'c3 = 200' is indexed. and so, we translate this to

(('c3_minValue <= 200) AND ('c3_maxValue >= 200))

and lookup in col stats translated df.

case 4:
only c2 and c3 are indexed for filter "c1 = 619sdc or c2 = 100 and c3 = 200":
Same as case 3
So, we translate it to

(('c3_minValue <= 200) AND ('c3_maxValue >= 200))

and lookup in col stats translated df.

case 5:
only c2 is indexed for filter "c1 = 619sdc or c2 = 100 and c3 = 200":
Since either of Or clause is not indexed, we have to bypass looking up in col stats.

case 6:
only c3 is indexed for filter "c1 = 619sdc or c2 = 100 and c3 = 200":
Since either of Or clause is not indexed, we have to bypass looking up in col stats.

case 7:
only c1 is indexed for filter "c1 = 619sdc or c2 = 100 and c3 = 200":
Since either of Or clause is not indexed, we have to bypass looking up in col stats.

case 8:
only c2 is indexed for filter "c1 = 619sdc or c2 = 100 and c3 = 200":
Since either of Or clause is not indexed, we have to bypass looking up in col stats.

Whenever we deem as col stats should not be looked up, we return the files w/o pruning from the col stats lookup step.

Impact

Optimized lookup in col stats.

Risk level (write none, low medium or high below)

medium

Documentation Update

Describe any necessary documentation update if there is any new feature, config, or user-facing change. If not, put "none".

The config description must be updated if new configs are added or the default value of the configs are changed
Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the
ticket number here and follow the instruction to make
changes to the website.

Contributor's checklist

Read through contributor's guide
Change Logs and Impact were stated clearly
Adequate tests were added if applicable
CI passed

...datasource/hudi-spark-common/src/main/scala/org/apache/hudi/PartitionStatsIndexSupport.scala

...atasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/hudi/DataSkippingUtils.scala

...ource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestColumnStatsIndexWithSQL.scala

...atasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/hudi/DataSkippingUtils.scala

codope · 2025-01-06T12:15:52Z

...ource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestColumnStatsIndexWithSQL.scala

+  }
+
+  @Test
+  def testTranslateIntoColumnStatsIndexFilterExpr(): Unit = {


Pretty good coverage! I have some more test cases:

Concurrent translation to exercise AtomicBoolean

Nested And/Or example

val dataFilter = And( Or( EqualTo(attribute("c1"), literal("619sdc")), And(EqualTo(attribute("c2"), literal("100")), EqualTo(attribute("c3"), literal("200"))) ), EqualTo(attribute("c4"), literal("300")) ) translateIntoColStatsExprAndValidate( dataFilter, Seq("c1", "c2", "c3", "c4"), expectedTransformedExpression, // Define the correct transformed expression false )

Index updates

val dataFilter = EqualTo(attribute("c1"), literal("619sdc")) translateIntoColStatsExprAndValidate(dataFilter, Seq("c1"), expectedExpr, false) // Simulate removing c1 from indexed columns translateIntoColStatsExprAndValidate(dataFilter, Seq.empty, TrueLiteral, true)

Unsupported filter types

val unsupportedFilter = SubqueryExpression(...) // Simulate a subquery-based filter translateIntoColStatsExprAndValidate(unsupportedFilter, Seq("c1"), TrueLiteral, true)

Too many filters

val largeFilter = (1 to 100).map(i => EqualTo(attribute(s"c$i"), literal("value"))).reduce(And) val indexedColumns = (1 to 50).map(i => s"c$i") translateIntoColStatsExprAndValidate(largeFilter, indexedColumns, expectedExpr, false)

Nested schema

val schema = StructType(Seq( StructField("c1", StringType, true), StructField("nested", StructType(Seq( StructField("inner_c2", StringType, true), StructField("inner_c3", IntegerType, true) )), true) )) val dataFilter = And( EqualTo(attribute("c1"), literal("val1")), EqualTo(attribute("nested.inner_c2"), literal("val2")) ) translateIntoColStatsExprAndValidate( dataFilter, Seq("c1", "nested.inner_c2"), expectedExpr, false )

Some of the above might not be handled even from before (e.g. I think case 3 was never handled). Feel free to track separately if it takes longer to fix. I just thought these would be good addition to the suite for future.

good list of tests. I have added them all except 1. not sure why we need mutli-writer tests. col stats lookup is designed to be single threaded. So, we don't need to test for multi writers. I had to use AtomicBoolean since I need something to be used as communication b/w caller and callee. we can't use list, and hence used AtomicBoolean.

oh ok. If that was the case, why couldn't we simply return a tuple of <Expression, Boolean> from translateIntoColumnStatsIndexFilterExpr? Wouldn't AtomicBoolean compare and swap be slightly worse?

yes, we could do that as well. I will give that a try.

actually we don't even need the boolean. A return of "TrueLiteral" means expr is not parseable or has non indexed cols. So, have simplified based on that.

codope

Overall, I am good with the patch. If we can avoid AtomicBoolean, given that colstats lookup is single threaded only, that would be great.

hudi-bot · 2025-01-08T08:50:28Z

CI report:

14581e6 Azure: SUCCESS

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

- Look up in col stats index based on indexed cols. Non indexed cols are ignored while looking up if rest of the filter expr is eligible to be looked up.

nsivabalan mentioned this pull request Jan 6, 2025

[HUDI-8663] Look up in col stats based on indexed cols. #12567

Closed

4 tasks

github-actions bot added the size:L PR with lines of changes in (300, 1000] label Jan 6, 2025

codope reviewed Jan 6, 2025

View reviewed changes

codope approved these changes Jan 7, 2025

View reviewed changes

nsivabalan added 6 commits January 7, 2025 23:26

Looking up in col stats index only for indexed cols

9a8989c

Fixing AND clause for non indexed cols pruning

b1a42a1

Fixing AND and Or clauses with col stats lookup

cce62fc

Fixing test failure

dcff304

Adding more tests

7968763

simplifying detecting non indexed cols

14581e6

nsivabalan force-pushed the colStatsLookupOnlyForIndexedCols branch from 428bc73 to 14581e6 Compare January 8, 2025 07:26

nsivabalan merged commit 3040aa5 into apache:master Jan 8, 2025
43 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[HUDI-8663] Look up in col stats based on indexed cols #12575

[HUDI-8663] Look up in col stats based on indexed cols #12575

nsivabalan commented Jan 6, 2025 •

edited

Loading

codope Jan 6, 2025

nsivabalan Jan 6, 2025

codope Jan 7, 2025

nsivabalan Jan 7, 2025

nsivabalan Jan 8, 2025

codope left a comment

hudi-bot commented Jan 8, 2025

[HUDI-8663] Look up in col stats based on indexed cols #12575

[HUDI-8663] Look up in col stats based on indexed cols #12575

Conversation

nsivabalan commented Jan 6, 2025 • edited Loading

Change Logs

Impact

Risk level (write none, low medium or high below)

Documentation Update

Contributor's checklist

codope Jan 6, 2025

Choose a reason for hiding this comment

nsivabalan Jan 6, 2025

Choose a reason for hiding this comment

codope Jan 7, 2025

Choose a reason for hiding this comment

nsivabalan Jan 7, 2025

Choose a reason for hiding this comment

nsivabalan Jan 8, 2025

Choose a reason for hiding this comment

codope left a comment

Choose a reason for hiding this comment

hudi-bot commented Jan 8, 2025

CI report:

nsivabalan commented Jan 6, 2025 •

edited

Loading