-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[HUDI-8663] Look up in col stats based on indexed cols #12575
[HUDI-8663] Look up in col stats based on indexed cols #12575
Conversation
...datasource/hudi-spark-common/src/main/scala/org/apache/hudi/PartitionStatsIndexSupport.scala
Outdated
Show resolved
Hide resolved
...atasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/hudi/DataSkippingUtils.scala
Outdated
Show resolved
Hide resolved
...atasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/hudi/DataSkippingUtils.scala
Outdated
Show resolved
Hide resolved
...atasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/hudi/DataSkippingUtils.scala
Outdated
Show resolved
Hide resolved
...ource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestColumnStatsIndexWithSQL.scala
Show resolved
Hide resolved
...atasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/hudi/DataSkippingUtils.scala
Outdated
Show resolved
Hide resolved
} | ||
|
||
@Test | ||
def testTranslateIntoColumnStatsIndexFilterExpr(): Unit = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pretty good coverage! I have some more test cases:
- Concurrent translation to exercise AtomicBoolean
- Nested And/Or example
val dataFilter = And(
Or(
EqualTo(attribute("c1"), literal("619sdc")),
And(EqualTo(attribute("c2"), literal("100")), EqualTo(attribute("c3"), literal("200")))
),
EqualTo(attribute("c4"), literal("300"))
)
translateIntoColStatsExprAndValidate(
dataFilter,
Seq("c1", "c2", "c3", "c4"),
expectedTransformedExpression, // Define the correct transformed expression
false
)
- Index updates
val dataFilter = EqualTo(attribute("c1"), literal("619sdc"))
translateIntoColStatsExprAndValidate(dataFilter, Seq("c1"), expectedExpr, false)
// Simulate removing c1 from indexed columns
translateIntoColStatsExprAndValidate(dataFilter, Seq.empty, TrueLiteral, true)
- Unsupported filter types
val unsupportedFilter = SubqueryExpression(...) // Simulate a subquery-based filter
translateIntoColStatsExprAndValidate(unsupportedFilter, Seq("c1"), TrueLiteral, true)
- Too many filters
val largeFilter = (1 to 100).map(i => EqualTo(attribute(s"c$i"), literal("value"))).reduce(And)
val indexedColumns = (1 to 50).map(i => s"c$i")
translateIntoColStatsExprAndValidate(largeFilter, indexedColumns, expectedExpr, false)
- Nested schema
val schema = StructType(Seq(
StructField("c1", StringType, true),
StructField("nested", StructType(Seq(
StructField("inner_c2", StringType, true),
StructField("inner_c3", IntegerType, true)
)), true)
))
val dataFilter = And(
EqualTo(attribute("c1"), literal("val1")),
EqualTo(attribute("nested.inner_c2"), literal("val2"))
)
translateIntoColStatsExprAndValidate(
dataFilter,
Seq("c1", "nested.inner_c2"),
expectedExpr,
false
)
Some of the above might not be handled even from before (e.g. I think case 3 was never handled). Feel free to track separately if it takes longer to fix. I just thought these would be good addition to the suite for future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good list of tests. I have added them all except 1. not sure why we need mutli-writer tests. col stats lookup is designed to be single threaded. So, we don't need to test for multi writers. I had to use AtomicBoolean since I need something to be used as communication b/w caller and callee. we can't use list, and hence used AtomicBoolean.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh ok. If that was the case, why couldn't we simply return a tuple of <Expression, Boolean> from translateIntoColumnStatsIndexFilterExpr
? Wouldn't AtomicBoolean compare and swap be slightly worse?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, we could do that as well. I will give that a try.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually we don't even need the boolean. A return of "TrueLiteral" means expr is not parseable or has non indexed cols. So, have simplified based on that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, I am good with the patch. If we can avoid AtomicBoolean, given that colstats lookup is single threaded only, that would be great.
428bc73
to
14581e6
Compare
- Look up in col stats index based on indexed cols. Non indexed cols are ignored while looking up if rest of the filter expr is eligible to be looked up.
- Look up in col stats index based on indexed cols. Non indexed cols are ignored while looking up if rest of the filter expr is eligible to be looked up.
- Look up in col stats index based on indexed cols. Non indexed cols are ignored while looking up if rest of the filter expr is eligible to be looked up.
Change Logs
leftExpr AND rightExpr
, where leftExpr has non indexed cols, while rightExpr has all indexed cols, we can still afford to prune files based on col stats index. The expression to look up in col stats will omit leftExpr accordingly. Only if both leftExpr and rightExpr has non indexex cols, we can avoid looking up in col stats index. Incase of 'Or' clause, we can't do much, but have to resort to bypass col stats lookup completely.lets walk through an example to understand this in detail
c1 = 619sdc or c2 = 100 and c3 = 200
Say this is the filter expression in the query issued.
case 1:
when all 3 cols are indexed w/ col stats:
expected col stats lookup expression is
i.e. all 3 cols will be looked up.
case 2:
only c1 and c2 are indexed for filter "c1 = 619sdc or c2 = 100 and c3 = 200":
since left of AND exp is indexed, we can ignore c3 which is not indexed (rigth expr of AND).
we translate to
and lookup in col stats translated df.
case 3:
only c1 and c3 are indexed for filter "c1 = 619sdc or c2 = 100 and c3 = 200":
Among c1 and c2, c2 is not indexed. So, left Expr of (c1 = 619sdc or c2 = 100) is deemed as not indexed. But the right expr of 'c3 = 200' is indexed. and so, we translate this to
and lookup in col stats translated df.
case 4:
only c2 and c3 are indexed for filter "c1 = 619sdc or c2 = 100 and c3 = 200":
Same as case 3
So, we translate it to
and lookup in col stats translated df.
case 5:
only c2 is indexed for filter "c1 = 619sdc or c2 = 100 and c3 = 200":
Since either of Or clause is not indexed, we have to bypass looking up in col stats.
case 6:
only c3 is indexed for filter "c1 = 619sdc or c2 = 100 and c3 = 200":
Since either of Or clause is not indexed, we have to bypass looking up in col stats.
case 7:
only c1 is indexed for filter "c1 = 619sdc or c2 = 100 and c3 = 200":
Since either of Or clause is not indexed, we have to bypass looking up in col stats.
case 8:
only c2 is indexed for filter "c1 = 619sdc or c2 = 100 and c3 = 200":
Since either of Or clause is not indexed, we have to bypass looking up in col stats.
Whenever we deem as col stats should not be looked up, we return the files w/o pruning from the col stats lookup step.
Impact
Optimized lookup in col stats.
Risk level (write none, low medium or high below)
medium
Documentation Update
Describe any necessary documentation update if there is any new feature, config, or user-facing change. If not, put "none".
ticket number here and follow the instruction to make
changes to the website.
Contributor's checklist