Measure the number of Parquet row groups filtered by predicate pushdown #17594

mhaseeb123 · 2024-12-13T21:25:21Z

Description

This PR adds a method to measure the number of remaining row groups after stats and bloom filtering during predicate pushdown.

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

copy-pr-bot · 2024-12-13T21:25:25Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

…roups

mhaseeb123 · 2025-01-16T03:59:29Z

cpp/src/io/parquet/reader_impl_helpers.cpp

  // if filter is not empty, then gather row groups to read after predicate pushdown
  if (filter.has_value()) {
-    filtered_row_group_indices = filter_row_groups(
-      sources, row_group_indices, output_dtypes, output_column_schemas, filter.value(), stream);
+    // Span of input row group indices for predicate pushdown


Moved this piece of code out of filter_row_groups() to make its signature consistent with apply_bloom_filters()

mhaseeb123 · 2025-01-16T04:00:16Z

cpp/src/io/parquet/predicate_pushdown.cpp

  host_span<data_type const> output_dtypes,
  host_span<int const> output_column_schemas,
  std::reference_wrapper<ast::expression const> filter,
  rmm::cuda_stream_view stream) const
 {
  auto mr = cudf::get_current_device_resource_ref();
-  // Create row group indices.


Moved this piece outside in select_row_groups() to take in input_row_group_indices as input param

mhaseeb123 · 2025-01-16T04:01:24Z

cpp/include/cudf/io/types.hpp

  std::map<std::string, std::string> user_data;  //!< Format-dependent metadata of the first input
                                                 //!< file as key-values pairs (deprecated)
  std::vector<std::unordered_map<std::string, std::string>>
    per_file_user_data;  //!< Per file format-dependent metadata as key-values pairs
+
+  // The following variables are currently only computed for Parquet reader
+  size_type num_input_row_groups;           //!< Number of input row groups across all data sources


Keeping all variables tracking number of row groups as size_type for consistency across the entire Parquet stack. Happy to change it to size_t all across in a separate PR if needed. @vuule @nvdbaranec

mhaseeb123 · 2025-01-16T04:09:32Z

cpp/src/io/parquet/bloom_filter_reader.cu

-    bloom_filter_spans, parquet_types, total_row_groups, equality_col_schemas.size()};
+  bloom_filter_caster const bloom_filter_col{bloom_filter_spans,
+                                             parquet_types,
+                                             static_cast<size_t>(total_row_groups),


static_cast simply reformatted this line

mhaseeb123 · 2025-01-28T17:40:12Z

cpp/include/cudf/io/types.hpp

+
+  // The following variables are currently only computed for Parquet reader
+  size_type num_input_row_groups{0};  //!< Total number of input row groups across all data sources
+  std::optional<size_type>


Edit: With recent updates to this PR:

num_row_groups_after_stats_filter will be std::nullopt if not filter.has_value() and equal to num_input_row_groups or actual value otherwise

num_row_groups_after_bloom_filter will be std::nullopt if not filter.has_value() or not bloom_filter_exist and equal to num_row_groups_after_stats_filter or actual value otherwise.

mhaseeb123 · 2025-01-28T18:33:53Z

cpp/src/io/parquet/reader_impl.cpp

+  // pushdown.
+  out_metadata.num_input_row_groups = _file_itm_data.num_input_row_groups;
+  // Copy the number surviving row groups from each predicate pushdown only if the filter has value.
+  if (_expr_conv.get_converted_expr().has_value()) {


Copy from _file_itm_data if filter.has_value()

mhaseeb123 · 2025-01-28T22:48:48Z

cpp/src/io/parquet/predicate_pushdown.cpp

@@ -451,29 +428,55 @@ std::optional<std::vector<std::vector<size_type>>> aggregate_reader_metadata::fi
  // Converts AST to StatsAST with reference to min, max columns in above `stats_table`.
  stats_expression_converter const stats_expr{filter.get(),
                                              static_cast<size_type>(output_dtypes.size())};
-  auto stats_ast     = stats_expr.get_stats_expr();


Removed this stale code (already moved inside collect_filtered_row_group_indices)

mhaseeb123 · 2025-01-28T22:50:20Z

cpp/src/io/parquet/predicate_pushdown.cpp


  // Filter stats table with StatsAST expression and collect filtered row group indices
  auto const filtered_row_group_indices = collect_filtered_row_group_indices(
    stats_table, stats_expr.get_stats_expr(), input_row_group_indices, stream);

+  // Number of surviving row groups after applying stats filter
+  auto const num_stats_filtered_row_groups =


There isn't really a straightforward way in here to check if stats weren't available so this will be set to either total_row_groups or the filtered number of row groups.

so if there is a filter, but no stats, we still report a num_stats_filtered_row_groups number, even though we didn't really do any stats-based filtering?

Yeah unfortunately, not that straightforward to check the availability of stats right now!

Created #17864 to handle this in future PR

cpp/include/cudf/io/types.hpp

vuule · 2025-01-29T01:14:39Z

cpp/src/io/parquet/predicate_pushdown.cpp


  // Filter stats table with StatsAST expression and collect filtered row group indices
  auto const filtered_row_group_indices = collect_filtered_row_group_indices(
    stats_table, stats_expr.get_stats_expr(), input_row_group_indices, stream);

+  // Number of surviving row groups after applying stats filter
+  auto const num_stats_filtered_row_groups =


so if there is a filter, but no stats, we still report a num_stats_filtered_row_groups number, even though we didn't really do any stats-based filtering?

vuule

Approving; my remaining comments are non-blocking

cpp/src/io/parquet/bloom_filter_reader.cu

cpp/src/io/parquet/predicate_pushdown.cpp

cpp/src/io/parquet/reader_impl_helpers.hpp

cpp/src/io/parquet/reader_impl_chunking.hpp

cpp/tests/io/parquet_reader_test.cpp

mythrocks

I'm 👍. A couple of nitpicks are mentioned, but needn't hold up this change.

mhaseeb123 · 2025-01-30T23:19:10Z

/merge

Add a method to measure predicate pushdown row group filtering

5466c7a

mhaseeb123 self-assigned this Dec 13, 2024

github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Dec 13, 2024

mhaseeb123 added 2 - In Progress Currently a work in progress DO NOT MERGE Hold off on merging; see PR for details improvement Improvement / enhancement to an existing function labels Dec 13, 2024

mhaseeb123 mentioned this pull request Dec 13, 2024

[FEA] Use bloom filters in Parquet reader to filter row groups with equality predicates #17164

Closed

mhaseeb123 added cuIO cuIO issue non-breaking Non-breaking change labels Dec 13, 2024

Merge branch 'branch-25.02' into fea/measure-predicate-pushdown-row-g…

57e1e1c

…roups

mhaseeb123 removed the DO NOT MERGE Hold off on merging; see PR for details label Jan 15, 2025

mhaseeb123 and others added 2 commits January 15, 2025 18:10

Merge branch 'branch-25.02' into fea/measure-predicate-pushdown-row-g…

c061a75

…roups

Implementation

17bfedd

mhaseeb123 changed the title ~~🚧 Add a method to measure the number of Parquet row groups filtered by predicate pushdown~~ Measure the number of Parquet row groups filtered by predicate pushdown Jan 16, 2025

mhaseeb123 added 2 commits January 16, 2025 03:52

Minor, remove periods in comments

2d00bca

Clean up docstrings

4f53906

mhaseeb123 commented Jan 16, 2025

View reviewed changes

mhaseeb123 added 2 commits January 16, 2025 04:07

Move max row group checks

c50bb43

Remove erroneous comments

c0c0483

mhaseeb123 marked this pull request as ready for review January 16, 2025 04:08

mhaseeb123 requested a review from a team as a code owner January 16, 2025 04:08

mhaseeb123 requested review from bdice, ttnghia and vuule January 16, 2025 04:08

mhaseeb123 commented Jan 16, 2025

View reviewed changes

mhaseeb123 added 2 commits January 16, 2025 04:12

Minor refactoring

fcf7f28

Minor refactoring

7811641

Fix scalar value for tests

d45aafd

mhaseeb123 commented Jan 28, 2025

View reviewed changes

mhaseeb123 added 4 commits January 28, 2025 18:36

Minor updates. Make struct name more readable.

da30d0b

Remove erroneous test

ae3ca5b

Apply suggestions from review

a1bce6b

Remove stale code

ae1bd1d

mhaseeb123 commented Jan 28, 2025

View reviewed changes

vuule reviewed Jan 29, 2025

View reviewed changes

vuule approved these changes Jan 29, 2025

View reviewed changes

Update docstring for variables

c336184

mhaseeb123 mentioned this pull request Jan 29, 2025

[FEA] Add a method to check if row group stats are available #17864

Open

mhaseeb123 added feature request New feature or request and removed improvement Improvement / enhancement to an existing function labels Jan 30, 2025