[data][Datasink] support passing write results to on_write_completes #49251

raulchen · 2024-12-13T07:33:34Z

Why are these changes needed?

A previous refactoring PR broke the ability to pass write results to Datasink.on_write_completes.
This PR adds back the ability and refines the Datasink interface by decoupling the stats handling code with the on_write_complete callback.

Closes #48933

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Hao Chen <[email protected]>

raulchen · 2024-12-13T07:35:50Z

python/ray/data/tests/test_datasink.py

@@ -26,6 +122,38 @@ def num_rows_per_write(self):
    )


+def test_custom_write_results(ray_start_regular_shared):


This is the only new test. The above are moved from test_formats.py

raulchen · 2024-12-13T07:40:17Z

python/ray/data/datasource/datasink.py

@@ -67,44 +46,32 @@ def write(
        self,
        blocks: Iterable[Block],
        ctx: TaskContext,
-    ) -> None:
+    ) -> WriteResultType:


@bveeramani Unrelated to this PR. But one pitfall about the current Datasink interface is that, the Datasink object will be used both on the driver (the on_xxx callbacks) and on the workers (this write function).
Users may mistakenly think that if they update an attribute in the write method, the update will be available on on_write_complete.

We should consider addressing this issue before making the Datasink API public. One solution is to introduce a separate DatasinkWriter class.

Yeah, I agree, that's janky.

One solution is to introduce a separate DatasinkWriter class.

Sounds reasonable.

We should consider addressing this issue before making the Datasink API public.

Makes sense. There's no urgency to make Datasink public.

Signed-off-by: Hao Chen <[email protected]>

raulchen · 2024-12-13T08:01:44Z

python/ray/data/datasource/file_datasink.py

-                write_row_to_path,
+                lambda row=row, write_path=write_path: write_row_to_path(
+                    row, write_path
+                ),


unrelated to this PR, but fixing the following lint error
python/ray/data/datasource/file_datasink.py:190:46: B023 Function definition does not bind loop variable 'write_path'.

Signed-off-by: Hao Chen <[email protected]>

bveeramani · 2024-12-17T00:42:36Z

python/ray/data/datasource/datasink.py

+    # Total size in bytes of written data.
+    size_bytes: int
+    # Results of all `Datasink.write`.
+    write_task_results: List[WriteResultType]


Nit: To avoid confusion between the WriteResult dataclass and the object returned from write tasks, it might clearer if we rename write_task_results to write_return_types (and WriteResultType to WriteReturnType)

Suggested change

write_task_results: List[WriteResultType]

write_task_returns: List[WriteReturnType]

bveeramani · 2024-12-17T00:44:16Z

python/ray/data/datasource/datasink.py

@@ -67,44 +46,32 @@ def write(
        self,
        blocks: Iterable[Block],
        ctx: TaskContext,
-    ) -> None:
+    ) -> WriteResultType:


Yeah, I agree, that's janky.

One solution is to introduce a separate DatasinkWriter class.

Sounds reasonable.

We should consider addressing this issue before making the Datasink API public.

Makes sense. There's no urgency to make Datasink public.

bveeramani · 2024-12-17T00:52:26Z

python/ray/data/_internal/planner/plan_write_op.py

+            {
+                "num_rows": [total_num_rows],
+                "size_bytes": [total_size_bytes],
+                "write_task_result": [ctx.kwargs.get("_data_sink_custom_result", None)],


Rather than passing information through the TaskContext, can we directly yield the write returns and stats in generate_write_fn.fn?

# Pseudocode for `generate_write_fn.fn` for block in blocks: write_return = datasink.write(block) yield Block({"write_return": write_return, "num_rows": block.num_rows()})

Wait, on second thought, do we even still need generate_collect_write_stats_fn?

I think we can return the write return and statistics from generate_write_fn.fn, and then aggregate the statistics and create WriteResult on the driver?

having 2 separate TransformFns allows optimization rules to insert certain operations in between them. And to pass data between them, TaskContext is probably the best place.

bveeramani · 2024-12-17T00:55:01Z

python/ray/data/_internal/planner/plan_write_op.py

@@ -16,19 +18,38 @@
 from ray.data.datasource.datasource import Datasource


+def gen_data_sink_write_result(


Nit: Datasink is one word, so this seems more accurate?

Suggested change

def gen_data_sink_write_result(

def gen_datasink_write_result(

Signed-off-by: Hao Chen <[email protected]>

Jay-ju · 2024-12-23T08:49:35Z

https://github.com/ray-project/ray/pull/49214/files. I've also made a fix for this issue. Is your pull request (PR) about to be merged? If it's going to be merged, I can make fixes for the changes written to Lance based on this PR. This scenario is quite urgent for us.

raulchen · 2024-12-26T23:34:25Z

@Jay-ju this PR should be merged soon. it was blocked by a doc issue.

Signed-off-by: Hao Chen <[email protected]>

…ay-project#49251) A previous refactoring [PR](ray-project#47942) broke the ability to pass write results to `Datasink.on_write_completes`. This PR adds back the ability and refines the Datasink interface by decoupling the stats handling code with the `on_write_complete` callback. Closes ray-project#48933 --------- Signed-off-by: Hao Chen <[email protected]> Co-authored-by: Balaji Veeramani <[email protected]>

…49251) A previous refactoring [PR](#47942) broke the ability to pass write results to `Datasink.on_write_completes`. This PR adds back the ability and refines the Datasink interface by decoupling the stats handling code with the `on_write_complete` callback. Closes #48933 --------- Signed-off-by: Hao Chen <[email protected]> Co-authored-by: Balaji Veeramani <[email protected]>

raulchen added 6 commits December 13, 2024 11:37

support passing custom results

87b04de

Signed-off-by: Hao Chen <[email protected]>

refactor

04814ce

Signed-off-by: Hao Chen <[email protected]>

comment

10ffd4d

Signed-off-by: Hao Chen <[email protected]>

refine

af508d2

Signed-off-by: Hao Chen <[email protected]>

add test

b34feee

Signed-off-by: Hao Chen <[email protected]>

lint

073972c

Signed-off-by: Hao Chen <[email protected]>

raulchen requested a review from a team as a code owner December 13, 2024 07:33

raulchen mentioned this pull request Dec 13, 2024

pass write result to on_write_complete #49091

Closed

8 tasks

raulchen commented Dec 13, 2024

View reviewed changes

raulchen added 2 commits December 13, 2024 15:57

'fix

d479c4e

Signed-off-by: Hao Chen <[email protected]>

lint

71d8ab4

Signed-off-by: Hao Chen <[email protected]>

raulchen commented Dec 13, 2024

View reviewed changes

annotation

71591ab

Signed-off-by: Hao Chen <[email protected]>

chenkovsky mentioned this pull request Dec 13, 2024

fix: fix ray lance sink error lancedb/lance#3230

Merged

bveeramani self-assigned this Dec 17, 2024

bveeramani reviewed Dec 17, 2024

View reviewed changes

raulchen added 3 commits December 17, 2024 19:07

rename

e5e6bc6

Signed-off-by: Hao Chen <[email protected]>

rename

bede1b8

Signed-off-by: Hao Chen <[email protected]>

docstring

a94c4d6

Signed-off-by: Hao Chen <[email protected]>

bveeramani approved these changes Dec 17, 2024

View reviewed changes

raulchen added 5 commits December 18, 2024 15:42

test

d738409

Signed-off-by: Hao Chen <[email protected]>

fix

052f884

Signed-off-by: Hao Chen <[email protected]>

revert

974f9e5

Signed-off-by: Hao Chen <[email protected]>

automodule

7246c6b

Signed-off-by: Hao Chen <[email protected]>

whatever

20bf620

Signed-off-by: Hao Chen <[email protected]>

bveeramani added 2 commits December 26, 2024 12:55

Update 'input_output.rst'

3c654bc

Update conf.py

01f4b24

bveeramani requested a review from a team as a code owner December 26, 2024 22:59

lint

da5de5d

Signed-off-by: Hao Chen <[email protected]>

raulchen added the go add ONLY when ready to merge, run all tests label Dec 26, 2024

fix

aa1970c

Signed-off-by: Hao Chen <[email protected]>

raulchen enabled auto-merge (squash) December 27, 2024 03:06

westonpace mentioned this pull request Dec 27, 2024

ray write lance error lancedb/lance#3229

Closed

lint

af8afe2

Signed-off-by: Hao Chen <[email protected]>

github-actions bot disabled auto-merge December 27, 2024 18:18

Merge branch 'master' into data-sink-write-res

8f6baf7

Signed-off-by: Hao Chen <[email protected]>

raulchen enabled auto-merge (squash) December 28, 2024 00:24

raulchen merged commit d9f69fd into ray-project:master Dec 28, 2024
5 of 6 checks passed

raulchen deleted the data-sink-write-res branch December 29, 2024 20:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[data][Datasink] support passing write results to on_write_completes #49251

[data][Datasink] support passing write results to on_write_completes #49251

raulchen commented Dec 13, 2024 •

edited

Loading

raulchen Dec 13, 2024

raulchen Dec 13, 2024

bveeramani Dec 17, 2024

raulchen Dec 13, 2024

bveeramani Dec 17, 2024

bveeramani Dec 17, 2024

bveeramani Dec 17, 2024

bveeramani Dec 17, 2024

raulchen Dec 17, 2024

bveeramani Dec 17, 2024

Jay-ju commented Dec 23, 2024

raulchen commented Dec 26, 2024

		@@ -26,6 +122,38 @@ def num_rows_per_write(self):
		)


		def test_custom_write_results(ray_start_regular_shared):

	write_task_results: List[WriteResultType]
	write_task_returns: List[WriteReturnType]

		@@ -16,19 +18,38 @@
		from ray.data.datasource.datasource import Datasource


		def gen_data_sink_write_result(

	def gen_data_sink_write_result(
	def gen_datasink_write_result(

[data][Datasink] support passing write results to on_write_completes #49251

[data][Datasink] support passing write results to on_write_completes #49251

Conversation

raulchen commented Dec 13, 2024 • edited Loading

Why are these changes needed?

Related issue number

Checks

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Jay-ju commented Dec 23, 2024

raulchen commented Dec 26, 2024

raulchen commented Dec 13, 2024 •

edited

Loading