[SPARK-50735][CONNECT] Failure in ExecuteResponseObserver results in infinite reattaching requests #49370

changgyoopark-db · 2025-01-06T11:54:28Z

What changes were proposed in this pull request?

The Spark Connect reattach request handler checks whether the associated ExecuteThreadRunner is completed, and returns an error if it has failed to record the outcome.

Why are the changes needed?

ExecuteResponseObserver.{onError, onComplete} are fallible while they are not retried; this leads to a situation where the ExecuteThreadRunner is completed without succeeding in responding to the client, and thus the client keeps retrying by reattaching the execution.

To be specific, if an ExecuteThreadRunner fails to record the completion of execution or an error on the observer and then just disappears, the client will endlessly reattach, hoping that "someone" will eventually record "some data," but since the ExecuteThreadRunner is gone, this becomes a deadlock situation.

The fix is that when the client attaches, the handler checks the status of the ExecuteThreadRunner, and if it finds that the execution cannot make any progress, an error is returned.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

testOnly org.apache.spark.sql.connect.service.SparkConnectServiceE2ESuite

Was this patch authored or co-authored using generative AI tooling?

No.

changgyoopark-db · 2025-01-10T09:14:56Z

Hey, @juliuszsompolski , I hope you are doing well. Can you please review this change?
-> Short description. If ExecuteThreadRunner fails to record the completion/error to the observer (e.g., due to OOM), the client permanently tries to reattach to the.
-> The fix is to let the stream sender send an error if ExecuteThreadRunner is gone without recording anything.
-> This does not cover streaming queries (if there's any problem).

juliuszsompolski

Hi @changgyoopark-db
I think I need some more description to understand this change.

...server/src/main/scala/org/apache/spark/sql/connect/execution/ExecuteGrpcResponseSender.scala

...nnect/server/src/main/scala/org/apache/spark/sql/connect/execution/ExecuteThreadRunner.scala

juliuszsompolski · 2025-01-13T18:24:10Z

sql/connect/server/src/main/scala/org/apache/spark/sql/connect/service/ExecuteHolder.scala

+  def isOrphan(): Boolean = {
+    // Check runner.completed() before others as the acquire memory fence in the method ensures the
+    // current thread to read the last known state of responseObserver correctly.
+    runner.completed() &&
+    !runner.shouldDelegateCompleteResponse(request) &&
+    !responseObserver.completed()
+  }


I need to look at it fresh tomorrow because with the new version of the code I am again confused :-).
This basically checks if the ExecuteThreadRunner exited without sending onCompleted / onError.
onCompleted is send at the end of executeInternal.
execute is wrapped in various try catches.

How about checking there that executeInternal exited without closing the stream, and closing the stream with an onError from there? Then the RPC handler side should get this error via the usual route.

While self-reviewing the code, I was also confused (+ I found a small logical hole with an interrupt) :-( I'll resume working on it tomorrow. + Closing the stream with an onError when reattaching seems to be a good idea.

Just calling onError doesn't work nicely because the client will retry if a custom RetryPolicy is specified; it's translated into an UNKNOWN gRPC error.

If you stick another ErrorUtils.handleError with a new IllegalStateException in this finally of execute if at this stage the stream is not closed (and not delegated), it would get turned into an error that shouldn't be swallowed by retry policy?

} finally { // Make sure to transition to completed in order to prevent the thread from being interrupted // afterwards. var currentState = state.getAcquire() while (currentState == ThreadState.started || currentState == ThreadState.startedInterrupted) { val interrupted = currentState == ThreadState.startedInterrupted val prevState = state.compareAndExchangeRelease(currentState, ThreadState.completed) if (prevState == currentState) { if (interrupted) { try { ErrorUtils.handleError( "execute", executeHolder.responseObserver, executeHolder.sessionHolder.userId, executeHolder.sessionHolder.sessionId, Some(executeHolder.eventsManager), true)(new SparkSQLException("OPERATION_CANCELED", Map.empty)) } finally { executeHolder.cleanup() } } return } currentState = prevState } } }

Putting something there (the finally block of ExecuteThreadRunner) is not 100% safe since creating an exception can cause OOM errors; it's also a heap allocation. That's why I would like to stick to handling the situation in the reattach handler.

OOM there (ExecuteThreadRunner) is unrecoverable.

Reattach->check will eventually succeed if the JVM manages to spare some memory.

...src/main/scala/org/apache/spark/sql/connect/service/SparkConnectReattachExecuteHandler.scala

…/service/SparkConnectReattachExecuteHandler.scala Co-authored-by: Juliusz Sompolski <[email protected]>

changgyoopark-db · 2025-01-20T15:15:37Z

@HyukjinKwon Hi, can you please merge this PR? Thanks!

HyukjinKwon · 2025-01-20T23:59:42Z

Merged to master and branch-4.0.

…infinite reattaching requests ### What changes were proposed in this pull request? The Spark Connect reattach request handler checks whether the associated ExecuteThreadRunner is completed, and returns an error if it has failed to record the outcome. ### Why are the changes needed? ExecuteResponseObserver.{onError, onComplete} are fallible while they are not retried; this leads to a situation where the ExecuteThreadRunner is completed without succeeding in responding to the client, and thus the client keeps retrying by reattaching the execution. To be specific, if an ExecuteThreadRunner fails to record the completion of execution or an error on the observer and then just disappears, the client will endlessly reattach, hoping that "someone" will eventually record "some data," but since the ExecuteThreadRunner is gone, this becomes a deadlock situation. The fix is that when the client attaches, the handler checks the status of the ExecuteThreadRunner, and if it finds that the execution cannot make any progress, an error is returned. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? testOnly org.apache.spark.sql.connect.service.SparkConnectServiceE2ESuite ### Was this patch authored or co-authored using generative AI tooling? No. Closes #49370 from changgyoopark-db/SPARK-50735. Lead-authored-by: changgyoopark-db <[email protected]> Co-authored-by: Changgyoo Park <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]> (cherry picked from commit 07aa4ff) Signed-off-by: Hyukjin Kwon <[email protected]>

github-actions bot added SQL CONNECT labels Jan 6, 2025

changgyoopark-db force-pushed the SPARK-50735 branch 4 times, most recently from e495228 to d3391c7 Compare January 6, 2025 13:49

changgyoopark-db force-pushed the SPARK-50735 branch from 039cb9d to 8e8773c Compare January 10, 2025 09:19

juliuszsompolski reviewed Jan 13, 2025

View reviewed changes

...server/src/main/scala/org/apache/spark/sql/connect/execution/ExecuteGrpcResponseSender.scala Outdated Show resolved Hide resolved

...nnect/server/src/main/scala/org/apache/spark/sql/connect/execution/ExecuteThreadRunner.scala Outdated Show resolved Hide resolved

changgyoopark-db added 4 commits January 13, 2025 13:54

Fix

904a547

Fix linting errors and problems

079c725

More fix

3a4da2e

Remove obsolete comments

08b8904

changgyoopark-db marked this pull request as draft January 13, 2025 15:07

changgyoopark-db added 2 commits January 13, 2025 16:30

Revert ExecuteGrpcResonseSender

c18ea12

Rework

8a15185

changgyoopark-db force-pushed the SPARK-50735 branch from 8e8773c to 8a15185 Compare January 13, 2025 17:06

changgyoopark-db marked this pull request as ready for review January 13, 2025 17:07

changgyoopark-db added 2 commits January 13, 2025 18:58

Fix linting errors

7ee3431

Merge branch 'master' into SPARK-50735

a00f136

juliuszsompolski reviewed Jan 13, 2025

View reviewed changes

changgyoopark-db marked this pull request as draft January 14, 2025 08:35

changgyoopark-db added 2 commits January 14, 2025 09:36

Merge branch 'master' into SPARK-50735

d2ecf92

Rework, again

76fe59d

changgyoopark-db marked this pull request as ready for review January 14, 2025 09:10

juliuszsompolski approved these changes Jan 14, 2025

View reviewed changes

...src/main/scala/org/apache/spark/sql/connect/service/SparkConnectReattachExecuteHandler.scala Outdated Show resolved Hide resolved

changgyoopark-db and others added 4 commits January 14, 2025 12:55

Update sql/connect/server/src/main/scala/org/apache/spark/sql/connect…

5fcb0d0

…/service/SparkConnectReattachExecuteHandler.scala Co-authored-by: Juliusz Sompolski <[email protected]>

Merge branch 'master' into SPARK-50735

e4c65b1

Merge branch 'master' into SPARK-50735

cb25956

Merge branch 'master' into SPARK-50735

093bb0a

Merge branch 'master' into SPARK-50735

71c17aa

HyukjinKwon approved these changes Jan 20, 2025

View reviewed changes

HyukjinKwon closed this in 07aa4ff Jan 20, 2025

changgyoopark-db deleted the SPARK-50735 branch January 21, 2025 06:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-50735][CONNECT] Failure in ExecuteResponseObserver results in infinite reattaching requests #49370

[SPARK-50735][CONNECT] Failure in ExecuteResponseObserver results in infinite reattaching requests #49370

changgyoopark-db commented Jan 6, 2025 •

edited

Loading

changgyoopark-db commented Jan 10, 2025

juliuszsompolski left a comment

juliuszsompolski Jan 13, 2025 •

edited

Loading

changgyoopark-db Jan 13, 2025

changgyoopark-db Jan 14, 2025

juliuszsompolski Jan 14, 2025

changgyoopark-db Jan 14, 2025

changgyoopark-db commented Jan 20, 2025

HyukjinKwon commented Jan 20, 2025 •

edited

Loading

[SPARK-50735][CONNECT] Failure in ExecuteResponseObserver results in infinite reattaching requests #49370

[SPARK-50735][CONNECT] Failure in ExecuteResponseObserver results in infinite reattaching requests #49370

Conversation

changgyoopark-db commented Jan 6, 2025 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

changgyoopark-db commented Jan 10, 2025

juliuszsompolski left a comment

Choose a reason for hiding this comment

juliuszsompolski Jan 13, 2025 • edited Loading

Choose a reason for hiding this comment

changgyoopark-db Jan 13, 2025

Choose a reason for hiding this comment

changgyoopark-db Jan 14, 2025

Choose a reason for hiding this comment

juliuszsompolski Jan 14, 2025

Choose a reason for hiding this comment

changgyoopark-db Jan 14, 2025

Choose a reason for hiding this comment

changgyoopark-db commented Jan 20, 2025

HyukjinKwon commented Jan 20, 2025 • edited Loading

changgyoopark-db commented Jan 6, 2025 •

edited

Loading

juliuszsompolski Jan 13, 2025 •

edited

Loading

HyukjinKwon commented Jan 20, 2025 •

edited

Loading