You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I kindly request the implementation of streaming support for model invocations in the Modus SDK. This feature is crucial for real-time applications that require incremental responses.
Expected Changes:
Introduce a stream: true option in the ChatModelInput configuration.
Provide a mechanism to handle tokens as they are received.
Benefits:
Reduces latency for real-time applications.
Enhances user experience by providing immediate feedback.
Thank you for considering this request.
The text was updated successfully, but these errors were encountered:
Hi. Thanks for the feature request. We've looked into this a bit already, and do intend to add this capability. Though I cannot give you a precise timing of when it will be available, it is on our roadmap. I'll keep this issue open for now, so you and others can provide feedback.
To provide some background information, there are multiple parts to supporting this feature:
Currently, outbound HTTP request bodies and resulting HTTP response bodies are synchronously passed into WASM memory in their entirety (as byte arrays). The entire request/response is done in a single host function call. It would be a considerable effort to implement true streaming for our HTTP API in general. At some future point, we'll be able to take advantage of WASI-HTTP, but that's not currently available in our upstream dependencies today.
A more likely interim state is that we provide a different HTTP API in our SDKs that is designed specifically for Server-sent Events (SSE) APIs, such as OpenAI uses for its streaming results. It would have a parameter that was a function callback, that would continuously receive individual SSE messages while the connection is open. That would allow you to have custom code to respond to each event message in real-time. It would also include a way to terminate the current connection in response to an event, if desired.
Modus currently offers only GraphQL endpoint types, and currently exposes only Query and Mutation root types. Thus, even if you had discrete events streamed from a model, we'd have no way to return any result until the entirety of the response was ready.
One way we could handle streaming responses is to support the @defer and @stream GraphQL directives in queries. However, these are still experimental and not standardized in GraphQL (as far as I am aware).
A more available solution would be to implement support for GraphQL Subscription operations using SSE for the transport mechanism. One would subscribe to a particular function, and that function would have a mechanism to emit events without exiting. The caller would need to handle those events - received in a GraphQL compliant SSE stream.
Once we have the above two items worked out, there'd still be some work to use these features in the models APIs of our SDK. This part shouldn't be too difficult though.
I'd be very interested in collecting some use cases. There are two I can think of:
Aborting an LLM response mid-way through the response, such as when the result is a hallucination or not conforming to expected output.
Delivering the results of the LLM response while they are being generated, such as with the typing animations one sees with ChatGPT and other online AI assistants.
Description:
I kindly request the implementation of streaming support for model invocations in the Modus SDK. This feature is crucial for real-time applications that require incremental responses.
Expected Changes:
stream: true
option in theChatModelInput
configuration.Benefits:
Thank you for considering this request.
The text was updated successfully, but these errors were encountered: