-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Figure out how to combine efforts with xpartition #15
Comments
I agree that could be a good collaboration point. How do you see this going? I could see xarray-beam depending on xpartition, the shared functionality being up-streamed to xarray, or combining xpartition and xarray-beam into one package. One strength of xpartition is that it can easily be used with anything that supports iterating over a sequence (e.g. beam, slurm array jobs, k8s indexed jobs, etc).
I’d be interested to here @SpencerC’s thoughts.
|
I'd personally like to see a more modularized approach where xpartition and xarray-beam remain independent and shared functionality is up-streamed. I see distinct use cases for each (although I'm partial to beam workflows myself). |
I'm not opposed to either pushing components upstream or even adding xpartition as an xarray-beam dependency, but it's easier to develop in a single repository until we know what the right reusable abstractions are :) For now I've tried to keep the abstractions pretty light-weight, but even one of the few abstractions we have already ( The logic around partitioning xarray Datasets into smaller chunks does feel very generic, so in the long term I do like the idea of trying to split it out, assuming that there are actual concrete non-Beam use-cases. Xarray-Beam could just be a thin layer with |
Makes sense. I'd suggest working out of the xpartition repo since it has a more permissive license and a code coverage step in the CI, and add it as submodule here; unless you've got a google3/copybara workflow this would overcomplicate and slow you down along other project paths?
As amazing as beam is, I've encountered at least one instance in geodata processing where the right call was to break out of beam and roll our own distributed workflow system for one step. If you can get it upstreamed, there are also definitely benefits to maintaining functionality in a heavily used, actively managed project with lots of contributors (assuming it's well administrated). |
Hah, @SpencerC are you are (ex-)Googler or are we just notorious? :) I do have a convenient sync setup for xarray-beam/google3 -- and we have a handful of internal uses that serve as integration tests -- but that's not really a good reason not to do open development, if others want to get involved! It is easy enough for me to mirror other external projects. It does seem more plausible that xpartition could/should be the shared foundation rather than xarray-beam, which will require a Beam dependency. I would rather not push this all the way to xarray (yet), until we really figure out what we're doing. (IMO the differences between MIT/Apache licenses and a code-coverage step in CI are not really material factors)
I agree, there are for sure cases where Beam does not make sense. One obvious one is saving outputs from a numerical model. ML inference on specialized hardware might be another. That said -- what is the specific shared functionality? Some possibilities:
sure, but who do you think is going to be on hook for maintaining said upstream project? ;) |
I think we probably have enough experience that we could make a decent design, so I wouldn't be too worried about needing to track xpartition too closely as a dependency. In this case, the abstraction is basically set theory (e.g. union, intersection, partitions of sets, and ways to map between partitions). I always find designs work pretty well when the code abstractions line up with math abstractions. |
I'm also not convinced this simple functionality would be more maintainable in xarray, although it could replace the map_blocks function, which I assume is not particularly easy to maintain... |
@spencerkclark and @nbren12's xpartition looks like it solves some very similar problems.
It would be great to figure out how to work together, e.g., to share helper functions for working with dataset "blocks" without using dask: spencerkclark/xpartition#13
The text was updated successfully, but these errors were encountered: