Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speculative: should we return chunks of the object instead of chunks of indices? #25

Closed
MasonProtter opened this issue Feb 4, 2024 · 11 comments
Labels
enhancement New feature or request

Comments

@MasonProtter
Copy link
Member

I feel like chunks is a bit of a misnomer. If I do [x for x in chunks([:a, :b, :c, :d, :e]; n=2)], it seems much more natural and also more useful to me that the result would be [[:a, :b, :c], [:d, :e]] rather than [1:1:3, 4:1:5].

This way instead of writing

for idx in chunks(A; n)
    f(@view A[idx])
end

we could just simply write

for chunk in chunks(A; n)
    f(chunk)
end

The current behaviour could then be emulated by doing

for idx in chunks(eachindex(A); n)
    f(@view A[idx])
end

Any thoughts on this? @lmiq @carstenbauer ? I think doing it like this would potentially make it easier to do things like support Dict and Dictionary or other more exotic containers.

@MasonProtter MasonProtter added the enhancement New feature or request label Feb 4, 2024
@carstenbauer
Copy link
Member

I agree that it's a bit of a misnomer. I kind of like that it returns indices though. It's cheap and well defined. Maybe we should have both variants.

About returning chunks of data (i.e. elements), what would typeof(chunk) be for generic A? E.g. what if A is a CuArray? Also a CuArray?

@MasonProtter
Copy link
Member Author

MasonProtter commented Feb 4, 2024

I think that should be up to the implementer, but the default should be a view (so that it's just as cheap as the current version)

We could also have chunk_indices or index_chunks or something like that for the current behaviour

@lmiq
Copy link
Collaborator

lmiq commented Feb 5, 2024

I agree that the name is not very good, and appears to signify the data of the chunk, not its indices. I have actually implemented chunk_indices at some point, but took it back. Independently of the name, what I think is that the option that returns the indices is the easiest for the user, because otherwise, in

for chunk in chunks(A; n)
    f(chunk)
end

f is a function that operates on a collection, not in one element of the collection. My impression (and my own use of it) generally is that one has a function that operates on an element, and we want to apply it in parallel in a collection of such elements. Returning the indices makes everything very explicit for the user (and less prone to introducing allocations, instabilities, method errors).

That's why, after experimenting with some names, I ended up not changing the simplest name chunks to return something other than the indices, but I agree that the name is not ideal.

At the same time, except for the breaking aspect of the changes, I wouldn't mind that we had both behaviors in different functions (or just using eachindex).

@MasonProtter
Copy link
Member Author

MasonProtter commented Feb 5, 2024

I guess from my point of view, f is typically map or reduce or whatever, so I do think of it as operating on the (sub) collection.

@lmiq
Copy link
Collaborator

lmiq commented Feb 5, 2024

I do not disagree. I think the name is not good.

Also I think that most users of this package will end up using OhMyThreads instead, so in the long run having chunks(eachindex(),n) and chunk_indices here will be ok.

I'm just afraid of releasing another breaking change. From what I've checked none of the dependents (which are only ~10 by now) updated to the new interface.

@longemen3000
Copy link

a non-breaking change could be done by adding chunk_indices, and then mass-PR the packages? (i can help with that)

@lmiq
Copy link
Collaborator

lmiq commented Feb 7, 2024

We have to mass-PR packages for deprecating the old syntax anyway, so that's one alternative before releasing 3.0.

There are ~10 packages that depend on it, that's simple. I'm worried about the now various discourse threads that propose using this package, and the possible breaking of other non-public scripts.

@carstenbauer
Copy link
Member

carstenbauer commented Mar 1, 2024

Speaking of misnomer, Chunk should probably also be called ChunkIterator - similar to PartitionIterator that Iterators.partition returns - or, in light of this thread, maybe even ChunkIndicesIterator.

(Even if we don't like the "Iterator" part, it should at least be the plural Chunks).

@carstenbauer
Copy link
Member

carstenbauer commented Jun 17, 2024

To summarize, the current proposal is

  • Rename chunks to chunks_indices,
  • Introduce a new chunks that returns (views into) data rather than indices,
  • Rename Chunk type to ChunkIndicesIterator

@lmiq
Copy link
Collaborator

lmiq commented Jun 20, 2024

  • Chunk was an internal name up to 2.4.3, and I made it public just because OhMyThreads is using it, but it is not even documented. So that change is essentially non-breaking.
  • I would maybe also make getchunk and internal function. It is documented and exported now, but I do not see any use of it really outside the package, and having it exported just limits out implementation flexibility.

@carstenbauer carstenbauer mentioned this issue Sep 18, 2024
Closed
6 tasks
@carstenbauer
Copy link
Member

See 3.x release :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants