Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Changes for 3.0 #48

Merged
merged 40 commits into from
Sep 25, 2024
Merged

Changes for 3.0 #48

merged 40 commits into from
Sep 25, 2024

Conversation

carstenbauer
Copy link
Member

@carstenbauer carstenbauer commented Sep 18, 2024

Closes #47

Copied from there:
A summary of the new API / changes that need to be made (likely incomplete and maybe some points are still up for debate):

  • new function chunk(x; ...) that returns an iterator that delivers the chunks, i.e. views into x. (We choose chunk, singular, here to indicate the verb form to chunk. This is to be in line with enumerate, zip, and co.)
  • rename chunks(x; ...) to chunk_indices(x; ...) and make the latter return an iterator that delivers the indices (of x) for each chunk.
  • iterator names: Chunk (returned by chunk) and ChunkIndices (returned by chunk_indices). This is in line with Base.Iterators.Enumerate (returned by enumerate), Base.Pairs (returned by pairs), and Base.Iterators.Zip (returned by zip) I implemented a single iterator ChunksIterator that has a type parameter ReturnIndices or ReturnViews to indicate what will be eventually returned.
  • drop the legacy API that relies on positional arguments
  • drop the Symbol option for split. While it is slightly more convenient, it can be less efficient and it doesn't make much sense to have two ways to do the same thing.
  • make getchunk internal (not public API anymore) + rename it to getchunkindices

What I think might make sense but I'm unsure about / not sure it's worth it:

  • Should we rename BatchSplit to ContiguousSplit? The latter seems to communicate the critical point more clearly: the resulting chunks hold contiguous data.
  • I don't like that we use "chunk" on the one hand and "split" on the other hand. Why two terms for the same thing? Perhaps something like chunk(x; n=4, strategy=Contiguous()) or something less verbose makes more sense as it gets rid of "split" terminology entirely?

@carstenbauer
Copy link
Member Author

julia> using ChunkSplitters

julia> function f(x)
           s = zero(eltype(x))
           for xc in chunk(x; n=4)
               s += sum(xc)
           end
           return s
       end
f (generic function with 1 method)

julia> x=rand(10^3);

julia> f(x)  sum(x)
true

julia> function f_indices(x)
           s = zero(eltype(x))
           for idcs in chunk_indices(x; n=4)
               s += sum(@view(x[idcs]))
           end
           return s
       end
f_indices (generic function with 1 method)

julia> f_indices(x)  sum(x)
true

julia> @benchmark f($x)
BenchmarkTools.Trial: 10000 samples with 788 evaluations.
 Range (min … max):  160.057 ns … 195.801 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     160.426 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   160.774 ns ±   1.690 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▄██▄                                                          ▂
  ████▇▆▅▄▅▅▅▆▇██▆▅▅▄▄▅▅▄▃▅▅▆▄▄▅▄▅▄▄▅▆▆▇▇▇███▇▇▅▆▅▅▅▅▃▅▁▄▄▄▄▁▄▆ █
  160 ns        Histogram: log(frequency) by time        170 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark f_indices($x)
BenchmarkTools.Trial: 10000 samples with 788 evaluations.
 Range (min … max):  159.688 ns … 181.313 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     159.898 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   160.350 ns ±   1.865 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▇█▄▂      ▁                                                   ▁
  █████▆▆▆▆▇█▇▇▅▆▅▄▄▄▁▁▆▅▄▅▅▄▃▃▅▅▇▆▆▇██▇▆▅▆▆▅▄▃▅▄▄▅▅▅▄▅▆▆▃▄▆▃▄▇ █
  160 ns        Histogram: log(frequency) by time        171 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

@carstenbauer
Copy link
Member Author

carstenbauer commented Sep 19, 2024

Preview of the revamped documentation: https://juliafolds2.github.io/ChunkSplitters.jl/previews/PR48/

@carstenbauer
Copy link
Member Author

carstenbauer commented Sep 20, 2024

On 1.10, chunk(x; size=10) seems to allocate whereas chunk_indices(x; size=10) does not. Why?

function f_chunk(x; n=nothing, size=nothing)
    s = zero(eltype(x))
    for xdata in chunk(x; n=n, size=size)
        s += sum(xdata)
    end
    return s
end
x = rand(10^3)
b = @benchmark $f_chunk($x; n=4) samples = 1 evals = 1
@test b.allocs == 0
b = @benchmark $f_chunk($x; size=10) samples = 1 evals = 1
if VERSION > v"1.10"
    @test b.allocs == 0
else
    @test_broken b.allocs == 0 # TODO: why does this allocate on 1.10?
end
julia> b = @benchmark $f_chunk($x; size=10) samples = 1 evals = 1
BenchmarkTools.Trial: 1 sample with 1 evaluation.
 Single result which took 1.542 μs (0.00% GC) to evaluate,
 with a memory estimate of 6.19 KiB, over 99 allocations.

@carstenbauer
Copy link
Member Author

carstenbauer commented Sep 20, 2024

I noticed that the svg in the old/current docs is wrong.

Indicating the scattering by coloring chunks:

splitters_colors

Indicating the scattering by arrows:
splitters_arrows

I prefer the latter (and will use it).

@carstenbauer
Copy link
Member Author

Alright guys, as it turned out, I had the motivation to implement this right away. Please checkout the new (preview) docs and run some tests to see if things are stable.

What I did beyond the points mentioned in the OP is drop 1.6 support (1.10 is the new lower bound).

Also, if you have an idea about #48 (comment) that would be great as well.

@carstenbauer carstenbauer marked this pull request as ready for review September 20, 2024 10:30
@carstenbauer
Copy link
Member Author

carstenbauer commented Sep 20, 2024

@MasonProtter
Copy link
Member

I'll try to look at this tonight. I suspect there's an inference problem here on 1.10, but I'll dig in and see if we can figure out what went wrong.

@carstenbauer
Copy link
Member Author

carstenbauer commented Sep 21, 2024

I also like these changes, in any case.

I still think that it might make sense to consider renaming Batch → Consecutive and Scatter → RoundRobin. (I've just done this in a new commit).

As for renaming split to strategy (or something similar), there are two reasons I didn't do it right away (but I'm still open to be convinced):

  1. strategy feels long compared to split
  2. The package name ChunkSplitters.jl already establishes "chunk" and "split" as terminology.

@codecov-commenter
Copy link

codecov-commenter commented Sep 21, 2024

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

Attention: Patch coverage is 92.03540% with 9 lines in your changes missing coverage. Please review.

Project coverage is 92.03%. Comparing base (ceadbb6) to head (1d4cdf4).
Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
src/internals.jl 91.89% 9 Missing ⚠️

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@            Coverage Diff             @@
##             main      #48      +/-   ##
==========================================
- Coverage   96.33%   92.03%   -4.30%     
==========================================
  Files           2        2              
  Lines         300      113     -187     
==========================================
- Hits          289      104     -185     
+ Misses         11        9       -2     
Flag Coverage Δ
92.03% <92.03%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

src/internals.jl Outdated Show resolved Hide resolved
src/internals.jl Outdated Show resolved Hide resolved
@carstenbauer
Copy link
Member Author

carstenbauer commented Sep 24, 2024

I've implemented the suggestions by @fredrikekre (thanks for them!).

One remaining minor question is whether we should export the subtypes <:Split, i.e. Consecutive and RoundRobin (in any case, they are public API). On the one hand, it is nice that one can just do using ChunkSplitters and write chunks(x; n=10, split=RoundRobin()). OTOH, the terms are quite generic and might cause name clashes (e.g. https://juliahub.com/ui/Search?type=symbols&q=Consecutive&u=define).

If we don't export them one has to write

chunks(x; n=10, split=ChunkSplitters.RoundRobin())

# or, alternatively

using ChunkSplitters: RoundRobin
chunks(x; n=10, split=RoundRobin())

Again, I don't have a strong opinion but tend to slightly prefer not exporting them.

We could also name them RoundRobinSplit and ConsecutiveSplit, of course, at the cost of redundancy (split=RoundRobinSplit()).

Opinions?

Copy link
Contributor

@fredrikekre fredrikekre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One remaining minor question is whether we should export the subtypes <:Split

I prefer not exporting, in particular since the practice of always explicitly using symbols anyway is getting more common (https://github.com/ericphanson/ExplicitImports.jl).

On the one hand, it is nice that one can just do using ChunkSplitters and write chunks(x; n=10, split=RoundRobin()).

This export is actually slightly annoying because the perfect variable name for the return value is chunks, and obviously chunks = chunks(...) doesn't work.

With the current export situation I would simply use

using ChunkSplitters: ChunkSplitters

and always use the ChunkSplitters. prefix for clarity.

src/internals.jl Outdated Show resolved Hide resolved
src/internals.jl Outdated
size::Int
end

struct IndexChunksIterator{T,C<:Constraint,S<:Split} <: AbstractChunksIterator{T,C,S}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More bikeshedding, but since you can random access these, is the Iterator suffix misleading? Perhaps it is just me, but I would have guessed this could only be iterated from the name.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bikeshedding indeed 😉 I'm fine dropping the Iterator suffix. Let me just note that

  1. the primary use case is to iterate over these, and
  2. this isn't public API anyways (so we can rename these types whenever we want)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer exporting the symbols, not bringing them to the namespace is always an option for the user.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lmiq, wrong thread? I assume by "the symbols" you mean Consecutive and RoundRobin and aren't suggesting to make the "iterator" types public API?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, I've dropped the "Iterator" suffixes.

@carstenbauer
Copy link
Member Author

carstenbauer commented Sep 24, 2024

I prefer not exporting, in particular since the practice of always explicitly using symbols anyway is getting more common (ericphanson/ExplicitImports.jl).

On the one hand, it is nice that one can just do using ChunkSplitters and write chunks(x; n=10, split=RoundRobin()).

I like the trend of explicitly using the bits that one actually needs. However, for this case, it doesn't matter whether we export the <:Split types or not. The question is about what should be "imported" if one simply does using ChunkSplitters. Not exporting anything (are you suggesting that?) seems radical to me. The naive using ChunkSplitters is useful in interactive workflows where I don't want to explicitly list everything I need.

This export is actually slightly annoying because the perfect variable name for the return value is chunks, and obviously chunks = chunks(...) doesn't work.

With the current export situation I would simply use

using ChunkSplitters: ChunkSplitters

and always use the ChunkSplitters. prefix for clarity.

Once again, not exporting chunks, the main API of this package, seems bad/too radical to me. And while I see your point about the chunks name clash, the same could be said about size = size(...), length = length(...), and probably many more functions. Moreover, the typical use case of chunks(...) is in combination with for or map in which case you don't store it in a variable anyways. As you pointed out, you can always opt-out of the "import everything".

@fredrikekre
Copy link
Contributor

Yea I just meant that if anything chunks would be the "problematic" export, not the split types.

@lmiq
Copy link
Collaborator

lmiq commented Sep 24, 2024

Just to say that I think this PR is fine, but we need to find a solution to #48 (comment) .

(even if it means reporting an inference bug to 1.10.5 and see if the solution can be back-ported from 1.11)

@fredrikekre
Copy link
Contributor

#50 fixes #48 (comment)

@carstenbauer carstenbauer mentioned this pull request Sep 24, 2024
Closed
6 tasks
@carstenbauer
Copy link
Member Author

I think this is good to go (unless we want to make other breaking changes).

@lmiq
Copy link
Collaborator

lmiq commented Sep 25, 2024

Thanks a lot @carstenbauer, and others of course, great work!

@lmiq lmiq merged commit 9d18e48 into main Sep 25, 2024
24 checks passed
@lmiq
Copy link
Collaborator

lmiq commented Sep 25, 2024

Released 3.0 now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3.0
5 participants