Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SLEP005: Resampler API #15
base: main
Are you sure you want to change the base?
SLEP005: Resampler API #15
Changes from 3 commits
0c49bfb
4ecc51b
c855ffe
c16ef7b
e2f6a70
8f8ebb6
ae03400
10c85ff
c39d615
2de0d48
387b338
5ecfead
e7faa6e
a4019ed
87a1d5d
5ddc6f9
cde164b
e87fd7e
ad4e94f
b989562
ee197cb
35c140d
bc45d6a
e306795
79123fb
8538e82
c64044d
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have tracks?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the main point is that it is an estimator that returns y right?
Something that is somewhat related to this SLEP is PLS. Our PLS needs some love but it can also return X and Y from transform. Is this also covered here? Does that need a different API? Is it considered out of scope of sklearn and should be deprecated?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
more specifically, their method is different because it does not require correspondence between input and output samples. At fit time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think "input" and "output" are not good words. I thought you meant
X
andy
for a second ;)Is this the real difference? I guess that would be forbidden by the common tests right now. Maybe there are two main points: the number of samples returned can change, and it returns X and y?
I remember people talking about using Birch for replacing samples by cluster centers. Birch is also a transformer. The pipeline wouldn't know what to do when something is a transformer and a resampler, right? So they must be exclusive? That's also a new concept. We don't have any exclusive features, I think, i.e. we don't really have types of estimators, any estimator can have any method right now, but not after this SLEP.
I guess that's discussed somewhat below, but maybe should be made more explicitly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is outlier rejection for data cleaning actually a use-case? Can you point to a reference where someone used this? I have never heard of anyone using this as preprocessing and it strikes me as a bad idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another use-case that I think is much more common is data augmentation. It's maybe not as common with the kind of data typical for sklearn but it's still a very common practice in ML.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should mention sample reduction e.g. representation of a dataset by its kmeans centroids.
I also note that the former two cases can be handled merely by the ability to modify sample_weight in a Pipeline.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think mentioning this in the alternative implementations is enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it should be noted that the required changes are mostly (exclusively?) to
Pipeline
.Is anything else affected? Some other meta-estimators?
And of course our API and contract changes. But the implementation is all contained within the pipeline and the resamplers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, changes are limited to new resamplers and the composition implementation (So either
Pipeline
orResampledTrainer
). I've also had to make several changes to existing estimator checks since many of them assumed that the estimator implements fit.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It can't. The sample correspondence will be violated. It must raise an error as does
fit_predict
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was some more discussion on this in #15 (comment). I agree that raising an error seems like the best option.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most consistent is to define fit_transform as applying fit and then transform. Maybe it's just not entirely clear what "yes" in this table means. The samples output by fit_transform must correspond to the input, they cannot be resampled
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I dislike defining
fit_transform
as.fit(X, y).transform(X,y)
because of the hidden overhead of going through the pipeline twice. Imagine you have a very slow pipeline, and now you add a resampler to the beginning. Your pipeline is now potentially much slower, and it's not clear why.If calling
fit_transform
throws an error, perhaps the error message could say something like:Calling fit_transform on a pipeline containing resamplers is not supported. You may want to call fit and transform separately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think @jnothman point is that it should be equivalent to
.fit(X, y).transform(X)
up to numerical issues, not that it should be implemented like that. Btw, you have ay
in your transform that shouldn't be there ;)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok I'm pretty convinced now that this should be "not supported" and we should explain below why.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can someone give me an example where a pipeline is used as a resampler? Can we just not have a pipeline have that and get rid of the fit_transform/fit.transform issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's an example where we need a pipeline to provide
fit_resample
?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If a Pipeline provides
fit_resample
by virtue of including a resampler, and if "an estimator cannot implement bothfit_resample(X, y)
andfit_transform(X)
/transform(X)
", then a resampling Pipeline cannot providetransform
. To avoid this contradiction, I am okay to allow an estimator to providefit_resample
andtransform
, but notfit_transform
; and a pipeline containing this resampler should callfit_resample(X_train)
andtransform(X_test)
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now I remember one of the reasons we cannot support
transform
: resampling cannot be part of a FeatureUnion or similar. I am okay with that constraint.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wait, no, as long as
.fit(X).transform(X)
is used, it should be fine.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I lost you lol.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is weird because we allow
fit().predict()
and this seems a bit arbitrary.Can you give an example of this breaking?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand this bullet and the second sentence doesn't parse.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alternative 3: sample weights
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure about this...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah we don't allow it being passed to
predict
right now, right?I think the main limitation is that estimators can't produce sample weights, so there is no way to actually implement this in a pipeline right now without adding a new method, right?
Also I think runtime is a major issue for highly imbalanced datasets. If I have 1000:1 imbalance there will be a big runtime difference between using balanced random forests and random forests with sample weights.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Like everything, a meta-estimator could be implemented providing this functionality. However this is difficult to use, unintuitive, and would have to implement all methods and attributes appropriate to the contexts where this may be used (regression, decomposition, etc)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be illustrative to give a common pipeline with a meta-estimator and a pipeline approach?
Are there use cases where the resampler is not the first step in the pipeline?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add that this is an implementation of alternative 1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't we link to imbalance-learn? What are the differences between this proposal and imbalance-learn if any?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fix this?