Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SLEP005: Resampler API #15

Open
wants to merge 27 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 3 commits
Commits
Show all changes
27 commits
Select commit Hold shift + click to select a range
0c49bfb
SLEP005: Outlier Rejection API
glemaitre Mar 1, 2019
4ecc51b
Update slep005/proposal.rst
orausch Mar 2, 2019
c855ffe
Update slep
glemaitre Mar 2, 2019
c16ef7b
Update slep005/proposal.rst
adrinjalali Mar 5, 2019
e2f6a70
Update proposal based on discussion
orausch Jun 25, 2019
8f8ebb6
Reword semisupervised usecase
orausch Jun 25, 2019
ae03400
Add description of first few pipeline methods
orausch Jun 26, 2019
10c85ff
Add code examples
orausch Jun 26, 2019
c39d615
formatting
orausch Jun 26, 2019
2de0d48
Formatting and cleanup
orausch Jun 27, 2019
387b338
even more formatting
orausch Jun 27, 2019
5ecfead
more formatting
orausch Jun 27, 2019
e7faa6e
try these headings
orausch Jun 27, 2019
a4019ed
last one
orausch Jun 27, 2019
87a1d5d
Some changes based on the discussion in the thread (#1)
glemaitre Jul 3, 2019
5ddc6f9
minor rephrasing
glemaitre Jul 3, 2019
cde164b
address comments
glemaitre Jul 3, 2019
e87fd7e
Apply suggestions from code review
glemaitre Jul 3, 2019
ad4e94f
Some text about resampling pipelines and their issues
jnothman Jul 3, 2019
b989562
Some text about resampling pipelines and their issues (#2)
jnothman Jul 3, 2019
ee197cb
minor changes
glemaitre Jul 3, 2019
35c140d
iter
glemaitre Jul 3, 2019
bc45d6a
Some comments on fit params
jnothman Aug 26, 2019
e306795
Merge branch 'slep005' of https://github.com/glemaitre/enhancement_pr…
jnothman Aug 26, 2019
79123fb
Slep005: Some comments on fit params (#3)
glemaitre Sep 9, 2019
8538e82
Merge branch 'master' into slep005
glemaitre Sep 23, 2019
c64044d
Merge branch 'master' into slep005
jnothman Aug 17, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,7 @@
slep002/proposal
slep003/proposal
slep004/proposal
slep005/proposal
glemaitre marked this conversation as resolved.
Show resolved Hide resolved

.. toctree::
:maxdepth: 1
Expand Down
109 changes: 109 additions & 0 deletions slep005/proposal.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,109 @@
.. _slep_005:

=============
Resampler API
=============

:Author: Oliver Raush ([email protected]),
glemaitre marked this conversation as resolved.
Show resolved Hide resolved
Christos Aridas ([email protected]),
Guillaume Lemaitre ([email protected])
:Status: Draft
:Type: Standards Track
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have tracks?

:Created: created on, in 2019-03-01
:Resolution: <url>

Abstract
--------

We propose the inclusion of a new type of estimator: resampler. The
resampler will change the samples in ``X`` and ``y``. In short:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the main point is that it is an estimator that returns y right?

Something that is somewhat related to this SLEP is PLS. Our PLS needs some love but it can also return X and Y from transform. Is this also covered here? Does that need a different API? Is it considered out of scope of sklearn and should be deprecated?


* resamplers will reduce or augment the number of samples in ``X`` and
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* resamplers will reduce or augment the number of samples in ``X`` and
* resamplers are able to reduce or augment the number of samples in ``X`` and

more specifically, their method is different because it does not require correspondence between input and output samples. At fit time.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think "input" and "output" are not good words. I thought you meant X and y for a second ;)

Is this the real difference? I guess that would be forbidden by the common tests right now. Maybe there are two main points: the number of samples returned can change, and it returns X and y?

I remember people talking about using Birch for replacing samples by cluster centers. Birch is also a transformer. The pipeline wouldn't know what to do when something is a transformer and a resampler, right? So they must be exclusive? That's also a new concept. We don't have any exclusive features, I think, i.e. we don't really have types of estimators, any estimator can have any method right now, but not after this SLEP.
I guess that's discussed somewhat below, but maybe should be made more explicitly?

``y``;
* ``Pipeline`` should treat them as a separate type of estimator.

Motivation
----------

Sample reduction or augmentation are part of machine-learning
pipeline. The current scikit-learn API does not offer support for such
use cases.

Two possible use cases are currently reported:

* sample rebalancing to correct bias toward class with large cardinality;
* outlier rejection to fit a clean dataset.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is outlier rejection for data cleaning actually a use-case? Can you point to a reference where someone used this? I have never heard of anyone using this as preprocessing and it strikes me as a bad idea.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another use-case that I think is much more common is data augmentation. It's maybe not as common with the kind of data typical for sklearn but it's still a very common practice in ML.


Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should mention sample reduction e.g. representation of a dataset by its kmeans centroids.

I also note that the former two cases can be handled merely by the ability to modify sample_weight in a Pipeline.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think mentioning this in the alternative implementations is enough.

Implementation
--------------
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should be noted that the required changes are mostly (exclusively?) to Pipeline.
Is anything else affected? Some other meta-estimators?

And of course our API and contract changes. But the implementation is all contained within the pipeline and the resamplers?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, changes are limited to new resamplers and the composition implementation (So either Pipeline or ResampledTrainer). I've also had to make several changes to existing estimator checks since many of them assumed that the estimator implements fit.


To handle outlier rejector in ``Pipeline``, we enforce the following:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
To handle outlier rejector in ``Pipeline``, we enforce the following:
To handle resampling in ``Pipeline``, we enforce the following:


* an estimator cannot implement both ``fit_resample(X, y)`` and
``fit_transform(X)`` / ``transform(X)``. If both are implemented,
``Pipeline`` will not be able to know which of the two methods to
call.
* resamplers are only applied during ``fit``. Otherwise, scoring will
glemaitre marked this conversation as resolved.
Show resolved Hide resolved
be harder. Specifically, the pipeline will act as follows:

===================== ================================
Method Resamplers applied
===================== ================================
``fit`` Yes
``fit_transform`` Yes
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can't. The sample correspondence will be violated. It must raise an error as does fit_predict.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was some more discussion on this in #15 (comment). I agree that raising an error seems like the best option.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most consistent is to define fit_transform as applying fit and then transform. Maybe it's just not entirely clear what "yes" in this table means. The samples output by fit_transform must correspond to the input, they cannot be resampled

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dislike defining fit_transform as .fit(X, y).transform(X,y) because of the hidden overhead of going through the pipeline twice. Imagine you have a very slow pipeline, and now you add a resampler to the beginning. Your pipeline is now potentially much slower, and it's not clear why.

If calling fit_transform throws an error, perhaps the error message could say something like:

Calling fit_transform on a pipeline containing resamplers is not supported. You may want to call fit and transform separately.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dislike defining fit_transform as .fit(X, y).transform(X,y) because of the hidden overhead of going through the pipeline twice

I think @jnothman point is that it should be equivalent to .fit(X, y).transform(X) up to numerical issues, not that it should be implemented like that. Btw, you have a y in your transform that shouldn't be there ;)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I'm pretty convinced now that this should be "not supported" and we should explain below why.

``fit_resample`` Yes
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can someone give me an example where a pipeline is used as a resampler? Can we just not have a pipeline have that and get rid of the fit_transform/fit.transform issue?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's an example where we need a pipeline to provide fit_resample?

``transform`` No
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a Pipeline provides fit_resample by virtue of including a resampler, and if "an estimator cannot implement both fit_resample(X, y) and fit_transform(X) / transform(X)", then a resampling Pipeline cannot provide transform. To avoid this contradiction, I am okay to allow an estimator to provide fit_resample and transform, but not fit_transform; and a pipeline containing this resampler should call fit_resample(X_train) and transform(X_test).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now I remember one of the reasons we cannot support transform: resampling cannot be part of a FeatureUnion or similar. I am okay with that constraint.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now I remember one of the reasons we cannot support transform: resampling cannot be part of a FeatureUnion or similar.

Wait, no, as long as .fit(X).transform(X) is used, it should be fine.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I lost you lol.

``predict`` No
``score`` No
``fit_predict`` not supported
===================== ================================

* ``fit_predict(X)`` (i.e., clustering methods) should not be called
if an outlier rejector is in the pipeline. The output will be of
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if an outlier rejector is in the pipeline. The output will be of
if a resampler is in the pipeline. The output will be of

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is weird because we allow fit().predict() and this seems a bit arbitrary.
Can you give an example of this breaking?

different size than ``X`` breaking metric computation.
* in a supervised scheme, resampler will need to validate which type
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this bullet and the second sentence doesn't parse.

of target is passed. Up to our knowledge, supervised are used for
binary and multiclass classification.

Alternative implementation
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternative 3: sample weights

..........................

Alternatively ``sample_weight`` could be used as a placeholder to
perform resampling. However, the current limitations are:

* ``sample_weight`` is not available for all estimators;
* ``sample_weight`` will implement only sample reductions;
glemaitre marked this conversation as resolved.
Show resolved Hide resolved
* ``sample_weight`` can be applied at both fit and predict time;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure about this...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah we don't allow it being passed to predict right now, right?

I think the main limitation is that estimators can't produce sample weights, so there is no way to actually implement this in a pipeline right now without adding a new method, right?

Also I think runtime is a major issue for highly imbalanced datasets. If I have 1000:1 imbalance there will be a big runtime difference between using balanced random forests and random forests with sample weights.

* ``sample_weight`` need to be passed and modified within a
glemaitre marked this conversation as resolved.
Show resolved Hide resolved
``Pipeline``.
glemaitre marked this conversation as resolved.
Show resolved Hide resolved

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like everything, a meta-estimator could be implemented providing this functionality. However this is difficult to use, unintuitive, and would have to implement all methods and attributes appropriate to the contexts where this may be used (regression, decomposition, etc)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be illustrative to give a common pipeline with a meta-estimator and a pipeline approach?
Are there use cases where the resampler is not the first step in the pipeline?

Current implementation
......................

glemaitre marked this conversation as resolved.
Show resolved Hide resolved
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add that this is an implementation of alternative 1

* Outlier rejection are implemented in:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we link to imbalance-learn? What are the differences between this proposal and imbalance-learn if any?

https://github.com/scikit-learn/scikit-learn/pull/13269

Backward compatibility
----------------------

There is no backward incompatibilities with the current API.

Discussion
----------

* https://github.com/scikit-learn/scikit-learn/pull/13269{
glemaitre marked this conversation as resolved.
Show resolved Hide resolved

References and Footnotes
------------------------

.. [1] Each SLEP must either be explicitly labeled as placed in the public
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix this?

domain (see this SLEP as an example) or licensed under the `Open
Publication License`_.

.. _Open Publication License: https://www.opencontent.org/openpub/


Copyright
---------

This document has been placed in the public domain. [1]_