Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SLEP005: Resampler API #15

Open
wants to merge 27 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
27 commits
Select commit Hold shift + click to select a range
0c49bfb
SLEP005: Outlier Rejection API
glemaitre Mar 1, 2019
4ecc51b
Update slep005/proposal.rst
orausch Mar 2, 2019
c855ffe
Update slep
glemaitre Mar 2, 2019
c16ef7b
Update slep005/proposal.rst
adrinjalali Mar 5, 2019
e2f6a70
Update proposal based on discussion
orausch Jun 25, 2019
8f8ebb6
Reword semisupervised usecase
orausch Jun 25, 2019
ae03400
Add description of first few pipeline methods
orausch Jun 26, 2019
10c85ff
Add code examples
orausch Jun 26, 2019
c39d615
formatting
orausch Jun 26, 2019
2de0d48
Formatting and cleanup
orausch Jun 27, 2019
387b338
even more formatting
orausch Jun 27, 2019
5ecfead
more formatting
orausch Jun 27, 2019
e7faa6e
try these headings
orausch Jun 27, 2019
a4019ed
last one
orausch Jun 27, 2019
87a1d5d
Some changes based on the discussion in the thread (#1)
glemaitre Jul 3, 2019
5ddc6f9
minor rephrasing
glemaitre Jul 3, 2019
cde164b
address comments
glemaitre Jul 3, 2019
e87fd7e
Apply suggestions from code review
glemaitre Jul 3, 2019
ad4e94f
Some text about resampling pipelines and their issues
jnothman Jul 3, 2019
b989562
Some text about resampling pipelines and their issues (#2)
jnothman Jul 3, 2019
ee197cb
minor changes
glemaitre Jul 3, 2019
35c140d
iter
glemaitre Jul 3, 2019
bc45d6a
Some comments on fit params
jnothman Aug 26, 2019
e306795
Merge branch 'slep005' of https://github.com/glemaitre/enhancement_pr…
jnothman Aug 26, 2019
79123fb
Slep005: Some comments on fit params (#3)
glemaitre Sep 9, 2019
8538e82
Merge branch 'master' into slep005
glemaitre Sep 23, 2019
c64044d
Merge branch 'master' into slep005
jnothman Aug 17, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,8 @@
:maxdepth: 1
:caption: Under review

slep005/proposal
SLEP009: Keyword-only arguments (voting until 11 Oct 2019) <slep009/proposal>
slep007/proposal
slep012/proposal
slep013/proposal
Expand Down
246 changes: 246 additions & 0 deletions slep005/proposal.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,246 @@
.. _slep_005:

=============
Resampler API
=============

:Author: Oliver Rausch ([email protected]),
Christos Aridas ([email protected]),
Guillaume Lemaitre ([email protected])
:Status: Draft
:Created: created on, in 2019-03-01
:Resolution: <url>

Abstract
--------

We propose the inclusion of a new type of estimator: resampler. The
resampler will change the samples in ``X`` and ``y`` and return both
``Xt`` and ``yt``. In short:

* a new verb/method that all resamplers must implement is introduced:
``fit_resample``.
* resamplers are able to reduce and/or augment the number of samples in
``X`` and ``y`` during ``fit``, but will perform no changes during
``predict``.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm quite confused by what predict for a resampler could mean. Are you actually referring to the composed estimators like ResampledTrainer or pipeline?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this refers to the composed estimators

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that the term resampler is used throughout the SLEP to refer both to the meta estimator and to the actual resampler object.

This is quite confusing to me.

* to facilitate this behavior a new meta-estimator (``ResampledTrainer``) that
allows for the composition of resamplers and estimators is proposed.
Alternatively we propose changes to ``Pipeline`` that also enable similar
compositions.


Motivation
----------

Sample reduction or augmentation are common parts of machine-learning
pipelines. The current scikit-learn API does not offer support for such
use cases.

Possible Usecases
.................

* Sample rebalancing to correct bias toward class with large cardinality.
* Outlier rejection to fit a clean dataset.
* Sample reduction e.g. representing a dataset by its k-means centroids.
* Currently semi-supervised learning is not supported by scoring-based
functions like ``cross_val_score``, ``GridSearchCV`` or ``validation_curve``
since the scorers will regard "unlabeled" as a separate class. A resampler
could add the unlabeled samples to the dataset during fit time to solve this
(note that this could also be solved by a new cv splitter).
* NaNRejector (drop all samples that contain nan).
* Dataset augmentation (like is commonly done in DL).

Implementation
--------------
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should be noted that the required changes are mostly (exclusively?) to Pipeline.
Is anything else affected? Some other meta-estimators?

And of course our API and contract changes. But the implementation is all contained within the pipeline and the resamplers?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, changes are limited to new resamplers and the composition implementation (So either Pipeline or ResampledTrainer). I've also had to make several changes to existing estimator checks since many of them assumed that the estimator implements fit.


API and Constraints
...................

* Resamplers implement a method ``fit_resample(X, y, **kwargs)``, a pure
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is now a bit inconsistent with the idea of having a props param below, FWIW.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This kwargs cannot be, IMO, the source for deriving kwargs in the return value. I'm wondering whether we should disregard kwarg input here, and discuss it separately below.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, we could decide to not implement kwarg resampling for now, and just reserve a third output item for when we do (because we know it should be possible to do, and to allow a resampler to generate sample weights)...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so fit_resample(X, y, sample_weight=None)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well even there is ambiguous whether sample_weight should be resampled and returned, or whether it should be used when fitting, or both. If we make it clear that sample-aligned kwargs like sample_weight are not to be resampled and returned by default, I think that's fine.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The SLEP does not explain why a new method name (fit_resample) is necessary.
Why not using fit_transform ?
Does resample alone make sense?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The API of fit_transform guarantees that for every sample you input, you get a transformed sample out. Since resamplers change the amount of samples in the output, we need a new verb.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. This should be added to the SLEP.

function which returns ``Xt, yt, kwargs`` corresponding to the resampled
dataset, where samples may have been added and/or removed.
* An estimator may only implement either ``fit_transform`` or ``fit_resample``
if support for ``Resamplers`` in ``Pipeline`` is enabled
(see Sect. "Limitations").
* Resamplers may not change the order, meaning, dtype or format of features
(this is left to transformers).
* Resamplers should also handled (e.g. resample, generate anew, etc.) any
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is generate anew ? What is the link with kwargs ?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Imagine you pass sample weights into a composed estimator containing a resampler and a predictor. If the resampler adds or removes samples, resampler must now also add or remove rows in the sample_weights so that the shape matches.

Copy link
Member

@amueller amueller Sep 24, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about adding "such as sample weights"? Maybe "fit_params" or "sample aligned kwargs" or something like that would be clearer as well?

kwargs.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

handled -> handle?


Composition
-----------

A key part of the proposal is the introduction of a way of composing resamplers
with predictors. We present two options: ``ResampledTrainer`` and modifications
to ``Pipeline``.

Alternative 1: ResampledTrainer
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the Resampler is only called during fit, why is it an estimator and not a function?
The resampler is stateless, right?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it is stateless. I guess this is mainly because we originally wanted these resamplers to work with pipelines, and the most natural realisation of this is that the resampler is an estimator

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but for Alternative 1 there's really no reason any more and it just adds complexity, right?

Copy link

@orausch orausch Sep 24, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, with this proposal there is no reason. Perhaps its good to note that there may actually be useful stateful resamplers (e.g. consider fitting SMOTE on your whole dataset, and then oversampling batches of that dataset on the fly to save memory) but we exclude these in this proposal due to the added complexity this would introduce.

...............................

This metaestimator composes a resampler and a predictor. It
behaves as follows:

* ``fit(X, y)``: resample ``X, y`` with the resampler, then fit the predictor
on the resampled dataset.
* ``predict(X)``: simply predict on ``X`` with the predictor.
* ``score(X)``: simply score on ``X`` with the predictor.

See PR #13269 for an implementation.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
See PR #13269 for an implementation.
See PR [#13269](https://github.com/scikit-learn/scikit-learn/pull/13269) for an implementation.


One benefit of the ``ResampledTrainer`` is that it does not stop the resampler
having other methods, such as ``transform``, as it is clear that the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is an example of resampler that would need a transform method ?
Why is it useful to have a transform method if only fit_resample is used ?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does transform mean for a resampler?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In terms of API resamplers do not have transform. What this means is that you can have transformers that are also resamplers.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you give an example of an object where having both transform and resample might be useful?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well depending on implementation, Pipeline might be an example: if we have a pipeline containing both a resampler and a transformer, we would probably like to have methods to do both of those things.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fair, can you add that to the document?

Copy link

@orausch orausch Sep 24, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's confusing

Agreed, I will have another look.

I would implement fit_transform(X, y) as:

onehotencoder.fit(myresampler.fit_resample(X, y)).transform(X, y)

EDIT: oops, that was fit_transform. I would implement fit_resample(X,y) as:

Xt, yt = myresampler.fit_resample(X, y)
return onehotencoder.fit_transform(Xt, yt), yt

Copy link
Member

@NicolasHug NicolasHug Sep 24, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought fit_resample was only available for the resampler objects? The meta estimators (Resampledtrainer or Pipeline) aren't resamplers, so why should they support fit_resample?

Copy link

@orausch orausch Sep 24, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought fit_resample was only available for the resampler objects?

This is correct. The question is whether it is possible/useful/a good idea to make the metaestimators resamplers.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. I think the problems should be framed is those exact terms in the SLEP then, that sounds clearer and more explicit to me

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that should be clearer, and I don't see a reason to do that.

``ResampledTrainer`` will only call ``fit_resample``.

There are complications around supporting ``fit_transform``, ``fit_predict``
and ``fit_resample`` methods in ``ResampledTrainer``. ``fit_transform`` support
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is fit_resample an issue? It's precisely the proposal?

is only possible by implementing ``fit_transform(X, y)`` as ``fit(X,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why ?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

y).transform(X)``, rather than calling ``fit_transform`` of the predictor.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is only possible by implementing fit_transform(X, y) as fit(X, y).transform(X), rather than calling fit_transform of the predictor

Why?

Can't we just define fit_transform() of the ResampledTrainer as

def fit_transform(self, X, y):
	Xt, yt = self.resampler_.fit_resample(X, y)
    return self.estimator_.fit_transform(Xt, yt)

?

Copy link

@orausch orausch Sep 24, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fit_transform should be equivalent to fit(...).transform(...). However, calling this implementation of fit_transform may change the amount of samples in the output, which transformers are not allowed to do.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you include this explanation in the slep?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

may change the amount of samples in the output, which transformers are not allowed to do.

I don't understand, that ResampledTrainer is not a transformer, it's a 2-step pipeline-like object where the first step is a resampler and the last step is a transformer. I don't see why it should have the same constraints as the transformers?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would argue that as soon as you implement transform, you should be bound by the API constraints of a transformer. Anything else would:
a) be rather unintuitive to users
b) make your object incompatible with the rest of sklearn

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I agree with @orausch. If it's not a transformer what else would it be if it implements transform?

``fit_predict`` would have to behave similarly. Thus ``ResampledTrainer``
would not work with non-inductive estimators (TSNE, AgglomerativeClustering,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why can't we resample X, y with the resampler, then call fit_transform in the predictor on the resampled dataset.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

etc.) as their final step. If the predictor of a ``ResampledTrainer`` is
itself a resampler, it's unclear how ``ResampledTrainer.fit_resample`` should
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With this proposal, what's the recommended way to chain two resamplers? Nest the ResampledTrainers? Then we run into defining ResampledTrainer.fit_resample, right? Are there other ways to do it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, just saw below how it's done. Then there's no need to define ResampledTrainer.fit_resample is there?

behave. These caveats also apply to the Pipeline modification below.

Example Usage:
~~~~~~~~~~~~~~

.. code-block:: python

est = ResampledTrainer(RandomUnderSampler(), SVC())
est = make_pipeline(
StandardScaler(),
ResampledTrainer(Birch(), make_pipeline(SelectKBest(), SVC()))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How can birch be the first step of a ResampledTrainer? What does the target become?

)
est = ResampledTrainer(
RandomUnderSampler(),
make_pipeline(StandardScaler(), SelectKBest(), SVC()),
)
clf = ResampledTrainer(
NaNRejector(), # removes samples containing NaN
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how this is gonna work, as in, what does the transform of this method actually do?

ResampledTrainer(RandomUnderSampler(),
make_pipeline(StandardScaler(), SGDClassifier()))
)

Alternative 2: Prediction Pipeline
..................................

As an alternative to ``ResampledTrainer``, ``Pipeline`` can be modified to
accomodate resamplers. The essence of the operation is this: one or more steps
of the pipeline may be a resampler. When fitting the Pipeline, ``fit_resample``
will be called on each resampler instead of ``fit_transform``, and the output
of ``fit_resample`` will be used in place of the original ``X``, ``y``, etc.,
to fit the subsequent step (and so on). When predicting in the Pipeline,
the resampler will act as a passthrough step.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To emphasize the different with alternative 1, you might explain that the pipeline solution would correspond to having implicitly both a ResampledTrainer and a ResampledTransformer, and that the latter is problematic.


Limitations
~~~~~~~~~~~

.. rubric:: Prohibiting ``transform`` on resamplers

It may be problematic for a resampler to provide ``transform`` if ``Pipeline``s
support resampling:

1. It is unclear what to do at test time if a resampler has a transform
method.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add (as discussed in details below) (or ref the next section in some other way) because while reading this sentence it's somewhat unclear why it is the case

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think just calling transform would be fine.

2. Adding ``fit_resample`` to the API of an an existing transformer may
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is an example of resampler that would need a transform method ?
What is an example of transformer that would need a resample method ?
Can we just make them exclusive ?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

drastically change its behaviour in a ``Pipeline``.

For this reason, it may be best to reject resamplers supporting ``transform``
from being used in a Pipeline.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also to reject it in common estimator checks?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If fit_transform becomes part of the official API, IMO it should be enforced by estimator checks.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly I don't follow this any more after thinking about it for some time. This is supported in several other frameworks without any issue.


.. rubric:: Prohibiting ``transform`` on resampling Pipelines

Providing a ``transform`` method on a Pipeline that contains a resampler
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might highlight that this is a different issue than "Prohibiting transform on resamplers".

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does it mean to provide a transform method on a pipeline?? Is it a pipeline where the last step is a transformer?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So let's use the latter formulation please :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well I guess the sentence should be "Providing a transform method on a Pipeline that contains a resampler when the last step is a transformer presents several problems"

presents several problems:

1. A resampling ``Pipeline`` needs to use a special code path for
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I really don't understand this whole part

Where is the resampler in the pipeline? Why should the last step be excluded?

``fit_transform`` that would call ``fit(X, y, **kw).transform(X)`` on the
``Pipeline``. Ordinarily a ``Pipeline`` would pass the transformed data to
``fit_transform`` of the left step. If the ``Pipeline`` contains a
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remaining steps?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
``fit_transform`` of the left step. If the ``Pipeline`` contains a
``fit_transform`` of the last step. If the ``Pipeline`` contains a

Is that what you mean?

resampler, it rather needs to fit the ``Pipeline`` excluding the last step,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't get why this is the case. Can you give an example ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same explanation as above, calling Pipeline.fit_transform is not allowed to change the number of samples, so we can't just resample and call fit_transform. But that should be explained here.

then transform the original training data until the last step, then
``fit_transform`` the last step. This means special code paths for pipelines
containing resamplers; the effect of the resampler is not localised in terms
of code maintenance.
2. As a result of issue 1, appending a step to the transformation ``Pipeline``
means that the transformer which was previously last, and previously trained
on the full dataset, will now be trained on the resampled dataset.
3. As a result of issue 1, the last step cannot be ``'passthrough'`` as in
other transformer pipelines.

For this reason, it may be best to disable ``fit_transform`` and ``transform``
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would you disable transform?

on the Pipeline. A resampling ``Pipeline`` would therefore not be usable as a
transformation within a ``FeatureUnion`` or ``ColumnTransformer``. Thus the
``ResampledTrainer`` would be strictly more expressive than a resampling
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would you use ResampledTrainer in this case ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ResampledTrainer still doesn't allow us to resample and do TSNE the way it's described above, right?

``Pipeline``.

.. rubric:: Handling ``fit`` parameters

Sample props or weights cannot be routed to steps downstream of a resampler in
a Pipeline, unless they too are resampled. To support this, a resampler
would need to be passed all props that are required downstream, and
``fit_resample`` should return resampled versions of them. Note that these
must be distinct from parameters that affect the resampler's fitting.
That is, consider the signature ``fit_resample(X, y=None, props=None, sample_weight=None)``.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if I understand correctly, one could apply a set of sample to drive the resampling and pass another set of weight which will be resampled to later be used by the downstream estimators?

Did you encounter any use case for sample_weight in practice? I mean that introducing this parameter might make the sampling API more difficult to grasp and if there is no use case maybe we are better to not allow it.

Note: I am not against including this parameter, this is more by a lack of knowledge regarding the use-case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may want to do the resampling w/o the sample_weight, and yet have sample_weight available for the downstream analysis, which is kind of a special case of having different sample weights for different steps.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to separate the two meanings as long as we support doing learning with weights or similar with a resampler upstream.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is exactly the same thing that cross-validation already does with fit_params, right? So it's not really that new...

Copy link
Member

@amueller amueller Feb 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw my argument is that sample_weights should be resampled and returned. But that means the signature of fit_resample is unclear. Also, adding props would be a backward-incompatible change. We could return a named tuple or dict, though that's not very idiomatic sklearn.

The ``sample_weight`` passed in should affect the resampling, but does not
itself need to be resampled. A Pipeline would pass ``props`` including the fit
parameters required downstream, which would be resampled and returned by
``fit_resample``.

Example Usage:
~~~~~~~~~~~~~~

.. code-block:: python

est = make_pipeline(RandomUnderSampler(), SVC())
est = make_pipeline(StandardScaler(), Birch(), SelectKBest(), SVC())
est = make_pipeline(
RandomUnderSampler(), StandardScaler(), SelectKBest(), SVC()
)
est = make_pipeline(
NaNRejector(), RandomUnderSampler(), StandardScaler(), SGDClassifer()
)
est.fit(X,y, sgdclassifier__sample_weight=my_weight)


Alternative implementation
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternative 3: sample weights

..........................

Alternatively ``sample_weight`` could be used as a placeholder to
perform resampling. However, the current limitations are:

* ``sample_weight`` is not available for all estimators;
* ``sample_weight`` will implement only simple resampling (only when resampling
uses original samples);
* ``sample_weight`` needs to be passed and modified within a
``Pipeline``, which isn't possible without something like resamplers.

Current implementation
......................

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add that this is an implementation of alternative 1

https://github.com/scikit-learn/scikit-learn/pull/13269

glemaitre marked this conversation as resolved.
Show resolved Hide resolved
Backward compatibility
----------------------

There is no backward incompatibilities with the current API.

Discussion
----------

* https://github.com/scikit-learn/scikit-learn/pull/13269

References and Footnotes
------------------------

.. [1] Each SLEP must either be explicitly labeled as placed in the public
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix this?

domain (see this SLEP as an example) or licensed under the `Open
Publication License`_.

.. _Open Publication License: https://www.opencontent.org/openpub/


Copyright
---------

This document has been placed in the public domain. [1]_