Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Inconsistent output of imblearn's pipeline #904

Closed
tvdboom opened this issue May 30, 2022 · 3 comments · Fixed by #954
Closed

[BUG] Inconsistent output of imblearn's pipeline #904

tvdboom opened this issue May 30, 2022 · 3 comments · Fixed by #954

Comments

@tvdboom
Copy link

tvdboom commented May 30, 2022

Describe the bug

The output of imblearn's pipeline is inconsistent for fit_transform and fit().transform() (see example). The reason this happens is because in the transform method SMOTE is not applied while transforming (as expected) but in the fit_transform method SMOTE is applied while fitting and that same data is returned.

Is this intended, and if so, why? It seems quite confusing for the user. If it's indeed a bug, I think the fix is quite straight forward, although it will make the fit_transform method slower since you first have to fit the pipeline (which includes all transformations), and then transform it all again excluding the samplers.

Steps/Code to Reproduce

from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline
from sklearn.datasets import load_breast_cancer
from sklearn.preprocessing import StandardScaler

X, y =load_breast_cancer(return_X_y=True)
print(X.shape)

s = Pipeline((("smote", SMOTE()), ("scaler", StandardScaler())))
print(s.fit_transform(X, y).shape)
print(s.fit(X, y).transform(X).shape)

Output:
(569, 30)
(714, 30)
(569, 30)

Expected Results

I expected the fit_transform method to return data without balancing (same as the transform method does)

Actual Results

Versions

System:
python: 3.9.7 (tags/v3.9.7:1016ef3, Aug 30 2021, 20:19:38) [MSC v.1929 64 bit (AMD64)]
executable: C:\Users\Mavs\Documents\Python\pycaret\venv\Scripts\python.exe
machine: Windows-10-10.0.19044-SP0
Python dependencies:
sklearn: 1.1.1
pip: 22.0.4
setuptools: 57.0.0
numpy: 1.21.5
scipy: 1.7.3
Cython: 0.29.28
pandas: 1.4.1
matplotlib: 3.5.2
joblib: 1.1.0
threadpoolctl: 3.0.0
Built with OpenMP: True
threadpoolctl info:
user_api: blas
internal_api: openblas
prefix: libopenblas
filepath: C:\Users\Mavs\Documents\Python\pycaret\venv\Lib\site-packages\numpy.libs\libopenblas.XWYDX2IKJW2NMTWSFYNGFUWKQU3LYTCZ.gfortran-win_amd64.dll
version: 0.3.17
threading_layer: pthreads
architecture: Zen
num_threads: 16
user_api: openmp
internal_api: openmp
prefix: vcomp
filepath: C:\Users\Mavs\Documents\Python\pycaret\venv\Lib\site-packages\sklearn.libs\vcomp140.dll
version: None
num_threads: 16
user_api: blas
internal_api: openblas
prefix: libopenblas
filepath: C:\Users\Mavs\Documents\Python\pycaret\venv\Lib\site-packages\scipy.libs\libopenblas.XWYDX2IKJW2NMTWSFYNGFUWKQU3LYTCZ.gfortran-win_amd64.dll
version: 0.3.17
threading_layer: pthreads
architecture: Zen
num_threads: 16
Windows-10-10.0.19044-SP0
Python 3.9.7 (tags/v3.9.7:1016ef3, Aug 30 2021, 20:19:38) [MSC v.1929 64 bit (AMD64)]
NumPy 1.21.5
SciPy 1.7.3
Scikit-Learn 1.1.1
Imbalanced-Learn 0.9.1

@haochunchang
Copy link

Hi, after some investigation, I found that the behavior resulted from here:

Xt, yt = self._fit(X, y, **fit_params_steps)
last_step = self._final_estimator
with _print_elapsed_time("Pipeline", self._log_message(len(self.steps) - 1)):
if last_step == "passthrough":
return Xt
fit_params_last_step = fit_params_steps[self.steps[-1][0]]
if hasattr(last_step, "fit_transform"):
return last_step.fit_transform(Xt, yt, **fit_params_last_step)

In pipeline.fit_transform, since StandardScaler has fit_transform attribute, it will calls fit_transform with data transformed by SMOTE (Xt).
While in pipeline.fit().transform(), the transform method iterates through the steps in the pipeline, filtering out steps with fit_resample attribute (SMOTE has it).

if filter_resample:
return filter(lambda x: not hasattr(x[-1], "fit_resample"), it)

I think if we want to skip samplers during fit_transform, we can let transform to handle the skipping after fit. Then fit_transform might not be slower than fit().transform() because under the hood they are doing the same thing.

I can open a PR to address this if this is indeed a bug.

@tvdboom
Copy link
Author

tvdboom commented Jun 27, 2022

@haochunchang The PR indeed solves the problem. Let's hope it gets merged soon

@glemaitre
Copy link
Member

I think this is a case that having resampling is ambiguous compared to the usual way.

The fit_resample semantic is to apply only during the fit stage but not during transform or predict phase. Therefore, applying fit_resample when fit_transform on the pipeline makes sense.

When requesting transform then we expect to not call fit_resample since we are in the inference/decision phase. When calling fit().transform(), fit_resample is only called during the fit() stage.

This surprising API is one of the reasons why we never adopted samplers in scikit-learn because it breaks the contract fit_transform == fit.transform. Some discussion happened there: scikit-learn/enhancement_proposals#12

At the end, I would not consider it a bug but we could improve the documentation to make it obvious.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants