[BUG] Inconsistent output of imblearn's pipeline #904

tvdboom · 2022-05-30T08:22:19Z

Describe the bug

The output of imblearn's pipeline is inconsistent for fit_transform and fit().transform() (see example). The reason this happens is because in the transform method SMOTE is not applied while transforming (as expected) but in the fit_transform method SMOTE is applied while fitting and that same data is returned.

Is this intended, and if so, why? It seems quite confusing for the user. If it's indeed a bug, I think the fix is quite straight forward, although it will make the fit_transform method slower since you first have to fit the pipeline (which includes all transformations), and then transform it all again excluding the samplers.

Steps/Code to Reproduce

from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline
from sklearn.datasets import load_breast_cancer
from sklearn.preprocessing import StandardScaler

X, y =load_breast_cancer(return_X_y=True)
print(X.shape)

s = Pipeline((("smote", SMOTE()), ("scaler", StandardScaler())))
print(s.fit_transform(X, y).shape)
print(s.fit(X, y).transform(X).shape)

Output:
(569, 30)
(714, 30)
(569, 30)

Expected Results

I expected the fit_transform method to return data without balancing (same as the transform method does)

Actual Results

Versions

System:
python: 3.9.7 (tags/v3.9.7:1016ef3, Aug 30 2021, 20:19:38) [MSC v.1929 64 bit (AMD64)]
executable: C:\Users\Mavs\Documents\Python\pycaret\venv\Scripts\python.exe
machine: Windows-10-10.0.19044-SP0
Python dependencies:
sklearn: 1.1.1
pip: 22.0.4
setuptools: 57.0.0
numpy: 1.21.5
scipy: 1.7.3
Cython: 0.29.28
pandas: 1.4.1
matplotlib: 3.5.2
joblib: 1.1.0
threadpoolctl: 3.0.0
Built with OpenMP: True
threadpoolctl info:
user_api: blas
internal_api: openblas
prefix: libopenblas
filepath: C:\Users\Mavs\Documents\Python\pycaret\venv\Lib\site-packages\numpy.libs\libopenblas.XWYDX2IKJW2NMTWSFYNGFUWKQU3LYTCZ.gfortran-win_amd64.dll
version: 0.3.17
threading_layer: pthreads
architecture: Zen
num_threads: 16
user_api: openmp
internal_api: openmp
prefix: vcomp
filepath: C:\Users\Mavs\Documents\Python\pycaret\venv\Lib\site-packages\sklearn.libs\vcomp140.dll
version: None
num_threads: 16
user_api: blas
internal_api: openblas
prefix: libopenblas
filepath: C:\Users\Mavs\Documents\Python\pycaret\venv\Lib\site-packages\scipy.libs\libopenblas.XWYDX2IKJW2NMTWSFYNGFUWKQU3LYTCZ.gfortran-win_amd64.dll
version: 0.3.17
threading_layer: pthreads
architecture: Zen
num_threads: 16
Windows-10-10.0.19044-SP0
Python 3.9.7 (tags/v3.9.7:1016ef3, Aug 30 2021, 20:19:38) [MSC v.1929 64 bit (AMD64)]
NumPy 1.21.5
SciPy 1.7.3
Scikit-Learn 1.1.1
Imbalanced-Learn 0.9.1

The text was updated successfully, but these errors were encountered:

haochunchang · 2022-06-18T03:57:35Z

Hi, after some investigation, I found that the behavior resulted from here:

imbalanced-learn/imblearn/pipeline.py

Lines 303 to 311 in 6176807

    
           Xt, yt = self._fit(X, y, **fit_params_steps) 
        
           last_step = self._final_estimator 
        
           with _print_elapsed_time("Pipeline", self._log_message(len(self.steps) - 1)): 
        
               if last_step == "passthrough": 
        
                   return Xt 
        
               fit_params_last_step = fit_params_steps[self.steps[-1][0]] 
        
               if hasattr(last_step, "fit_transform"): 
        
                   return last_step.fit_transform(Xt, yt, **fit_params_last_step)

In pipeline.fit_transform, since StandardScaler has fit_transform attribute, it will calls fit_transform with data transformed by SMOTE (Xt).
While in pipeline.fit().transform(), the transform method iterates through the steps in the pipeline, filtering out steps with fit_resample attribute (SMOTE has it).

imbalanced-learn/imblearn/pipeline.py

Lines 181 to 182 in 6176807

    
           if filter_resample: 
        
               return filter(lambda x: not hasattr(x[-1], "fit_resample"), it)

I think if we want to skip samplers during fit_transform, we can let transform to handle the skipping after fit. Then fit_transform might not be slower than fit().transform() because under the hood they are doing the same thing.

I can open a PR to address this if this is indeed a bug.

tvdboom · 2022-06-27T07:23:55Z

@haochunchang The PR indeed solves the problem. Let's hope it gets merged soon

glemaitre · 2022-12-03T21:54:15Z

I think this is a case that having resampling is ambiguous compared to the usual way.

The fit_resample semantic is to apply only during the fit stage but not during transform or predict phase. Therefore, applying fit_resample when fit_transform on the pipeline makes sense.

When requesting transform then we expect to not call fit_resample since we are in the inference/decision phase. When calling fit().transform(), fit_resample is only called during the fit() stage.

This surprising API is one of the reasons why we never adopted samplers in scikit-learn because it breaks the contract fit_transform == fit.transform. Some discussion happened there: scikit-learn/enhancement_proposals#12

At the end, I would not consider it a bug but we could improve the documentation to make it obvious.

haochunchang mentioned this issue Jun 18, 2022

[MRG] FIX Make pipeline.fit_transform behaves the same as fit().transform() #910

Closed

glemaitre mentioned this issue Dec 3, 2022

DOC add a warning regarding fit_transform != fit.transform #954

Merged

glemaitre closed this as completed in #954 Dec 3, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Inconsistent output of imblearn's pipeline #904

[BUG] Inconsistent output of imblearn's pipeline #904

tvdboom commented May 30, 2022

haochunchang commented Jun 18, 2022

tvdboom commented Jun 27, 2022

glemaitre commented Dec 3, 2022

[BUG] Inconsistent output of imblearn's pipeline #904

[BUG] Inconsistent output of imblearn's pipeline #904

Comments

tvdboom commented May 30, 2022

Describe the bug

Steps/Code to Reproduce

Expected Results

Actual Results

Versions

haochunchang commented Jun 18, 2022

tvdboom commented Jun 27, 2022

glemaitre commented Dec 3, 2022