Saturday, January 20, 2024

[FIXED] Custom Transformers in Sklearn Pipeline do not work as expected

January 20, 2024 pipeline, scikit-learn No comments

Issue

I am working on ML project using sklearn. I have writtern few custom transformers as below:

DateTimeTransformer - To extract day, month, year, hour, minute, second (thereby getting 6 new columns) applied on Arrival Time

KBinTransformer - To turn continous into category [n_bins=3, encode='ordinal', strategy='uniform'] (thereby getting 1 new columns) applied on Age

I have a DataFrame like below:

Name (object), class (category), Age (int), Arrival Time (datetime)
-------------------------------------------------------------------
foo          | A               |  44       | 20/7/2023 4:15:2 
bar          | B               |  34       | 10/7/2023 2:10:5

df = pd.DataFrame() #  Contains above data in df

I have created a pipeline as below:

steps = {
    "date_time": DateTimeTransformer(),
    "k_bin": KBinTransformer(),
}

pipe = Pipeline(steps=steps)

pipe.fit(X=df)
pipe.transform(X=df)

The issue is when, in steps I put both(date_time and k_bin) and run it. I get output with DateTimeTransformer giving 12 (day, month, year, hour, miniute, second, day, month, year, hour, miniute, second) new columns (which is wrong expected 6 new columns) and KBinTransformer giving 1 new column.

I tried reversing the steps

steps = {
    "k_bin": KBinTransformer(),
    "date_time": DateTimeTransformer(),
}

Now for KBinTransformer giving 2 (age, age) new columns (which is wrong and expected 1 new column) and DateTimeTransformer giving 6 new columns.

What happening is input to next transformer is the output of previous transformer(including newly created columns + old unused columns) during the fit() function and calling actual transform() creates again those column thereby getting duplicate on final output.

But if I keep only one transformer in the pipe and run it, it gives correct output. I ran keeping DateTimeTransformer giving 6 new columns I ran keeping KBinTransformer giving 1 new column

What I am missing in using pipeline?

Solution

If you use your transformers as steps in a pipeline, they will be applied one after the other on all columns.

I guess you do not want your transformers as steps, but as ColumnTransformer to transform only the columns based on the dtype. You can use make_column_selector to select the columns you want:

from sklearn.compose import ColumnTransformer
from sklearn.compose import make_column_selector
import numpy as np

ct = ColumnTransformer(
    transformers=[
        ('datetime', DateTimeTransformer(), make_column_selector(dtype_include=np.datetime64)), 
        ('kbin', KBinTransformer(), make_column_selector(dtype_include=np.number))
    ],
    remainder='passthrough')

df_transformed = ct.fit_transform(df)

Answered By - DataJanitor

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Saturday, January 20, 2024

[FIXED] Custom Transformers in Sklearn Pipeline do not work as expected

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels