Issue
I am working on ML project using sklearn
. I have writtern few custom transformers as below:
DateTimeTransformer - To extract day, month, year, hour, minute, second (thereby getting 6 new columns) applied on Arrival Time
KBinTransformer - To turn continous into category [n_bins=3, encode='ordinal', strategy='uniform']
(thereby getting 1 new columns) applied on Age
I have a DataFrame
like below:
Name (object), class (category), Age (int), Arrival Time (datetime)
-------------------------------------------------------------------
foo | A | 44 | 20/7/2023 4:15:2
bar | B | 34 | 10/7/2023 2:10:5
df = pd.DataFrame() # Contains above data in df
I have created a pipeline
as below:
steps = {
"date_time": DateTimeTransformer(),
"k_bin": KBinTransformer(),
}
pipe = Pipeline(steps=steps)
pipe.fit(X=df)
pipe.transform(X=df)
The issue is when, in steps I put both(date_time and k_bin) and run it. I get output with DateTimeTransformer
giving 12 (day, month, year, hour, miniute, second, day, month, year, hour, miniute, second) new columns (which is wrong expected 6 new columns) and KBinTransformer
giving 1 new column.
I tried reversing the steps
steps = {
"k_bin": KBinTransformer(),
"date_time": DateTimeTransformer(),
}
Now for KBinTransformer
giving 2 (age, age) new columns (which is wrong and expected 1 new column) and DateTimeTransformer
giving 6 new columns.
What happening is input to next transformer is the output of previous transformer(including newly created columns + old unused columns) during the fit()
function and calling actual transform()
creates again those column thereby getting duplicate on final output.
But if I keep only one transformer in the pipe and run it, it gives correct output.
I ran keeping DateTimeTransformer
giving 6 new columns
I ran keeping KBinTransformer
giving 1 new column
What I am missing in using pipeline
?
Solution
If you use your transformers as steps in a pipeline
, they will be applied one after the other on all columns.
I guess you do not want your transformers as steps, but as ColumnTransformer
to transform only the columns based on the dtype. You can use make_column_selector
to select the columns you want:
from sklearn.compose import ColumnTransformer
from sklearn.compose import make_column_selector
import numpy as np
ct = ColumnTransformer(
transformers=[
('datetime', DateTimeTransformer(), make_column_selector(dtype_include=np.datetime64)),
('kbin', KBinTransformer(), make_column_selector(dtype_include=np.number))
],
remainder='passthrough')
df_transformed = ct.fit_transform(df)
Answered By - DataJanitor
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.