Monday, May 9, 2022

[FIXED] applying different transformation to two columns which are object, sklearn pipeline

May 09, 2022 one-hot-encoding, pipeline, python, scikit-learn No comments

Issue

i am trying to apply two different transformation from sklearn to two different columns which both of them are object inside my Pipeline. My DataFrame looks like this ( i avoid all rows just to illustrate my point) :

             email  country  label 
[email protected]       NI   True 
[email protected]       AR  False
[email protected]       CZ   True

Both email and country are object type

For email i create a bunch of function to transform it into some numeric representation of it.

like :

def email_length(email) -> np.array:
    return np.array([len(e[0].split('@')[0]) for e in email]).reshape(-1, 1)

def domain_length(email) -> np.array:
    return np.array([len(e[0].split('@')[-1]) for e in email]).reshape(-1, 1)

def number_of_vouls(email) -> np.array:
    vouls = 'aeiouAEIOU'
    name = [e[0].split('@')[0] for e in email]
    return np.array([sum(1 for char in name if char in vouls) for name in name]).reshape(-1, 1)

For passing the functions to the email in a sklearn Pipeline i was using FunctionTransformer and FeatureUnion like this.

get_email_length = FunctionTransformer(email_length)
get_domain_length = FunctionTransformer(domain_length)
get_number_of_vouls = FunctionTransformer(number_of_vouls)

preproc = FeatureUnion([
        ('email_length', get_email_length),
        ('domain_length', get_domain_length),
        ('number_of_vouls', get_number_of_vouls)])

pipe = Pipeline([
        ('preproc', preproc),
        ('classifier', LGBMClassifier())
        ])

But i want to pass also inside my Pipeline a one hot encoder into country, which would be the best way to do it given this Pipeline definition ?

Solution

You could try ColumnTransformer:

1. With DataFrame input

def email_and_domain_length(df: pd.DataFrame) -> pd.DataFrame:
    return df["email"].str.split("@", expand=True).applymap(len)


def number_of_vouls(df: pd.DataFrame) -> pd.DataFrame:
    return (
        df["email"]
        .str.split("@")
        .str[0]
        .str.lower()
        .apply(lambda x: sum(x.count(v) for v in "aeiou"))
        .to_frame()
    )


get_email_length = FunctionTransformer(email_and_domain_length)
get_number_of_vouls = FunctionTransformer(number_of_vouls)

preproc = ColumnTransformer(
    [
        ("lengths", get_email_length, ["email"]),
        ("vouls", get_number_of_vouls, ["email"]),
        ("countries", OneHotEncoder(), ["country"]),
    ]
)
preproc.fit_transform(df[["email", "country"]])

2. With ndarray input:

Just add this to the code in your question, it already works with ndarray input.

preproc = ColumnTransformer(
    [
        ("email_lengths", get_email_length, [0]),
        ("voul_lengths", get_domain_length, [0]),
        ("vouls", get_number_of_vouls, [0]),
        ("countries", OneHotEncoder(), [1]),
    ]
)
preproc.fit_transform(df[["email", "country"]].to_numpy())

Output:

array([[8., 9., 4., 0., 0., 1.],
       [8., 9., 3., 1., 0., 0.],
       [8., 9., 0., 0., 1., 0.]])

As an aside, one-hot-encoding would cause more harm than good if country has high cardinality.

I've also tried to vectorize the preprocessing functions by using .str accessor methods instead of list comprehensions.

Answered By - Tim

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Monday, May 9, 2022

[FIXED] applying different transformation to two columns which are object, sklearn pipeline

Issue

Solution

1. With DataFrame input

2. With ndarray input:

0 comments:

Post a Comment

Popular Posts

Labels