Issue
i am trying to apply two different transformation from sklearn to two different columns which both of them are object
inside my Pipeline
. My DataFrame
looks like this ( i avoid all rows just to illustrate my point) :
email country label
[email protected] NI True
[email protected] AR False
[email protected] CZ True
Both email
and country
are object
type
For email
i create a bunch of function to transform it into some numeric representation of it.
like :
def email_length(email) -> np.array:
return np.array([len(e[0].split('@')[0]) for e in email]).reshape(-1, 1)
def domain_length(email) -> np.array:
return np.array([len(e[0].split('@')[-1]) for e in email]).reshape(-1, 1)
def number_of_vouls(email) -> np.array:
vouls = 'aeiouAEIOU'
name = [e[0].split('@')[0] for e in email]
return np.array([sum(1 for char in name if char in vouls) for name in name]).reshape(-1, 1)
For passing the functions to the email
in a sklearn Pipeline
i was using FunctionTransformer
and FeatureUnion
like this.
get_email_length = FunctionTransformer(email_length)
get_domain_length = FunctionTransformer(domain_length)
get_number_of_vouls = FunctionTransformer(number_of_vouls)
preproc = FeatureUnion([
('email_length', get_email_length),
('domain_length', get_domain_length),
('number_of_vouls', get_number_of_vouls)])
pipe = Pipeline([
('preproc', preproc),
('classifier', LGBMClassifier())
])
But i want to pass also inside my Pipeline
a one hot encoder into country
, which would be the best way to do it given this Pipeline
definition ?
Solution
You could try ColumnTransformer:
1. With DataFrame input
def email_and_domain_length(df: pd.DataFrame) -> pd.DataFrame:
return df["email"].str.split("@", expand=True).applymap(len)
def number_of_vouls(df: pd.DataFrame) -> pd.DataFrame:
return (
df["email"]
.str.split("@")
.str[0]
.str.lower()
.apply(lambda x: sum(x.count(v) for v in "aeiou"))
.to_frame()
)
get_email_length = FunctionTransformer(email_and_domain_length)
get_number_of_vouls = FunctionTransformer(number_of_vouls)
preproc = ColumnTransformer(
[
("lengths", get_email_length, ["email"]),
("vouls", get_number_of_vouls, ["email"]),
("countries", OneHotEncoder(), ["country"]),
]
)
preproc.fit_transform(df[["email", "country"]])
2. With ndarray input:
Just add this to the code in your question, it already works with ndarray input.
preproc = ColumnTransformer(
[
("email_lengths", get_email_length, [0]),
("voul_lengths", get_domain_length, [0]),
("vouls", get_number_of_vouls, [0]),
("countries", OneHotEncoder(), [1]),
]
)
preproc.fit_transform(df[["email", "country"]].to_numpy())
Output:
array([[8., 9., 4., 0., 0., 1.],
[8., 9., 3., 1., 0., 0.],
[8., 9., 0., 0., 1., 0.]])
As an aside, one-hot-encoding would cause more harm than good if country
has high cardinality.
I've also tried to vectorize the preprocessing functions by using .str accessor methods instead of list comprehensions.
Answered By - Tim
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.