Issue
Suppose that we have this data frame:
ID | CATEGORIES |
---|---|
0 | ['A'] |
1 | ['A', 'C'] |
2 | ['B', 'C'] |
And I want to apply one hot encoder to categories column. The result I want is
ID | A | B | C |
---|---|---|---|
0 | 1 | 0 | 0 |
1 | 1 | 0 | 1 |
2 | 0 | 1 | 1 |
I know it can be easily codded. I just want to know if this function is already implemented in some package. Code it in python will probably result in a quite slow function.
(i needed to put the tables in code fields because stackoverflow was not allowing me to post it as tables)
Solution
You can use str.join
combined with str.get_dummies
:
out = df[['ID']].join(df['CATEGORIES'].str.join('|').str.get_dummies())
Output:
ID A B C
0 0 1 0 0
1 1 1 0 1
2 2 0 1 1
used input:
df = pd.DataFrame({'ID': [0, 1, 2],
'CATEGORIES': [['A'], ['A', 'C'], ['B', 'C']]})
There are many other alternatives, using pivot
, crosstab
, etc.
One example:
df2 = df.explode('CATEGORIES')
out = pd.crosstab(df2['ID'], df2['CATEGORIES']).reset_index()
Answered By - mozway
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.