Issue
My input is a dataframe :
df = pd.DataFrame({'col1': ['A', 'A', 'B', 'C', 'D', 'D', 'D', 'E', 'E'],
'col2': ['x1', 'x2', 'x1', 'x1', 'x1', 'x2', 'x3', 'x1', 'x2'],
'col3': ['', 'mssg1', 'mssg2', '', 'mssg3', '', 'mssg4', '', '']})
col1 col2 col3
0 A x1
1 A x2 mssg1
2 B x1 mssg2
3 C x1
4 D x1 mssg3
5 D x2
6 D x3 mssg4
7 E x1
8 E x2
I need to loop over each group in the column col1
and see if it's a duplicate and if True, go check the column col3
and keep only the non empty values. If all values are empty, then keep them all.
For that I made the code below. at first sight, it works but not only it is slow on my real dataset but I also ignore if it misses something in the logic.
def clean(x):
if len(x) > 1 and x['col3'].notna().any():
return x.loc[x['col3'].notna()]
else:
return x
final = df.replace('', None).groupby('col1', as_index=False).apply(clean).droplevel(0).fillna('')
My expect output is this :
col1 col2 col3
1 A x2 mssg1
2 B x1 mssg2
3 C x1
4 D x1 mssg3
6 D x3 mssg4
7 E x1
8 E x2
Can you guys show me how to solve my problem ?
Solution
Use boolean indexing:
# is col3 not empty?
m1 = df['col3'].ne('')
# is m1 not all True per col1?
m2 = ~m1.groupby(df['col1']).transform('any')
out = df[m1|m2]
Output:
col1 col2 col3
1 A x2 mssg1
2 B x1 mssg2
3 C x1
4 D x1 mssg3
6 D x3 mssg4
7 E x1
8 E x2
Answered By - mozway
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.