Issue
I have a large pandas dataframe with common sentences 'S', which I'd like to learn, and their counts 'C', like this:
df = pd.DataFrame({"S": ["Yes.", "Yes!", "Yes?",
"No?", "No.", "What?"],
"C": [100, 50, 40, 30, 10, 5]})
df
S C
0 Yes. 100
1 Yes! 50
2 Yes? 40
3 No? 30
4 No. 10
5 What? 5
If sentences are the same up to their last character, I'd like to collapse them into one, represented by the most common sentence, while adding up the counts across the group. The resulting dataframe should look like this:
S C
0 Yes. 190
1 No? 40
2 What? 5
In the same way, I have list of common words - some lowercase, some uppercase - and I'd like to add the counts across the lower/uppercase group while retaining only the most common case.
Solution
First create a column for 'S' until the last character in the string
df['s'] = df['S'].str[:-1]
Then you can use groupby() to sum up 'C' and get the most common occurrence in 'S'. Example
most_common = df[df['C'] == df.groupby('s')['C'].transform('max')][['S', 's']]
total = df.groupby('s')['C'].sum().reset_index()
most_common.merge(total).drop(columns=['s']).sort_values(by='C', ascending=False)
Same for uppercase/lowercase using str.lower()
Answered By - crx91
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.