Issue
I have a pandas dataframe which has 4 columns of note, ID, DATE, PRIMARY_INDICATOR, PHONE
Each ID can have multiple rows in the table. The rows are guaranteed to be sorted in descending date order within each ID subgroup. for example:
>>> df.head(10)
ID DATE PRIMARY_INDICATOR PHONE
0 123 20230125 1 8071234
1 123 20230124 0 8079999
2 999 20230125 1 8074312
3 999 20230120 1 9087654
4 999 20230119 0 1235678
5 765 20230125 0 9990000
6 765 20230125 0 9999999
My goal is:
- If an ID group has a single primary_indicator as 1, nothing needs to be done (
df.loc[df['ID'] == 123]
in the above table is an example of this group) - If an ID group has no primary indicators as 1, nothing needs to be done (
df.loc[df['ID'] == 765]
in the above table is an example of this group) - If an ID group has more than 1 primary indicator as 1, set only the one with the most recent date to 1 and the rest to 0. For the group formed by
df.loc[df['ID'] == 999]
in the above example, the result would look like
>>> df.head(10)
ID DATE PRIMARY_INDICATOR PHONE
2 999 20230125 1 8074312
3 999 20230120 0 9087654
4 999 20230119 0 1235678
I have around 280000 unique IDs.
I have tried adding the IDs to a set, then I pop from the set, create a subset of the dataframe via loc with the ID, and then iterate using itterrows and a bool flag.
This approach works but is really slow. The only meaningfully slow part of the query was creating the subset dataframe for each pop'd record, idDataframe = df.loc[df['ID'] == currentId]
. It was something like .016475 seconds per id, and the entire script took 80 minutes.
>>> import pandas as pd
>>>
>>> pd.set_option('display.max_columns', None)
>>> data = {'ID': [123, 123, 999, 999, 999, 765, 765],
... 'DATE': ['20230125', '20230124', '20230125', '20230120', '20230119', '20230125', '20230125'],
... 'PRIMARY_INDICATOR': [1, 0, 1, 1, 0, 0, 0],
... 'PHONE' : [8071234, 8079999, 8074312, 9087654, 1235678, 9990000, 9999999]}
>>> df = pd.DataFrame.from_dict(data)
>>> df.head(10)
ID DATE PRIMARY_INDICATOR PHONE
0 123 20230125 1 8071234
1 123 20230124 0 8079999
2 999 20230125 1 8074312
3 999 20230120 1 9087654
4 999 20230125 0 1235678
5 765 20230125 0 9990000
6 765 20230125 0 9999999
import pandas as pd
data = {'ID': [123, 123, 999, 999, 999, 765, 765],
'DATE': ['20230125', '20230124', '20230125', '20230120', '20230125', '20230125', '20230125'],
'PRIMARY_INDICATOR': [1, 0, 1, 1, 0, 0, 0],
'PHONE' : [8071234, 8079999, 8074312, 9087654, 1235678, 9990000, 9999999]}
df = pd.DataFrame.from_dict(data)
idSet = set(df.ID.unique())
while idSet :
# pop one id from the set
currentId = idSet .pop()
# get a subset of the original dataframe which only shows the pop'd ids records
idDataframe = df.loc[df['ID'] == currentId]
idDataframe.drop_duplicates()
# create the output row from each row in the subframe
primaryPhoneIndicated = False
for index, row in idDataframe.iterrows():
if not primaryPhoneIndicated and row['PRIMARY_INDICATOR'] == 1:
PRIMARY_INDICATOR = 1
primaryPhoneIndicated = True
else:
PRIMARY_INDICATOR = 0
print([row['ID'], row['DATE'], PRIMARY_INDICATOR, row['PHONE']])
Is there a pandas-y way to do this, without the need to create a dataframe for each ID to apply the logic?
Solution
It can be done without groupby
:
idx = (df.loc[df['PRIMARY_INDICATOR'].eq(1)]
.duplicated('ID')
.loc[lambda x: x].index)
df.loc[idx, 'PRIMARY_INDICATOR'] = 0
Output:
>>> df
ID DATE PRIMARY_INDICATOR PHONE
0 123 20230125 1 8071234
1 123 20230124 0 8079999
2 999 20230125 1 8074312
3 999 20230120 0 9087654
4 999 20230119 0 1235678
5 765 20230125 0 9990000
6 765 20230125 0 9999999
Step by step:
# Select only rows where PRIMARY_INDICATOR==1
>>> out = df.loc[df['PRIMARY_INDICATOR'].eq(1)]
ID DATE PRIMARY_INDICATOR PHONE
0 123 20230125 1 8071234
2 999 20230125 1 8074312
3 999 20230120 1 9087654
# Mark duplicated all oldest dates
>>> out = out.duplicated('ID')
0 False
2 False
3 True
dtype: bool
# Extract index
>>> idx = out.loc[lambda x: x].index
Index([3], dtype='int64')
Answered By - Corralien
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.