Monday, February 5, 2024

[FIXED] Update all values within each subgroup in pandas dataframe

February 05, 2024 dataframe, group-by, pandas, python No comments

Issue

I have a pandas dataframe which has 4 columns of note, ID, DATE, PRIMARY_INDICATOR, PHONE
Each ID can have multiple rows in the table. The rows are guaranteed to be sorted in descending date order within each ID subgroup. for example:

>>> df.head(10)
    ID      DATE  PRIMARY_INDICATOR    PHONE
0  123  20230125                  1  8071234
1  123  20230124                  0  8079999
2  999  20230125                  1  8074312
3  999  20230120                  1  9087654
4  999  20230119                  0  1235678
5  765  20230125                  0  9990000
6  765  20230125                  0  9999999

My goal is:

If an ID group has a single primary_indicator as 1, nothing needs to be done (df.loc[df['ID'] == 123] in the above table is an example of this group)
If an ID group has no primary indicators as 1, nothing needs to be done (df.loc[df['ID'] == 765] in the above table is an example of this group)
If an ID group has more than 1 primary indicator as 1, set only the one with the most recent date to 1 and the rest to 0. For the group formed by df.loc[df['ID'] == 999] in the above example, the result would look like

>>> df.head(10)
    ID      DATE  PRIMARY_INDICATOR    PHONE
2  999  20230125                  1  8074312
3  999  20230120                  0  9087654
4  999  20230119                  0  1235678

I have around 280000 unique IDs.

I have tried adding the IDs to a set, then I pop from the set, create a subset of the dataframe via loc with the ID, and then iterate using itterrows and a bool flag.

This approach works but is really slow. The only meaningfully slow part of the query was creating the subset dataframe for each pop'd record, idDataframe = df.loc[df['ID'] == currentId]. It was something like .016475 seconds per id, and the entire script took 80 minutes.

>>> import pandas as pd
>>>
>>> pd.set_option('display.max_columns', None)
>>> data = {'ID': [123, 123, 999, 999, 999, 765, 765],
...         'DATE': ['20230125', '20230124', '20230125', '20230120', '20230119', '20230125', '20230125'],
...         'PRIMARY_INDICATOR': [1, 0, 1, 1, 0, 0, 0],
...         'PHONE' : [8071234, 8079999, 8074312, 9087654, 1235678, 9990000, 9999999]}
>>> df = pd.DataFrame.from_dict(data)
>>> df.head(10)
    ID      DATE  PRIMARY_INDICATOR    PHONE
0  123  20230125                  1  8071234
1  123  20230124                  0  8079999
2  999  20230125                  1  8074312
3  999  20230120                  1  9087654
4  999  20230125                  0  1235678
5  765  20230125                  0  9990000
6  765  20230125                  0  9999999

import pandas as pd

data = {'ID': [123, 123, 999, 999, 999, 765, 765], 
        'DATE': ['20230125', '20230124', '20230125', '20230120', '20230125', '20230125', '20230125'],
        'PRIMARY_INDICATOR': [1, 0, 1, 1, 0, 0, 0],
        'PHONE' : [8071234, 8079999, 8074312, 9087654, 1235678, 9990000, 9999999]}
df = pd.DataFrame.from_dict(data)
idSet = set(df.ID.unique())
while idSet :
  # pop one id from the set
  currentId = idSet .pop()
  # get a subset of the original dataframe which only shows the pop'd ids records
  idDataframe = df.loc[df['ID'] == currentId]
  idDataframe.drop_duplicates()
  # create the output row from each row in the subframe
  primaryPhoneIndicated = False
  for index, row in idDataframe.iterrows():
    if not primaryPhoneIndicated and row['PRIMARY_INDICATOR'] == 1:
      PRIMARY_INDICATOR = 1
      primaryPhoneIndicated = True
    else:
      PRIMARY_INDICATOR = 0
    print([row['ID'], row['DATE'], PRIMARY_INDICATOR, row['PHONE']])

Is there a pandas-y way to do this, without the need to create a dataframe for each ID to apply the logic?

Solution

It can be done without groupby:

idx = (df.loc[df['PRIMARY_INDICATOR'].eq(1)]
         .duplicated('ID')
         .loc[lambda x: x].index)
df.loc[idx, 'PRIMARY_INDICATOR'] = 0

Output:

>>> df
    ID      DATE  PRIMARY_INDICATOR    PHONE
0  123  20230125                  1  8071234
1  123  20230124                  0  8079999
2  999  20230125                  1  8074312
3  999  20230120                  0  9087654
4  999  20230119                  0  1235678
5  765  20230125                  0  9990000
6  765  20230125                  0  9999999

Step by step:

# Select only rows where PRIMARY_INDICATOR==1
>>> out = df.loc[df['PRIMARY_INDICATOR'].eq(1)]
    ID      DATE  PRIMARY_INDICATOR    PHONE
0  123  20230125                  1  8071234
2  999  20230125                  1  8074312
3  999  20230120                  1  9087654

# Mark duplicated all oldest dates
>>> out = out.duplicated('ID')
0    False
2    False
3     True
dtype: bool

# Extract index
>>> idx = out.loc[lambda x: x].index
Index([3], dtype='int64')

Answered By - Corralien

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Monday, February 5, 2024

[FIXED] Update all values within each subgroup in pandas dataframe

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels