Sunday, January 21, 2024

[FIXED] Replacing second row of each group with first row of another dataframe

January 21, 2024 pandas, python No comments

Issue

I have two dataframes:

import pandas as pd 

df1 = pd.DataFrame(
    {
        'sym': ['a', 'a', 'a', 'a', 'b', 'b', 'b', 'c'],
        'open': [99, 22, 34, 63, 75, 86, 1800, 82],
        'high': [3987, 41123, 46123, 6643, 75, 3745, 72123, 74],
        'x': ['gd', 'ed', 'we', 'vt', 'de', 'sw', 'ee', 'et'],

    }
)


df2 = pd.DataFrame(
    {
        'sym': ['a', 'a', 'b', 'b', 'c', 'c', 'c'],
        'open': [77, 232, 434, 33, 55, 66, 1000],
        'high': [177, 11123, 1123, 343, 55, 3545, 21323],
        'x': ['g', 'e', 'w', 'v', 'd', 's', 'g'],
    }
)

And this is the output that I want:

  sym  open   high   x
0   a    99   3987  gd
1   a    77   177   ed
2   a    34  46123  we
3   a    63   6643  vt
4   b    75     75  de
5   b   434   1123  sw
6   b  1800  72123  ee
7   c    82     74  et

These are the steps needed. Groups are defined by sym:

a) Select the first row of each group in df2

b) Only open and high is needed for the previous step.

c) Replace these values with the values from the second row of each group in df1.

So for example for group a:

a) df2: row 0 is selected

b) df2: open is 77 and high is 177

c) from row 1 of df1 22 and 41123 are replaced with 77 and 177.

This is what I have tried. It gives me an IndexError. But even if it does not give me that error, it feels like this is not the way:

def replace_second_row(df):
    selected_sym = df.sym.iloc[0]
    row = df2.loc[df2.sym == selected_sym]
    row = row[['open', 'high']].iloc[0]
    df.iloc[1, df.columns.get_loc('open'): df.columns.get_loc('open') + 2] = row
    return df


output = df1.groupby('sym').apply(replace_second_row)

The traceback of aboveIndexError:

Traceback (most recent call last):
  File "D:\python\py_files\example_df.py", line 1618, in <module>
    x = df1.groupby('sym').apply(replace_second_row)
  File "C:\Users\AF\AppData\Local\Programs\Python\Python38\lib\site-packages\pandas\core\groupby\groupby.py", line 894, in apply
    result = self._python_apply_general(f, self._selected_obj)
  File "C:\Users\AF\AppData\Local\Programs\Python\Python38\lib\site-packages\pandas\core\groupby\groupby.py", line 928, in _python_apply_general
    keys, values, mutated = self.grouper.apply(f, data, self.axis)
  File "C:\Users\AF\AppData\Local\Programs\Python\Python38\lib\site-packages\pandas\core\groupby\ops.py", line 238, in apply
    res = f(group)
  File "D:\python\py_files\example_df.py", line 1614, in replace_second_row
    df.iloc[1, df.columns.get_loc('open'): df.columns.get_loc('open') + 2] = row
  File "C:\Users\AF\AppData\Local\Programs\Python\Python38\lib\site-packages\pandas\core\indexing.py", line 689, in __setitem__
    self._has_valid_setitem_indexer(key)
  File "C:\Users\AF\AppData\Local\Programs\Python\Python38\lib\site-packages\pandas\core\indexing.py", line 1401, in _has_valid_setitem_indexer
    raise IndexError("iloc cannot enlarge its target object")
IndexError: iloc cannot enlarge its target object

For more clarification of the process, I have uploaded an image. The highlighted rows are the rows that are needed to be selected/changed.

Solution

Code

df1, df2 set_index & make cc column

tmp1 = df1.assign(cc=df1.groupby('sym').cumcount()).set_index(['sym', 'cc'])
tmp2 = df2.groupby('sym').head(1).assign(cc=1).set_index(['sym', 'cc'])\
          .drop('x', axis=1)

tmp1

        open    high    x
sym cc          
a   0   99      3987    gd
    1   22      41123   ed
    2   34      46123   we
    3   63      6643    vt
b   0   75      75      de
    1   86      3745    sw
    2   1800    72123   ee
c   0   82      74      et

tmp2(only 1st row of df2 group, and make cc column to 1 [means 2nd row] & remove x column):

        open    high
sym cc      
a   1   77      177
b   1   434     1123
c   1   55      55

use combine_first tmp2 & tmp1

out = tmp2[tmp2.index.isin(tmp1.index)].combine_first(tmp1)\
          .reindex_like(tmp1).reset_index().drop('cc', axis=1)

out:

  sym   open    high    x
0   a   99      3987    gd
1   a   77      177     ed
2   a   34      46123   we
3   a   63      6643    vt
4   b   75      75      de
5   b   434     1123    sw
6   b   1800    72123   ee
7   c   82      74      et

The boolean indexing used to create out removes the values of c in tmp2 that do not have a second row in tmp1.

Answered By - Panda Kim

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Sunday, January 21, 2024

[FIXED] Replacing second row of each group with first row of another dataframe

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels