Issue
I have two dataframes:
import pandas as pd
df1 = pd.DataFrame(
{
'sym': ['a', 'a', 'a', 'a', 'b', 'b', 'b', 'c'],
'open': [99, 22, 34, 63, 75, 86, 1800, 82],
'high': [3987, 41123, 46123, 6643, 75, 3745, 72123, 74],
'x': ['gd', 'ed', 'we', 'vt', 'de', 'sw', 'ee', 'et'],
}
)
df2 = pd.DataFrame(
{
'sym': ['a', 'a', 'b', 'b', 'c', 'c', 'c'],
'open': [77, 232, 434, 33, 55, 66, 1000],
'high': [177, 11123, 1123, 343, 55, 3545, 21323],
'x': ['g', 'e', 'w', 'v', 'd', 's', 'g'],
}
)
And this is the output that I want:
sym open high x
0 a 99 3987 gd
1 a 77 177 ed
2 a 34 46123 we
3 a 63 6643 vt
4 b 75 75 de
5 b 434 1123 sw
6 b 1800 72123 ee
7 c 82 74 et
These are the steps needed. Groups are defined by sym
:
a) Select the first row of each group in df2
b) Only open
and high
is needed for the previous step.
c) Replace these values with the values from the second row of each group in df1
.
So for example for group a
:
a) df2
: row 0
is selected
b) df2
: open
is 77 and high
is 177
c) from row 1
of df1
22 and 41123 are replaced with 77 and 177.
This is what I have tried. It gives me an IndexError
. But even if it does not give me that error, it feels like this is not the way:
def replace_second_row(df):
selected_sym = df.sym.iloc[0]
row = df2.loc[df2.sym == selected_sym]
row = row[['open', 'high']].iloc[0]
df.iloc[1, df.columns.get_loc('open'): df.columns.get_loc('open') + 2] = row
return df
output = df1.groupby('sym').apply(replace_second_row)
The traceback of aboveIndexError
:
Traceback (most recent call last):
File "D:\python\py_files\example_df.py", line 1618, in <module>
x = df1.groupby('sym').apply(replace_second_row)
File "C:\Users\AF\AppData\Local\Programs\Python\Python38\lib\site-packages\pandas\core\groupby\groupby.py", line 894, in apply
result = self._python_apply_general(f, self._selected_obj)
File "C:\Users\AF\AppData\Local\Programs\Python\Python38\lib\site-packages\pandas\core\groupby\groupby.py", line 928, in _python_apply_general
keys, values, mutated = self.grouper.apply(f, data, self.axis)
File "C:\Users\AF\AppData\Local\Programs\Python\Python38\lib\site-packages\pandas\core\groupby\ops.py", line 238, in apply
res = f(group)
File "D:\python\py_files\example_df.py", line 1614, in replace_second_row
df.iloc[1, df.columns.get_loc('open'): df.columns.get_loc('open') + 2] = row
File "C:\Users\AF\AppData\Local\Programs\Python\Python38\lib\site-packages\pandas\core\indexing.py", line 689, in __setitem__
self._has_valid_setitem_indexer(key)
File "C:\Users\AF\AppData\Local\Programs\Python\Python38\lib\site-packages\pandas\core\indexing.py", line 1401, in _has_valid_setitem_indexer
raise IndexError("iloc cannot enlarge its target object")
IndexError: iloc cannot enlarge its target object
For more clarification of the process, I have uploaded an image. The highlighted rows are the rows that are needed to be selected/changed.
Solution
Code
df1, df2 set_index
& make cc
column
tmp1 = df1.assign(cc=df1.groupby('sym').cumcount()).set_index(['sym', 'cc'])
tmp2 = df2.groupby('sym').head(1).assign(cc=1).set_index(['sym', 'cc'])\
.drop('x', axis=1)
tmp1
open high x
sym cc
a 0 99 3987 gd
1 22 41123 ed
2 34 46123 we
3 63 6643 vt
b 0 75 75 de
1 86 3745 sw
2 1800 72123 ee
c 0 82 74 et
tmp2(only 1st row of df2 group, and make cc column to 1 [means 2nd row] & remove x column):
open high
sym cc
a 1 77 177
b 1 434 1123
c 1 55 55
use combine_first
tmp2 & tmp1
out = tmp2[tmp2.index.isin(tmp1.index)].combine_first(tmp1)\
.reindex_like(tmp1).reset_index().drop('cc', axis=1)
out:
sym open high x
0 a 99 3987 gd
1 a 77 177 ed
2 a 34 46123 we
3 a 63 6643 vt
4 b 75 75 de
5 b 434 1123 sw
6 b 1800 72123 ee
7 c 82 74 et
The boolean indexing used to create out
removes the values of c in tmp2 that do not have a second row in tmp1.
Answered By - Panda Kim
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.