Sunday, January 21, 2024

[FIXED] Removing duplicate rows based on values in row somewhere above

January 21, 2024 duplicates, pandas, python No comments

Issue

Here is an example of my dataframe:

df = pd.DataFrame([['In', 'Age', 'Nat.'],
                   ['Jakub Kiwior', 22, 'Poland'],
                   ['Leandro Trossard', 28, 'Belgium'],
                   ['Jorginho', 31, 'Italy'],
                   ['Out', 'Age', 'Nat.'],
                   ['Jhon Durán', 19, 'Colombia'],
                   ['In', 'Age', 'Nat.'],
                   ['Jhon Durán', 19, 'Colombia'],
                   ['Álex Moreno', 29, 'Spain'],
                   ['Out', 'Age', 'Nat.'],
                   ['Leandro Trossard', 28, 'Belgium'],
                   ['Jorginho', 31, 'Italy'],
                   ['In', 'Age', 'Nat.'],
                   ['Out', 'Age', 'Nat.'],
                   ['In', 'Age', 'Nat.'],
                  ], columns=['Player', 'Age', 'Nat.'])

My desired output is a dataframe that removes duplicate rows if the row above (not necessarily directly above) has the value 'Out' in the 'Player' column.

For example, the desired output would remove the first "Jhon Durán" row, and the second "Leandro Trossard" and "Jorginho" rows, since these are the rows with "Out" above them and not "In".

Is this possible to achieve with pandas?

Solution

You could use Pandas shift method to help achieve this.

df['previousPlayer'] = df['Player'].shift(1)
df

                 Player  Age      Nat.    previousPlayer
0                 In  Age      Nat.               NaN
1       Jakub Kiwior   22    Poland                In
2   Leandro Trossard   28   Belgium      Jakub Kiwior
3           Jorginho   31     Italy  Leandro Trossard
4                Out  Age      Nat.          Jorginho
5         Jhon Durán   19  Colombia               Out
6                 In  Age      Nat.        Jhon Durán
7         Jhon Durán   19  Colombia                In
8        Álex Moreno   29     Spain        Jhon Durán
9                Out  Age      Nat.       Álex Moreno
10  Leandro Trossard   28   Belgium               Out
11          Jorginho   31     Italy  Leandro Trossard
12                In  Age      Nat.          Jorginho
13               Out  Age      Nat.                In
14                In  Age      Nat.               Out

Then simply filter out any values in the new column with the word of your choice:

df = df[df.previousPlayer != 'Out'].drop('previousPlayer', axis=1)
print(df)

              Player  Age      Nat.
0                 In  Age      Nat.
1       Jakub Kiwior   22    Poland
2   Leandro Trossard   28   Belgium
3           Jorginho   31     Italy
4                Out  Age      Nat.
6                 In  Age      Nat.
7         Jhon Durán   19  Colombia
8        Álex Moreno   29     Spain
9                Out  Age      Nat.
11          Jorginho   31     Italy
12                In  Age      Nat.
13               Out  Age      Nat.

Answered By - straka86

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Sunday, January 21, 2024

[FIXED] Removing duplicate rows based on values in row somewhere above

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels