Monday, February 28, 2022

[FIXED] filtering data from data frame with complex criteria in pandas python

February 28, 2022 pandas, python-3.x No comments

Issue

This is my dataframe

import pandas as pd

data=pd.DataFrame({'vehicle':['car','car','car','car','car','car','bus','bus','bus','bus','bus','bus','car','car','car','car','car','car','bus','bus','bus','bus','bus','bus'],
'expecteddate':['2/24/2022','2/24/2022','3/15/2022','3/15/2022','4/20/2022','4/20/2022','2/24/2022','2/24/2022','3/15/2022','3/15/2022','4/20/2022','4/20/2022','2/24/2022','2/24/2022','3/15/2022','3/15/2022','4/20/2022','4/20/2022','2/24/2022','2/24/2022','3/15/2022','3/15/2022','4/20/2022','4/20/2022'],'range':[240,240,240,240,240,240,300,300,300,300,300,300,240,240,240,240,240,240,300,300,300,300,300,300],'color':['blue','red','blue','red','blue','red','blue','red','blue','red','blue','red','blue','red','blue','red','blue','red','blue','red','blue','red','blue','red'],'discount':[70,80,90,60,40,50,120,110,130,140,80,90,60,40,50,30,70,45,130,100,140,120,90,30],'date':['2/18/2022','2/18/2022','2/18/2022','2/18/2022','2/18/2022','2/18/2022','2/18/2022','2/18/2022','2/18/2022','2/18/2022','2/18/2022','2/18/2022','2/17/2022','2/17/2022','2/17/2022','2/17/2022','2/17/2022','2/17/2022','2/17/2022','2/17/2022','2/17/2022','2/17/2022','2/17/2022','2/17/2022']})
print(data)

data in dataframe:

   vehicle  expecteddate  range color  discount       date
0      car    2/24/2022    240  blue        70  2/18/2022
1      car    2/24/2022    240   red        80  2/18/2022
2      car    3/15/2022    240  blue        90  2/18/2022
3      car    3/15/2022    240   red        60  2/18/2022
4      car    4/20/2022    240  blue        40  2/18/2022
5      car    4/20/2022    240   red        50  2/18/2022
6      bus    2/24/2022    300  blue       120  2/18/2022
7      bus    2/24/2022    300   red       110  2/18/2022
8      bus    3/15/2022    300  blue       130  2/18/2022
9      bus    3/15/2022    300   red       140  2/18/2022
10     bus    4/20/2022    300  blue        80  2/18/2022
11     bus    4/20/2022    300   red        90  2/18/2022
12     car    2/24/2022    240  blue        60  2/17/2022
13     car    2/24/2022    240   red        40  2/17/2022
14     car    3/15/2022    240  blue        50  2/17/2022
15     car    3/15/2022    240   red        30  2/17/2022
16     car    4/20/2022    240  blue        70  2/17/2022
17     car    4/20/2022    240   red        45  2/17/2022
18     bus    2/24/2022    300  blue       130  2/17/2022
19     bus    2/24/2022    300   red       100  2/17/2022
20     bus    3/15/2022    300  blue       140  2/17/2022
21     bus    3/15/2022    300   red       120  2/17/2022
22     bus    4/20/2022    300  blue        90  2/17/2022
23     bus    4/20/2022    300   red        30  2/17/2022

from this dataframe we have two vehicles ,three expecteddates ,range,two colors ,discount and date. we have to find min value in discount and date at which we got min value in discount column ,into seperate two new columns that is mindisc column and mindate and that should be save in new column based on latest date ,this should filtered based on vehicle,expecteddate,range,color and date

we have to find min value in discount column in two dates(all dates as we have many dates not limited to two dates) 2/18/2022,2/17/2022 based on same color,range,expecteddate and vehicle

finally this min to added to mindisc column at latest date and corresponding date at which min date appeared to mindate column at latest date row

output should look like

 country expecteddate  range color  discount       date  mindisc    mindate
0      car    2/24/2022    240  blue        70  2/18/2022   60       2/17/2022
1      car    2/24/2022    240   red        80  2/18/2022   40       2/17/2022
2      car    3/15/2022    240  blue        90  2/18/2022   50       2/17/2022
3      car    3/15/2022    240   red        60  2/18/2022   30       2/17/2022
4      car    4/20/2022    240  blue        40  2/18/2022   40       2/18/2022
5      car    4/20/2022    240   red        50  2/18/2022   45       2/17/2022
6      bus    2/24/2022    300  blue       120  2/18/2022   120      2/18/2022
7      bus    2/24/2022    300   red       110  2/18/2022   100      2/17/2022
8      bus    3/15/2022    300  blue       130  2/18/2022   130      2/18/2022
9      bus    3/15/2022    300   red       140  2/18/2022   120      2/17/2022
10     bus    4/20/2022    300  blue        80  2/18/2022   80       2/18/2022
11     bus    4/20/2022    300   red        90  2/18/2022   30       2/17/2022
12     car    2/24/2022    240  blue        60  2/17/2022
13     car    2/24/2022    240   red        40  2/17/2022
14     car    3/15/2022    240  blue        50  2/17/2022
15     car    3/15/2022    240   red        30  2/17/2022
16     car    4/20/2022    240  blue        70  2/17/2022
17     car    4/20/2022    240   red        45  2/17/2022
18     bus    2/24/2022    300  blue       130  2/17/2022
19     bus    2/24/2022    300   red       100  2/17/2022
20     bus    3/15/2022    300  blue       140  2/17/2022
21     bus    3/15/2022    300   red       120  2/17/2022
22     bus    4/20/2022    300  blue        90  2/17/2022
23     bus    4/20/2022    300   red        30  2/17/2022

vehicles are not limited to two like car and bus ,it has many vehicles and data is not always have equal rows in vehicle and range columns and date is not limited to two dates

Solution

This is my approach. Let's begin by treating the "date" column as a datetime object:

df["date"] = pd.to_datetime(df["date"])

Now, we can group the data as you described to find the rows where the minimum discount is reached:

common_groupby = df.groupby(["color", "range", "expecteddate", "vehicle"])

Now, let's find the rows where the minimum discount and the maximum date happens for each group:

source_idx = common_groupby["discount"].idxmin()
target_idx = common_groupby["date"].idxmax()
# Use df.loc[idx] to see the rows where the minimum discount is reached

Now, we can add the values of the minimum discount and the respective date (from source_idx rows) to the correct rows (from target_src) by using .loc:

df.loc[target_idx, "mindisc"] = df.loc[source_idx, "discount"].values
df.loc[target_idx, "mindate"] = df.loc[source_idx, "date"].values

As you can see, we're only changing the values in the rows where the minimum discount was reached (idx). This is the output of these operations:

   vehicle expecteddate  range color  discount       date  mindisc    mindate
0      car    2/24/2022    240  blue        70 2022-02-18     60.0 2022-02-17
1      car    2/24/2022    240   red        80 2022-02-18     40.0 2022-02-17
2      car    3/15/2022    240  blue        90 2022-02-18     50.0 2022-02-17
3      car    3/15/2022    240   red        60 2022-02-18     30.0 2022-02-17
4      car    4/20/2022    240  blue        40 2022-02-18     40.0 2022-02-18
5      car    4/20/2022    240   red        50 2022-02-18     45.0 2022-02-17
6      bus    2/24/2022    300  blue       120 2022-02-18    120.0 2022-02-18
7      bus    2/24/2022    300   red       110 2022-02-18    100.0 2022-02-17
8      bus    3/15/2022    300  blue       130 2022-02-18    130.0 2022-02-18
9      bus    3/15/2022    300   red       140 2022-02-18    120.0 2022-02-17
10     bus    4/20/2022    300  blue        80 2022-02-18     80.0 2022-02-18
11     bus    4/20/2022    300   red        90 2022-02-18     30.0 2022-02-17
12     car    2/24/2022    240  blue        60 2022-02-17      NaN        NaT
13     car    2/24/2022    240   red        40 2022-02-17      NaN        NaT
14     car    3/15/2022    240  blue        50 2022-02-17      NaN        NaT
15     car    3/15/2022    240   red        30 2022-02-17      NaN        NaT
16     car    4/20/2022    240  blue        70 2022-02-17      NaN        NaT
17     car    4/20/2022    240   red        45 2022-02-17      NaN        NaT
18     bus    2/24/2022    300  blue       130 2022-02-17      NaN        NaT
19     bus    2/24/2022    300   red       100 2022-02-17      NaN        NaT
20     bus    3/15/2022    300  blue       140 2022-02-17      NaN        NaT
21     bus    3/15/2022    300   red       120 2022-02-17      NaN        NaT
22     bus    4/20/2022    300  blue        90 2022-02-17      NaN        NaT
23     bus    4/20/2022    300   red        30 2022-02-17      NaN        NaT

Another possible solution trying to improve perormance by sorting and then droping duplicates twice (inspired by this question: Select the max row per group - pandas performance issue):

# Min discount rows
source_df = df.sort_values(by=["discount"], ascending=True, kind='mergesort').drop_duplicates(["color", "range", "expecteddate", "vehicle"])
source_df = source_df.rename(columns={"date": "mindate", "discount": "mindisc"})

# Max date rows
target_df = df.reset_index().sort_values(by=["date"], ascending=False, kind="mergesort").drop_duplicates(["color", "range", "expecteddate", "vehicle"])

# Put min discount values into max date rows
df.loc[target_df["index"], ["mindisc", "mindate"]] = source_df[["mindisc", "mindate"]].values

Answered By - aaossa

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Monday, February 28, 2022

[FIXED] filtering data from data frame with complex criteria in pandas python

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels