Saturday, December 30, 2023

[FIXED] Calculate sum based on another data frame

December 30, 2023 date, numpy, pandas, python-3.x No comments

Issue

First data frame (df):

start	end
6:15:00	6:15:06
6:15:00	6:15:00
6:15:00	6:15:01
6:15:01	6:15:06
6:15:01	6:15:15
6:15:01	6:15:09
6:15:01	6:15:09
6:15:02	6:15:06
6:15:02	6:15:08
6:15:02	6:15:09

df2:

periods (1 sec timedelta)	total
6:15:00	2
6:15:01	5
6:15:02	8

How to calculate total in df2 without using loops? For each period in df2 sum of rows from df is needed, provided that start <= period and end >= period (total). For example 6:15:02 in df2 = x:

from datetime import datetime, timedelta


x = datetime(year=2023,month=10, day=6, hour=6, minute=15, second=2).time()

df = pd.DataFrame({'start': ['6:15:00', '6:15:00', '6:15:00', '6:15:01', '6:15:01', '6:15:01', '6:15:01', '6:15:02', '6:15:02', '6:15:02'],
              'end': ['6:15:06', '6:15:00', '6:15:01', '6:15:06', '6:15:15', '6:15:09', '6:15:09', '6:15:06', '6:15:08', '6:15:09']})

df['start'] = pd.to_datetime(df['start'], yearfirst=True).dt.time
df['end'] = pd.to_datetime(df['end'], yearfirst=True).dt.time

start = datetime(year=2023,month=10, day=6, hour=6, minute=15, second=0)
end = datetime(year=2023,month=10, day=6, hour=6, minute=15, second=2)
df2 = pd.DataFrame({"periods (1 sec timedelta)": pd.date_range(start=start, end=end, freq=timedelta(seconds=1)), "total": None})
df2["periods (1 sec timedelta)"] = pd.to_datetime(df2["periods (1 sec timedelta)"], yearfirst=True).dt.time

total = len(df[(df['start'] <= x) & (df['end'] >= x)])

total = 8. If counted for each row in df2 it takes a lot of time. Is there a more efficient way?

Solution

Parse the time like columns to timedelta array, then call a function which will return the count of rows satisfying the given condition. The trick here is to compile the function with numba to achive C like speeds and since your dataframes are big numba would be the ideal approach for efficiency and memory management

from numba import njit

@njit
def func(period, start, end):
    for p in period:
        mask = (start <= p) & (end >= p)
        yield sum(mask)


def to_timedelta_arr(col):
    return pd.to_timedelta(col.astype(str)).to_numpy()


df2['total'] = list(func(to_timedelta_arr(df2['periods (1 sec timedelta)']),
                         to_timedelta_arr(df['start']), to_timedelta_arr(df['end'])))

  periods (1 sec timedelta)  total
0                  06:15:00      3
1                  06:15:01      6
2                  06:15:02      8

Answered By - Shubham Sharma

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Saturday, December 30, 2023

[FIXED] Calculate sum based on another data frame

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels