Issue
First data frame (df
):
start | end |
---|---|
6:15:00 | 6:15:06 |
6:15:00 | 6:15:00 |
6:15:00 | 6:15:01 |
6:15:01 | 6:15:06 |
6:15:01 | 6:15:15 |
6:15:01 | 6:15:09 |
6:15:01 | 6:15:09 |
6:15:02 | 6:15:06 |
6:15:02 | 6:15:08 |
6:15:02 | 6:15:09 |
df2
:
periods (1 sec timedelta) | total |
---|---|
6:15:00 | 2 |
6:15:01 | 5 |
6:15:02 | 8 |
How to calculate total in df2
without using loops? For each period in df2
sum of rows from df
is needed, provided that start <= period
and end >= period (total)
. For example 6:15:02
in df2 = x
:
from datetime import datetime, timedelta
x = datetime(year=2023,month=10, day=6, hour=6, minute=15, second=2).time()
df = pd.DataFrame({'start': ['6:15:00', '6:15:00', '6:15:00', '6:15:01', '6:15:01', '6:15:01', '6:15:01', '6:15:02', '6:15:02', '6:15:02'],
'end': ['6:15:06', '6:15:00', '6:15:01', '6:15:06', '6:15:15', '6:15:09', '6:15:09', '6:15:06', '6:15:08', '6:15:09']})
df['start'] = pd.to_datetime(df['start'], yearfirst=True).dt.time
df['end'] = pd.to_datetime(df['end'], yearfirst=True).dt.time
start = datetime(year=2023,month=10, day=6, hour=6, minute=15, second=0)
end = datetime(year=2023,month=10, day=6, hour=6, minute=15, second=2)
df2 = pd.DataFrame({"periods (1 sec timedelta)": pd.date_range(start=start, end=end, freq=timedelta(seconds=1)), "total": None})
df2["periods (1 sec timedelta)"] = pd.to_datetime(df2["periods (1 sec timedelta)"], yearfirst=True).dt.time
total = len(df[(df['start'] <= x) & (df['end'] >= x)])
total
= 8. If counted for each row in df2
it takes a lot of time. Is there a more efficient way?
Solution
Parse the time like columns to timedelta
array, then call a function which will return the count of rows satisfying the given condition. The trick here is to compile the function with numba
to achive C like speeds and since your dataframes are big numba would be the ideal approach for efficiency and memory management
from numba import njit
@njit
def func(period, start, end):
for p in period:
mask = (start <= p) & (end >= p)
yield sum(mask)
def to_timedelta_arr(col):
return pd.to_timedelta(col.astype(str)).to_numpy()
df2['total'] = list(func(to_timedelta_arr(df2['periods (1 sec timedelta)']),
to_timedelta_arr(df['start']), to_timedelta_arr(df['end'])))
periods (1 sec timedelta) total
0 06:15:00 3
1 06:15:01 6
2 06:15:02 8
Answered By - Shubham Sharma
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.