Issue
I have a dataframe:
sensor_id | timestamp | current | apparent_power |
---|---|---|---|
0 | 2024-02-01 00:00:00.944369920 | 1.550 | 101.5 |
1 | 2024-02-01 00:00:00.959425024 | 0.284 | 20.4 |
0 | 2024-02-01 00:00:01.945563136 | 1.549 | 101.6 |
1 | 2024-02-01 00:00:01.950393856 | 0.286 | 20.2 |
0 | 2024-02-01 00:00:02.944880896 | 1.547 | 100.6 |
1 | 2024-02-01 00:00:02.949997056 | 0.290 | 21.6 |
0 | 2024-02-01 00:00:03.944605184 | 1.547 | 100.9 |
1 | 2024-02-01 00:00:03.949342976 | 0.290 | 21.8 |
0 | 2024-02-01 00:00:04.944967936 | 1.548 | 100.9 |
1 | 2024-02-01 00:00:04.950715136 | 0.285 | 20.5 |
0 | 2024-02-01 00:00:05.944571904 | 1.547 | 100.7 |
1 | 2024-02-01 00:00:05.952841984 | 0.284 | 20.2 |
0 | 2024-02-01 00:00:06.945287936 | 1.548 | 100.8 |
1 | 2024-02-01 00:00:06.950289920 | 0.287 | 20.8 |
0 | 2024-02-01 00:00:07.944841984 | 1.547 | 100.7 |
1 | 2024-02-01 00:00:07.962969088 | 0.290 | 21.8 |
0 | 2024-02-01 00:00:08.945434112 | 1.550 | 103.1 |
1 | 2024-02-01 00:00:08.950361088 | 0.281 | 20.1 |
0 | 2024-02-01 00:00:09.944396032 | 1.551 | 103.8 |
1 | 2024-02-01 00:00:09.949807872 | 0.285 | 21.1 |
My end aim to so push sensor 1 into sensor 2 line, summing the row of each only to create the two row total:
timestamp_0 | timestamp_1 | total_current | total_apparent_power |
---|---|---|---|
2024-02-01 00:00:00.944369920 | 2024-02-01 00:00:00.959425024 | 1.834 | 121.9 |
2024-02-01 00:00:01.945563136 | 2024-02-01 00:00:01.950393856 | 1.835 | 121.6 |
2024-02-01 00:00:02.944880896 | 2024-02-01 00:00:02.949997056 | xxx | xxx |
2024-02-01 00:00:03.944605184 | 2024-02-01 00:00:03.949342976 | xxx | xxx |
2024-02-01 00:00:04.944967936 | 2024-02-01 00:00:04.950715136 | xxx | xxx |
2024-02-01 00:00:05.944571904 | 2024-02-01 00:00:05.952841984 | xxx | xxx |
2024-02-01 00:00:06.945287936 | 2024-02-01 00:00:06.950289920 | xxx | xxx |
2024-02-01 00:00:07.944841984 | 2024-02-01 00:00:07.962969088 | xxx | xxx |
2024-02-01 00:00:08.945434112 | 2024-02-01 00:00:08.950361088 | xxx | xxx |
2024-02-01 00:00:09.944396032 | 2024-02-01 00:00:09.949807872 | xxx | xxx |
I have took the timestamps and grouped them by second and hour to produce:
sensor_id | timestamp | current | apparent_power | seconds | hour |
---|---|---|---|---|---|
0 | 2024-02-01 00:00:00.944369920 | 1.550 | 101.5 | 0 | 0 |
1 | 2024-02-01 00:00:00.959425024 | 0.284 | 20.4 | 0 | 0 |
0 | 2024-02-01 00:00:01.945563136 | 1.549 | 101.6 | 1 | 0 |
1 | 2024-02-01 00:00:01.950393856 | 0.286 | 20.2 | 1 | 0 |
0 | 2024-02-01 00:00:02.944880896 | 1.547 | 100.6 | 2 | 0 |
1 | 2024-02-01 00:00:02.949997056 | 0.290 | 21.6 | 2 | 0 |
0 | 2024-02-01 00:00:03.944605184 | 1.547 | 100.9 | 3 | 0 |
1 | 2024-02-01 00:00:03.949342976 | 0.290 | 21.8 | 3 | 0 |
0 | 2024-02-01 00:00:04.944967936 | 1.548 | 100.9 | 4 | 0 |
1 | 2024-02-01 00:00:04.950715136 | 0.285 | 20.5 | 4 | 0 |
0 | 2024-02-01 00:00:05.944571904 | 1.547 | 100.7 | 5 | 0 |
1 | 2024-02-01 00:00:05.952841984 | 0.284 | 20.2 | 5 | 0 |
0 | 2024-02-01 00:00:06.945287936 | 1.548 | 100.8 | 6 | 0 |
1 | 2024-02-01 00:00:06.950289920 | 0.287 | 20.8 | 6 | 0 |
0 | 2024-02-01 00:00:07.944841984 | 1.547 | 100.7 | 7 | 0 |
1 | 2024-02-01 00:00:07.962969088 | 0.290 | 21.8 | 7 | 0 |
0 | 2024-02-01 00:00:08.945434112 | 1.550 | 103.1 | 8 | 0 |
1 | 2024-02-01 00:00:08.950361088 | 0.281 | 20.1 | 8 | 0 |
0 | 2024-02-01 00:00:09.944396032 | 1.551 | 103.8 | 9 | 0 |
1 | 2024-02-01 00:00:09.949807872 | 0.285 | 21.1 | 9 | 0 |
I have something that starts to do it, but falls over if the data has a missing sensor reading:
second_tracker = 0
a_row = 0
b_row = 0
for i, row in test_file_df.iterrows():
if row['seconds'] != second_tracker:
second_tracker += 1
# store totals
a_cur = a_row['current']
b_cur = b_row['current']
total_current = a_cur + b_cur
a_app_power = a_row['apparent_power']
b_app_power = b_row['apparent_power']
total_app_power = a_app_power + b_app_power
new_row = {'timestamp_0': a_row['timestamp'],'timestamp_1': b_row['timestamp'], 'total_current': total_current, 'total_app_power': total_app_power}
print(new_row)
power_sum_df.loc[len(power_sum_df)] = new_row
a_row = 0
b_row = 0
if row['sensor_id'] == 0:
a_row = row
else:
b_row = row
It stops when there is no matching sensor row create due to sensor missing - i.e. row is sensor 0, next row is also sensor 0 as previous sensor 1 was lost.
I have tried splitting into to two DF and tries to sum row 1 + row 1 but could understand how to add the two rows into a new data frame and sum only two columns while copying up the next row timestamp.
Can anyone help - even a better way of pulling out the alternate rows into one row. I tried to use unstack, group and agg to no avail.
Update***** data shows irregular records within the single time second:
line no | sensor_id | timestamp | current | apparent_power |
---|---|---|---|---|
4 | 0 | 2024-01-25 00:00:02.977937920 | 1.502 | 370.9 |
5 | 1 | 2024-01-25 00:00:02.983158016 | 0.293 | 72.4 |
6 | 0 | 2024-01-25 00:00:03.977388032 | 1.503 | 371.1 |
7 | 1 | 2024-01-25 00:00:03.995464960 | 0.284 | 70.3 |
8 | 0 | 2024-01-25 00:00:04.978688000 | 1.500 | 370.6 |
9 | 1 | 2024-01-25 00:00:05.025767168 | 0.287 | 70.9 |
10 | 0 | 2024-01-25 00:00:05.980681984 | 1.495 | 369.0 |
11 | 1 | 2024-01-25 00:00:06.031785984 | 0.286 | 70.7 |
12 | 0 | 2024-01-25 00:00:06.977396992 | 1.497 | 369.7 |
13 | 1 | 2024-01-25 00:00:06.984870912 | 0.280 | 69.3 |
14 | 0 | 2024-01-25 00:00:07.979064832 | 1.495 | 369.1 |
15 | 1 | 2024-01-25 00:00:07.983921920 | 0.282 | 69.6 |
Desired Result: There is logic applied where there are more than 2 records in a single second - always taking one of sensor0 and one of sensor 1. If there is only a single sensor reading within the second, then use the previous second corresponding sensor value for the sum.
lines joined | sensor_0 | sensor_1 | timestamp_0 | timestamp_1 | total_current | total_apparent_power | Comment |
---|---|---|---|---|---|---|---|
4 & 5 | 0 | 1 | 2024-01-25 00:00:02.977937920 | 2024-01-25 00:00:02.983158016 | 1.795 | 443.2 | Normal |
6 & 7 | 0 | 1 | 2024-01-25 00:00:03.977388032 | 2024-01-25 00:00:03.995464960 | 1.787 | 441.4 | Normal |
8 & 7 | 0 | 1 | 2024-01-25 00:00:04.978688000 | use line 7 sensor 1 as no sensor 1 in this second | 1.784 | 440.9 | use previous sensor 1 |
10 & 9 | 0 | 1 | 2024-01-25 00:00:05.980681984 | 2024-01-25 00:00:05.025767168 | 1.782 | 439.9 | swap so ts_0 is sensor_0 |
12 & 13 | 0 | 1 | 2024-01-25 00:00:06.977396992 | 2024-01-25 00:00:06.984870912 | 1.777 | 439.0 | use last sens0, and last sens1 of second |
14 & 15 | 0 | 1 | 2024-01-25 00:00:07.979064832 | 2024-01-25 00:00:07.983921920 | 1.777 | 438.7 | Normal |
I have also modified the group for hour, min & second to put it in the grouping I am after.
power_agg_df = (test_file_df.groupby([ts.dt.hour, ts.dt.minute, ts.dt.second], sort=False)
.agg(** {
"sensor_0" : ("sensor_id", "first"),
"sensor_1" : ("sensor_id", "last"),
"timestamp_0" : ("timestamp", "first"),
"timestamp_1" : ("timestamp", "last"),
"total_current" : ("current", "sum"),
"total_apparent_power" : ("apparent_power", "sum"),
}))
Solution
Instead of a loop, you can use the (hour
/second
) as a grouper to agg
the operations you need :
ts = pd.to_datetime(df["timestamp"])
grp = {"hour": ts.dt.hour, "second": ts.dt.second} # +more ?
# grp = {"time": ts.dt.floor("s").dt.time} # 3-components of time
#1 reduce the groups with more than two identical censors
uniq = (df.assign(**grp).sort_values([*grp, "sensor_id"]).loc[
lambda x: ~x.duplicated(subset=[*grp, "sensor_id"], keep="last")])
#2 look for the previous cursors to complete the groups having a single one
miss = df.groupby(list(grp.values()))["sensor_id"].transform("nunique").eq(1)
exte = (uniq.mask(miss | uniq["sensor_id"].eq(uniq["sensor_id"].shift(-1)))
.ffill().loc[miss].convert_dtypes())#.assign(timestamp=None))
#3 swap the censors in the groups with a leading censor-1
cln = (pd.concat([uniq, exte if not exte.empty else None])
.sort_values(["sensor_id", *grp]))
out = (
cln.astype({"line no": "str"})
.groupby(list(grp.values()))
.agg(**{
# feel free to comment (#) any agg
"lines_joined": ("line no", " & ".join),
"sensor_0": ("sensor_id", "first"),
"sensor_1": ("sensor_id", "last"),
"timestamp_0": ("timestamp", "first"),
"timestamp_1": ("timestamp", "last"),
"total_current": ("current", "sum"),
"total_apparent_power": ("apparent_power", "sum")
})
.reset_index(names=[*grp]) # to show up the groups
#.reset_index(drop=True) # or just drop them if optional
)
Answered By - Timeless
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.