Tuesday, February 6, 2024

[FIXED] How to reduce a DataFrame to two DataFrames then sum rows of each data frame into new data frame

February 06, 2024 dataframe, groupingby, loops, pandas, python No comments

Issue

I have a dataframe:

sensor_id	timestamp	current	apparent_power
0	2024-02-01 00:00:00.944369920	1.550	101.5
1	2024-02-01 00:00:00.959425024	0.284	20.4
0	2024-02-01 00:00:01.945563136	1.549	101.6
1	2024-02-01 00:00:01.950393856	0.286	20.2
0	2024-02-01 00:00:02.944880896	1.547	100.6
1	2024-02-01 00:00:02.949997056	0.290	21.6
0	2024-02-01 00:00:03.944605184	1.547	100.9
1	2024-02-01 00:00:03.949342976	0.290	21.8
0	2024-02-01 00:00:04.944967936	1.548	100.9
1	2024-02-01 00:00:04.950715136	0.285	20.5
0	2024-02-01 00:00:05.944571904	1.547	100.7
1	2024-02-01 00:00:05.952841984	0.284	20.2
0	2024-02-01 00:00:06.945287936	1.548	100.8
1	2024-02-01 00:00:06.950289920	0.287	20.8
0	2024-02-01 00:00:07.944841984	1.547	100.7
1	2024-02-01 00:00:07.962969088	0.290	21.8
0	2024-02-01 00:00:08.945434112	1.550	103.1
1	2024-02-01 00:00:08.950361088	0.281	20.1
0	2024-02-01 00:00:09.944396032	1.551	103.8
1	2024-02-01 00:00:09.949807872	0.285	21.1

My end aim to so push sensor 1 into sensor 2 line, summing the row of each only to create the two row total:

timestamp_0	timestamp_1	total_current	total_apparent_power
2024-02-01 00:00:00.944369920	2024-02-01 00:00:00.959425024	1.834	121.9
2024-02-01 00:00:01.945563136	2024-02-01 00:00:01.950393856	1.835	121.6
2024-02-01 00:00:02.944880896	2024-02-01 00:00:02.949997056	xxx	xxx
2024-02-01 00:00:03.944605184	2024-02-01 00:00:03.949342976	xxx	xxx
2024-02-01 00:00:04.944967936	2024-02-01 00:00:04.950715136	xxx	xxx
2024-02-01 00:00:05.944571904	2024-02-01 00:00:05.952841984	xxx	xxx
2024-02-01 00:00:06.945287936	2024-02-01 00:00:06.950289920	xxx	xxx
2024-02-01 00:00:07.944841984	2024-02-01 00:00:07.962969088	xxx	xxx
2024-02-01 00:00:08.945434112	2024-02-01 00:00:08.950361088	xxx	xxx
2024-02-01 00:00:09.944396032	2024-02-01 00:00:09.949807872	xxx	xxx

I have took the timestamps and grouped them by second and hour to produce:

sensor_id	timestamp	current	apparent_power	seconds
0	2024-02-01 00:00:00.944369920	1.550	101.5	0
1	2024-02-01 00:00:00.959425024	0.284	20.4	0
0	2024-02-01 00:00:01.945563136	1.549	101.6	1
1	2024-02-01 00:00:01.950393856	0.286	20.2	1
0	2024-02-01 00:00:02.944880896	1.547	100.6	2
1	2024-02-01 00:00:02.949997056	0.290	21.6	2
0	2024-02-01 00:00:03.944605184	1.547	100.9	3
1	2024-02-01 00:00:03.949342976	0.290	21.8	3
0	2024-02-01 00:00:04.944967936	1.548	100.9	4
1	2024-02-01 00:00:04.950715136	0.285	20.5	4
0	2024-02-01 00:00:05.944571904	1.547	100.7	5
1	2024-02-01 00:00:05.952841984	0.284	20.2	5
0	2024-02-01 00:00:06.945287936	1.548	100.8	6
1	2024-02-01 00:00:06.950289920	0.287	20.8	6
0	2024-02-01 00:00:07.944841984	1.547	100.7	7
1	2024-02-01 00:00:07.962969088	0.290	21.8	7
0	2024-02-01 00:00:08.945434112	1.550	103.1	8
1	2024-02-01 00:00:08.950361088	0.281	20.1	8
0	2024-02-01 00:00:09.944396032	1.551	103.8	9
1	2024-02-01 00:00:09.949807872	0.285	21.1	9

I have something that starts to do it, but falls over if the data has a missing sensor reading:

second_tracker = 0
a_row = 0
b_row = 0

for i, row in test_file_df.iterrows():
   
   if row['seconds'] != second_tracker:
       second_tracker += 1
       # store totals
       a_cur = a_row['current']
       b_cur = b_row['current']
       total_current = a_cur + b_cur
       
       a_app_power = a_row['apparent_power']
       b_app_power = b_row['apparent_power']
       total_app_power = a_app_power + b_app_power
       
       new_row = {'timestamp_0': a_row['timestamp'],'timestamp_1': b_row['timestamp'], 'total_current': total_current, 'total_app_power': total_app_power}
       print(new_row)
       power_sum_df.loc[len(power_sum_df)] = new_row
       a_row = 0
       b_row = 0
   
   if row['sensor_id'] == 0:
       a_row = row
   else:
       b_row = row

It stops when there is no matching sensor row create due to sensor missing - i.e. row is sensor 0, next row is also sensor 0 as previous sensor 1 was lost.

I have tried splitting into to two DF and tries to sum row 1 + row 1 but could understand how to add the two rows into a new data frame and sum only two columns while copying up the next row timestamp.

Can anyone help - even a better way of pulling out the alternate rows into one row. I tried to use unstack, group and agg to no avail.

Update***** data shows irregular records within the single time second:

line no	sensor_id	timestamp	current	apparent_power
4	0	2024-01-25 00:00:02.977937920	1.502	370.9
5	1	2024-01-25 00:00:02.983158016	0.293	72.4
6	0	2024-01-25 00:00:03.977388032	1.503	371.1
7	1	2024-01-25 00:00:03.995464960	0.284	70.3
8	0	2024-01-25 00:00:04.978688000	1.500	370.6
9	1	2024-01-25 00:00:05.025767168	0.287	70.9
10	0	2024-01-25 00:00:05.980681984	1.495	369.0
11	1	2024-01-25 00:00:06.031785984	0.286	70.7
12	0	2024-01-25 00:00:06.977396992	1.497	369.7
13	1	2024-01-25 00:00:06.984870912	0.280	69.3
14	0	2024-01-25 00:00:07.979064832	1.495	369.1
15	1	2024-01-25 00:00:07.983921920	0.282	69.6

Desired Result: There is logic applied where there are more than 2 records in a single second - always taking one of sensor0 and one of sensor 1. If there is only a single sensor reading within the second, then use the previous second corresponding sensor value for the sum.

lines joined	sensor_1	timestamp_0	timestamp_1	total_current	total_apparent_power	Comment
4 & 5	1	2024-01-25 00:00:02.977937920	2024-01-25 00:00:02.983158016	1.795	443.2	Normal
6 & 7	1	2024-01-25 00:00:03.977388032	2024-01-25 00:00:03.995464960	1.787	441.4	Normal
8 & 7	1	2024-01-25 00:00:04.978688000	use line 7 sensor 1 as no sensor 1 in this second	1.784	440.9	use previous sensor 1
10 & 9	1	2024-01-25 00:00:05.980681984	2024-01-25 00:00:05.025767168	1.782	439.9	swap so ts_0 is sensor_0
12 & 13	1	2024-01-25 00:00:06.977396992	2024-01-25 00:00:06.984870912	1.777	439.0	use last sens0, and last sens1 of second
14 & 15	1	2024-01-25 00:00:07.979064832	2024-01-25 00:00:07.983921920	1.777	438.7	Normal

I have also modified the group for hour, min & second to put it in the grouping I am after.

power_agg_df = (test_file_df.groupby([ts.dt.hour, ts.dt.minute, ts.dt.second], sort=False)
               .agg(** {
                   "sensor_0" : ("sensor_id", "first"),
                   "sensor_1" : ("sensor_id", "last"),
                   "timestamp_0" : ("timestamp", "first"),
                   "timestamp_1" : ("timestamp", "last"),
                   "total_current" : ("current", "sum"),
                   "total_apparent_power" : ("apparent_power", "sum"),
                   
               }))

Solution

Instead of a loop, you can use the (hour/second) as a grouper to agg the operations you need :

ts = pd.to_datetime(df["timestamp"])
grp = {"hour": ts.dt.hour, "second": ts.dt.second} # +more ?
# grp = {"time": ts.dt.floor("s").dt.time} # 3-components of time

#1 reduce the groups with more than two identical censors
uniq = (df.assign(**grp).sort_values([*grp, "sensor_id"]).loc[
        lambda x: ~x.duplicated(subset=[*grp, "sensor_id"], keep="last")])

#2 look for the previous cursors to complete the groups having a single one
miss = df.groupby(list(grp.values()))["sensor_id"].transform("nunique").eq(1)
exte = (uniq.mask(miss | uniq["sensor_id"].eq(uniq["sensor_id"].shift(-1)))
           .ffill().loc[miss].convert_dtypes())#.assign(timestamp=None))

#3 swap the censors in the groups with a leading censor-1
cln = (pd.concat([uniq, exte if not exte.empty else None])
         .sort_values(["sensor_id", *grp]))
out = (
        cln.astype({"line no": "str"})
        .groupby(list(grp.values()))
            .agg(**{
                # feel free to comment (#) any agg
            "lines_joined": ("line no", " & ".join),
            "sensor_0": ("sensor_id", "first"),
            "sensor_1": ("sensor_id", "last"),
            "timestamp_0": ("timestamp", "first"),
            "timestamp_1": ("timestamp", "last"),
            "total_current": ("current", "sum"),
            "total_apparent_power": ("apparent_power", "sum")
        })
     .reset_index(names=[*grp]) # to show up the groups
    #.reset_index(drop=True) # or just drop them if optional
)

Answered By - Timeless

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Tuesday, February 6, 2024

[FIXED] How to reduce a DataFrame to two DataFrames then sum rows of each data frame into new data frame

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels