Issue
I am now using Pandas to handle some data. After I used group by
in pandas, the simplified DataFrame's format is [MMSI(Vessel_ID), BaseTime, Location, Speed, Course,...].
I use
for MMSI, group in grouped_df:
print(MMSI)
print(group)
to print the data.
For example, one group of data is:
MMSI BaseDateTime LAT LON SOG COG
1507 538007509.0 2022-12-08T00:02:25 49.29104 -123.19135 0.0 9.6
1508 538007509.0 2022-12-08T00:05:25 49.29102 -123.19138 0.0 9.6
I want to add a column which is the time difference of two points.
Below is the Output I want
MMSI BaseDateTime LAT LON SOG COG Time-diff
1507 538007509.0 2022-12-08T00:02:25 49.29104 -123.19135 0.0 9.6 0.05(hours)
1508 538007509.0 2022-12-08T00:05:25 49.29102 -123.19138 0.0 9.6 Na
So I use the code below to try to get the result:
for MMSI, group in grouped_df:
group = group.sort_values(by='BaseDateTime')
group['new-time'] = group.shift(-1)['BaseDateTime']
group.dropna()
for x in group.index:
group.loc[x,'time-diff'] = get_timediff(group.loc[x,'new-time'],group.loc[x,'BaseDateTime']) # A function to calculate the time difference
group['GROUP'] = group['time-diff'].fillna(np.inf).ge(2).cumsum()
# When Time-diff >= 2hours split them into different group
I can use print to show group['GROUP'] and group['time-diff']. The result is not shown after I tried to visit grouped_df again. There's a warning showing that my group
in grouped_df
is just a copy of a slice from a DataFrame and it recommend me using .loc[row_indexer,col_indexer] = value
instead. But in this case I don't know how to use .loc
to visit the specific [row,col].
At the very beginning, I tried to use
grouped_df['new-time'] = grouped_df.shift(-1)['BaseDateTime']
grouped_df.dropna()
But it shows
'DataFrameGroupBy' object does not support item assignment
Now my solution is create an empty_df and then concatenate the groups in grouped
_df
step by step like this:
df['time-diff'] = pd.Series(dtype='float64')
df['GROUP'] = pd.Series(dtype='int')
grouped_df = df.groupby('MMSI')
for MMSI, group in grouped_df:
# ... as the same as the code above
group = group.sort_values(by='BaseDateTime')
group['new-time'] = group.shift(-1)['BaseDateTime']
group.dropna()
for x in group.index:
group.loc[x,'time-diff'] = get_timediff(group.loc[x,'new-time'],group.loc[x,'BaseDateTime']) # A function to calculate the time difference
group['GROUP'] = group['time-diff'].fillna(np.inf).ge(2).cumsum()
# ... as the same as the code above
frame = [empty_df, group]
empty_df = pd.concat(frames)
I am not satisfied with this solution but I didn't find a proper way to change the value in grouped_df
.
I'm now trying to use the solution from this question to handle the DataFrame before group by.
Can someone help me?
Solution
Don't use a loop, directly go with groupby.shift
:
s = pd.to_datetime(df['BaseDateTime'])
df['Time-diff'] = (s.groupby(df['MMSI']).shift(-1)
.sub(s).dt.total_seconds().div(3600)
)
Or groupby.diff
:
s = pd.to_datetime(df['BaseDateTime'])
df['Time-diff'] = (s.groupby(df['MMSI']).diff(-1)
.mul(-1).dt.total_seconds().div(3600)
)
Output:
MMSI BaseDateTime LAT LON SOG COG Time-diff
1507 538007509.0 2022-12-08T00:02:25 49.29104 -123.19135 0.0 9.6 0.05
1508 538007509.0 2022-12-08T00:05:25 49.29102 -123.19138 0.0 9.6 NaN
1509 538007510.0 2022-12-08T00:02:25 49.29104 -123.19135 0.0 9.6 0.05
1510 538007510.0 2022-12-08T00:05:25 49.29102 -123.19138 0.0 9.6 NaN
1511 538007511.0 2022-12-08T00:02:25 49.29104 -123.19135 0.0 9.6 0.05
1523 538007511.0 2022-12-08T00:05:25 49.29102 -123.19138 0.0 9.6 NaN
Answered By - mozway
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.