Issue
In Python, I am using the following code to scale several band values grouped by year (extracted from the "date" field) and field_id:
def normalise_by_year(df, columns):
df['date'] = pd.to_datetime(df['date'], dayfirst=True)
df['year'] = df['date'].dt.year
normalised_df = df.copy()
groups = df.groupby(['field_id', 'year'])
scaler = MinMaxScaler()
column_range = list(df.columns[1:range])
for column in column_range:
column_values = groups[column].transform(lambda x: x.values)
if column_values.notnull().any(): # Check if the column has non-null values
normalised_values = scaler.fit_transform(column_values.values.reshape(-1, 1)).flatten()
normalised_df[column] = normalised_values
return normalised_df
This same function is applied to three different dataframes coming from different satellites, with years ranging from 2017 to 2023 and hundreds of unique field_ids. I expected the value ranges from each column independently, for each field_id and for each year, to be rescaled to the range (0, 1).
So, for field_id AL145, year 2022, if the minimum NDVI value is 0.569389457 and the maximum is 0.850761894, I expected that these will be 0 and 1 respectively, with everything else in between.
This is not what I am observing. My original data for precisely this situation looks like this (NDVI is only one value to be normalised. Each df has at least 14 columns with raw values):
date NDVI field_id
03/03/2022 0.650033467 AL145
12/03/2022 0.569389457 AL145
28/03/2022 0.602192985 AL145
04/04/2022 0.606251563 AL145
20/04/2022 0.78551144 AL145
22/05/2022 0.850363019 AL145
31/05/2022 0.850761894 AL145
16/06/2022 0.620308877 AL145
23/06/2022 0.651676257 AL145
02/07/2022 0.688994517 AL145
18/07/2022 0.656786687 AL145
25/07/2022 0.734503793 AL145
The scaled data looks like this:
date NDVI field_id
03/03/2022 0.341115446 AL145
12/03/2022 0.326905252 AL145
28/03/2022 0.332685526 AL145
04/04/2022 0.333400684 AL145
20/04/2022 0.364987875 AL145
22/05/2022 0.376415302 AL145
31/05/2022 0.376485588 AL145
16/06/2022 0.335877708 AL145
23/06/2022 0.341404921 AL145
02/07/2022 0.347980731 AL145
18/07/2022 0.342305424 AL145
25/07/2022 0.355999872 AL145
I am a beginner in Python and assume this is because of some poor understanding of the scaler, loops, grouping functions and the like. Is something in my function not grouping correctly, or are the columns in my range perhaps not being isolated from the others? Should the scaler be rebooted every time I apply the function to a different column, even if it is defined within the function?
What could explain this output?
Solution
The issue had indeed to do with how the columns were grouped for transformation. Normalisation was being applied to the entire group of columns at once. This is the corrected function:
def normalise_columns_byYear(df, range):
df['date'] = pd.to_datetime(df['date'], dayfirst=True)
df['year'] = df['date'].dt.year
normalised_df = df.copy()
groups = df.groupby(['field_id', 'year'])
scaler = MinMaxScaler()
column_range = list(df.columns[1:range])
for column in column_range:
column_values = groups[column].transform(lambda x: scaler.fit_transform(x.values.reshape(-1, 1)).flatten() if x.notnull().any() else x)
normalised_df[column] = column_values
return normalised_df
This applies the function only to the specific column being iterated over.
Answered By - Barbara Perez de Araújo
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.