Saturday, January 20, 2024

[FIXED] sklearn MinMaxScaler not transforming to 0,1 range in grouped transformation

January 20, 2024 dataframe, grouping, normalization, python, scikit-learn No comments

Issue

In Python, I am using the following code to scale several band values grouped by year (extracted from the "date" field) and field_id:

def normalise_by_year(df, columns):
    df['date'] = pd.to_datetime(df['date'], dayfirst=True)
    df['year'] = df['date'].dt.year
    
    normalised_df = df.copy()
    
    groups = df.groupby(['field_id', 'year'])
    
    scaler = MinMaxScaler()

    column_range = list(df.columns[1:range])

    for column in column_range:
        column_values = groups[column].transform(lambda x: x.values)
        if column_values.notnull().any():  # Check if the column has non-null values
            normalised_values = scaler.fit_transform(column_values.values.reshape(-1, 1)).flatten()
            normalised_df[column] = normalised_values
    
    return normalised_df

This same function is applied to three different dataframes coming from different satellites, with years ranging from 2017 to 2023 and hundreds of unique field_ids. I expected the value ranges from each column independently, for each field_id and for each year, to be rescaled to the range (0, 1).

So, for field_id AL145, year 2022, if the minimum NDVI value is 0.569389457 and the maximum is 0.850761894, I expected that these will be 0 and 1 respectively, with everything else in between.

This is not what I am observing. My original data for precisely this situation looks like this (NDVI is only one value to be normalised. Each df has at least 14 columns with raw values):

date            NDVI      field_id
03/03/2022  0.650033467   AL145
12/03/2022  0.569389457   AL145
28/03/2022  0.602192985   AL145
04/04/2022  0.606251563   AL145
20/04/2022  0.78551144    AL145
22/05/2022  0.850363019   AL145
31/05/2022  0.850761894   AL145
16/06/2022  0.620308877   AL145
23/06/2022  0.651676257   AL145
02/07/2022  0.688994517   AL145
18/07/2022  0.656786687   AL145
25/07/2022  0.734503793   AL145

The scaled data looks like this:

date            NDVI      field_id
03/03/2022  0.341115446   AL145
12/03/2022  0.326905252   AL145
28/03/2022  0.332685526   AL145
04/04/2022  0.333400684   AL145
20/04/2022  0.364987875   AL145
22/05/2022  0.376415302   AL145
31/05/2022  0.376485588   AL145
16/06/2022  0.335877708   AL145
23/06/2022  0.341404921   AL145
02/07/2022  0.347980731   AL145
18/07/2022  0.342305424   AL145
25/07/2022  0.355999872   AL145

I am a beginner in Python and assume this is because of some poor understanding of the scaler, loops, grouping functions and the like. Is something in my function not grouping correctly, or are the columns in my range perhaps not being isolated from the others? Should the scaler be rebooted every time I apply the function to a different column, even if it is defined within the function?

What could explain this output?

Solution

The issue had indeed to do with how the columns were grouped for transformation. Normalisation was being applied to the entire group of columns at once. This is the corrected function:

def normalise_columns_byYear(df, range):
    df['date'] = pd.to_datetime(df['date'], dayfirst=True)
    df['year'] = df['date'].dt.year
    
    normalised_df = df.copy()
    
    groups = df.groupby(['field_id', 'year'])
    
    scaler = MinMaxScaler()

    column_range = list(df.columns[1:range])

    for column in column_range:
        column_values = groups[column].transform(lambda x: scaler.fit_transform(x.values.reshape(-1, 1)).flatten() if x.notnull().any() else x)
        normalised_df[column] = column_values
    
    return normalised_df

This applies the function only to the specific column being iterated over.

Answered By - Barbara Perez de Araújo

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Saturday, January 20, 2024

[FIXED] sklearn MinMaxScaler not transforming to 0,1 range in grouped transformation

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels