Saturday, December 30, 2023

[FIXED] Data imputation using mean method does the same as data imputation with kNN

December 30, 2023 dataframe, imputation, pandas, python, scikit-learn No comments

Issue

I am using the dataset https://physionet.org/content/gaitpdb/1.0.0/ about Parkinson's Disease for a university project and wanted to show the data imputation options mean and kNN in Python. My professor recommended to use Impyute library but I opted to use a different approach because dataframes do not have the to_matrix method anymore (as far as I found). I instead used scikit and pandas, resulting in this code:

imputed_data_mean = df.copy()
for column in df.columns:
    # check if column is numeric
    if imputed_data_mean[column].dtype != 'object':
        imputed_data_mean[column] = imputed_data_mean[column].fillna(imputed_data_mean[column].mean())
print(imputed_data_mean)
imputed_data_mean.to_csv('imputed_data_mean.csv', index=False)

df = pd.read_csv('dataset/demographics.txt', sep='\t')
from sklearn.impute import KNNImputer
imputed_data_knn = df.copy()

imputer = KNNImputer(n_neighbors=3)
for column in df.columns:
    # check if column is numeric
    if df[column].dtype != 'object':
        imputed_data_knn[column] = imputer.fit_transform(imputed_data_knn[[column]])
print(imputed_data_knn)
imputed_data_knn.to_csv('imputed_data_knn.csv', index=False)

Now, I am very suspicious if I am doing this correct as both dataframes are exactly the same. I even re-imported the original dataframe because of that, in case it was changed and I somehow oversaw it. Even changing the number of neighbours does not change the result of imputed_data_knn. Is this just randomly the same or am I missing something?

I expected kNN to return somewhat better results for this but now all I am is confused

Solution

The problem is here:

imputed_data_knn[column] = imputer.fit_transform(imputed_data_knn[[column]])

When you run fit_transform you are just passing 1 column data to try to input NaN value for the same column. For example, it means that you want to imput NaN values on "Weight" column just using the same "Weight" column. Make no sense right?

Key point here, is that you don't have to create an inputer for every different column. You just have to run it once, and the inputer will fill all NaN values of all columns in dataset.

Following your example:

To make thinks easier, I reduced the dataframe. I did also to avoid having categorical columns (probably gender is a category too, but we don't care now):

df=df.iloc[:, 4: 10]

  Gender    Age Height  Weight  HoehnYahr   UPDRS
0   2       82  1.45    50.0    3.0          20.0
1   1       68  1.71    NaN     2.5          25.0
2   2       82  1.53    51.0    2.5          24.0
3   1       72  1.70    82.0    2.0          16.0
4   2       53  1.67    54.0    3.0          44.0

For mean input:

imputed_data_mean = df.copy()
column="Weight"
imputed_data_mean[column] = imputed_data_mean[column].fillna(imputed_data_mean[column].mean())
imputed_data_mean.head()

    Gender  Age Height  Weight  HoehnYahr   UPDRS
0     2     82  1.45    50.000000   3.0      20.0
1     1     68  1.71    72.217949   2.5      25.0
2     2     82  1.53    51.000000   2.5      24.0
3     1     72  1.70    82.000000   2.0      16.0
4     2     53  1.67    54.000000   3.0      44.0

Now let's see the difference useing KNNImputer:

imputed_data_knn = df.copy()
imputer = KNNImputer(n_neighbors=3)

values_imputed = imputer.fit_transform(imputed_data_knn)

df_updated = pd.DataFrame(values_imputed , columns = imputed_data_knn.columns)
df_updated.head()

    Gender  Age    Height   Weight   HoehnYahr  UPDRS
0    2.0    82.0    1.45    50.000000   3.0     20.0
1    1.0    68.0    1.71    74.666667   2.5     25.0
2    2.0    82.0    1.53    51.000000   2.5     24.0
3    1.0    72.0    1.70    82.000000   2.0     16.0
4    2.0    53.0    1.67    54.000000   3.0     44.0

We can see different values on both method. (72.21 with mean method and 74.66 for the KNN imputer)

Answered By - Alex Serra Marrugat

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Saturday, December 30, 2023

[FIXED] Data imputation using mean method does the same as data imputation with kNN

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels