Issue
I am using the dataset https://physionet.org/content/gaitpdb/1.0.0/ about Parkinson's Disease for a university project and wanted to show the data imputation options mean and kNN in Python. My professor recommended to use Impyute library
but I opted to use a different approach because dataframes do not have the to_matrix
method anymore (as far as I found). I instead used scikit and pandas, resulting in this code:
imputed_data_mean = df.copy()
for column in df.columns:
# check if column is numeric
if imputed_data_mean[column].dtype != 'object':
imputed_data_mean[column] = imputed_data_mean[column].fillna(imputed_data_mean[column].mean())
print(imputed_data_mean)
imputed_data_mean.to_csv('imputed_data_mean.csv', index=False)
df = pd.read_csv('dataset/demographics.txt', sep='\t')
from sklearn.impute import KNNImputer
imputed_data_knn = df.copy()
imputer = KNNImputer(n_neighbors=3)
for column in df.columns:
# check if column is numeric
if df[column].dtype != 'object':
imputed_data_knn[column] = imputer.fit_transform(imputed_data_knn[[column]])
print(imputed_data_knn)
imputed_data_knn.to_csv('imputed_data_knn.csv', index=False)
Now, I am very suspicious if I am doing this correct as both dataframes are exactly the same. I even re-imported the original dataframe because of that, in case it was changed and I somehow oversaw it. Even changing the number of neighbours does not change the result of imputed_data_knn
. Is this just randomly the same or am I missing something?
I expected kNN to return somewhat better results for this but now all I am is confused
Solution
The problem is here:
imputed_data_knn[column] = imputer.fit_transform(imputed_data_knn[[column]])
When you run fit_transform
you are just passing 1 column data to try to input NaN value for the same column. For example, it means that you want to imput NaN values on "Weight" column just using the same "Weight" column. Make no sense right?
Key point here, is that you don't have to create an inputer for every different column. You just have to run it once, and the inputer will fill all NaN values of all columns in dataset.
Following your example:
To make thinks easier, I reduced the dataframe. I did also to avoid having categorical columns (probably gender is a category too, but we don't care now):
df=df.iloc[:, 4: 10]
Gender Age Height Weight HoehnYahr UPDRS
0 2 82 1.45 50.0 3.0 20.0
1 1 68 1.71 NaN 2.5 25.0
2 2 82 1.53 51.0 2.5 24.0
3 1 72 1.70 82.0 2.0 16.0
4 2 53 1.67 54.0 3.0 44.0
For mean input:
imputed_data_mean = df.copy()
column="Weight"
imputed_data_mean[column] = imputed_data_mean[column].fillna(imputed_data_mean[column].mean())
imputed_data_mean.head()
Gender Age Height Weight HoehnYahr UPDRS
0 2 82 1.45 50.000000 3.0 20.0
1 1 68 1.71 72.217949 2.5 25.0
2 2 82 1.53 51.000000 2.5 24.0
3 1 72 1.70 82.000000 2.0 16.0
4 2 53 1.67 54.000000 3.0 44.0
Now let's see the difference useing KNNImputer
:
imputed_data_knn = df.copy()
imputer = KNNImputer(n_neighbors=3)
values_imputed = imputer.fit_transform(imputed_data_knn)
df_updated = pd.DataFrame(values_imputed , columns = imputed_data_knn.columns)
df_updated.head()
Gender Age Height Weight HoehnYahr UPDRS
0 2.0 82.0 1.45 50.000000 3.0 20.0
1 1.0 68.0 1.71 74.666667 2.5 25.0
2 2.0 82.0 1.53 51.000000 2.5 24.0
3 1.0 72.0 1.70 82.000000 2.0 16.0
4 2.0 53.0 1.67 54.000000 3.0 44.0
We can see different values on both method. (72.21
with mean method and 74.66
for the KNN imputer)
Answered By - Alex Serra Marrugat
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.