Wednesday, January 17, 2024

[FIXED] How to transform descriptors mean centered and scaled to unit variance prior to ML modeling

January 17, 2024 numpy, pandas, python, static-libraries No comments

Issue

How can I transform the given data set as mean centred and scaled to unit variance using pandas or numpy or any appropriate python module, data also contain some missing values as "Nan" that should also be removed prior to modelling task pleas help .

thanks

Ex. data set.

GA_ID   PN_ID   PC_ID   MBP_ID  GR_ID   AP_ID   class
0.033   6.652   6.681   0.194   0.874   3.177     0
0.034   9.039   6.224   0.194   1.137   Nan       0
0.035   10.936  10.304  1.015   0.911   4.9       1
0.022   10.11   9.603   1.374   0.848   4.566     1
0.035   2.963   17.156  0.599   0.823   9.406     1
0.033   10.872  10.244  1.015   0.574   4.871     1
0.035   21.694  22.389  1.015   0.859   9.259     1
0.035   10.936  10.304  1.015   0.911   Nan       1
0.035   10.936  10.304  1.015   0.911   4.9       1
0.035   10.936  10.304  1.015   0.911   4.9       0
0.036   1.373   12.034  0.35    0.259   5.723     0
0.033   9.831   9.338   0.35    0.919   4.44      0

I have used:

from sklearn import preprocessing
import numpy as np
raw_data = open("/home/zebrafish/Desktop/scklearn/data.csv")
dataset = np.loadtxt(raw_data, delimiter=",")
X = dataset[:,0:5]
y = dataset[:,6]
X_pro = preprocessing.scale(X)

but I am not sure wither this method is current or and would it ignore the "Nan" or it will automatically take appropriate steps for "Nan" because in original data there was no "Nan" value but to understand the solution if it occurs I have incorporated "Nan"manually at two positions.

thanks

                   Question Update

With some googling and playing around the data probably i found that this method may normalizing data on Row basis and I want to normalize data with column basis.

So what would be the appropriate method for column basis normalization.

thanks

Solution

As you have already started, an easy way to accomplish this is via the preprocessing library of sklearn

You can start by removing NaN values:

from sklearn.preprocessing import Imputer
imp = Imputer(missing_values='Nan', strategy='mean', axis=1)
cleaned_X = imp.fit_transform(X)

In this scenario, your 'Nan' values will be replaced by the mean of the rest across that column (AP_ID), as opposed to dropping the rows completely (and losing data).

Next, in order to normalize your data on a column basis, your method is actually correct:

scaled_X = preprocessing.scale(cleaned_X)

By default, sklearn will normalize your variables by feature (column) ; to normalize by sample (row) you can add 'axis = 1' to the arguments of the scale function. However, doubt you would ever want to do that.

For reference: http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.scale.html

One point worth noting is if your statistical analysis later on (say linear regression or what have you) requires an assumption of no significant correlations across features and you notice that there are a lot of correlation across features - scaling each column independently will not be sufficient (which preprocessing.scale does automatically).

If that indeed is the case, I would suggest to first use sklearn's PCA decomposition with 'whiten = True'. This will effectively scale the data to unit variance and zero mean while removing linear correlations across features (by projecting into orthogonal directions which explain most of the variability of your data).

For reference: http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html#sklearn.decomposition.PCA

Hope this helps!

Answered By - Azmy Rajab

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Wednesday, January 17, 2024

[FIXED] How to transform descriptors mean centered and scaled to unit variance prior to ML modeling

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels