Issue
How can I transform the given data set as mean centred and scaled to unit variance using pandas or numpy or any appropriate python module, data also contain some missing values as "Nan" that should also be removed prior to modelling task pleas help .
thanks
Ex. data set.
GA_ID PN_ID PC_ID MBP_ID GR_ID AP_ID class
0.033 6.652 6.681 0.194 0.874 3.177 0
0.034 9.039 6.224 0.194 1.137 Nan 0
0.035 10.936 10.304 1.015 0.911 4.9 1
0.022 10.11 9.603 1.374 0.848 4.566 1
0.035 2.963 17.156 0.599 0.823 9.406 1
0.033 10.872 10.244 1.015 0.574 4.871 1
0.035 21.694 22.389 1.015 0.859 9.259 1
0.035 10.936 10.304 1.015 0.911 Nan 1
0.035 10.936 10.304 1.015 0.911 4.9 1
0.035 10.936 10.304 1.015 0.911 4.9 0
0.036 1.373 12.034 0.35 0.259 5.723 0
0.033 9.831 9.338 0.35 0.919 4.44 0
I have used:
from sklearn import preprocessing
import numpy as np
raw_data = open("/home/zebrafish/Desktop/scklearn/data.csv")
dataset = np.loadtxt(raw_data, delimiter=",")
X = dataset[:,0:5]
y = dataset[:,6]
X_pro = preprocessing.scale(X)
but I am not sure wither this method is current or and would it ignore the "Nan" or it will automatically take appropriate steps for "Nan" because in original data there was no "Nan" value but to understand the solution if it occurs I have incorporated "Nan"manually at two positions.
thanks
Question Update
With some googling and playing around the data probably i found that this method may normalizing data on Row basis and I want to normalize data with column basis.
So what would be the appropriate method for column basis normalization.
thanks
Solution
As you have already started, an easy way to accomplish this is via the preprocessing library of sklearn
You can start by removing NaN values:
from sklearn.preprocessing import Imputer
imp = Imputer(missing_values='Nan', strategy='mean', axis=1)
cleaned_X = imp.fit_transform(X)
In this scenario, your 'Nan' values will be replaced by the mean of the rest across that column (AP_ID), as opposed to dropping the rows completely (and losing data).
Next, in order to normalize your data on a column basis, your method is actually correct:
scaled_X = preprocessing.scale(cleaned_X)
By default, sklearn will normalize your variables by feature (column) ; to normalize by sample (row) you can add 'axis = 1' to the arguments of the scale function. However, doubt you would ever want to do that.
For reference: http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.scale.html
One point worth noting is if your statistical analysis later on (say linear regression or what have you) requires an assumption of no significant correlations across features and you notice that there are a lot of correlation across features - scaling each column independently will not be sufficient (which preprocessing.scale does automatically).
If that indeed is the case, I would suggest to first use sklearn's PCA decomposition with 'whiten = True'. This will effectively scale the data to unit variance and zero mean while removing linear correlations across features (by projecting into orthogonal directions which explain most of the variability of your data).
For reference: http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html#sklearn.decomposition.PCA
Hope this helps!
Answered By - Azmy Rajab
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.