How can I transform the given data set as mean centred and scaled to unit variance using pandas or numpy or any appropriate python module, data also contain some missing values as "Nan" that should also be removed prior to modelling task pleas help .
Ex. data set.
0.033 6.652 6.681 0.194 0.874 3.177 0
0.034 9.039 6.224 0.194 1.137 Nan 0
0.035 10.936 10.304 1.015 0.911 4.9 1
0.022 10.11 9.603 1.374 0.848 4.566 1
0.035 2.963 17.156 0.599 0.823 9.406 1
0.033 10.872 10.244 1.015 0.574 4.871 1
0.035 21.694 22.389 1.015 0.859 9.259 1
0.035 10.936 10.304 1.015 0.911 Nan 1
0.035 10.936 10.304 1.015 0.911 4.9 1
0.035 10.936 10.304 1.015 0.911 4.9 0
0.036 1.373 12.034 0.35 0.259 5.723 0
0.033 9.831 9.338 0.35 0.919 4.44 0
I have used:
from sklearn import preprocessing
import numpy as np
raw_data = open("/home/zebrafish/Desktop/scklearn/data.csv")
dataset = np.loadtxt(raw_data, delimiter=",")
X = dataset[:,0:5]
y = dataset[:,6]
X_pro = preprocessing.scale(X)
but I am not sure wither this method is current or and would it ignore the "Nan" or it will automatically take appropriate steps for "Nan" because in original data there was no "Nan" value but to understand the solution if it occurs I have incorporated "Nan"manually at two positions.
Question Update
With some googling and playing around the data probably i found that this method may normalizing data on Row basis and I want to normalize data with column basis.
So what would be the appropriate method for column basis normalization.
As you have already started, an easy way to accomplish this is via the preprocessing library of sklearn
You can start by removing NaN values:
from sklearn.preprocessing import Imputer
imp = Imputer(missing_values='Nan', strategy='mean', axis=1)
cleaned_X = imp.fit_transform(X)
In this scenario, your 'Nan' values will be replaced by the mean of the rest across that column (AP_ID), as opposed to dropping the rows completely (and losing data).
Next, in order to normalize your data on a column basis, your method is actually correct:
scaled_X = preprocessing.scale(cleaned_X)
By default, sklearn will normalize your variables by feature (column) ; to normalize by sample (row) you can add 'axis = 1' to the arguments of the scale function. However, doubt you would ever want to do that.
For reference:
One point worth noting is if your statistical analysis later on (say linear regression or what have you) requires an assumption of no significant correlations across features and you notice that there are a lot of correlation across features - scaling each column independently will not be sufficient (which preprocessing.scale does automatically).
If that indeed is the case, I would suggest to first use sklearn's PCA decomposition with 'whiten = True'. This will effectively scale the data to unit variance and zero mean while removing linear correlations across features (by projecting into orthogonal directions which explain most of the variability of your data).
For reference:
Hope this helps!
