Tuesday, October 11, 2022

[FIXED] How to apply the result of K-Fold cross validation on my original test-set?

October 11, 2022 machine-learning, python, scikit-learn No comments

Issue

I have a dataset which i splitted to 80-20% training and test set. On the trainset I do k-fold cross validation and get the mean of the accuracies. However, it is not clear to me how should I apply this result to my original testset?

#Splitting Training & Test dataset
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

#Standartisation scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train[:,:] = sc.fit_transform(X_train[:,:])
X_test[:,:] = sc.transform(X_test[:,:])

#Trainign the SVM model on the Training set
from sklearn.svm import SVC
classifier = SVC(kernel='rbf', random_state=0)
classifier.fit(X_train, y_train)

#Making the Confusion Matrix of SVM model
from sklearn.metrics import confusion_matrix, accuracy_score
y_pred = classifier.predict(X_test)
cm = confusion_matrix(y_test, y_pred)
print("SVM Model: ")
print(cm)
print('Accuracy of the test set:'+ str(accuracy_score(y_test, y_pred)))

#applying k-Fold cross validation
from sklearn.model_selection import cross_val_score
accuracies = cross_val_score(estimator=classifier, X=X_train, y=y_train, cv=15)
print('Accuracy of K-Fold Validation: {:.2f} %'.format(accuracies.mean()*100))
print('Standard Deviation of K-Fold Validatio: {:.2f} %'.format(accuracies.std()*100))

Solution

There are two issues here:

If you're using cross-validation, do not scale the data first. Instead, you need to add a scaler to a pipeline, and do the cross-validation using the pipeline. This is because during cross-validation, each training step needs the data scaled only to the current training set, not to the entire dataset.
In cross-validation, you do not actually train the model yourself. Instead, sklearn is going to train it for you inside the cross-validation loop. Once you have selected the model you're going to use, then you will train the model on all your data.

So here's one way to approach your task:

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.pipeline import make_pipeline
from sklearn.metrics import confusion_matrix, accuracy_score
from sklearn.model_selection import cross_val_score, cross_val_predict, cross_validate

X, y = make_classification(n_features=2, n_redundant=0, n_informative=2, random_state=1)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

sc = StandardScaler()
clf = SVC(kernel='rbf', random_state=0)

# Make a pipeline model that scales and estimates.
pipe = make_pipeline(sc, clf)

# Make predictions during cross-validation.
y_pred = cross_val_predict(pipe, X_train, y_train, cv=10, n_jobs=-1)

cm = confusion_matrix(y_train, y_pred)
score = accuracy_score(y_train, y_pred)
print("SVM Model: ")
print(cm)
print('Validation accuracy on training set: {:.2f}%'.format(score*100))

# Alternatively, use cross_val_score.
accuracies = cross_val_score(estimator=pipe, X=X_train, y=y_train, cv=10)
print()
print('Accuracy of k-fold validation: {:.2f}%'.format(accuracies.mean()*100))
print('Standard deviation of k-fold validation: {:.2f}%'.format(accuracies.std()*100))

This results in:

SVM Model: 
[[41  0]
 [ 3 36]]
Validation accuracy on training set: 96.25%

Accuracy of k-fold validation: 96.25%
Standard deviation of k-fold validation: 5.73%

The key thing about cross_val_predict is that it makes predictions during cross-validation, when each fold of the data is the validation set. Then it puts all the predictions together to give you validation predictions for the full dataset, so you can use sklearn's various scoring tool on that.

The nice thing about cross_val_score is that you see all the scores from each fold. Notice that they average to the same thing as the overall dataset: 96.25%. But now you can also get the variance, which is important because this informs you a bit about the possible prediction variance you can expect from the model on unseen data in the future.

What about the test data?

You should not test your model against X_test and y_test until the very end of the model selection workflow. Only when you have set all the hyperparameters in the model, e.g. the C, kernel, and gamma hyperparameters in this particular algorithm, should you check how that model does on the test data. To put it another way: do not use test for model tuning, only for performance estimation.

Answered By - kwinkunks

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Tuesday, October 11, 2022

[FIXED] How to apply the result of K-Fold cross validation on my original test-set?

Issue

Solution

What about the test data?

0 comments:

Post a Comment

Popular Posts

Labels