Issue
Please I want to know if it is possible to know the specific variables' influence in testing a sample data to a model. The model below clarifies the question;
Given a dataset to predict the score of students.
ID Studies hours Games hours lectures hours social Activities Score
0 1 20 5 15 2 78
1 2 15 6 13 3 69
2 3 31 2 16 1 95
3 4 22 2 15 2 80
4 5 19 7 15 4 71
5 6 10 8 10 8 52
6 7 13 7 11 6 59
7 8 34 1 16 1 96
8 9 25 6 15 1 83
9 10 22 3 16 2 76
10 11 17 7 15 1 66
11 12 28 2 14 2 87
12 13 21 3 16 3 77
import pandas as pd
import numpy as np
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedKFold
from numpy import absolute
from xgboost import XGBModel
import pickle
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
%matplotlib inline
from xgboost import plot_importance
data = pd.read_csv("student_score.csv")
def perfomance(data):
X = data.iloc[:,:-1]
y = data.iloc[:,-1:]
model = XGBModel(booster='gbtree')
#model = XGBModel(booster='gblinear')
model.fit(X, y)
cv = RepeatedKFold(n_splits=3, n_repeats=3, random_state=1)
# evaluate model
scores = cross_val_score(model, X,y, scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1)
# force scores to be positive
scores = np.absolute(scores)
metrics = ('Mean MAE: %.3f (%.3f)' % (scores.mean(), scores.std()) )
# save the model to disk
filename = 'score.sav'
pickle.dump(model, open(filename, 'wb'))
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=0)
# load the model from disk
loaded_model = pickle.load(open('score.sav', 'rb'))
result = loaded_model.predict(X_test)
print(result)
plt.rcParams["figure.figsize"] = (20,15)
plot_importance(model)
plt.show()
Feature Importances :
[5.6058721e-04 6.7560148e-01 3.1960118e-01 4.2312010e-03 5.4962843e-06]
The feature importance is the general importance ranked by the model.
What I need now is:
when I pick A sample test say test = pd.DataFrame([{"Studies hours":15, "Games hours":6, "lectures hours":13,"social Activities":3}])
and predict; loaded_model.predict(test)
and I get a score like 68, Which of the variables specifically (not the general importance) didn't make this specific sample test not score 100 but rather 68?
For Example, the model should tell me studies hours were bad or were less than expected.
Can Machine Learning Model do that?
Solution
The topic you're describing is called model explainability or interpretability. The more sophisticated the model, the more accurate it is, but the harder it is to explain (really generally speaking). SHAP values are the most common way I see folks explaining the effect of each feature on predictions generally, and each feature value on the prediction for a given observation. The most common visualization of SHAP values is the force plot. It looks like this:
The blog from which I took this image explains how to build a force plot for any model: Explain Any Models with the SHAP Values — Use the KernelExplainer
Answered By - K. Thorspear
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.