Monday, May 9, 2022

[FIXED] How to determine the cause of not achieving a target in using machine learning model

May 09, 2022 machine-learning, pandas, python, scikit-learn No comments

Issue

Please I want to know if it is possible to know the specific variables' influence in testing a sample data to a model. The model below clarifies the question;

Given a dataset to predict the score of students.

ID  Studies hours   Games hours lectures hours  social Activities   Score
0   1   20  5   15  2   78
1   2   15  6   13  3   69
2   3   31  2   16  1   95
3   4   22  2   15  2   80
4   5   19  7   15  4   71
5   6   10  8   10  8   52
6   7   13  7   11  6   59
7   8   34  1   16  1   96
8   9   25  6   15  1   83
9   10  22  3   16  2   76
10  11  17  7   15  1   66
11  12  28  2   14  2   87
12  13  21  3   16  3   77

import pandas as pd
import numpy as np
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedKFold
from numpy import absolute
from xgboost import XGBModel
import pickle
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
%matplotlib inline
from xgboost import plot_importance 

data = pd.read_csv("student_score.csv")

def perfomance(data):
    X = data.iloc[:,:-1]
    y = data.iloc[:,-1:]
    model = XGBModel(booster='gbtree')
    #model = XGBModel(booster='gblinear')
    model.fit(X, y)
    cv = RepeatedKFold(n_splits=3, n_repeats=3, random_state=1)
    # evaluate model
    scores = cross_val_score(model, X,y, scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1)
    # force scores to be positive
    scores = np.absolute(scores)
    metrics = ('Mean MAE: %.3f (%.3f)' % (scores.mean(), scores.std()) )
    # save the model to disk
    filename = 'score.sav'
    pickle.dump(model, open(filename, 'wb'))
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=0)
    # load the model from disk
    loaded_model = pickle.load(open('score.sav', 'rb'))
    result = loaded_model.predict(X_test)
    print(result)
    plt.rcParams["figure.figsize"] = (20,15)
    plot_importance(model)
    plt.show()

Feature Importances :

[5.6058721e-04 6.7560148e-01 3.1960118e-01 4.2312010e-03 5.4962843e-06]

The feature importance is the general importance ranked by the model.

What I need now is:

when I pick A sample test say test = pd.DataFrame([{"Studies hours":15, "Games hours":6, "lectures hours":13,"social Activities":3}]) and predict; loaded_model.predict(test) and I get a score like 68, Which of the variables specifically (not the general importance) didn't make this specific sample test not score 100 but rather 68? For Example, the model should tell me studies hours were bad or were less than expected.

Can Machine Learning Model do that?

Solution

The topic you're describing is called model explainability or interpretability. The more sophisticated the model, the more accurate it is, but the harder it is to explain (really generally speaking). SHAP values are the most common way I see folks explaining the effect of each feature on predictions generally, and each feature value on the prediction for a given observation. The most common visualization of SHAP values is the force plot. It looks like this:

The blog from which I took this image explains how to build a force plot for any model: Explain Any Models with the SHAP Values — Use the KernelExplainer

Answered By - K. Thorspear

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Monday, May 9, 2022

[FIXED] How to determine the cause of not achieving a target in using machine learning model

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels