Saturday, December 23, 2023

[FIXED] How to measure performance on a highly unbalanced dataset?

December 23, 2023 machine-learning, python, random-forest, scikit-learn No comments

Issue

I am struggling to find the optimal way to measure the performance of my model given a highly unbalanced dataset.My dataset is about the binary classification problem of predicting stroke cases. The ratio is 3364 negative cases and 202 positive cases.

In this case f1-score would be the most important metric in this context, correct? But this metric always comes out extremely low, im also calculating the ROC curve but im not sure if it is useful in this case.When balancing the data note that im only balancing only the training set, and leaving the test set intact.

Here's the code:

Spliting the training and test data:

x_train, x_test, y_train, y_test = train_test_split(x_base, y_base)

Function that receives the resampled training set and prints the metrics:

def reportSample(x_resampled,y_resampled,name):
    print(name)
    from sklearn.ensemble import RandomForestClassifier
    from sklearn.metrics import classification_report, fbeta_score,roc_auc_score
    rf_classifier = RandomForestClassifier(n_estimators=100, random_state=42)
    rf_classifier.fit(x_resampled,y_resampled)
    from sklearn.metrics import accuracy_score
    previsoes = rf_classifier.predict(x_test)
    report = classification_report(y_test, previsoes)
    probabilidades = rf_classifier.predict_proba(x_test)[:, 1]
    auc = roc_auc_score(y_test, probabilidades)
    print(report)
    print("AUC = ",auc)

RandomOverSampler:

from imblearn.over_sampling import RandomOverSampler
over_sampler = RandomOverSampler(sampling_strategy=0.5)
x_resampled, y_resampled = over_sampler.fit_resample(x_train, y_train)
reportSample(x_resampled,y_resampled,"Random over sampler")

NearMiss:

from imblearn.under_sampling import NearMiss
nearmiss = NearMiss(version=2,sampling_strategy='majority')
x_resampled, y_resampled = nearmiss.fit_resample(x_train, y_train)
reportSample(x_resampled,y_resampled,"NearMiss underSample")

Smote:

from imblearn.over_sampling import SMOTE
sm = SMOTE(random_state=42)
x_resampled,y_resampled = sm.fit_resample(x_train,y_train)
reportSample(x_resampled,y_resampled,"Smote over sampling")

Classification reports of all 3 methods:

[Nearmiss cr](https://i.stack.imgur.com/6M8FL.png)
[RandomCr](https://i.stack.imgur.com/yvZB8.png)
[SmoteCr](https://i.stack.imgur.com/lIDHz.png)

Solution

It's very difficult for someone to give you a correct answer to this question, since it depends on your specific needs. Ultimately, the answer will involve the following:

Figure out what you actually want your model to do. Do you care more about correct predictions from one of the classes? Do you care about minimising false-positives? Etc. etc.
Learn what information each metric actually provides you. You probably don't understand the metrics well enough if you aren't sure if one you're using is worth using in this scenario - read up on what it does.
Use a variety of metrics in combination. Each metric tells you something different and you'll likely end up balancing competing metrics.

If you like, you can combine the results of multiple metrics based on some importance criteria you define.

Answered By - Téo

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Saturday, December 23, 2023

[FIXED] How to measure performance on a highly unbalanced dataset?

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels