Monday, September 27, 2021

[FIXED] Making predictions using all labels in multilabel text classification

September 27, 2021 multilabel-classification, python, scikit-learn No comments

Issue

I'm currently working on a multilabel text classification problem, in which I have 4 labels, which is represented as 4 dummy variables. I have tried out several ways to transform the data in a way that is suitable for making the MLC.

Right now I'm running with pipelines, but as far as I can see, this doesn't fit a model with all labels included, but rather makes 1 model per label - do you agree with this?

I have tried to use MultiLabelBinarizer and LabelBinarizer, but with no luck.

Do you have a tip on how I can solve this problem in a way that makes the model include all the labels in one model, taking into account the different label combinations?

A subset of the data and my code is here:

import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline

# Import data
df  = import_data("product_data")
# Define dataframe to only include relevant columns
df = df.loc[:,['text','TV','Internet','Mobil','Fastnet']]
# Define dataframe with labels
df_labels = df.loc[:,['TV','Internet','Mobil','Fastnet']]
# Sum the number of labels per text
sum_column = df["TV"] + df["Internet"] + df["Mobil"] + df["Fastnet"]
df["label_sum"] = sum_column
# Remove texts with no labels
df.drop(df[df['label_sum'] == 0].index, inplace = True)
# Split dataset
train, test = train_test_split(df, random_state=42, test_size=0.2, shuffle=True)
X_train = train.text
X_test = test.text

categories = ['TV','Internet','Mobil','Fastnet']

# Model
LogReg_pipeline = Pipeline([
                ('tfidf', TfidfVectorizer(analyzer = 'word', max_df=0.20)),
                ('clf', LogisticRegression(solver='lbfgs', multi_class = 'ovr', class_weight = 'balanced', n_jobs=-1)),
                 ])
    
for category in categories:
    print('... Processing {}'.format(category))
    LogReg_pipeline.fit(X_train, train[category])
    prediction = LogReg_pipeline.predict(X_test)
    print('Test accuracy is {}'.format(accuracy_score(test[category], prediction)))

https://www.transfernow.net/dl/20210921NbWDt3eo

Solution

Code Analysis

The scikit-learn LogisticRegression classifier using OVR (one-vs-rest) can only predict a single output/label at a time. Since you are training the model in the pipeline on multiple labels one at a time, you will produce one trained model per label. The algorithm itself will be the same for all models, but you would have trained them differently.

Multi-Output Regressor

Multi-output regressors can accept multiple independent labels and generate one prediction for each target.
The output should be the same as what you have, but you only need to maintain a single model and train it once.
To use this approach, wrap your LR model in a MultiOutputRegressor.
Here is a good tutorial on multi-output regression models.

model = LogisticRegression(solver='lbfgs', multi_class='ovr', class_weight='balanced', n_jobs=-1)

pipeline = Pipeline([
                ('tfidf', TfidfVectorizer(analyzer = 'word', max_df=0.20)),
                ('clf', MultiOutputRegressor(model))])

preds = pipeline.fit(X_train, df_labels).predict(X_test)
df_preds = combine_data(X=X_test, Y=preds, y_cols=categories)

combine_data() merges all data into a single DataFrame for convenience:

def combine_data(X, Y, y_cols):
    """ X is a dataframe, Y is a np array, y_cols is a list """
    df_out = pd.DataFrame(Y, columns=y_cols)
    df_out.index = X.index
    return pd.concat([X, df_out], axis=1).sort_index()

Multinomial Logistic Regression

To use a LogisticRegression classifier on all labels at once, set multi_class=multinomial.
The softmax function is used to find the predicted probability of a sample belonging to a class.
You'll need to reverse the one-hot encoding on the label to get back the categorical variable (answer here). If you have the original label before one-hot encoding, use that.
Here is a good tutorial on multinomial logistic regression.

label_col=["text_source"]
clf = LogisticRegression(multi_class='multinomial', solver='lbfgs')
model = clf.fit(df_train[input_cols], df_train[label_col])

# Generate a table of probabilities for each class
probs = model.predict_proba(X_test)
df_probs = combine_data(X=X_test, Y=probs, y_cols=label_col)

# Predict the class for a sample, i.e. the one with the highest probability
preds = model.predict(X_test)
df_preds = combine_data(X=X_test, Y=preds, y_cols=label_col)

Answered By - DV82XL

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Monday, September 27, 2021

[FIXED] Making predictions using all labels in multilabel text classification

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels