Issue
I'm currently working on a multilabel text classification problem, in which I have 4 labels, which is represented as 4 dummy variables. I have tried out several ways to transform the data in a way that is suitable for making the MLC.
Right now I'm running with pipelines, but as far as I can see, this doesn't fit a model with all labels included, but rather makes 1 model per label - do you agree with this?
I have tried to use MultiLabelBinarizer
and LabelBinarizer
, but with no luck.
Do you have a tip on how I can solve this problem in a way that makes the model include all the labels in one model, taking into account the different label combinations?
A subset of the data and my code is here:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
# Import data
df = import_data("product_data")
# Define dataframe to only include relevant columns
df = df.loc[:,['text','TV','Internet','Mobil','Fastnet']]
# Define dataframe with labels
df_labels = df.loc[:,['TV','Internet','Mobil','Fastnet']]
# Sum the number of labels per text
sum_column = df["TV"] + df["Internet"] + df["Mobil"] + df["Fastnet"]
df["label_sum"] = sum_column
# Remove texts with no labels
df.drop(df[df['label_sum'] == 0].index, inplace = True)
# Split dataset
train, test = train_test_split(df, random_state=42, test_size=0.2, shuffle=True)
X_train = train.text
X_test = test.text
categories = ['TV','Internet','Mobil','Fastnet']
# Model
LogReg_pipeline = Pipeline([
('tfidf', TfidfVectorizer(analyzer = 'word', max_df=0.20)),
('clf', LogisticRegression(solver='lbfgs', multi_class = 'ovr', class_weight = 'balanced', n_jobs=-1)),
])
for category in categories:
print('... Processing {}'.format(category))
LogReg_pipeline.fit(X_train, train[category])
prediction = LogReg_pipeline.predict(X_test)
print('Test accuracy is {}'.format(accuracy_score(test[category], prediction)))
https://www.transfernow.net/dl/20210921NbWDt3eo
Solution
Code Analysis
The scikit-learn LogisticRegression classifier using OVR (one-vs-rest) can only predict a single output/label at a time. Since you are training the model in the pipeline on multiple labels one at a time, you will produce one trained model per label. The algorithm itself will be the same for all models, but you would have trained them differently.
Multi-Output Regressor
- Multi-output regressors can accept multiple independent labels and generate one prediction for each target.
- The output should be the same as what you have, but you only need to maintain a single model and train it once.
- To use this approach, wrap your LR model in a MultiOutputRegressor.
- Here is a good tutorial on multi-output regression models.
model = LogisticRegression(solver='lbfgs', multi_class='ovr', class_weight='balanced', n_jobs=-1)
pipeline = Pipeline([
('tfidf', TfidfVectorizer(analyzer = 'word', max_df=0.20)),
('clf', MultiOutputRegressor(model))])
preds = pipeline.fit(X_train, df_labels).predict(X_test)
df_preds = combine_data(X=X_test, Y=preds, y_cols=categories)
combine_data()
merges all data into a single DataFrame for convenience:
def combine_data(X, Y, y_cols):
""" X is a dataframe, Y is a np array, y_cols is a list """
df_out = pd.DataFrame(Y, columns=y_cols)
df_out.index = X.index
return pd.concat([X, df_out], axis=1).sort_index()
Multinomial Logistic Regression
- To use a LogisticRegression classifier on all labels at once, set
multi_class=multinomial
. - The softmax function is used to find the predicted probability of a sample belonging to a class.
- You'll need to reverse the one-hot encoding on the label to get back the categorical variable (answer here). If you have the original label before one-hot encoding, use that.
- Here is a good tutorial on multinomial logistic regression.
label_col=["text_source"]
clf = LogisticRegression(multi_class='multinomial', solver='lbfgs')
model = clf.fit(df_train[input_cols], df_train[label_col])
# Generate a table of probabilities for each class
probs = model.predict_proba(X_test)
df_probs = combine_data(X=X_test, Y=probs, y_cols=label_col)
# Predict the class for a sample, i.e. the one with the highest probability
preds = model.predict(X_test)
df_preds = combine_data(X=X_test, Y=preds, y_cols=label_col)
Answered By - DV82XL
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.