Monday, November 20, 2023

[FIXED] A RandomForest in a Pipeline - sklearn

November 20, 2023 scikit-learn No comments

Issue

If I create a Pipeline in sklearn where the first step is a transformation (Imputer) and the second step is fitting a RandomForestClassifier with the keyword argument warmstart marked as True, how do I successively call the RandomForestClassifier? Does warmstart do anything when embedded in a `Pipeline?

http://scikit-learn.org/0.18/auto_examples/missing_values.html

Solution

Yes it can, but then the pipeline parts become slightly complex.

You see warm_start is only useful if you increase the n_estimators in the RandomForestClassifier.

See here:-

warn("Warm-start fitting without increasing n_estimators does not fit new trees.")

So you will need to increase the n_estimators of the RandomForestClassifier inside the pipeline.

For that you will first need to access the RandomForestClassifier estimator from the pipeline and then set the n_estimators as required. But then when you call fit() on pipeline, the imputer step will still get executed (which just repeats each time).

For example, consider the below pipeline:

pipe = Pipeline([('imputer', Imputer()), 
                 ('clf', RandomForestClassifier(warm_start=True))])

Now according to your question, you will need to do this to use the warm_start:-

# Fit the data initially
pipe.fit(X, y)

# Change the n_estimators (any one line from given two)
pipe.set_params(clf__n_estimators=30)
  OR
pipe.named_steps['clf'].n_estimators = 30

# Fit the same data again or new data
pipe.fit(X_new, y_new)

In the first call to pipe.fit(), the imputer will be fitted on given data (X, y). Now in the second call to fit(), two things may happen based on the data:

If you give same data again, then the imputer will still be fitted again, which is not needed.
If the data is different, the imputer will be fitted on the new data and forget the previously learnt information. So the imputing of missing values in this new data will be different from how it handled the previous data. This is not what you want in my opinion.

Answered By - Vivek Kumar

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Monday, November 20, 2023

[FIXED] A RandomForest in a Pipeline - sklearn

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels