Monday, February 21, 2022

[FIXED] Filling a dataframe with a list to get the max_leaf_nodes with the lowest mean_absolute_error

February 21, 2022 dataframe, list, machine-learning, pandas, scikit-learn No comments

Issue

I made a simple DecisionTreeRegressor and want to get the best max_leaf_nodes value.

Code:

from sklearn.metrics import mean_absolute_error as MAE
from sklearn.model_selection import train_test_split as TTS

#split the data in 2 parts: training data and validation data
train_X, val_X, train_y, val_y = TTS(X, y, random_state=0)

#Define and fit the modell with the training data
model = DecisionTreeRegressor(random_state=1)
model.fit(train_X, train_y)

#predict
val_prediction = model.predict(val_X)
#check predictions
print(MAE(val_prediction, val_y))

#defining get_mae function
def get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y):
    model = DecisionTreeRegressor(max_leaf_nodes=max_leaf_nodes, random_state=0)
    model.fit(train_X, train_y)
    preds_val = model.predict(val_X)
    mae = mean_absolute_error(val_y, preds_val)
    return(mae)

#DataFrame and list
df_mae = pd.DataFrame(columns = ["MAE"])
li = []

#collecting mae's depending on max_leaf_nodes values
for max_leaf_nodes in range(2, 10000, 2):
    mae = get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y)
    li.append(mae)

How can I add the values of li to the "MAE" column of df_mae?

Is there a better way to find good max_leaf_nodes? (My Laptop was working on that for-loop for 25 minutes)

Solution

You could append a row directly in the dataframe, instead of creating a list first.

df_mae = df_mae.append({'MAE': mae}, ignore_index = True)

However, if you prefer to add the list instead of individual values (outside the for loop):

df_mae = df_mae.append(pd.DataFrame({'MAE': li}), ignore_index = True)

Please, be aware that you need to store the max_leaf_nodes as well, otherwise your resulting dataframe won't be meaningful.

df_mae = pd.DataFrame(columns = ["MAE", "max_leaf_nodes"])
li = []
max_leaf_nodes_list = []

for max_leaf_nodes in range(2, 10000, 2):
    mae = get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y)
    li.append(mae)
    max_leaf_nodes_list.append(max_leaf_nodes)

df_mae = df_mae.append(pd.DataFrame({'MAE': li, 'max_leaf_nodes': max_leaf_nodes_list}), ignore_index = True)

or, appending the values into the dataframe directly:

df_mae = pd.DataFrame(columns = ["MAE", "max_leaf_nodes"])
for max_leaf_nodes in range(2, 10000, 2):
    mae = get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y)
    df_mae = df_mae.append({'MAE': mae, 'max_leaf_nodes': max_leaf_nodes}, ignore_index = True)

To reduce the execution time with this approach, I would increase the step from 2 to a bigger number on the range function. Once you find the interval which produces the best values, you can limit the interval to find an even better metric. In other words, searching the entire hyperparameter grid is not the best approach.

Alternatively, you could use other methods such as Hyperopt or Hyperopt-sklearn.

Answered By - Daniel Labbe

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Monday, February 21, 2022

[FIXED] Filling a dataframe with a list to get the max_leaf_nodes with the lowest mean_absolute_error

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels