Issue
I made a simple DecisionTreeRegressor and want to get the best max_leaf_nodes
value.
Code:
from sklearn.metrics import mean_absolute_error as MAE
from sklearn.model_selection import train_test_split as TTS
#split the data in 2 parts: training data and validation data
train_X, val_X, train_y, val_y = TTS(X, y, random_state=0)
#Define and fit the modell with the training data
model = DecisionTreeRegressor(random_state=1)
model.fit(train_X, train_y)
#predict
val_prediction = model.predict(val_X)
#check predictions
print(MAE(val_prediction, val_y))
#defining get_mae function
def get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y):
model = DecisionTreeRegressor(max_leaf_nodes=max_leaf_nodes, random_state=0)
model.fit(train_X, train_y)
preds_val = model.predict(val_X)
mae = mean_absolute_error(val_y, preds_val)
return(mae)
#DataFrame and list
df_mae = pd.DataFrame(columns = ["MAE"])
li = []
#collecting mae's depending on max_leaf_nodes values
for max_leaf_nodes in range(2, 10000, 2):
mae = get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y)
li.append(mae)
How can I add the values of li
to the "MAE" column of df_mae
?
Is there a better way to find good max_leaf_nodes
? (My Laptop was working on that for-loop for 25 minutes)
Solution
You could append a row directly in the dataframe, instead of creating a list first.
df_mae = df_mae.append({'MAE': mae}, ignore_index = True)
However, if you prefer to add the list instead of individual values (outside the for loop):
df_mae = df_mae.append(pd.DataFrame({'MAE': li}), ignore_index = True)
Please, be aware that you need to store the max_leaf_nodes as well, otherwise your resulting dataframe won't be meaningful.
df_mae = pd.DataFrame(columns = ["MAE", "max_leaf_nodes"])
li = []
max_leaf_nodes_list = []
for max_leaf_nodes in range(2, 10000, 2):
mae = get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y)
li.append(mae)
max_leaf_nodes_list.append(max_leaf_nodes)
df_mae = df_mae.append(pd.DataFrame({'MAE': li, 'max_leaf_nodes': max_leaf_nodes_list}), ignore_index = True)
or, appending the values into the dataframe directly:
df_mae = pd.DataFrame(columns = ["MAE", "max_leaf_nodes"])
for max_leaf_nodes in range(2, 10000, 2):
mae = get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y)
df_mae = df_mae.append({'MAE': mae, 'max_leaf_nodes': max_leaf_nodes}, ignore_index = True)
To reduce the execution time with this approach, I would increase the step from 2 to a bigger number on the range function. Once you find the interval which produces the best values, you can limit the interval to find an even better metric. In other words, searching the entire hyperparameter grid is not the best approach.
Alternatively, you could use other methods such as Hyperopt or Hyperopt-sklearn.
Answered By - Daniel Labbe
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.