Issue
titanic = pd.read_csv("train.csv")
titanic_test = pd.read_csv("test.csv")
titanic_train_labels = titanic['Survived'].copy()
titanic = titanic.drop(columns = 'Survived')
**
#Pipeline**
titanic_num = ['Age', 'Fare']
titanic_cat = ['Sex', 'Embarked']
num_pipeline = Pipeline([
("imputer", SimpleImputer(strategy='median')),
("std_scaler", StandardScaler()),
])
cat_pipeline = Pipeline([
("enc", OneHotEncoder(drop='if_binary'))
])
def full_pipeline(num_attribs, cat_attribs):
return ColumnTransformer([
("num", num_pipeline, num_attribs),
("cat", cat_pipeline, cat_attribs)
])
titanic_prepared = full_pipeline(titanic_num, titanic_cat)
titanic_clean = titanic_prepared.fit_transform(titanic)
**
#Here, I'm preparing the test data via the same pipeline**
titanic_test_num = titanic_num
titanic_test_cat = titanic_cat
titanic_test_prepared = full_pipeline(titanic_test_num, titanic_test_cat)
titanic_test_clean = titanic_test_prepared.fit_transform(titanic_test)
final_model.fit(titanic_clean, titanic_train_labels)
The code giving the error on the title:
final_model.predict(titanic_test_clean)
Printing useful info that may give hints about the problem:
titanic_clean[0] -> array([-0.56573646, -0.50244517, 1. , 0. , 0. ,
1. , 0. ]) # 7 items
titanic_test_clean[0] -> array([ 0.38623105, -0.49741333, 1. , 0. , 1. ,
0. ]) # 6 items
From the info above, the problem I assume is with the non matching number of onecodeencoder. What I suspected was that the number of categorical values were not the same for the train and test set. But they actually are.
the link to the dataset -> https://github.com/minsuk-heo/kaggle-titanic/blob/master/input/test.csv
Solution
The error you're seeing is indeed caused by OneHotEncoder
.
However, I want to point out a more crucial point: It is not a good practice to put your pipeline into a function.
Usually we assign the pipeline to a variable and then call fit
and fit_transform
on it:
# Define the pipelines for numerical and categorical attributes
num_pipeline = Pipeline([
("imputer", SimpleImputer(strategy='median')),
("std_scaler", StandardScaler()),
])
cat_pipeline = Pipeline([
("enc", OneHotEncoder(drop='if_binary'))
])
# Combine pipelines in a ColumnTransformer
full_pipeline = ColumnTransformer([
("num", num_pipeline, titanic_num),
("cat", cat_pipeline, titanic_cat)
])
# Fit and transform the training data
titanic_clean = full_pipeline.fit_transform(titanic)
# Transform the test data using the same transformations
titanic_test_clean = full_pipeline.transform(titanic_test)
# Model fitting and prediction
final_model.fit(titanic_clean, titanic_train_labels)
predictions = final_model.predict(titanic_test_clean)
This approach ensures that the same transformations are applied to both datasets, thereby maintaining a consistent feature set. The OneHotEncoder
inside the ColumnTransformer
will learn the categories from the training data and apply the same encoding to the test data, resolving the feature mismatch issue.
Answered By - DataJanitor
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.