Issue
I am learning about machine learning (ML), and I decided to use it for spam and non-spam email classification.
The issue is that for the example data I am using, it is in the form of email subject, importance, and sender, where each one is a string. What I want to do is change them into vectors like [1,0,0]
so that I can differentiate each value.
The error I am encountering is that I cannot replace the vector with a value because the sizes do not match.
def vec(u_v):
y = len(u_v)
x = [0] * y
for j in range(y):
x[j] = 1
u_v[j] = tuple(x.copy())
x = [0] * y
return u_v
def arrange(df):
organized_df = df.copy()
for i in df.columns:
unique_values = df[i].unique()
replacement_values = vec(unique_values)
for j in range(len(unique_values)):
organized_df[i] = organized_df[i].replace({unique_values[j]: replacement_values[j]})
return organized_df
These are the two functions that I'm using to organize the dataframe, this is the error that I receive
ValueError: operands could not be broadcast together with shapes (1000,) (6,)
I was expecting something like this:
| Subject | Importance |
| -------- | -------- |
| [1,0,0] | [0,0,1] |
| [0,1,0] | [1,0,0] |
Solution
With pandas, you can achieve this using get_dummies
:
Each variable is converted in as many 0/1 variables as there are different values. Columns in the output are each named after a value; if the input is a DataFrame, the name of the original variable is prepended to the value.
import pandas as pd
df = pd.DataFrame({
'Subject': ['spam', 'not-spam', 'spam', 'not-spam'],
'Importance': ['low', 'high', 'medium', 'low']
})
organized_df = pd.get_dummies(df)
Output :
Subject_not_spam | Subject_spam | Importance_high | Importance_low | Importance_medium |
---|---|---|---|---|
0 | 1 | 0 | 1 | 0 |
1 | 0 | 1 | 0 | 0 |
0 | 1 | 0 | 0 | 1 |
1 | 0 | 0 | 1 | 0 |
Answered By - Hamza NABIL
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.