Issue
I have a dataframe called 'main_df' that contains 3 columns X,Y,Z.
X Y Z
NaN NaN ZVal1
NaN NaN ZVal2
XVal1 NaN NaN
NaN YVal1 NaN
Each column carries data of the specific type (X,Y,Z) so if there is data in column X for a particular row, there will be no data in columns Y/Z because it is not of type X.
If you combine all 3 columns, they "slide into each other" neatly and you will get values all the way down.
How can I combine these 3 columns into a 4th column so that NaN values are ignored and we just get whatever single value exists in the 3 columns for that row?
Expected output:
X Y Z XYZ
NaN NaN ZVal1 ZVal1
NaN NaN ZVal2 ZVal2
XVal1 NaN NaN XVal1
NaN YVal1 NaN YVal1
Dataframe code:
import pandas as pd
import numpy as np
df = pd.DataFrame(columns=['X', 'Y', 'Z'], data=[[np.NaN, np.NaN, 'ZVal1'], [np.NaN, np.NaN, 'ZVal2'], ['XVal1', np.NaN, np.NaN], [np.NaN,'YVal1' ,np.NaN]])
Right now I am trying to do something along the lines of:
df['XYZ'] = df['X'].astype(str) + df['Y'].astype(str) + df['Z'].astype(str) but that combines the NaN values into one long string
Solution
With stack
:
df["XYZ"] = df.stack().values
to get
>>> df
X Y Z XYZ
0 NaN NaN ZVal1 ZVal1
1 NaN NaN ZVal2 ZVal2
2 XVal1 NaN NaN XVal1
3 NaN YVal1 NaN YVal1
since you guarantee only 1 non-NaN per row and stack
drops NaNs by default.
Another way with fancy indexing:
df["XYZ"] = df.to_numpy()[np.arange(len(df)),
df.columns.get_indexer(df.notna().idxmax(axis=1))]
which, for each row, looks at the index of the non-NaN value and selects it.
Answered By - Mustafa Aydın
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.