Issue
I am rusty with Pandas, please be gentle!
I have a dataframe that is (349, 17) of various water sample values (pH, salinity, temperature, etc). I am using the PyCO2SYS toolbox to calculate chemical outputs. I've created a function that should use the dataframe row index, pull variables from the specified column associated with the row, and return the variable (using PyCO2SYS) I want.
Here's the function:
def pCO2_column(i):
# input variables
PAR1 = df['TA (umol/kg)'][i] # ALK
PAR2 = df['pH'][i] # pH
SAL = df['Sal psu'][i] # Salinity
TEMPIN = df['Temp C'][i] # Temperature (input)
TEMPOUT = TEMPIN # Temperature (output)
PRESIN = df['Pressure psi a'][i] # Pressure (input)
PRESOUT = PRESIN # Pressure (output)
# Result I want to add into column
pCO2_out = pyco2.sys(PAR1, PAR2, PAR1TYPE, PAR2TYPE, SAL, TEMPIN, TEMPOUT, PRESIN, PRESOUT, pHSCALEIN, K1K2CONSTANTS, KSO4CONSTANTS)["pCO2_out"]
return pCO2_out
Note: the other parameters were globally defined; the ones in the function are the ones that will change with each row
I want to use this function for every row index, to create a column of those values I want. I have been able to do it in a clunky way but I want to optimize it. One way I did it was to apply my function to each row based on that index:
df['pCO2_out (μatm)'] = df.apply(lambda row: pCO2_column(df.index), axis=0)
HOWEVER, when I first run it, it gives me the following error:
ValueError: Wrong number of items passed 17, placement implies 1
If I change it to axis=1, each row contains EVERY valuable calculated for all the rows, in an array.
(https://i.stack.imgur.com/5QaMH.png)
If I change it back to axis=0, it populates correctly, with a single unique value in each row.
(https://i.stack.imgur.com/qkaAM.png)
I know I could also loop through each row, fill an array with the values, then insert that array as a new column...
This seems incredibly simple but I don't know where I've gone wrong. Any advice?
Solution
You've structured your lambda function incorrectly check the doc or see some examples online.
Specifically you don't need to iterate by index, as with axis=1
you're getting each row already. To fix your code with a minimal example see the below:
df_p = pd.DataFrame({'pH':np.random.random(10)})
def pCO2_column(row):
# input variables
PAR2 = row['pH']
return PAR2
df_p.apply(pCO2_column, axis=1)
Notice I don't need the row index, and just selecting the column as row will be a series
Answered By - Suraj Shourie
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.