Issue
I have a dataframe with columns as below -
u'wellthie_issuer_identifier', u'issuer_name', u'service_area_identifier', u'hios_plan_identifier', u'plan_year', u'type'
I need to validate values in each column and finally have a dataframe which is valid.
For example, I need to check if plan_year
column satisfies below validation
presence: true, numericality: true, length: { is: 4 }
hios_plan_identifier
column satisfies below regex.
format: /\A(\d{5}[A-Z]{2}[a-zA-Z0-9]{3,7}-TMP|\d{5}[A-Z]{2}\d{3,7}(\-?\d{2})*)\z/,
presence: true, length: { minimum: 10 },
type
column contains,
in: ['MetalPlan', 'MedicarePlan', 'BasicHealthPlan', 'DualPlan', 'MedicaidPlan', 'ChipPlan']
There are lot of columns which I need to validate. I have tried to give an example data.
I am able to check regex with str.contains('\A(\d{5}[A-Z]{2}[a-zA-Z0-9]{3,7}-TMP|\d{5}[A-Z]{2}\d{3,7}(\-?\d{2})*)\Z', regex=True)
Similary I can check other validation as well individually. I am confused as to how to put all the validation together. Should I put all in a if
loop with and
conditions. Is there a easy way to validate the dataframe columns ? Need help here
Solution
There are multiple pandas functions you could use of. Basically the syntax you could use to filter your dataframe by content is:
df = df[(condition1) & (condition2) & ...] # filter the df and assign to the same df
Specifically for your case, you could replace condition
with following functions(expressions):
df[some_column] == some_value
df[some_column].isin(some_list_of_values) # This check whether the value of the column is one of the values in the list
df[some_column].str.contains() # You can use it the same as str.contains()
df[some_column].str.isdigit() # Same usage as str.isdigit(), check whether string is all digits, need to make sure column type is string in advance
df[some_column].str.len() == 4 # Filter string with length of 4
Finally, if you want to reset the index, you could use df = df.reset_index(drop=True)
to reset your output df index to 0,1,2,...
Edit: To check for NaN, NaT, None values you could use
df[some_column].isnull()
For multiple columns, you could use
df[[col1, col2]].isin(valuelist).all(axis=1)
Answered By - Kevin Fang
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.