How to drop rows of Pandas DataFrame whose value in a certain column is NaN
To get rid of rows with NaNs in a specified column:
While this one-liner essentially solves the issue, nuances lurk beneath the surface. Let's uncover some gems that deal with NaN values, cater to special situations, and adopt best practices for DataFrame sanitization.
Drop NaN with Different Scopes and Conditions
Selective removal, or how I learned to not kill all the NaNs in sight
Weed out NaN values directly from one or multiple columns:
Dropping NaN
in-place; because, who likes re-assignments?
Save the result back into df without additional line of code:
Cut-off Thresholds; Data cleaning meets Highjump
Remove rows that don't meet a certain count of non-NaN values (the threshold):
All or nothing, the NaN version
A stricter approach—discard rows that have NaN values in all columns:
The Boolean Mask, not a new Superhero
Create a mask for rows with valid values and apply it:
The Whole Shebang: Dropping rows/columns, thresholds and masks
Sweep, don't weep
Get rid of any row that contains at least one NaN value with df.dropna(how='any')
:
X-rays for your DataFrame
Before deleting, it helps to know where the NaNs are. With isna().any(axis=1)
you see the affected rows:
Null-cypher: Dropping columns with NaNs
What if columns rather than rows are jammed with NaN values? You deal with them by just switching the axis
in dropna
:
Understand NaNs: Sometimes Absence Makes the Data Grow Fonder
NaNs aren't always a problem; they might signify missing information. Sometimes, imputing with a statistical measure (mean, median) or a constant value therefore can be more beneficial. Also, the pattern of NaNs can provide insights on data quality or bias.
Was this article helpful?