14. Data Cleaning in Pandas
Duplicacy and Missing Values
When working with data, you'll often encounter duplicate records or missing values. This guide shows how to find and handle them using pandas.
Sample dataset (duplicates)
0
C001
Abhishek Mishra
abhishek@gmail.com
9876543210
Kanpur
2024-01-05
1
C002
Riya Sharma
riya.sharma@gmail.com
9123456780
Delhi
2024-01-07
2
C003
Arjun Verma
arjun.verma@gmail.com
9988776655
Mumbai
2024-01-10
3
C004
Neha Agarwal
neha.agrawal@gmail.com
9090909090
Pune
2024-01-11
4
C005
Sunil Mehra
sunil.mehra@gmail.com
9876501234
Kanpur
2024-01-12
5
C001
Abhishek Mishra
abhishek@gmail.com
9876543210
Kanpur
2024-01-05
6
C001
Abhishek Mishra
abhishek@gmail.com
9876543210
Kanpur
2024-01-05
7
C002
Riya Sharma
riya.sharma@gmail.com
9123456780
Delhi
2024-01-07
8
C003
Arjun Verma
arjun.verma@gmail.com
9988776600
Mumbai
2024-01-10
9
C004
Neha Agarwal
neha.a@gmail.com
9090909090
Pune
2024-01-11
10
C006
Sunil Kumar
sunil.mehra@gmail.com
9876511111
Kanpur
2024-01-12
11
C007
Meera Singh
meera.singh@gmail.com
9876543210
Delhi
2024-01-15
12
C008
Rohan Singh
rohan@gmail.com
9876501234
Kanpur
2024-01-20
13
C009
Sunita Sharma
sunil.mehra@gmail.com
9999999999
Mumbai
2024-01-21
14
C010
Amit Tandon
amit@gmail.com
9123456780
Pune
2024-01-22
How to find exact duplicate rows
Use duplicated() with loc to filter duplicate rows:
Output:
5
C001
Abhishek Mishra
abhishek@gmail.com
9876543210
Kanpur
2024-01-05
6
C001
Abhishek Mishra
abhishek@gmail.com
9876543210
Kanpur
2024-01-05
7
C002
Riya Sharma
riya.sharma@gmail.com
9123456780
Delhi
2024-01-07
Note: The keep parameter controls which duplicates are considered duplicates.
subset parameter
subset helps find duplicates considering only selected column(s):
Example output (duplicates by Email):
0
C001
Abhishek Mishra
abhishek@gmail.com
9876543210
Kanpur
2024-01-05
1
C002
Riya Sharma
riya.sharma@gmail.com
9123456780
Delhi
2024-01-07
2
C003
Arjun Verma
arjun.verma@gmail.com
9988776655
Mumbai
2024-01-10
4
C005
Sunil Mehra
sunil.mehra@gmail.com
9876501234
Kanpur
2024-01-12
5
C001
Abhishek Mishra
abhishek@gmail.com
9876543210
Kanpur
2024-01-05
6
C001
Abhishek Mishra
abhishek@gmail.com
9876543210
Kanpur
2024-01-05
7
C002
Riya Sharma
riya.sharma@gmail.com
9123456780
Delhi
2024-01-07
8
C003
Arjun Verma
arjun.verma@gmail.com
9988776600
Mumbai
2024-01-10
10
C006
Sunil Kumar
sunil.mehra@gmail.com
9876511111
Kanpur
2024-01-12
13
C009
Sunita Sharma
sunil.mehra@gmail.com
9999999999
Mumbai
2024-01-21
Example: find records where Name and City are same
drop_duplicates()
Drop duplicates with drop_duplicates():
Output (after dropping exact duplicate rows):
0
C001
Abhishek Mishra
abhishek@gmail.com
9876543210
Kanpur
2024-01-05
1
C002
Riya Sharma
riya.sharma@gmail.com
9123456780
Delhi
2024-01-07
2
C003
Arjun Verma
arjun.verma@gmail.com
9988776655
Mumbai
2024-01-10
3
C004
Neha Agarwal
neha.agrawal@gmail.com
9090909090
Pune
2024-01-11
4
C005
Sunil Mehra
sunil.mehra@gmail.com
9876501234
Kanpur
2024-01-12
8
C003
Arjun Verma
arjun.verma@gmail.com
9988776600
Mumbai
2024-01-10
9
C004
Neha Agarwal
neha.a@gmail.com
9090909090
Pune
2024-01-11
10
C006
Sunil Kumar
sunil.mehra@gmail.com
9876511111
Kanpur
2024-01-12
11
C007
Meera Singh
meera.singh@gmail.com
9876543210
Delhi
2024-01-15
12
C008
Rohan Singh
rohan@gmail.com
9876501234
Kanpur
2024-01-20
13
C009
Sunita Sharma
sunil.mehra@gmail.com
9999999999
Mumbai
2024-01-21
14
C010
Amit Tandon
amit@gmail.com
9123456780
Pune
2024-01-22
Example: delete rows where Phone is same (possible fraud case):
Example: delete rows where Phone and Email are same:
Combining steps to clean the data (example workflow provided in original content):
Re-read CSV
drop exact duplicates
drop duplicates by Email or Phone as needed
(See original code examples for exact sequences and outputs.)
Missing Value in Pandas
Missing values must be handled because they can break calculations, introduce bias, or cause errors in analysis and models. The examples below use a sample CSV called missing file.csv.
0
c001
Rahul Sharma
28.0
Delhi
rahul@example.com
1200.0
1
c002
Priya Singh
NaN
Mumbai
priya@example.com
1500.0
2
NaN
NaN
NaN
NaN
NaN
NaN
3
c003
Amit Verma
35.0
NaN
amit@example.com
NaN
4
c004
NaN
42.0
Pune
NaN
2000.0
5
c005
Neha Rao
30.0
Bangalore
neha@example.com
1800.0
6
NaN
NaN
NaN
NaN
NaN
NaN
7
c006
Ravi Kumar
NaN
Chennai
ravi@example.com
NaN
8
c007
NaN
25.0
NaN
NaN
NaN
9
c008
Sonia Mehra
29.0
Delhi
sonia@example.com
1600.0
10
NaN
NaN
NaN
NaN
NaN
NaN
isna() / isnull()
isna()/isnull() indicate missing values (True if missing):
Output (boolean mask) — see original output table.
isna().sum() / isnull().sum()
Count missing per column:
Example output:
CustomerID 3 Name 5 Age 5 City 5 Email 5 PurchaseAmount 6 dtype: int64
Count missing per row (axis=1)
The missing column shows the number of NaNs per row (see original table).
Filter rows with missing in a specific column
notna() / notnull()
notna()/notnull() indicate non-missing values (True if present).
Example output:
CustomerID 8 Name 6 Age 6 City 6 Email 6 PurchaseAmount 5 dtype: int64
Count non-missing per row:
Filter rows where Age is not missing:
fillna()
fillna() is a Pandas function used to replace missing values (NaN) in a DataFrame or Series. It helps ensure the dataset is complete and avoids errors during analysis or calculations.
When to use fillna():
fillna():When your dataset contains missing values that need to be handled.
When you want to replace missing entries with:
A fixed value →
df.fillna(0)Statistical values like mean/median/mode →
df['col'].fillna(df['col'].mean())
Before performing visualizations, aggregations, or machine learning where NaN cannot be used.
fillna() helps maintain data consistency and prepares the dataset for further processing.
Example using a dictionary to fill different columns differently:
Fill a single column with its mean:
dropna()
dropna() is a Pandas function used to remove rows or columns that contain missing values (NaN). It helps clean the dataset by eliminating incomplete data points that may affect analysis.
When to use dropna():
dropna():When missing values are few and removing them will not affect the dataset significantly.
When you want only fully complete records for accurate analysis.
dropna() — how Parameter
how ParameterThe how parameter tells Pandas when a row or column should be dropped based on missing values.
Very Simple Explanation:
how='any' → Drop the row/column if it has even one NaN
how='all' → Drop the row/column only if all values are NaN
Examples:
df.dropna(how='any')→ Remove rows that contain at least one missing value.df.dropna(how='all')→ Remove rows only if every value in that row is missing.
Simple idea:
'any' = even one NaN → gone
'all' = all NaN → gone
subset parameter
dropna() — subset Parameter
subset ParameterThe subset parameter tells Pandas which specific columns to check for missing values when deciding to drop a row.
Simple Meaning:
Instead of checking the whole row, Pandas will only look at the columns you mention in subset.
Why use it?
Because sometimes you only care about missing values in important columns, not all columns.
Example:
df.dropna(subset=['Email', 'Phone'])
This means:
Drop the row only if Email or Phone is missing
Ignore missing values in other columns
Simple idea:
subset = “check missing values only in these columns”
Check only specific columns when deciding to drop rows:
This drops rows where Email OR Phone is missing (depending on how).
Examples:
thresh parameter
dropna() — thresh Parameter
thresh ParameterThe thresh parameter tells Pandas how many non-missing values (not NaN) a row must have to be kept.
Simple Meaning:
Keep the row only if it has at least a certain number of valid (non-NaN) values.
If it has fewer valid values than the number you give → the row is dropped.
Example:
df.dropna(thresh=3)
This means:
A row must have at least 3 non-null values to stay.
If it has 2 or fewer non-null values → drop it.
Simple idea:
thresh = minimum number of real (non-NaN) values required to keep the row
or restrict to subset:
2. dropna() Practice
Drop all rows that contain any missing value.
Drop all rows that contain only missing values.
Drop rows where the Email column is missing.
Drop rows where Name OR City is missing.
Drop rows where Name AND Email are missing using
subset.Keep only those rows that have at least 3 non-missing values (use
thresh).
(Use df.dropna() with how, subset, and thresh as shown above.)
3. fillna() Practice
Fill missing values in City with
"Unknown".Fill missing numeric values in "Age" with the mean.
Fill missing numeric values in "Salary" with the median.
Fill missing values of "Phone" with
"Not Provided".Fill missing Email values with
"test@example.com".
(Use df['City'] = df['City'].fillna('Unknown'), df['Age'].fillna(df['Age'].mean()), etc.)
5. Custom Practical Tasks
Create a new DataFrame keeping only rows where at least 2 columns have values.
Fill missing values in "City" with
"NA".Replace missing "Age" values with 0.
Replace all missing values in the entire DataFrame with
"Missing".Fill missing values only in numeric columns using their mean.
Last updated