14. Data Cleaning in Pandas

385B

1KB

Duplicacy and Missing Values

When working with data, you'll often encounter duplicate records or missing values. This guide shows how to find and handle them using pandas.

import pandas as pd

Sample dataset (duplicates)

df = pd.read_csv('duplicate dataset.csv')
df

CustomerID

CustomerName

Phone

City

SignupDate

C001

Abhishek Mishra

abhishek@gmail.com

9876543210

Kanpur

2024-01-05

C002

Riya Sharma

riya.sharma@gmail.com

9123456780

Delhi

2024-01-07

C003

Arjun Verma

arjun.verma@gmail.com

9988776655

Mumbai

2024-01-10

C004

Neha Agarwal

neha.agrawal@gmail.com

9090909090

Pune

2024-01-11

C005

Sunil Mehra

sunil.mehra@gmail.com

9876501234

Kanpur

2024-01-12

C001

Abhishek Mishra

abhishek@gmail.com

9876543210

Kanpur

2024-01-05

C001

Abhishek Mishra

abhishek@gmail.com

9876543210

Kanpur

2024-01-05

C002

Riya Sharma

riya.sharma@gmail.com

9123456780

Delhi

2024-01-07

C003

Arjun Verma

arjun.verma@gmail.com

9988776600

Mumbai

2024-01-10

C004

Neha Agarwal

neha.a@gmail.com

9090909090

Pune

2024-01-11

C006

Sunil Kumar

sunil.mehra@gmail.com

9876511111

Kanpur

2024-01-12

C007

Meera Singh

meera.singh@gmail.com

9876543210

Delhi

2024-01-15

C008

Rohan Singh

rohan@gmail.com

9876501234

Kanpur

2024-01-20

C009

Sunita Sharma

sunil.mehra@gmail.com

9999999999

Mumbai

2024-01-21

C010

Amit Tandon

amit@gmail.com

9123456780

Pune

2024-01-22

How to find exact duplicate rows

Use duplicated() with loc to filter duplicate rows:

df.loc[df.duplicated()]

Output:

CustomerID

CustomerName

Phone

City

SignupDate

C001

Abhishek Mishra

abhishek@gmail.com

9876543210

Kanpur

2024-01-05

C001

Abhishek Mishra

abhishek@gmail.com

9876543210

Kanpur

2024-01-05

C002

Riya Sharma

riya.sharma@gmail.com

9123456780

Delhi

2024-01-07

Note: The keep parameter controls which duplicates are considered duplicates.

keep parameter — options

keep='first' — does not consider the first occurrence as duplicate
keep='last' — does not consider the last occurrence as duplicate
keep=False — considers all duplicate rows as duplicates

Examples:

df.loc[df.duplicated(keep='first')]

df.loc[df.duplicated(keep='last')]

df.loc[df.duplicated(keep=False)]

subset parameter

subset helps find duplicates considering only selected column(s):

df.loc[df.duplicated(subset='Email', keep=False)]

Example output (duplicates by Email):

CustomerID

CustomerName

Phone

City

SignupDate

C001

Abhishek Mishra

abhishek@gmail.com

9876543210

Kanpur

2024-01-05

C002

Riya Sharma

riya.sharma@gmail.com

9123456780

Delhi

2024-01-07

C003

Arjun Verma

arjun.verma@gmail.com

9988776655

Mumbai

2024-01-10

C005

Sunil Mehra

sunil.mehra@gmail.com

9876501234

Kanpur

2024-01-12

C001

Abhishek Mishra

abhishek@gmail.com

9876543210

Kanpur

2024-01-05

C001

Abhishek Mishra

abhishek@gmail.com

9876543210

Kanpur

2024-01-05

C002

Riya Sharma

riya.sharma@gmail.com

9123456780

Delhi

2024-01-07

C003

Arjun Verma

arjun.verma@gmail.com

9988776600

Mumbai

2024-01-10

C006

Sunil Kumar

sunil.mehra@gmail.com

9876511111

Kanpur

2024-01-12

C009

Sunita Sharma

sunil.mehra@gmail.com

9999999999

Mumbai

2024-01-21

Example: find records where Name and City are same

df.loc[df.duplicated(subset=['CustomerName','City'], keep=False)]

drop_duplicates()

Drop duplicates with drop_duplicates():

df = df.drop_duplicates()
df

Output (after dropping exact duplicate rows):

CustomerID

CustomerName

Phone

City

SignupDate

C001

Abhishek Mishra

abhishek@gmail.com

9876543210

Kanpur

2024-01-05

C002

Riya Sharma

riya.sharma@gmail.com

9123456780

Delhi

2024-01-07

C003

Arjun Verma

arjun.verma@gmail.com

9988776655

Mumbai

2024-01-10

C004

Neha Agarwal

neha.agrawal@gmail.com

9090909090

Pune

2024-01-11

C005

Sunil Mehra

sunil.mehra@gmail.com

9876501234

Kanpur

2024-01-12

C003

Arjun Verma

arjun.verma@gmail.com

9988776600

Mumbai

2024-01-10

C004

Neha Agarwal

neha.a@gmail.com

9090909090

Pune

2024-01-11

C006

Sunil Kumar

sunil.mehra@gmail.com

9876511111

Kanpur

2024-01-12

C007

Meera Singh

meera.singh@gmail.com

9876543210

Delhi

2024-01-15

C008

Rohan Singh

rohan@gmail.com

9876501234

Kanpur

2024-01-20

C009

Sunita Sharma

sunil.mehra@gmail.com

9999999999

Mumbai

2024-01-21

C010

Amit Tandon

amit@gmail.com

9123456780

Pune

2024-01-22

Example: delete rows where Phone is same (possible fraud case):

df = df.drop_duplicates(subset='Phone', keep=False)
df

Example: delete rows where Phone and Email are same:

df = df.drop_duplicates(subset=['Email','Phone'], keep=False)
df

Combining steps to clean the data (example workflow provided in original content):

Re-read CSV
drop exact duplicates
drop duplicates by Email or Phone as needed

(See original code examples for exact sequences and outputs.)

Missing Value in Pandas

Missing values must be handled because they can break calculations, introduce bias, or cause errors in analysis and models. The examples below use a sample CSV called missing file.csv.

df = pd.read_csv('missing file.csv')
df

CustomerID

Name

Age

City

PurchaseAmount

c001

Rahul Sharma

28.0

Delhi

rahul@example.com

1200.0

c002

Priya Singh

NaN

Mumbai

priya@example.com

1500.0

NaN

c003

Amit Verma

35.0

NaN

amit@example.com

NaN

c004

NaN

42.0

Pune

NaN

2000.0

c005

Neha Rao

30.0

Bangalore

neha@example.com

1800.0

NaN

c006

Ravi Kumar

NaN

Chennai

ravi@example.com

NaN

c007

NaN

25.0

NaN

c008

Sonia Mehra

29.0

Delhi

sonia@example.com

1600.0

NaN

isna() / isnull()

isna()/isnull() indicate missing values (True if missing):

df.isna()

Output (boolean mask) — see original output table.

isna().sum() / isnull().sum()

Count missing per column:

df.isnull().sum()

Example output:

CustomerID 3 Name 5 Age 5 City 5 Email 5 PurchaseAmount 6 dtype: int64

Count missing per row (axis=1)

df['missing'] = df.isnull().sum(axis=1)
df

The missing column shows the number of NaNs per row (see original table).

Filter rows with missing in a specific column

df.loc[df['Age'].isnull()]

notna() / notnull()

notna()/notnull() indicate non-missing values (True if present).

df = pd.read_csv('missing file.csv')
df.notna().sum()

Example output:

CustomerID 8 Name 6 Age 6 City 6 Email 6 PurchaseAmount 5 dtype: int64

Count non-missing per row:

df['not missing'] = df.notnull().sum(axis=1)
df

Filter rows where Age is not missing:

df.loc[df['Age'].notnull()]

fillna()

fillna() is a Pandas function used to replace missing values (NaN) in a DataFrame or Series. It helps ensure the dataset is complete and avoids errors during analysis or calculations.

When to use `fillna()`:

When your dataset contains missing values that need to be handled.
When you want to replace missing entries with:
- A fixed value → df.fillna(0)
- Statistical values like mean/median/mode → df['col'].fillna(df['col'].mean())
Before performing visualizations, aggregations, or machine learning where NaN cannot be used.

fillna() helps maintain data consistency and prepares the dataset for further processing.

df = df.fillna(0)
df

Example using a dictionary to fill different columns differently:

df = pd.read_csv('missing file.csv')
d = {
  'CustomerID': 'User',
  'Name': 'user',
  'Age': df['Age'].mean(),
  'City': 'no city',
  'Email': 'no email',
}
df = df.fillna(d)
df

Fill a single column with its mean:

df['PurchaseAmount'] = df['PurchaseAmount'].fillna(df['PurchaseAmount'].mean())
df

dropna()

dropna() is a Pandas function used to remove rows or columns that contain missing values (NaN). It helps clean the dataset by eliminating incomplete data points that may affect analysis.

When to use `dropna()`:

When missing values are few and removing them will not affect the dataset significantly.
When you want only fully complete records for accurate analysis.

dropna() — `how` Parameter

The how parameter tells Pandas when a row or column should be dropped based on missing values.

Very Simple Explanation:

how='any' → Drop the row/column if it has even one NaN
how='all' → Drop the row/column only if all values are NaN

Examples:

df.dropna(how='any') → Remove rows that contain at least one missing value.
df.dropna(how='all') → Remove rows only if every value in that row is missing.

Simple idea:

'any' = even one NaN → gone
'all' = all NaN → gone

# delete rows if any value is missing
df = df.dropna(how='any')

# delete rows if all values are missing
df = df.dropna(how='all')

subset parameter

dropna() — `subset` Parameter

The subset parameter tells Pandas which specific columns to check for missing values when deciding to drop a row.

Simple Meaning:

Instead of checking the whole row, Pandas will only look at the columns you mention in subset.

Why use it?

Because sometimes you only care about missing values in important columns, not all columns.

Example:

df.dropna(subset=['Email', 'Phone'])

This means:

Drop the row only if Email or Phone is missing
Ignore missing values in other columns

Simple idea:

subset = “check missing values only in these columns”

Check only specific columns when deciding to drop rows:

df = df.dropna(subset=['Email', 'Phone'])

This drops rows where Email OR Phone is missing (depending on how).

Examples:

# delete rows where Age is missing
df = df.dropna(subset='Age')

# delete rows where Name OR City (anything) is missing
df = df.dropna(subset=['Name','City'])
# default how='any' so it removes rows missing in any of those columns

# delete rows where Name, Age, City all are missing
df = df.dropna(subset=['Name','Age','City'], how='all')

thresh parameter

dropna() — `thresh` Parameter

The thresh parameter tells Pandas how many non-missing values (not NaN) a row must have to be kept.

Simple Meaning:

Keep the row only if it has at least a certain number of valid (non-NaN) values.
If it has fewer valid values than the number you give → the row is dropped.

Example:

df.dropna(thresh=3)

This means:

A row must have at least 3 non-null values to stay.
If it has 2 or fewer non-null values → drop it.

Simple idea:

thresh = minimum number of real (non-NaN) values required to keep the row

# keep rows with at least 3 non-null values
df = df.dropna(thresh=3)

or restrict to subset:

# keep rows where at least 2 of Name,Age,City are present
df = df.dropna(subset=['Name','Age','City'], thresh=2)

Assignments — Practical Questions on Handling Missing Values

1. Missing Value Checks

Write code to check how many missing values each column has.
Add a new column "missing" that counts missing values row-wise.

(Use df.isnull().sum() and df['missing'] = df.isnull().sum(axis=1).)

2. dropna() Practice

Drop all rows that contain any missing value.
Drop all rows that contain only missing values.
Drop rows where the Email column is missing.
Drop rows where Name OR City is missing.
Drop rows where Name AND Email are missing using subset.
Keep only those rows that have at least 3 non-missing values (use thresh).

(Use df.dropna() with how, subset, and thresh as shown above.)

3. fillna() Practice

Fill missing values in City with "Unknown".
Fill missing numeric values in "Age" with the mean.
Fill missing numeric values in "Salary" with the median.
Fill missing values of "Phone" with "Not Provided".
Fill missing Email values with "test@example.com".

(Use df['City'] = df['City'].fillna('Unknown'), df['Age'].fillna(df['Age'].mean()), etc.)

4. subset Parameter Practice

Drop rows where Email and Phone are missing using subset.
Drop rows where SignupDate is missing.

(Use df.dropna(subset=['Email','Phone']) and df.dropna(subset=['SignupDate']).)

5. Custom Practical Tasks

Create a new DataFrame keeping only rows where at least 2 columns have values.
Fill missing values in "City" with "NA".
Replace missing "Age" values with 0.
Replace all missing values in the entire DataFrame with "Missing".
Fill missing values only in numeric columns using their mean.

Previous13. validation in pandas Next15. Outlier in pandas

Last updated 1 month ago

Good afternoon

hashtagDuplicacy and Missing Values

hashtagSample dataset (duplicates)

hashtagHow to find exact duplicate rows

hashtagkeep parameter — options

hashtagsubset parameter

hashtagExample: find records where Name and City are same

hashtagdrop_duplicates()

hashtagMissing Value in Pandas

hashtagisna() / isnull()

hashtagisna().sum() / isnull().sum()

hashtagCount missing per row (axis=1)

hashtagFilter rows with missing in a specific column

hashtagnotna() / notnull()

hashtagfillna()

hashtagWhen to use fillna():

hashtagdropna()

hashtagWhen to use dropna():

hashtagdropna() — how Parameter

hashtagVery Simple Explanation:

hashtagExamples:

hashtagSimple idea:

hashtagsubset parameter

hashtagdropna() — subset Parameter

hashtagSimple Meaning:

hashtagWhy use it?

hashtagExample:

hashtagSimple idea:

hashtagthresh parameter

hashtagdropna() — thresh Parameter

hashtagSimple Meaning:

hashtagExample:

hashtagSimple idea:

hashtagAssignments — Practical Questions on Handling Missing Values

hashtag1. Missing Value Checks

hashtag2. dropna() Practice

hashtag3. fillna() Practice

hashtag4. subset Parameter Practice

hashtag5. Custom Practical Tasks

Duplicacy and Missing Values

Sample dataset (duplicates)

How to find exact duplicate rows

keep parameter — options

subset parameter

Example: find records where Name and City are same

drop_duplicates()

Missing Value in Pandas

isna() / isnull()

isna().sum() / isnull().sum()

Count missing per row (axis=1)

Filter rows with missing in a specific column

notna() / notnull()

fillna()

When to use `fillna()`:

dropna()

When to use `dropna()`:

dropna() — `how` Parameter

Very Simple Explanation:

Examples:

Simple idea:

subset parameter

dropna() — `subset` Parameter

Simple Meaning:

Why use it?

Example:

Simple idea:

thresh parameter

dropna() — `thresh` Parameter

Simple Meaning:

Example:

Simple idea:

Assignments — Practical Questions on Handling Missing Values

1. Missing Value Checks

2. dropna() Practice

3. fillna() Practice

4. subset Parameter Practice

5. Custom Practical Tasks