14. Data Cleaning in Pandas

file-download
385B

Duplicacy and Missing Values

When working with data, you'll often encounter duplicate records or missing values. This guide shows how to find and handle them using pandas.

Sample dataset (duplicates)

CustomerID
CustomerName
Email
Phone
City
SignupDate

0

C001

Abhishek Mishra

abhishek@gmail.com

9876543210

Kanpur

2024-01-05

1

C002

Riya Sharma

riya.sharma@gmail.com

9123456780

Delhi

2024-01-07

2

C003

Arjun Verma

arjun.verma@gmail.com

9988776655

Mumbai

2024-01-10

3

C004

Neha Agarwal

neha.agrawal@gmail.com

9090909090

Pune

2024-01-11

4

C005

Sunil Mehra

sunil.mehra@gmail.com

9876501234

Kanpur

2024-01-12

5

C001

Abhishek Mishra

abhishek@gmail.com

9876543210

Kanpur

2024-01-05

6

C001

Abhishek Mishra

abhishek@gmail.com

9876543210

Kanpur

2024-01-05

7

C002

Riya Sharma

riya.sharma@gmail.com

9123456780

Delhi

2024-01-07

8

C003

Arjun Verma

arjun.verma@gmail.com

9988776600

Mumbai

2024-01-10

9

C004

Neha Agarwal

neha.a@gmail.com

9090909090

Pune

2024-01-11

10

C006

Sunil Kumar

sunil.mehra@gmail.com

9876511111

Kanpur

2024-01-12

11

C007

Meera Singh

meera.singh@gmail.com

9876543210

Delhi

2024-01-15

12

C008

Rohan Singh

rohan@gmail.com

9876501234

Kanpur

2024-01-20

13

C009

Sunita Sharma

sunil.mehra@gmail.com

9999999999

Mumbai

2024-01-21

14

C010

Amit Tandon

amit@gmail.com

9123456780

Pune

2024-01-22

How to find exact duplicate rows

Use duplicated() with loc to filter duplicate rows:

Output:

CustomerID
CustomerName
Email
Phone
City
SignupDate

5

C001

Abhishek Mishra

abhishek@gmail.com

9876543210

Kanpur

2024-01-05

6

C001

Abhishek Mishra

abhishek@gmail.com

9876543210

Kanpur

2024-01-05

7

C002

Riya Sharma

riya.sharma@gmail.com

9123456780

Delhi

2024-01-07

Note: The keep parameter controls which duplicates are considered duplicates.

1

keep parameter — options

  • keep='first' — does not consider the first occurrence as duplicate

  • keep='last' — does not consider the last occurrence as duplicate

  • keep=False — considers all duplicate rows as duplicates

Examples:

subset parameter

subset helps find duplicates considering only selected column(s):

Example output (duplicates by Email):

CustomerID
CustomerName
Email
Phone
City
SignupDate

0

C001

Abhishek Mishra

abhishek@gmail.com

9876543210

Kanpur

2024-01-05

1

C002

Riya Sharma

riya.sharma@gmail.com

9123456780

Delhi

2024-01-07

2

C003

Arjun Verma

arjun.verma@gmail.com

9988776655

Mumbai

2024-01-10

4

C005

Sunil Mehra

sunil.mehra@gmail.com

9876501234

Kanpur

2024-01-12

5

C001

Abhishek Mishra

abhishek@gmail.com

9876543210

Kanpur

2024-01-05

6

C001

Abhishek Mishra

abhishek@gmail.com

9876543210

Kanpur

2024-01-05

7

C002

Riya Sharma

riya.sharma@gmail.com

9123456780

Delhi

2024-01-07

8

C003

Arjun Verma

arjun.verma@gmail.com

9988776600

Mumbai

2024-01-10

10

C006

Sunil Kumar

sunil.mehra@gmail.com

9876511111

Kanpur

2024-01-12

13

C009

Sunita Sharma

sunil.mehra@gmail.com

9999999999

Mumbai

2024-01-21

Example: find records where Name and City are same

drop_duplicates()

Drop duplicates with drop_duplicates():

Output (after dropping exact duplicate rows):

CustomerID
CustomerName
Email
Phone
City
SignupDate

0

C001

Abhishek Mishra

abhishek@gmail.com

9876543210

Kanpur

2024-01-05

1

C002

Riya Sharma

riya.sharma@gmail.com

9123456780

Delhi

2024-01-07

2

C003

Arjun Verma

arjun.verma@gmail.com

9988776655

Mumbai

2024-01-10

3

C004

Neha Agarwal

neha.agrawal@gmail.com

9090909090

Pune

2024-01-11

4

C005

Sunil Mehra

sunil.mehra@gmail.com

9876501234

Kanpur

2024-01-12

8

C003

Arjun Verma

arjun.verma@gmail.com

9988776600

Mumbai

2024-01-10

9

C004

Neha Agarwal

neha.a@gmail.com

9090909090

Pune

2024-01-11

10

C006

Sunil Kumar

sunil.mehra@gmail.com

9876511111

Kanpur

2024-01-12

11

C007

Meera Singh

meera.singh@gmail.com

9876543210

Delhi

2024-01-15

12

C008

Rohan Singh

rohan@gmail.com

9876501234

Kanpur

2024-01-20

13

C009

Sunita Sharma

sunil.mehra@gmail.com

9999999999

Mumbai

2024-01-21

14

C010

Amit Tandon

amit@gmail.com

9123456780

Pune

2024-01-22

Example: delete rows where Phone is same (possible fraud case):

Example: delete rows where Phone and Email are same:

Combining steps to clean the data (example workflow provided in original content):

  • Re-read CSV

  • drop exact duplicates

  • drop duplicates by Email or Phone as needed

(See original code examples for exact sequences and outputs.)


Missing Value in Pandas

Missing values must be handled because they can break calculations, introduce bias, or cause errors in analysis and models. The examples below use a sample CSV called missing file.csv.

CustomerID
Name
Age
City
Email
PurchaseAmount

0

c001

Rahul Sharma

28.0

Delhi

rahul@example.com

1200.0

1

c002

Priya Singh

NaN

Mumbai

priya@example.com

1500.0

2

NaN

NaN

NaN

NaN

NaN

NaN

3

c003

Amit Verma

35.0

NaN

amit@example.com

NaN

4

c004

NaN

42.0

Pune

NaN

2000.0

5

c005

Neha Rao

30.0

Bangalore

neha@example.com

1800.0

6

NaN

NaN

NaN

NaN

NaN

NaN

7

c006

Ravi Kumar

NaN

Chennai

ravi@example.com

NaN

8

c007

NaN

25.0

NaN

NaN

NaN

9

c008

Sonia Mehra

29.0

Delhi

sonia@example.com

1600.0

10

NaN

NaN

NaN

NaN

NaN

NaN

isna() / isnull()

isna()/isnull() indicate missing values (True if missing):

Output (boolean mask) — see original output table.

isna().sum() / isnull().sum()

Count missing per column:

Example output:

CustomerID 3 Name 5 Age 5 City 5 Email 5 PurchaseAmount 6 dtype: int64

Count missing per row (axis=1)

The missing column shows the number of NaNs per row (see original table).

Filter rows with missing in a specific column

notna() / notnull()

notna()/notnull() indicate non-missing values (True if present).

Example output:

CustomerID 8 Name 6 Age 6 City 6 Email 6 PurchaseAmount 5 dtype: int64

Count non-missing per row:

Filter rows where Age is not missing:

fillna()

fillna() is a Pandas function used to replace missing values (NaN) in a DataFrame or Series. It helps ensure the dataset is complete and avoids errors during analysis or calculations.

When to use fillna():

  • When your dataset contains missing values that need to be handled.

  • When you want to replace missing entries with:

    • A fixed value → df.fillna(0)

    • Statistical values like mean/median/mode → df['col'].fillna(df['col'].mean())

  • Before performing visualizations, aggregations, or machine learning where NaN cannot be used.

fillna() helps maintain data consistency and prepares the dataset for further processing.

Example using a dictionary to fill different columns differently:

Fill a single column with its mean:

dropna()

dropna() is a Pandas function used to remove rows or columns that contain missing values (NaN). It helps clean the dataset by eliminating incomplete data points that may affect analysis.

When to use dropna():

  • When missing values are few and removing them will not affect the dataset significantly.

  • When you want only fully complete records for accurate analysis.

dropna() — how Parameter

The how parameter tells Pandas when a row or column should be dropped based on missing values.

Very Simple Explanation:

  • how='any' → Drop the row/column if it has even one NaN

  • how='all' → Drop the row/column only if all values are NaN

Examples:

  • df.dropna(how='any') → Remove rows that contain at least one missing value.

  • df.dropna(how='all') → Remove rows only if every value in that row is missing.

Simple idea:

  • 'any' = even one NaN → gone

  • 'all' = all NaN → gone

subset parameter

dropna() — subset Parameter

The subset parameter tells Pandas which specific columns to check for missing values when deciding to drop a row.

Simple Meaning:

Instead of checking the whole row, Pandas will only look at the columns you mention in subset.

Why use it?

Because sometimes you only care about missing values in important columns, not all columns.

Example:

df.dropna(subset=['Email', 'Phone'])

This means:

  • Drop the row only if Email or Phone is missing

  • Ignore missing values in other columns

Simple idea:

subset = “check missing values only in these columns”

Check only specific columns when deciding to drop rows:

This drops rows where Email OR Phone is missing (depending on how).

Examples:

thresh parameter

dropna() — thresh Parameter

The thresh parameter tells Pandas how many non-missing values (not NaN) a row must have to be kept.

Simple Meaning:

  • Keep the row only if it has at least a certain number of valid (non-NaN) values.

  • If it has fewer valid values than the number you give → the row is dropped.

Example:

df.dropna(thresh=3)

This means:

  • A row must have at least 3 non-null values to stay.

  • If it has 2 or fewer non-null values → drop it.

Simple idea:

thresh = minimum number of real (non-NaN) values required to keep the row

or restrict to subset:


1

Assignments — Practical Questions on Handling Missing Values

1. Missing Value Checks

  • Write code to check how many missing values each column has.

  • Add a new column "missing" that counts missing values row-wise.

(Use df.isnull().sum() and df['missing'] = df.isnull().sum(axis=1).)

2

2. dropna() Practice

  • Drop all rows that contain any missing value.

  • Drop all rows that contain only missing values.

  • Drop rows where the Email column is missing.

  • Drop rows where Name OR City is missing.

  • Drop rows where Name AND Email are missing using subset.

  • Keep only those rows that have at least 3 non-missing values (use thresh).

(Use df.dropna() with how, subset, and thresh as shown above.)

3

3. fillna() Practice

  • Fill missing values in City with "Unknown".

  • Fill missing numeric values in "Age" with the mean.

  • Fill missing numeric values in "Salary" with the median.

  • Fill missing values of "Phone" with "Not Provided".

  • Fill missing Email values with "test@example.com".

(Use df['City'] = df['City'].fillna('Unknown'), df['Age'].fillna(df['Age'].mean()), etc.)

4

4. subset Parameter Practice

  • Drop rows where Email and Phone are missing using subset.

  • Drop rows where SignupDate is missing.

(Use df.dropna(subset=['Email','Phone']) and df.dropna(subset=['SignupDate']).)

5

5. Custom Practical Tasks

  • Create a new DataFrame keeping only rows where at least 2 columns have values.

  • Fill missing values in "City" with "NA".

  • Replace missing "Age" values with 0.

  • Replace all missing values in the entire DataFrame with "Missing".

  • Fill missing values only in numeric columns using their mean.

Last updated