15. Outlier in pandas

What is an Outlier?

An outlier is a data point that is significantly different from the rest of the values in a dataset. It either lies too high or too low compared to the majority of the observations.

In simple words:

  • Outliers are values that don’t fit the general pattern of the data.

  • They can occur due to data entry errors, measurement mistakes, or genuine rare events.

  • Outliers can distort averages, affect model performance, and mislead analysis, so they need special attention.

Example Scenario

Suppose you are analyzing daily temperatures of a city, and most days fall between 20°C – 35°C. If one day suddenly shows a recorded temperature of 70°C, that value is an outlier.

Why Outliers Matter

  • They affect mean, standard deviation, and visualisation shapes.

  • They can indicate system errors, sensor malfunctions, or special cases.


Simple Scenario

1

Salary example — baseline

salary1 = 100 200 500

2

Salary example — with a possible outlier

salary = 100 200 500 10000

In this scenario, the outlier is the person who has salary of 10000. We need to detect outlier(s) and remove them for some analysis.


Measure of Central Tendency

The measure of central tendency finds the center or typical value in a set of data.

Three main types:

  • Mean (Average) – Add up all numbers and divide by how many there are. Example: (2 + 4 + 6) / 3 = 4

  • Median – The middle value when numbers are arranged in order. Example: In [3, 5, 8], the median is 5.

  • Mode – The value that appears most often. Example: In [2, 3, 3, 5], the mode is 3.


Measure of Spread

  • Range = Max - Min Example: (10,20,30,40) → Range = 40 - 10 = 30

Example code (illustrative):


Variance and Standard Deviation

  • Variance: how spread out numbers are from the mean.

  • Standard Deviation (STD): square root of variance. We generally prefer STD because its scale is the same as the original data and is easier to interpret.

Understanding STD1, STD2, STD3 (1, 2, 3 standard deviations) is helpful to classify data points as normal, unusual, or outliers.

Empirical Rule (68-95-99.7):

  • STD1: Mean ± 1 STD → ~68% of data (very normal)

  • STD2: Mean ± 2 STD → ~95% of data (less common)

  • STD3: Mean ± 3 STD → ~99.7% of data (rare)

  • Beyond STD3 → Outliers (~0.3% of data)

Example (Mean = 50, STD = 10):

  • STD1 range: 40 to 60

  • STD2 range: 30 to 70

  • STD3 range: 20 to 80

  • Outliers: values < 20 or > 80


Normal Distribution

A Way to Find Outliers (worked example with code)

The examples below use a dataset loaded from "Adidas US Sales Datasets.xlsx". The column total is computed as Price per Unit * Units Sold.

Step-by-step process (mean/std and IQR approaches):

1

Load data and compute total

Example of first rows (truncated):

Retailer
Invoice Date
Region
State
City
Product
Price per Unit
Units Sold
Sales Method
total

Foot Locker

2020-01-01

Northeast

New York

New York

Men's Street Footwear

50.0

1200

In-store

60000.0

...

...

...

...

...

...

...

...

...

...

2

Method 1 — Mean and Standard Deviation (STD) approach

Compute mean and std:

1-STD range:

3-STD range:

Filter rows within ±3 STD:

Note: mean/std is sensitive to extreme outliers — high values inflate the mean and std.

3

Percentiles and Quartiles (concept)

  • A percentile shows the percentage of observations below a value. Example: 50th percentile = median.

  • Quartiles: 25% (Q1), 50% (Q2), 75% (Q3). Most data often lies between Q1 and Q3.

Example usage:

4

Method 2 — IQR (Interquartile Range) method

Compute Q1 and Q3, then IQR and fences:

Filter using IQR fences:

Notes:

  • IQR is robust to outliers and often preferred to find outliers in skewed data.

  • Observations outside [Q1 - 1.5IQR, Q3 + 1.5IQR] are commonly considered outliers.


Box plot illustration


Box plot IQR example (small numeric example)

Data:

Find Q1 and Q3:

IQR:

Fences:

Note: Not every dataset will contain outliers; fences indicate where outliers would lie.


Summary: Which method to pick?

  • Use IQR when you want a robust method less affected by extreme values (good for skewed distributions).

Last updated