15. Outlier in pandas
What is an Outlier?
An outlier is a data point that is significantly different from the rest of the values in a dataset. It either lies too high or too low compared to the majority of the observations.
In simple words:
Outliers are values that don’t fit the general pattern of the data.
They can occur due to data entry errors, measurement mistakes, or genuine rare events.
Outliers can distort averages, affect model performance, and mislead analysis, so they need special attention.
Example Scenario
Suppose you are analyzing daily temperatures of a city, and most days fall between 20°C – 35°C. If one day suddenly shows a recorded temperature of 70°C, that value is an outlier.
Why Outliers Matter
They affect mean, standard deviation, and visualisation shapes.
They can indicate system errors, sensor malfunctions, or special cases.
Simple Scenario
Measure of Central Tendency
The measure of central tendency finds the center or typical value in a set of data.
Three main types:
Mean (Average) – Add up all numbers and divide by how many there are. Example: (2 + 4 + 6) / 3 = 4
Median – The middle value when numbers are arranged in order. Example: In [3, 5, 8], the median is 5.
Mode – The value that appears most often. Example: In [2, 3, 3, 5], the mode is 3.
Measure of Spread
Range = Max - Min Example: (10,20,30,40) → Range = 40 - 10 = 30
Example code (illustrative):
Variance and Standard Deviation
Variance: how spread out numbers are from the mean.
Standard Deviation (STD): square root of variance. We generally prefer STD because its scale is the same as the original data and is easier to interpret.
Understanding STD1, STD2, STD3 (1, 2, 3 standard deviations) is helpful to classify data points as normal, unusual, or outliers.
Empirical Rule (68-95-99.7):
STD1: Mean ± 1 STD → ~68% of data (very normal)
STD2: Mean ± 2 STD → ~95% of data (less common)
STD3: Mean ± 3 STD → ~99.7% of data (rare)
Beyond STD3 → Outliers (~0.3% of data)
Example (Mean = 50, STD = 10):
STD1 range: 40 to 60
STD2 range: 30 to 70
STD3 range: 20 to 80
Outliers: values < 20 or > 80
Normal Distribution


A Way to Find Outliers (worked example with code)
The examples below use a dataset loaded from "Adidas US Sales Datasets.xlsx". The column total is computed as Price per Unit * Units Sold.
Step-by-step process (mean/std and IQR approaches):
Box plot illustration

Box plot IQR example (small numeric example)
Data:
Find Q1 and Q3:
IQR:
Fences:
Note: Not every dataset will contain outliers; fences indicate where outliers would lie.
Summary: Which method to pick?
Use IQR when you want a robust method less affected by extreme values (good for skewed distributions).
Last updated