5. Filtering
In this session, we are going to cover filtering. Filtering is to get the data that you want from your dataset.
Types of Filtering
There are two types of filtering.
iloc (index location) - This is filtering with the help of index.
loc (location) - This is filtering with the help of values.
iloc
First we are going to filter with the help of index.
In [22]:
import pandas as pd
df = pd.read_csv('retail_sales_dataset.csv')
dfOut[22]:
0
CUST001
Male
34
Beauty
3
50
1
CUST002
Female
26
Clothing
2
500
2
CUST003
Male
50
Electronics
1
30
3
CUST004
Male
37
Clothing
1
500
4
CUST005
Male
30
Beauty
2
50
...
...
...
...
...
...
...
995
CUST996
Male
62
Clothing
1
50
996
CUST997
Male
52
Beauty
3
30
997
CUST998
Female
23
Beauty
4
25
998
CUST999
Female
36
Electronics
3
50
999
CUST1000
Male
47
Electronics
4
30
1000 rows × 6 columns
Getting any row in DataFrame
To get any row in dataframe, you should just specify the row index.
In [23]:
df.iloc[0] # this gets the first rowOut[23]:
Customer ID CUST001 Gender Male Age 34 Product Category Beauty Quantity 3 Price per Unit 50 Name: 0, dtype: object
Getting multiple rows in DataFrame
To get multiple rows in dataframe, you should just specify list of index in iloc.
In [24]:
df.iloc[[0, 3, 5]]Out[24]:
0
CUST001
Male
34
Beauty
3
50
3
CUST004
Male
37
Clothing
1
500
5
CUST006
Female
45
Beauty
1
30
Slicing multiple rows in DataFrame
To slice multiple rows in dataframe, you should just use slicing of index like we used to do it in Python.
In [25]:
df.iloc[2:5]Out[25]:
2
CUST003
Male
50
Electronics
1
30
3
CUST004
Male
37
Clothing
1
500
4
CUST005
Male
30
Beauty
2
50
Getting single value in DataFrame
A single value has two indexes. To get a single value, you need to specify row and column index.
In [26]:
df.iloc[0, 0]Out[26]:
'CUST001'
Getting random rows and random columns in DataFrame
To get random rows and random columns from dataframe, you should specify the list of rows and list of columns.
In [27]:
df.iloc[[0, 3, 5], [0, 2, 4]]Out[27]:
0
CUST001
34
3
3
CUST004
37
1
5
CUST006
45
1
Getting slicing of rows and columns in DataFrame
You can use slicing of rows and columns in dataframe.
In [28]:
df.iloc[0:3, 2:4]Out[28]:
0
34
Beauty
1
26
Clothing
2
50
Electronics
How would you get a single column?
You can get single column by slicing index of all rows and single column.
In [29]:
df.iloc[:, 1]Out[29]:
0 Male 1 Female 2 Male 3 Male 4 Male ... 995 Male 996 Male 997 Female 998 Female 999 Male Name: Gender, Length: 1000, dtype: object
Note: iloc is used to get data from index.
loc
loc is used to filter data from values. The syntax of loc is df.loc[Condition].
If condition is True, Data will reflect.
If condition is False, Data will not reflect.
Filter data where Gender is Male
In [30]:
mdf = df.loc[df['Gender'] == 'Male']
mdf.head()Out[30]:
0
CUST001
Male
34
Beauty
3
50
2
CUST003
Male
50
Electronics
1
30
3
CUST004
Male
37
Clothing
1
500
4
CUST005
Male
30
Beauty
2
50
6
CUST007
Male
46
Clothing
2
25
Filter data where Age is greater than 50
In [31]:
fdf = df.loc[df['Age'] > 50]
fdf.head()Out[31]:
8
CUST009
Male
63
Electronics
2
300
9
CUST010
Female
52
Clothing
4
50
13
CUST014
Male
64
Clothing
4
30
18
CUST019
Female
62
Clothing
2
25
24
CUST025
Female
64
Beauty
1
50
Multiple conditions in filtering
If your task involves multiple filtering, you need to use logical operators. In filtering, we do not use Python logical operators (and, or, not).
& - Filter data where Gender is Male and Age is greater than 50
In [32]:
mdf = df.loc[(df['Gender'] == 'Male') & (df['Age'] > 50)]
mdfOut[32]:
8
CUST009
Male
63
Electronics
2
300
13
CUST014
Male
64
Clothing
4
30
35
CUST036
Male
52
Beauty
3
300
47
CUST048
Male
54
Electronics
3
300
58
CUST059
Male
62
Clothing
1
50
...
...
...
...
...
...
...
969
CUST970
Male
59
Electronics
4
500
972
CUST973
Male
60
Clothing
1
50
983
CUST984
Male
56
Clothing
1
500
995
CUST996
Male
62
Clothing
1
50
996
CUST997
Male
52
Beauty
3
30
155 rows × 6 columns
& - Filter data where Age is between 20 to 30
In [33]:
mdf = df.loc[(df['Age'] > 20) & (df['Age'] < 30)]
mdf.head()Out[33]:
1
CUST002
Female
26
Clothing
2
500
10
CUST011
Male
23
Clothing
2
50
12
CUST013
Male
22
Electronics
3
500
16
CUST017
Female
27
Clothing
4
25
19
CUST020
Male
22
Clothing
3
300
| - Filter data where Category is Clothing or Electronics
In [34]:
mdf = df.loc[(df['Product Category'] == 'Clothing') | (df['Product Category'] == 'Electronics')]
mdf.head()Out[34]:
1
CUST002
Female
26
Clothing
2
500
2
CUST003
Male
50
Electronics
1
30
3
CUST004
Male
37
Clothing
1
500
6
CUST007
Male
46
Clothing
2
25
7
CUST008
Male
30
Electronics
4
25
!= - Filter where data is not equal to Electronics
In [35]:
mdf = df.loc[df['Product Category'] != 'Electronics']
mdf.head()Out[35]:
0
CUST001
Male
34
Beauty
3
50
1
CUST002
Female
26
Clothing
2
500
3
CUST004
Male
37
Clothing
1
500
4
CUST005
Male
30
Beauty
2
50
5
CUST006
Female
45
Beauty
1
30
~ - Filter where data is not equal to Electronics
~ does everything opposite. It gives the result as opposed to the condition.
In [36]:
mdf = df.loc[~(df['Product Category'] != 'Electronics')]
mdf.head()Out[36]:
2
CUST003
Male
50
Electronics
1
30
7
CUST008
Male
30
Electronics
4
25
8
CUST009
Male
63
Electronics
2
300
12
CUST013
Male
22
Electronics
3
500
14
CUST015
Female
42
Electronics
4
500
~ - Filter data where Gender is not male and Category is not electronics
In [37]:
mdf = df.loc[~((df['Gender'] == 'Male') & (df['Product Category'] == 'Electronics'))]
mdf.head()Out[37]:
0
CUST001
Male
34
Beauty
3
50
1
CUST002
Female
26
Clothing
2
500
3
CUST004
Male
37
Clothing
1
500
4
CUST005
Male
30
Beauty
2
50
5
CUST006
Female
45
Beauty
1
30
isin()
isin() function in pandas can filter multiple values.
In [38]:
mdf = df.loc[df['Product Category'].isin(['Electronics', 'Clothing'])]
mdf.head()Out[38]:
1
CUST002
Female
26
Clothing
2
500
2
CUST003
Male
50
Electronics
1
30
3
CUST004
Male
37
Clothing
1
500
6
CUST007
Male
46
Clothing
2
25
7
CUST008
Male
30
Electronics
4
25
between()
between() function filters the data between two values.
In [39]:
mdf = df.loc[df['Age'].between(20, 30)]
mdfOut[39]:
1
CUST002
Female
26
Clothing
2
500
4
CUST005
Male
30
Beauty
2
50
7
CUST008
Male
30
Electronics
4
25
10
CUST011
Male
23
Clothing
2
50
12
CUST013
Male
22
Electronics
3
500
...
...
...
...
...
...
...
970
CUST971
Female
27
Electronics
4
50
980
CUST981
Female
30
Electronics
2
30
982
CUST983
Female
29
Clothing
1
300
986
CUST987
Female
30
Clothing
3
300
997
CUST998
Female
23
Beauty
4
25
231 rows × 6 columns
reset_index()
When you do filtering, index lose their serial order. You can use reset_index() to reset the index in dataframe.
reset_index() - it resets the index.
reset_index(drop=True) - it resets the index and deletes the previous index from the dataframe.
In [40]:
mdf = mdf.reset_index(drop=True)
mdfOut[40]:
0
CUST002
Female
26
Clothing
2
500
1
CUST005
Male
30
Beauty
2
50
2
CUST008
Male
30
Electronics
4
25
3
CUST011
Male
23
Clothing
2
50
4
CUST013
Male
22
Electronics
3
500
...
...
...
...
...
...
...
226
CUST971
Female
27
Electronics
4
50
227
CUST981
Female
30
Electronics
2
30
228
CUST983
Female
29
Clothing
1
300
229
CUST987
Female
30
Clothing
3
300
230
CUST998
Female
23
Beauty
4
25
231 rows × 6 columns
Update records
Suppose you want to change price of Beauty from 50 to 100. You can use .loc property to update the value.
In [41]:
df.loc[df['Product Category'] == 'Beauty', 'Price per Unit'] = 100
df.head()Out[41]:
0
CUST001
Male
34
Beauty
3
100
1
CUST002
Female
26
Clothing
2
500
2
CUST003
Male
50
Electronics
1
30
3
CUST004
Male
37
Clothing
1
500
4
CUST005
Male
30
Beauty
2
100
You can also use multiple conditions. Suppose you want to change Male Clothing product price from 500 to 1000.
In [42]:
df.loc[(df['Gender'] == 'Male') & (df['Product Category'] == 'Clothing'), 'Price per Unit'] = 1000
df.head()Out[42]:
0
CUST001
Male
34
Beauty
3
100
1
CUST002
Female
26
Clothing
2
500
2
CUST003
Male
50
Electronics
1
30
3
CUST004
Male
37
Clothing
1
1000
4
CUST005
Male
30
Beauty
2
100
Assignments:
50 Pandas Filtering & Update Questions
🔸 Basic Filtering
Display all rows where
Regionis"Northeast".Filter rows where
Units Soldis greater than 1000.Show records where
Price per Unitis equal to$50.00.Get all rows where
Sales Methodis"In-store".Select records where
Retaileris"Foot Locker"andCityis"New York".Filter rows where
Units Soldis not equal to 1000.Show records where
Units Soldis greater than or equal to 900.Filter rows where
Price per Unitis less than$50.00.Display all rows where
Units Soldis between 900 and 1200.Show all records where
Regionis"Northeast"andStateis"New York".
🔸 Multiple Conditions
Display rows where
Units Sold> 1000 andPrice per Unit>$45.00.Show rows where
Units Sold< 1000 orPrice per Unit<$45.00.Filter rows where
Units Sold> 800 andUnits Sold< 1200.Select rows where
Price per Unitequals$50.00andSales Methodis"In-store".Show records where
Units Soldis either 850 or 900.Get rows where
Units Soldis not in[850, 900, 1250].Select all rows where
Productis in a list of specific product types (e.g.["Men's Apparel", "Women's Apparel"]).Display all rows where
Units Sold<= 1000 andPrice per Unit>=$45.00.Filter all rows where both
CityandStateare"New York".Show rows where
Price per Unitis$45.00or$50.00.
🔸 Using .isin() and .between()
.isin() and .between()Filter rows where
Regionis in["Northeast", "Southwest"].Display all rows where
Units Soldis between 1000 and 1200 (inclusive).Select all rows where
Price per Unitis between$45.00and$60.00.Get rows where
Sales Methodis in["In-store", "Online"].Exclude all rows where
Regionis in["Northwest", "Midwest"]using~.Show records where
Units Soldis not between 900 and 1100.Get rows where both
RegionandSales Methodare in given lists.Select rows where
Price per Unitis between$40.00and$50.00andUnits Sold> 900.Display rows where
Units Soldis in[850, 1000, 1200, 1250].Filter out rows where
Units Soldis in[850, 900].
🔸 Using .iloc for positional filtering
.iloc for positional filteringDisplay the first 10 rows using
.iloc.Show rows from index position 100 to 120.
Display every 5th row using
.iloc.Show the last 10 rows using negative indices.
Display alternate rows from the dataset (even index rows).
Select rows between position 50 and 60 where
Units Sold> 1000.Using
.iloc, get the first 3 rows and specific columns (Retailer,Units Sold).Get rows from index 10 to 20 for columns
ProductandUnits Sold.Fetch rows using a list of specific index positions (e.g.,
[0, 2, 5, 7]).Select the middle 10% of records using
.iloc.
🔸 Using .loc for conditional filtering
.loc for conditional filteringUse
.locto select rows wherePrice per Unit>$45.00.Use
.locto select rows whereUnits Sold< 1000 andProductis"Men's Street Footwear".Using
.loc, filter rows whereSales Methodis"In-store"andRegionis"Northeast".Using
.loc, select rows whereUnits Sold> 1000 orProductis"Women's Apparel".With
.loc, exclude rows wherePrice per Unitequals$50.00.
🔸 Updating Records using .loc
.locUpdate
Sales Methodto"Online"whereUnits Sold> 1200.Change
Price per Unitto$55.00for all"Men's Apparel"products.Set
Regionto"East"for all rows whereStateis"New York".Increase
Units Soldby 10% for all rows whereProductcontains"Footwear".Update multiple columns — set
Price per Unit = "$60.00"andSales Method = "Online"whereUnits Sold> 1000.
Last updated