5. Filtering

In this session, we are going to cover filtering. Filtering is to get the data that you want from your dataset.

Types of Filtering

There are two types of filtering.

  • iloc (index location) - This is filtering with the help of index.

  • loc (location) - This is filtering with the help of values.

iloc

First we are going to filter with the help of index.

In [22]:

import pandas as pd
df = pd.read_csv('retail_sales_dataset.csv')
df

Out[22]:

Customer ID
Gender
Age
Product Category
Quantity
Price per Unit

0

CUST001

Male

34

Beauty

3

50

1

CUST002

Female

26

Clothing

2

500

2

CUST003

Male

50

Electronics

1

30

3

CUST004

Male

37

Clothing

1

500

4

CUST005

Male

30

Beauty

2

50

...

...

...

...

...

...

...

995

CUST996

Male

62

Clothing

1

50

996

CUST997

Male

52

Beauty

3

30

997

CUST998

Female

23

Beauty

4

25

998

CUST999

Female

36

Electronics

3

50

999

CUST1000

Male

47

Electronics

4

30

1000 rows × 6 columns

Getting any row in DataFrame

To get any row in dataframe, you should just specify the row index.

In [23]:

df.iloc[0]  # this gets the first row

Out[23]:

Customer ID CUST001 Gender Male Age 34 Product Category Beauty Quantity 3 Price per Unit 50 Name: 0, dtype: object

Getting multiple rows in DataFrame

To get multiple rows in dataframe, you should just specify list of index in iloc.

In [24]:

df.iloc[[0, 3, 5]]

Out[24]:

Customer ID
Gender
Age
Product Category
Quantity
Price per Unit

0

CUST001

Male

34

Beauty

3

50

3

CUST004

Male

37

Clothing

1

500

5

CUST006

Female

45

Beauty

1

30

Slicing multiple rows in DataFrame

To slice multiple rows in dataframe, you should just use slicing of index like we used to do it in Python.

In [25]:

df.iloc[2:5]

Out[25]:

Customer ID
Gender
Age
Product Category
Quantity
Price per Unit

2

CUST003

Male

50

Electronics

1

30

3

CUST004

Male

37

Clothing

1

500

4

CUST005

Male

30

Beauty

2

50

Getting single value in DataFrame

A single value has two indexes. To get a single value, you need to specify row and column index.

In [26]:

df.iloc[0, 0]

Out[26]:

'CUST001'

Getting random rows and random columns in DataFrame

To get random rows and random columns from dataframe, you should specify the list of rows and list of columns.

In [27]:

df.iloc[[0, 3, 5], [0, 2, 4]]

Out[27]:

Customer ID
Age
Quantity

0

CUST001

34

3

3

CUST004

37

1

5

CUST006

45

1

Getting slicing of rows and columns in DataFrame

You can use slicing of rows and columns in dataframe.

In [28]:

df.iloc[0:3, 2:4]

Out[28]:

Age
Product Category

0

34

Beauty

1

26

Clothing

2

50

Electronics

How would you get a single column?

You can get single column by slicing index of all rows and single column.

In [29]:

df.iloc[:, 1]

Out[29]:

0 Male 1 Female 2 Male 3 Male 4 Male ... 995 Male 996 Male 997 Female 998 Female 999 Male Name: Gender, Length: 1000, dtype: object

Note: iloc is used to get data from index.

loc

loc is used to filter data from values. The syntax of loc is df.loc[Condition].

  • If condition is True, Data will reflect.

  • If condition is False, Data will not reflect.

Filter data where Gender is Male

In [30]:

mdf = df.loc[df['Gender'] == 'Male']
mdf.head()

Out[30]:

Customer ID
Gender
Age
Product Category
Quantity
Price per Unit

0

CUST001

Male

34

Beauty

3

50

2

CUST003

Male

50

Electronics

1

30

3

CUST004

Male

37

Clothing

1

500

4

CUST005

Male

30

Beauty

2

50

6

CUST007

Male

46

Clothing

2

25

Filter data where Age is greater than 50

In [31]:

fdf = df.loc[df['Age'] > 50]
fdf.head()

Out[31]:

Customer ID
Gender
Age
Product Category
Quantity
Price per Unit

8

CUST009

Male

63

Electronics

2

300

9

CUST010

Female

52

Clothing

4

50

13

CUST014

Male

64

Clothing

4

30

18

CUST019

Female

62

Clothing

2

25

24

CUST025

Female

64

Beauty

1

50

Multiple conditions in filtering

If your task involves multiple filtering, you need to use logical operators. In filtering, we do not use Python logical operators (and, or, not).

1

& (ampersand)

It is used for and operator — when both conditions must be true.

Example:

mdf = df.loc[(df['Gender'] == 'Male') & (df['Age'] > 50)]
mdf
2

| (pipe)

It is used for or — when any one condition must be true.

Example:

mdf = df.loc[(df['Product Category'] == 'Clothing') | (df['Product Category'] == 'Electronics')]
mdf.head()
3

!= (not equal)

It is used for not equal.

Example:

mdf = df.loc[df['Product Category'] != 'Electronics']
mdf.head()
4

~ (tilde)

It does the opposite of the condition (logical NOT). Note: Always write condition in parentheses () when there are multiple conditions.

Example:

mdf = df.loc[~((df['Gender'] == 'Male') & (df['Product Category'] == 'Electronics'))]
mdf.head()

& - Filter data where Gender is Male and Age is greater than 50

In [32]:

mdf = df.loc[(df['Gender'] == 'Male') & (df['Age'] > 50)]
mdf

Out[32]:

Customer ID
Gender
Age
Product Category
Quantity
Price per Unit

8

CUST009

Male

63

Electronics

2

300

13

CUST014

Male

64

Clothing

4

30

35

CUST036

Male

52

Beauty

3

300

47

CUST048

Male

54

Electronics

3

300

58

CUST059

Male

62

Clothing

1

50

...

...

...

...

...

...

...

969

CUST970

Male

59

Electronics

4

500

972

CUST973

Male

60

Clothing

1

50

983

CUST984

Male

56

Clothing

1

500

995

CUST996

Male

62

Clothing

1

50

996

CUST997

Male

52

Beauty

3

30

155 rows × 6 columns

& - Filter data where Age is between 20 to 30

In [33]:

mdf = df.loc[(df['Age'] > 20) & (df['Age'] < 30)]
mdf.head()

Out[33]:

Customer ID
Gender
Age
Product Category
Quantity
Price per Unit

1

CUST002

Female

26

Clothing

2

500

10

CUST011

Male

23

Clothing

2

50

12

CUST013

Male

22

Electronics

3

500

16

CUST017

Female

27

Clothing

4

25

19

CUST020

Male

22

Clothing

3

300

| - Filter data where Category is Clothing or Electronics

In [34]:

mdf = df.loc[(df['Product Category'] == 'Clothing') | (df['Product Category'] == 'Electronics')]
mdf.head()

Out[34]:

Customer ID
Gender
Age
Product Category
Quantity
Price per Unit

1

CUST002

Female

26

Clothing

2

500

2

CUST003

Male

50

Electronics

1

30

3

CUST004

Male

37

Clothing

1

500

6

CUST007

Male

46

Clothing

2

25

7

CUST008

Male

30

Electronics

4

25

!= - Filter where data is not equal to Electronics

In [35]:

mdf = df.loc[df['Product Category'] != 'Electronics']
mdf.head()

Out[35]:

Customer ID
Gender
Age
Product Category
Quantity
Price per Unit

0

CUST001

Male

34

Beauty

3

50

1

CUST002

Female

26

Clothing

2

500

3

CUST004

Male

37

Clothing

1

500

4

CUST005

Male

30

Beauty

2

50

5

CUST006

Female

45

Beauty

1

30

~ - Filter where data is not equal to Electronics

~ does everything opposite. It gives the result as opposed to the condition.

In [36]:

mdf = df.loc[~(df['Product Category'] != 'Electronics')]
mdf.head()

Out[36]:

Customer ID
Gender
Age
Product Category
Quantity
Price per Unit

2

CUST003

Male

50

Electronics

1

30

7

CUST008

Male

30

Electronics

4

25

8

CUST009

Male

63

Electronics

2

300

12

CUST013

Male

22

Electronics

3

500

14

CUST015

Female

42

Electronics

4

500

~ - Filter data where Gender is not male and Category is not electronics

In [37]:

mdf = df.loc[~((df['Gender'] == 'Male') & (df['Product Category'] == 'Electronics'))]
mdf.head()

Out[37]:

Customer ID
Gender
Age
Product Category
Quantity
Price per Unit

0

CUST001

Male

34

Beauty

3

50

1

CUST002

Female

26

Clothing

2

500

3

CUST004

Male

37

Clothing

1

500

4

CUST005

Male

30

Beauty

2

50

5

CUST006

Female

45

Beauty

1

30

isin()

isin() function in pandas can filter multiple values.

In [38]:

mdf = df.loc[df['Product Category'].isin(['Electronics', 'Clothing'])]
mdf.head()

Out[38]:

Customer ID
Gender
Age
Product Category
Quantity
Price per Unit

1

CUST002

Female

26

Clothing

2

500

2

CUST003

Male

50

Electronics

1

30

3

CUST004

Male

37

Clothing

1

500

6

CUST007

Male

46

Clothing

2

25

7

CUST008

Male

30

Electronics

4

25

between()

between() function filters the data between two values.

In [39]:

mdf = df.loc[df['Age'].between(20, 30)]
mdf

Out[39]:

Customer ID
Gender
Age
Product Category
Quantity
Price per Unit

1

CUST002

Female

26

Clothing

2

500

4

CUST005

Male

30

Beauty

2

50

7

CUST008

Male

30

Electronics

4

25

10

CUST011

Male

23

Clothing

2

50

12

CUST013

Male

22

Electronics

3

500

...

...

...

...

...

...

...

970

CUST971

Female

27

Electronics

4

50

980

CUST981

Female

30

Electronics

2

30

982

CUST983

Female

29

Clothing

1

300

986

CUST987

Female

30

Clothing

3

300

997

CUST998

Female

23

Beauty

4

25

231 rows × 6 columns

reset_index()

When you do filtering, index lose their serial order. You can use reset_index() to reset the index in dataframe.

  • reset_index() - it resets the index.

  • reset_index(drop=True) - it resets the index and deletes the previous index from the dataframe.

In [40]:

mdf = mdf.reset_index(drop=True)
mdf

Out[40]:

Customer ID
Gender
Age
Product Category
Quantity
Price per Unit

0

CUST002

Female

26

Clothing

2

500

1

CUST005

Male

30

Beauty

2

50

2

CUST008

Male

30

Electronics

4

25

3

CUST011

Male

23

Clothing

2

50

4

CUST013

Male

22

Electronics

3

500

...

...

...

...

...

...

...

226

CUST971

Female

27

Electronics

4

50

227

CUST981

Female

30

Electronics

2

30

228

CUST983

Female

29

Clothing

1

300

229

CUST987

Female

30

Clothing

3

300

230

CUST998

Female

23

Beauty

4

25

231 rows × 6 columns

Update records

Suppose you want to change price of Beauty from 50 to 100. You can use .loc property to update the value.

In [41]:

df.loc[df['Product Category'] == 'Beauty', 'Price per Unit'] = 100
df.head()

Out[41]:

Customer ID
Gender
Age
Product Category
Quantity
Price per Unit

0

CUST001

Male

34

Beauty

3

100

1

CUST002

Female

26

Clothing

2

500

2

CUST003

Male

50

Electronics

1

30

3

CUST004

Male

37

Clothing

1

500

4

CUST005

Male

30

Beauty

2

100

You can also use multiple conditions. Suppose you want to change Male Clothing product price from 500 to 1000.

In [42]:

df.loc[(df['Gender'] == 'Male') & (df['Product Category'] == 'Clothing'), 'Price per Unit'] = 1000
df.head()

Out[42]:

Customer ID
Gender
Age
Product Category
Quantity
Price per Unit

0

CUST001

Male

34

Beauty

3

100

1

CUST002

Female

26

Clothing

2

500

2

CUST003

Male

50

Electronics

1

30

3

CUST004

Male

37

Clothing

1

1000

4

CUST005

Male

30

Beauty

2

100

Assignments:

50 Pandas Filtering & Update Questions

🔸 Basic Filtering

  1. Display all rows where Region is "Northeast".

  2. Filter rows where Units Sold is greater than 1000.

  3. Show records where Price per Unit is equal to $50.00.

  4. Get all rows where Sales Method is "In-store".

  5. Select records where Retailer is "Foot Locker" and City is "New York".

  6. Filter rows where Units Sold is not equal to 1000.

  7. Show records where Units Sold is greater than or equal to 900.

  8. Filter rows where Price per Unit is less than $50.00.

  9. Display all rows where Units Sold is between 900 and 1200.

  10. Show all records where Region is "Northeast" and State is "New York".


🔸 Multiple Conditions

  1. Display rows where Units Sold > 1000 and Price per Unit > $45.00.

  2. Show rows where Units Sold < 1000 or Price per Unit < $45.00.

  3. Filter rows where Units Sold > 800 and Units Sold < 1200.

  4. Select rows where Price per Unit equals $50.00 and Sales Method is "In-store".

  5. Show records where Units Sold is either 850 or 900.

  6. Get rows where Units Sold is not in [850, 900, 1250].

  7. Select all rows where Product is in a list of specific product types (e.g. ["Men's Apparel", "Women's Apparel"]).

  8. Display all rows where Units Sold <= 1000 and Price per Unit >= $45.00.

  9. Filter all rows where both City and State are "New York".

  10. Show rows where Price per Unit is $45.00 or $50.00.


🔸 Using .isin() and .between()

  1. Filter rows where Region is in ["Northeast", "Southwest"].

  2. Display all rows where Units Sold is between 1000 and 1200 (inclusive).

  3. Select all rows where Price per Unit is between $45.00 and $60.00.

  4. Get rows where Sales Method is in ["In-store", "Online"].

  5. Exclude all rows where Region is in ["Northwest", "Midwest"] using ~.

  6. Show records where Units Sold is not between 900 and 1100.

  7. Get rows where both Region and Sales Method are in given lists.

  8. Select rows where Price per Unit is between $40.00 and $50.00 and Units Sold > 900.

  9. Display rows where Units Sold is in [850, 1000, 1200, 1250].

  10. Filter out rows where Units Sold is in [850, 900].


🔸 Using .iloc for positional filtering

  1. Display the first 10 rows using .iloc.

  2. Show rows from index position 100 to 120.

  3. Display every 5th row using .iloc.

  4. Show the last 10 rows using negative indices.

  5. Display alternate rows from the dataset (even index rows).

  6. Select rows between position 50 and 60 where Units Sold > 1000.

  7. Using .iloc, get the first 3 rows and specific columns (Retailer, Units Sold).

  8. Get rows from index 10 to 20 for columns Product and Units Sold.

  9. Fetch rows using a list of specific index positions (e.g., [0, 2, 5, 7]).

  10. Select the middle 10% of records using .iloc.


🔸 Using .loc for conditional filtering

  1. Use .loc to select rows where Price per Unit > $45.00.

  2. Use .loc to select rows where Units Sold < 1000 and Product is "Men's Street Footwear".

  3. Using .loc, filter rows where Sales Method is "In-store" and Region is "Northeast".

  4. Using .loc, select rows where Units Sold > 1000 or Product is "Women's Apparel".

  5. With .loc, exclude rows where Price per Unit equals $50.00.


🔸 Updating Records using .loc

  1. Update Sales Method to "Online" where Units Sold > 1200.

  2. Change Price per Unit to $55.00 for all "Men's Apparel" products.

  3. Set Region to "East" for all rows where State is "New York".

  4. Increase Units Sold by 10% for all rows where Product contains "Footwear".

  5. Update multiple columns — set Price per Unit = "$60.00" and Sales Method = "Online" where Units Sold > 1000.

Last updated