7. Aggregate Functions

Aggregate Function in Pandas

Welcome to the most important part in pandas. When you are working on data, you need to perform calculations. Aggregate functions (sum, max, min, mean, median, mode, count) help you perform calculations on a dataset.

In [2]:

import pandas as pd

In [3]:

df = pd.read_csv('score.csv')
df

Out[3]:

Name

Math

Science

English

History

Art

Alice

Bob

Charlie

David

Eva

Frank

Grace

Axis parameter explanation:

axis = 0 (columns)

This is the default. Aggregation is performed column-wise.

axis = 1 (rows)

Aggregation is performed row-wise.

sum

sum function is used to add values.

sum on columns

In [4]:

df['Math'].sum()

Out[4]:

np.int64(585)

In [5]:

df[['Math','Science']].sum()

Out[5]:

Math       585
Science    585
dtype: int64

In [6]:

df.iloc[:, 1:6].sum()

Out[6]:

Math       585
Science    585
English    543
History    524
Art        607
dtype: int64

sum on rows

To perform operation on rows, pass axis=1.

In [7]:

df['total'] = df.iloc[:, 1:6].sum(axis=1)
df

Out[7]:

Name

Math

Science

English

History

Art

total

Alice

393

Bob

405

Charlie

454

David

322

Eva

423

Frank

405

Grace

442

max

max function is used to find maximum values in rows and columns.

max for columns

In [8]:

df['Math'].max()

Out[8]:

np.int64(92)

In [9]:

df[['Math','Science']].max()

Out[9]:

Math       92
Science    95
dtype: int64

In [10]:

df.iloc[:, 1:6].sum()

Out[10]:

Math       585
Science    585
English    543
History    524
Art        607
dtype: int64

max for rows

In [11]:

df['maximum score'] = df.iloc[:, 1:6].max(axis=1)
df

Out[11]:

Name

Math

Science

English

History

Art

total

maximum score

Alice

393

Bob

405

Charlie

454

David

322

Eva

423

Frank

405

Grace

442

min

min function is used to find minimum values.

min for columns

In [12]:

df['Math'].min()

Out[12]:

np.int64(65)

In [13]:

df[['Math','Science']].min()

Out[13]:

Math       65
Science    70
dtype: int64

In [14]:

df.iloc[:, 1:6].min()

Out[14]:

Math       65
Science    70
English    60
History    55
Art        72
dtype: int64

min for rows

In [15]:

df['min score'] = df.iloc[:, 1:6].min(axis=1)
df

Out[15]:

Name

Math

Science

English

History

Art

total

maximum score

min score

Alice

393

Bob

405

Charlie

454

David

322

Eva

423

Frank

405

Grace

442

count

count function is used to count values.

count for columns

In [16]:

df['Name'].count()

Out[16]:

np.int64(7)

In [17]:

df[['Name','Math']].count()

Out[17]:

Name    7
Math    7
dtype: int64

In [18]:

df.iloc[:, 1:6].count()

Out[18]:

Math       7
Science    7
English    7
History    7
Art        7
dtype: int64

count for rows

In [19]:

df['total exams'] = df.iloc[:, 1:6].count(axis=1)
df

Out[19]:

Name

Math

Science

English

History

Art

total

maximum score

min score

total exams

Alice

393

Bob

405

Charlie

454

David

322

Eva

423

Frank

405

Grace

442

mean

mean function is used to find the average of values.

mean for columns

In [20]:

df['Math'].mean()

Out[20]:

np.float64(83.57142857142857)

In [21]:

df[['Math','Science']].mean()

Out[21]:

Math       83.571429
Science    83.571429
dtype: float64

In [22]:

df.iloc[:, 1:6].mean()

Out[22]:

Math       83.571429
Science    83.571429
English    77.571429
History    74.857143
Art        86.714286
dtype: float64

mean for rows

In [23]:

df['avg mark'] = df.iloc[:, 1:6].mean(axis=1)
df

Out[23]:

Name

Math

Science

English

History

Art

total

maximum score

min score

total exams

avg mark

Alice

393

78.6

Bob

405

81.0

Charlie

454

90.8

David

322

64.4

Eva

423

84.6

Frank

405

81.0

Grace

442

88.4

median

median is used to find the center value.

In [24]:

df['Math'].median()

Out[24]:

np.float64(85.0)

median for rows

In [25]:

df['middle value'] = df.iloc[:, 1:6].median(axis=1)
df

Out[25]:

Name

Math

Science

English

History

Art

total

maximum score

min score

total exams

avg mark

middle value

Alice

393

78.6

78.0

Bob

405

81.0

80.0

Charlie

454

90.8

92.0

David

322

64.4

65.0

Eva

423

84.6

85.0

Frank

405

81.0

80.0

Grace

442

88.4

90.0

mode

mode is used to find the most occurred value.

In [26]:

df['Math'].mode()

Out[26]:

0    85
1    92
Name: Math, dtype: int64

Measure of Central Tendency

The measure of central tendency is a way to find the center or typical value in a set of data — the point where most numbers cluster around.

Three main types:

Mean (Average) – Add up all the numbers and divide by how many there are. Example: (2 + 4 + 6) / 3 = 4
Median – The middle value when all numbers are arranged in order. Example: In [3, 5, 8], the median is 5.
Mode – The value that appears most often. Example: In [2, 3, 3, 5], the mode is 3.

In short — these measures tell you what’s typical or central about your data.

A Scenario to understand the difference between mean, median and mode

Imagine you work for a recruitment company and you need to find out which country pays better: India or USA. You collected samples of working professionals from cities in India and the USA.

In [27]:

idf = pd.read_csv('india_software_engineer.csv')
idf

Out[27]:

(Truncated table; full rows shown in original content)

Employee_ID

Country

Salary

India

48000

...

India

250000

India

400000

India

600000

India

800000

India

1000000

In [28]:

udf = pd.read_csv('usa_software_engineer.csv')
udf

Out[28]:

(Truncated table; full rows shown in original content)

Employee_ID

Country

Salary

USA

115000

...

USA

119000

In [29]:

avg_indian_salary = idf['Salary'].mean()
avg_usa_salary = udf['Salary'].mean()
avg_indian_salary, avg_usa_salary

Out[29]:

(np.float64(101127.27272727272), np.float64(119775.0))

Now, as you can see, there is not much difference between average salary in India and USA. Something is not right — this is because there is a huge salary difference for some individuals (outliers). Mean can be misleading when outliers exist.

In such a scenario, we use median to understand the typical salary people get in both countries.

In [30]:

median_india_salary = idf['Salary'].median()
median_usa_salary = udf['Salary'].median()
median_india_salary, median_usa_salary

Out[30]:

(np.float64(50500.0), np.float64(120000.0))

Now 50% of Indian salaries are less than 50,500, whereas 50% of USA salaries are less than 120,000. So here the median indicates the typical salary is much higher in the USA.

Conclusion

Avg Salary in India : 1 Lakh

Median Salary in India : 50 K

Avg Salary In US : 1.2 Lakh

Median Salary in US : 1.2 Lakh

So , 50% of Indian software engineers are getting less than 50000 , while In US , the figure is 1 Lakh.

USA Employees are getting paid almost twice than Indian Employees.

In India , the difference between mean and median is also very high, which means some software engineers are getting paid a lot while others don't. While In USA , Payscale doesn't vary much.

mode

Mode shows the most frequent salary.

In [31]:

idf['Salary'].mode()

Out[31]:

0    49500
Name: Salary, dtype: int64

In [32]:

udf['Salary'].mode()

Out[32]:

0    120000
Name: Salary, dtype: int64

Assignments

A. Column-wise (axis=0)

Download store sales.csv file from Required Files

Treat each column separately — analyze sales across all stores per quarter.

Find the total sales per quarter using .sum(axis=0).
Find the maximum sales value per quarter using .max(axis=0).
Find the minimum sales value per quarter using .min(axis=0).
Compute the average (mean) sales per quarter using .mean(axis=0).
Compute the median sales per quarter using .median(axis=0).
Find the mode of sales per quarter using .mode(axis=0).
Find how many stores reported data for each quarter using .count(axis=0).
Identify which quarter had the highest total sales overall.
Calculate the range (max − min) for each quarter.

B. Row-wise (axis=1)

Treat each row as one store — analyze how its quarterly sales performed.

Calculate the total yearly sales per store using .sum(axis=1).
Find the quarter with the highest sales for each store using .max(axis=1).
Find the quarter with the lowest sales per store using .min(axis=1).
Compute the average (mean) quarterly sales for each store using .mean(axis=1).
Compute the median sales per store using .median(axis=1).
Count how many quarters each store reported (use .count(axis=1)).
Find which store had the highest yearly total sales.
Find which store had the lowest average quarterly sales.

Download Air Quality index file from Required Files.

Which Country has better air quality index USA or India ? Find Mean and Median and also find Variance ?

Previous6. Sorting Value in Pandas Next8. Practice - air quality usa vs india

Last updated 12 days ago

Good morning

Aggregate Function in Pandas

axis = 0 (columns)

axis = 1 (rows)

sum

sum on columns

sum on rows

max

max for columns

max for rows

min

min for columns

min for rows

count

count for columns

count for rows

mean

mean for columns

mean for rows

median

median for rows

mode

Measure of Central Tendency

A Scenario to understand the difference between mean, median and mode

Conclusion

mode

Assignments

A. Column-wise (axis=0)

B. Row-wise (axis=1)