7. Aggregate Functions
Aggregate Function in Pandas
Welcome to the most important part in pandas. When you are working on data, you need to perform calculations. Aggregate functions (sum, max, min, mean, median, mode, count) help you perform calculations on a dataset.
In [2]:
import pandas as pdIn [3]:
df = pd.read_csv('score.csv')
dfOut[3]:
0
Alice
78
82
75
68
90
1
Bob
85
79
80
74
87
2
Charlie
92
95
85
89
93
3
David
65
70
60
55
72
4
Eva
88
90
78
82
85
5
Frank
85
79
80
74
87
6
Grace
92
90
85
82
93
Axis parameter explanation:
sum
sum function is used to add values.
sum on columns
In [4]:
df['Math'].sum()Out[4]:
np.int64(585)In [5]:
df[['Math','Science']].sum()Out[5]:
Math 585
Science 585
dtype: int64In [6]:
df.iloc[:, 1:6].sum()Out[6]:
Math 585
Science 585
English 543
History 524
Art 607
dtype: int64sum on rows
To perform operation on rows, pass axis=1.
In [7]:
df['total'] = df.iloc[:, 1:6].sum(axis=1)
dfOut[7]:
0
Alice
78
82
75
68
90
393
1
Bob
85
79
80
74
87
405
2
Charlie
92
95
85
89
93
454
3
David
65
70
60
55
72
322
4
Eva
88
90
78
82
85
423
5
Frank
85
79
80
74
87
405
6
Grace
92
90
85
82
93
442
max
max function is used to find maximum values in rows and columns.
max for columns
In [8]:
df['Math'].max()Out[8]:
np.int64(92)In [9]:
df[['Math','Science']].max()Out[9]:
Math 92
Science 95
dtype: int64In [10]:
df.iloc[:, 1:6].sum()Out[10]:
Math 585
Science 585
English 543
History 524
Art 607
dtype: int64max for rows
In [11]:
df['maximum score'] = df.iloc[:, 1:6].max(axis=1)
dfOut[11]:
0
Alice
78
82
75
68
90
393
90
1
Bob
85
79
80
74
87
405
87
2
Charlie
92
95
85
89
93
454
95
3
David
65
70
60
55
72
322
72
4
Eva
88
90
78
82
85
423
90
5
Frank
85
79
80
74
87
405
87
6
Grace
92
90
85
82
93
442
93
min
min function is used to find minimum values.
min for columns
In [12]:
df['Math'].min()Out[12]:
np.int64(65)In [13]:
df[['Math','Science']].min()Out[13]:
Math 65
Science 70
dtype: int64In [14]:
df.iloc[:, 1:6].min()Out[14]:
Math 65
Science 70
English 60
History 55
Art 72
dtype: int64min for rows
In [15]:
df['min score'] = df.iloc[:, 1:6].min(axis=1)
dfOut[15]:
0
Alice
78
82
75
68
90
393
90
68
1
Bob
85
79
80
74
87
405
87
74
2
Charlie
92
95
85
89
93
454
95
85
3
David
65
70
60
55
72
322
72
55
4
Eva
88
90
78
82
85
423
90
78
5
Frank
85
79
80
74
87
405
87
74
6
Grace
92
90
85
82
93
442
93
82
count
count function is used to count values.
count for columns
In [16]:
df['Name'].count()Out[16]:
np.int64(7)In [17]:
df[['Name','Math']].count()Out[17]:
Name 7
Math 7
dtype: int64In [18]:
df.iloc[:, 1:6].count()Out[18]:
Math 7
Science 7
English 7
History 7
Art 7
dtype: int64count for rows
In [19]:
df['total exams'] = df.iloc[:, 1:6].count(axis=1)
dfOut[19]:
0
Alice
78
82
75
68
90
393
90
68
5
1
Bob
85
79
80
74
87
405
87
74
5
2
Charlie
92
95
85
89
93
454
95
85
5
3
David
65
70
60
55
72
322
72
55
5
4
Eva
88
90
78
82
85
423
90
78
5
5
Frank
85
79
80
74
87
405
87
74
5
6
Grace
92
90
85
82
93
442
93
82
5
mean
mean function is used to find the average of values.
mean for columns
In [20]:
df['Math'].mean()Out[20]:
np.float64(83.57142857142857)In [21]:
df[['Math','Science']].mean()Out[21]:
Math 83.571429
Science 83.571429
dtype: float64In [22]:
df.iloc[:, 1:6].mean()Out[22]:
Math 83.571429
Science 83.571429
English 77.571429
History 74.857143
Art 86.714286
dtype: float64mean for rows
In [23]:
df['avg mark'] = df.iloc[:, 1:6].mean(axis=1)
dfOut[23]:
0
Alice
78
82
75
68
90
393
90
68
5
78.6
1
Bob
85
79
80
74
87
405
87
74
5
81.0
2
Charlie
92
95
85
89
93
454
95
85
5
90.8
3
David
65
70
60
55
72
322
72
55
5
64.4
4
Eva
88
90
78
82
85
423
90
78
5
84.6
5
Frank
85
79
80
74
87
405
87
74
5
81.0
6
Grace
92
90
85
82
93
442
93
82
5
88.4
median
median is used to find the center value.
In [24]:
df['Math'].median()Out[24]:
np.float64(85.0)median for rows
In [25]:
df['middle value'] = df.iloc[:, 1:6].median(axis=1)
dfOut[25]:
0
Alice
78
82
75
68
90
393
90
68
5
78.6
78.0
1
Bob
85
79
80
74
87
405
87
74
5
81.0
80.0
2
Charlie
92
95
85
89
93
454
95
85
5
90.8
92.0
3
David
65
70
60
55
72
322
72
55
5
64.4
65.0
4
Eva
88
90
78
82
85
423
90
78
5
84.6
85.0
5
Frank
85
79
80
74
87
405
87
74
5
81.0
80.0
6
Grace
92
90
85
82
93
442
93
82
5
88.4
90.0
mode
mode is used to find the most occurred value.
In [26]:
df['Math'].mode()Out[26]:
0 85
1 92
Name: Math, dtype: int64Measure of Central Tendency
The measure of central tendency is a way to find the center or typical value in a set of data — the point where most numbers cluster around.
Three main types:
Mean (Average) – Add up all the numbers and divide by how many there are. Example: (2 + 4 + 6) / 3 = 4
Median – The middle value when all numbers are arranged in order. Example: In [3, 5, 8], the median is 5.
Mode – The value that appears most often. Example: In [2, 3, 3, 5], the mode is 3.
In short — these measures tell you what’s typical or central about your data.
A Scenario to understand the difference between mean, median and mode
Imagine you work for a recruitment company and you need to find out which country pays better: India or USA. You collected samples of working professionals from cities in India and the USA.
In [27]:
idf = pd.read_csv('india_software_engineer.csv')
idfOut[27]:
(Truncated table; full rows shown in original content)
0
1
India
48000
...
...
...
...
50
51
India
250000
51
52
India
400000
52
53
India
600000
53
54
India
800000
54
55
India
1000000
In [28]:
udf = pd.read_csv('usa_software_engineer.csv')
udfOut[28]:
(Truncated table; full rows shown in original content)
0
1
USA
115000
...
...
...
...
39
40
USA
119000
In [29]:
avg_indian_salary = idf['Salary'].mean()
avg_usa_salary = udf['Salary'].mean()
avg_indian_salary, avg_usa_salaryOut[29]:
(np.float64(101127.27272727272), np.float64(119775.0))Now, as you can see, there is not much difference between average salary in India and USA. Something is not right — this is because there is a huge salary difference for some individuals (outliers). Mean can be misleading when outliers exist.
In such a scenario, we use median to understand the typical salary people get in both countries.
In [30]:
median_india_salary = idf['Salary'].median()
median_usa_salary = udf['Salary'].median()
median_india_salary, median_usa_salaryOut[30]:
(np.float64(50500.0), np.float64(120000.0))Now 50% of Indian salaries are less than 50,500, whereas 50% of USA salaries are less than 120,000. So here the median indicates the typical salary is much higher in the USA.
Conclusion
Avg Salary in India : 1 Lakh
Median Salary in India : 50 K
Avg Salary In US : 1.2 Lakh
Median Salary in US : 1.2 Lakh
So , 50% of Indian software engineers are getting less than 50000 , while In US , the figure is 1 Lakh.
USA Employees are getting paid almost twice than Indian Employees.
In India , the difference between mean and median is also very high, which means some software engineers are getting paid a lot while others don't. While In USA , Payscale doesn't vary much.
mode
Mode shows the most frequent salary.
In [31]:
idf['Salary'].mode()Out[31]:
0 49500
Name: Salary, dtype: int64In [32]:
udf['Salary'].mode()Out[32]:
0 120000
Name: Salary, dtype: int64Assignments
A. Column-wise (axis=0)
Download store sales.csv file from Required Files
Treat each column separately — analyze sales across all stores per quarter.
Find the total sales per quarter using
.sum(axis=0).Find the maximum sales value per quarter using
.max(axis=0).Find the minimum sales value per quarter using
.min(axis=0).Compute the average (mean) sales per quarter using
.mean(axis=0).Compute the median sales per quarter using
.median(axis=0).Find the mode of sales per quarter using
.mode(axis=0).Find how many stores reported data for each quarter using
.count(axis=0).Identify which quarter had the highest total sales overall.
Calculate the range (max − min) for each quarter.
B. Row-wise (axis=1)
Treat each row as one store — analyze how its quarterly sales performed.
Calculate the total yearly sales per store using
.sum(axis=1).Find the quarter with the highest sales for each store using
.max(axis=1).Find the quarter with the lowest sales per store using
.min(axis=1).Compute the average (mean) quarterly sales for each store using
.mean(axis=1).Compute the median sales per store using
.median(axis=1).Count how many quarters each store reported (use
.count(axis=1)).Find which store had the highest yearly total sales.
Find which store had the lowest average quarterly sales.
Download Air Quality index file from Required Files.
Which Country has better air quality index USA or India ? Find Mean and Median and also find Variance ?
Last updated