7. Aggregate Functions

Aggregate Function in Pandas

Welcome to the most important part in pandas. When you are working on data, you need to perform calculations. Aggregate functions (sum, max, min, mean, median, mode, count) help you perform calculations on a dataset.

In [2]:

import pandas as pd

In [3]:

df = pd.read_csv('score.csv')
df

Out[3]:

Name
Math
Science
English
History
Art

0

Alice

78

82

75

68

90

1

Bob

85

79

80

74

87

2

Charlie

92

95

85

89

93

3

David

65

70

60

55

72

4

Eva

88

90

78

82

85

5

Frank

85

79

80

74

87

6

Grace

92

90

85

82

93

Axis parameter explanation:

1

axis = 0 (columns)

This is the default. Aggregation is performed column-wise.

2

axis = 1 (rows)

Aggregation is performed row-wise.


sum

sum function is used to add values.

sum on columns

In [4]:

df['Math'].sum()

Out[4]:

np.int64(585)

In [5]:

df[['Math','Science']].sum()

Out[5]:

Math       585
Science    585
dtype: int64

In [6]:

df.iloc[:, 1:6].sum()

Out[6]:

Math       585
Science    585
English    543
History    524
Art        607
dtype: int64

sum on rows

To perform operation on rows, pass axis=1.

In [7]:

df['total'] = df.iloc[:, 1:6].sum(axis=1)
df

Out[7]:

Name
Math
Science
English
History
Art
total

0

Alice

78

82

75

68

90

393

1

Bob

85

79

80

74

87

405

2

Charlie

92

95

85

89

93

454

3

David

65

70

60

55

72

322

4

Eva

88

90

78

82

85

423

5

Frank

85

79

80

74

87

405

6

Grace

92

90

85

82

93

442


max

max function is used to find maximum values in rows and columns.

max for columns

In [8]:

df['Math'].max()

Out[8]:

np.int64(92)

In [9]:

df[['Math','Science']].max()

Out[9]:

Math       92
Science    95
dtype: int64

In [10]:

df.iloc[:, 1:6].sum()

Out[10]:

Math       585
Science    585
English    543
History    524
Art        607
dtype: int64

max for rows

In [11]:

df['maximum score'] = df.iloc[:, 1:6].max(axis=1)
df

Out[11]:

Name
Math
Science
English
History
Art
total
maximum score

0

Alice

78

82

75

68

90

393

90

1

Bob

85

79

80

74

87

405

87

2

Charlie

92

95

85

89

93

454

95

3

David

65

70

60

55

72

322

72

4

Eva

88

90

78

82

85

423

90

5

Frank

85

79

80

74

87

405

87

6

Grace

92

90

85

82

93

442

93


min

min function is used to find minimum values.

min for columns

In [12]:

df['Math'].min()

Out[12]:

np.int64(65)

In [13]:

df[['Math','Science']].min()

Out[13]:

Math       65
Science    70
dtype: int64

In [14]:

df.iloc[:, 1:6].min()

Out[14]:

Math       65
Science    70
English    60
History    55
Art        72
dtype: int64

min for rows

In [15]:

df['min score'] = df.iloc[:, 1:6].min(axis=1)
df

Out[15]:

Name
Math
Science
English
History
Art
total
maximum score
min score

0

Alice

78

82

75

68

90

393

90

68

1

Bob

85

79

80

74

87

405

87

74

2

Charlie

92

95

85

89

93

454

95

85

3

David

65

70

60

55

72

322

72

55

4

Eva

88

90

78

82

85

423

90

78

5

Frank

85

79

80

74

87

405

87

74

6

Grace

92

90

85

82

93

442

93

82


count

count function is used to count values.

count for columns

In [16]:

df['Name'].count()

Out[16]:

np.int64(7)

In [17]:

df[['Name','Math']].count()

Out[17]:

Name    7
Math    7
dtype: int64

In [18]:

df.iloc[:, 1:6].count()

Out[18]:

Math       7
Science    7
English    7
History    7
Art        7
dtype: int64

count for rows

In [19]:

df['total exams'] = df.iloc[:, 1:6].count(axis=1)
df

Out[19]:

Name
Math
Science
English
History
Art
total
maximum score
min score
total exams

0

Alice

78

82

75

68

90

393

90

68

5

1

Bob

85

79

80

74

87

405

87

74

5

2

Charlie

92

95

85

89

93

454

95

85

5

3

David

65

70

60

55

72

322

72

55

5

4

Eva

88

90

78

82

85

423

90

78

5

5

Frank

85

79

80

74

87

405

87

74

5

6

Grace

92

90

85

82

93

442

93

82

5


mean

mean function is used to find the average of values.

mean for columns

In [20]:

df['Math'].mean()

Out[20]:

np.float64(83.57142857142857)

In [21]:

df[['Math','Science']].mean()

Out[21]:

Math       83.571429
Science    83.571429
dtype: float64

In [22]:

df.iloc[:, 1:6].mean()

Out[22]:

Math       83.571429
Science    83.571429
English    77.571429
History    74.857143
Art        86.714286
dtype: float64

mean for rows

In [23]:

df['avg mark'] = df.iloc[:, 1:6].mean(axis=1)
df

Out[23]:

Name
Math
Science
English
History
Art
total
maximum score
min score
total exams
avg mark

0

Alice

78

82

75

68

90

393

90

68

5

78.6

1

Bob

85

79

80

74

87

405

87

74

5

81.0

2

Charlie

92

95

85

89

93

454

95

85

5

90.8

3

David

65

70

60

55

72

322

72

55

5

64.4

4

Eva

88

90

78

82

85

423

90

78

5

84.6

5

Frank

85

79

80

74

87

405

87

74

5

81.0

6

Grace

92

90

85

82

93

442

93

82

5

88.4


median

median is used to find the center value.

In [24]:

df['Math'].median()

Out[24]:

np.float64(85.0)

median for rows

In [25]:

df['middle value'] = df.iloc[:, 1:6].median(axis=1)
df

Out[25]:

Name
Math
Science
English
History
Art
total
maximum score
min score
total exams
avg mark
middle value

0

Alice

78

82

75

68

90

393

90

68

5

78.6

78.0

1

Bob

85

79

80

74

87

405

87

74

5

81.0

80.0

2

Charlie

92

95

85

89

93

454

95

85

5

90.8

92.0

3

David

65

70

60

55

72

322

72

55

5

64.4

65.0

4

Eva

88

90

78

82

85

423

90

78

5

84.6

85.0

5

Frank

85

79

80

74

87

405

87

74

5

81.0

80.0

6

Grace

92

90

85

82

93

442

93

82

5

88.4

90.0


mode

mode is used to find the most occurred value.

In [26]:

df['Math'].mode()

Out[26]:

0    85
1    92
Name: Math, dtype: int64

Measure of Central Tendency

The measure of central tendency is a way to find the center or typical value in a set of data — the point where most numbers cluster around.

Three main types:

  • Mean (Average) – Add up all the numbers and divide by how many there are. Example: (2 + 4 + 6) / 3 = 4

  • Median – The middle value when all numbers are arranged in order. Example: In [3, 5, 8], the median is 5.

  • Mode – The value that appears most often. Example: In [2, 3, 3, 5], the mode is 3.

In short — these measures tell you what’s typical or central about your data.


A Scenario to understand the difference between mean, median and mode

Imagine you work for a recruitment company and you need to find out which country pays better: India or USA. You collected samples of working professionals from cities in India and the USA.

In [27]:

idf = pd.read_csv('india_software_engineer.csv')
idf

Out[27]:

(Truncated table; full rows shown in original content)

Employee_ID
Country
Salary

0

1

India

48000

...

...

...

...

50

51

India

250000

51

52

India

400000

52

53

India

600000

53

54

India

800000

54

55

India

1000000

In [28]:

udf = pd.read_csv('usa_software_engineer.csv')
udf

Out[28]:

(Truncated table; full rows shown in original content)

Employee_ID
Country
Salary

0

1

USA

115000

...

...

...

...

39

40

USA

119000

In [29]:

avg_indian_salary = idf['Salary'].mean()
avg_usa_salary = udf['Salary'].mean()
avg_indian_salary, avg_usa_salary

Out[29]:

(np.float64(101127.27272727272), np.float64(119775.0))

Now, as you can see, there is not much difference between average salary in India and USA. Something is not right — this is because there is a huge salary difference for some individuals (outliers). Mean can be misleading when outliers exist.

In such a scenario, we use median to understand the typical salary people get in both countries.

In [30]:

median_india_salary = idf['Salary'].median()
median_usa_salary = udf['Salary'].median()
median_india_salary, median_usa_salary

Out[30]:

(np.float64(50500.0), np.float64(120000.0))

Now 50% of Indian salaries are less than 50,500, whereas 50% of USA salaries are less than 120,000. So here the median indicates the typical salary is much higher in the USA.

Conclusion

Avg Salary in India : 1 Lakh

Median Salary in India : 50 K

Avg Salary In US : 1.2 Lakh

Median Salary in US : 1.2 Lakh

So , 50% of Indian software engineers are getting less than 50000 , while In US , the figure is 1 Lakh.

USA Employees are getting paid almost twice than Indian Employees.

In India , the difference between mean and median is also very high, which means some software engineers are getting paid a lot while others don't. While In USA , Payscale doesn't vary much.

mode

Mode shows the most frequent salary.

In [31]:

idf['Salary'].mode()

Out[31]:

0    49500
Name: Salary, dtype: int64

In [32]:

udf['Salary'].mode()

Out[32]:

0    120000
Name: Salary, dtype: int64

Assignments

A. Column-wise (axis=0)

Download store sales.csv file from Required Files

Treat each column separately — analyze sales across all stores per quarter.

  1. Find the total sales per quarter using .sum(axis=0).

  2. Find the maximum sales value per quarter using .max(axis=0).

  3. Find the minimum sales value per quarter using .min(axis=0).

  4. Compute the average (mean) sales per quarter using .mean(axis=0).

  5. Compute the median sales per quarter using .median(axis=0).

  6. Find the mode of sales per quarter using .mode(axis=0).

  7. Find how many stores reported data for each quarter using .count(axis=0).

  8. Identify which quarter had the highest total sales overall.

  9. Calculate the range (max − min) for each quarter.


B. Row-wise (axis=1)

Treat each row as one store — analyze how its quarterly sales performed.

  1. Calculate the total yearly sales per store using .sum(axis=1).

  2. Find the quarter with the highest sales for each store using .max(axis=1).

  3. Find the quarter with the lowest sales per store using .min(axis=1).

  4. Compute the average (mean) quarterly sales for each store using .mean(axis=1).

  5. Compute the median sales per store using .median(axis=1).

  6. Count how many quarters each store reported (use .count(axis=1)).

  7. Find which store had the highest yearly total sales.

  8. Find which store had the lowest average quarterly sales.

Download Air Quality index file from Required Files.

  1. Which Country has better air quality index USA or India ? Find Mean and Median and also find Variance ?

Last updated