9. value_counts() and groupby()

Download Dataset:

579KB

value_counts() and groupby() in pandas

value_counts() and groupby() in pandas are used to summarize the whole data to find meaningful insights.

Suppose you work for bmw and you are given a dataset to find meaningful information. You may use value_counts() and groupby() to find meaningful information out of it.

# read dataset
import pandas as pd
df = pd.read_csv('bmw.csv')
df.head()

Out:

model

year

price

transmission

mileage

fuelType

tax

mpg

engineSize

2 Series

2018

16250

Manual

10401

Petrol

145

52.3

1.5

5 Series

2014

11200

Automatic

67068

Diesel

125

57.6

2.0

6 Series

2018

27000

Automatic

14827

Petrol

145

42.8

2.0

5 Series

2016

16000

Automatic

62794

Diesel

160

51.4

3.0

1 Series

2017

12750

Automatic

26676

Diesel

145

72.4

1.5

value_counts()

value_counts() is used to count the unique values in a column.

transmission

How many types of transmission did BMW introduce?

df['transmission'].value_counts()

Out:

transmission Semi-Auto 4666 Automatic 3588 Manual 2527 Name: count, dtype: int64

Insight: Semi-Auto (4666) is the highest and Manual (2527) is the lowest.

Note 1:

value_counts() always gives values in descending order.

Note 2:

When you value_counts() a column, the values become the index.

normalize = True

Convert counts to proportions:

df['transmission'].value_counts(normalize=True)

Out:

transmission Semi-Auto 0.432798 Automatic 0.332808 Manual 0.234394 Name: proportion, dtype: float64

Insight: BMW manufactured ~0.43 Semi-Auto cars of all cars.

Convert to percentage

df['transmission'].value_counts(normalize=True) * 100

Out:

transmission Semi-Auto 43.279844 Automatic 33.280772 Manual 23.439384 Name: proportion, dtype: float64

Insight: BMW manufactured 43% Semi-Auto cars.

Different Scenarios of value_counts

sort = False

df['transmission'].value_counts(sort=False)

Out:

transmission Manual 2527 Automatic 3588 Semi-Auto 4666 Name: count, dtype: int64

sort_index()

df['transmission'].value_counts().sort_index()

Out:

transmission Automatic 3588 Manual 2527 Semi-Auto 4666 Name: count, dtype: int64

reset_index()

df['transmission'].value_counts().reset_index()

Out:

transmission

count

Semi-Auto

4666

Automatic

3588

Manual

2527

Counting Multiple Values with value_counts()

year and transmission

df[['year','transmission']].value_counts()

Out (truncated):

year transmission 2019 Semi-Auto 2005 Automatic 997 2016 Automatic 736 2017 Semi-Auto 672 ... Name: count, dtype: int64

Insight: 2019 has the highest number of Semi-Auto cars.

normalize = True (proportions)

df[['year','transmission']].value_counts(normalize=True)

Out (truncated; proportions):

year transmission 2019 Semi-Auto 0.185975 Automatic 0.092478 2016 Automatic 0.068268 ... Name: proportion, dtype: float64

percentage

df[['year','transmission']].value_counts(normalize=True) * 100

Out (truncated; percentages):

year transmission 2019 Semi-Auto 18.597533 Automatic 9.247751 2016 Automatic 6.826825 ... Name: proportion, dtype: float64

Insight: 18% of Semi-Auto cars were produced in 2019 — the highest.

sort_index()

You can sort by both indexes:

mdf = df[['year','transmission']].value_counts(normalize=True) * 100
mdf = mdf.sort_index()
mdf

Out (truncated):

year transmission 1996 Automatic 0.009276 1997 Manual 0.009276 1998 Manual 0.009276 1999 Automatic 0.009276 1999 Manual 0.027827 ... 2020 Semi-Auto 5.509693 Name: proportion, dtype: float64

reset_index()

mdf = df[['year','transmission']].value_counts(normalize=True) * 100
mdf = mdf.reset_index()
mdf

Out (truncated):

year

transmission

proportion

2019

Semi-Auto

18.597533

2019

Automatic

9.247751

2016

Automatic

6.826825

2017

Semi-Auto

6.233188

2017

Automatic

5.973472

...

sort by year and proportion

mdf = mdf.sort_values(['year','transmission'], ascending=[True, False])
mdf

Out (truncated, sorted):

year

transmission

proportion

1996

Automatic

0.009276

1997

Manual

0.009276

1998

Manual

0.009276

1999

Manual

0.027827

...

2020

Automatic

0.704944

Insight: Shows top manufactured transmission cars in each year.

Groupby in Pandas

Groupby is used to group column(s) and perform aggregate functions.

Syntax:

df.groupby(groupby_col).agg( variable = (operation_col, operation) )

Single-column groupby

Average mileage per transmission

mdf = df.groupby('transmission').agg(avg_mileage = ('mileage','mean'))
mdf

Out:

transmission

avg_mileage

Automatic

31830.679487

Manual

34720.485160

Semi-Auto

15631.347621

Note: groupby makes the grouping column the index. Use reset_index() to get it back as a column.

mdf = df.groupby('transmission').agg(avg_mileage = ('mileage','mean')).reset_index()
mdf

Out:

transmission

avg_mileage

Automatic

31830.679487

Manual

34720.485160

Semi-Auto

15631.347621

sort avg_mileage to find best mileage transmission

mdf = mdf.sort_values('avg_mileage', ascending=False)
mdf

Out:

transmission

avg_mileage

Manual

34720.485160

Automatic

31830.679487

Semi-Auto

15631.347621

Insight: Manual cars have the highest average mileage.

model count, average, median per transmission

mdf = df.groupby('transmission').agg(
    avg_mileage = ('mileage','mean'),
    median_mileage = ('mileage','median'),
    car_count = ('model','count')
)
mdf = mdf.sort_values('median_mileage', ascending=False)
mdf

Out:

transmission

avg_mileage

median_mileage

car_count

Manual

34720.48516

30000.0

2527

Automatic

31830.67949

26193.5

3588

Semi-Auto

15631.34762

9036.5

4666

Insight: Median mileage for Manual is ~30000; Semi-Auto ~9000.

multiple aggregates per transmission

mdf = df.groupby('transmission').agg(
    model_count = ('model','count'),
    total_price = ('price','sum'),
    avg_price = ('price','mean'),
    median_price = ('price','median'),
    max_price = ('price','max'),
    min_price = ('price','min'),
    avg_engine_size = ('engineSize','mean'),
)
mdf

Out:

transmission

model_count

total_price

avg_price

median_price

max_price

min_price

avg_engine_size

Automatic

3588

80439458

22419.023969

19490.0

99950

1200

2.258250

Manual

2527

36988839

14637.451128

13500.0

46000

1445

1.837594

Semi-Auto

4666

127660584

27359.747964

24990.0

123456

5299

2.277004

Best fuel type for mileage

mdf = df.groupby('fuelType').agg(
    avg_mileage = ('mileage','mean'),
    median_mileage = ('mileage','median'),
).sort_values('avg_mileage', ascending=False)
mdf

Out:

fuelType

avg_mileage

median_mileage

Other

44760.638889

44351.5

Diesel

29852.804042

24600.0

Hybrid

23400.530201

16900.0

Electric

19925.333333

20321.0

Petrol

16524.097454

9097.0

Insight: Excluding "Other", Diesel has the highest avg mileage (~29852) and Petrol the lowest.

Multiple-columns groupby

transmission + fuelType — average price

mdf = df.groupby(['transmission','fuelType']).agg(avg_price = ('price','mean'))
mdf

Out:

transmission

fuelType

avg_price

Automatic

Diesel

21216.705504

Automatic

Electric

18466.000000

Automatic

Hybrid

25543.463542

Automatic

Other

18193.861111

Automatic

Petrol

25364.465438

Manual

Diesel

13722.741104

Manual

Petrol

16299.633222

Semi-Auto

Diesel

26776.627579

Semi-Auto

Hybrid

30115.377358

Semi-Auto

Petrol

28209.395278

To find the highest avg price per transmission type, reset and sort:

mdf = mdf.reset_index().sort_values(['transmission','fuelType'], ascending=False)
mdf

Out (sorted):

transmission

fuelType

avg_price

Semi-Auto

Petrol

28209.395278

Semi-Auto

Hybrid

30115.377358

Semi-Auto

Diesel

26776.627579

Manual

Petrol

16299.633222

Manual

Diesel

13722.741104

Automatic

Petrol

25364.465438

Automatic

Other

18193.861111

Automatic

Hybrid

25543.463542

Automatic

Electric

18466.000000

Automatic

Diesel

21216.705504

transmission + fuelType — avg price, total price, avg mileage

mdf = df.groupby(['transmission','fuelType']).agg(
    avg_price = ('price','mean'),
    total_price = ('price','sum'),
    avg_mileage = ('mileage','mean'),
)
mdf

Out (truncated):

transmission

fuelType

avg_price

total_price

avg_mileage

Automatic

Diesel

21216.705504

52808380

36286.725191

Automatic

Electric

18466.000000

55398

19925.333333

...

Semi-Auto

Petrol

28209.395278

46601921

11222.515738

Assignments (expandable)

Assignments: questions to answer using value_counts() and groupby()

VALUE COUNTS QUESTIONS

How many cars of each model are there?
How many cars of each fuel type are there?
How many cars of each transmission type are there?
How many cars of each year are there?
How many cars of each engine size are there?
How many cars of each model + fuel type combination are there?
How many cars of each model + transmission combination are there?

How many cars are there for each model + fuel type combination?
How many cars are there for each model + transmission combination?
How many cars are there for each fuel type + transmission combination?
How many cars are there for each model + year combination?

GROUP-BY QUESTIONS

What is the average price for each model?
What is the average mileage for each model?
What is the average mpg for each model?
What is the average price for each fuel type?
What is the average price for each transmission type?
What is the average price for each year?
What is the highest price for each model?
What is the lowest price for each model?
What is the average mileage for each fuel type?
What is the average price for each engine size?
What is the average price for each model + fuel type combination?
What is the average mileage for each model + year combination?
How many cars are listed in each model group?
How many cars are listed in each fuel type group?
What is the average mpg for each fuel type?
What is the average mileage for each transmission type?
What is the average price for cars with low vs high mileage?
What is the average price for each mpg range?
What is the average engine size for each model?
What is the average price for each model and year?
What is the average price for each year and fuel type?
What is the average mileage for each year?
What is the highest mileage for each model?
What is the lowest mpg for each model?
What is the average price grouped by number of cylinders (engineSize) categories?
What is the average price of Petrol, Diesel, and Hybrid cars within each model?
What is the average mileage for each model and transmission type?
How many cars are in each model + engine size group?
What is the average price for cars with different mpg ranges?
What is the average price for cars grouped by (fuel type + transmission)?

(End of notebook content)

Previous8. Practice - air quality usa vs india Next10. datetime in pandas

Last updated 7 days ago

Good morning

Download Dataset:

value_counts() and groupby() in pandas

value_counts()

transmission

normalize = True

Convert to percentage

Different Scenarios of value_counts

sort = False

sort_index()

reset_index()

Counting Multiple Values with value_counts()

year and transmission

normalize = True (proportions)

percentage

sort_index()

reset_index()

sort by year and proportion

Groupby in Pandas

Single-column groupby

Average mileage per transmission

sort avg_mileage to find best mileage transmission

model count, average, median per transmission

multiple aggregates per transmission

Best fuel type for mileage

Multiple-columns groupby

transmission + fuelType — average price

transmission + fuelType — avg price, total price, avg mileage

Assignments (expandable)

VALUE COUNTS QUESTIONS

GROUP-BY QUESTIONS