9. value_counts() and groupby()
Download Dataset:
value_counts() and groupby() in pandas
value_counts() and groupby() in pandas are used to summarize the whole data to find meaningful insights.
Suppose you work for bmw and you are given a dataset to find meaningful information. You may use value_counts() and groupby() to find meaningful information out of it.
# read dataset
import pandas as pd
df = pd.read_csv('bmw.csv')
df.head()Out:
0
2 Series
2018
16250
Manual
10401
Petrol
145
52.3
1.5
1
5 Series
2014
11200
Automatic
67068
Diesel
125
57.6
2.0
2
6 Series
2018
27000
Automatic
14827
Petrol
145
42.8
2.0
3
5 Series
2016
16000
Automatic
62794
Diesel
160
51.4
3.0
4
1 Series
2017
12750
Automatic
26676
Diesel
145
72.4
1.5
value_counts()
value_counts() is used to count the unique values in a column.
transmission
How many types of transmission did BMW introduce?
df['transmission'].value_counts()Out:
transmission Semi-Auto 4666 Automatic 3588 Manual 2527 Name: count, dtype: int64
Insight: Semi-Auto (4666) is the highest and Manual (2527) is the lowest.
Note 1:
value_counts() always gives values in descending order.
Note 2:
When you value_counts() a column, the values become the index.
normalize = True
Convert counts to proportions:
df['transmission'].value_counts(normalize=True)Out:
transmission Semi-Auto 0.432798 Automatic 0.332808 Manual 0.234394 Name: proportion, dtype: float64
Insight: BMW manufactured ~0.43 Semi-Auto cars of all cars.
Convert to percentage
df['transmission'].value_counts(normalize=True) * 100Out:
transmission Semi-Auto 43.279844 Automatic 33.280772 Manual 23.439384 Name: proportion, dtype: float64
Insight: BMW manufactured 43% Semi-Auto cars.
Different Scenarios of value_counts
sort = False
df['transmission'].value_counts(sort=False)Out:
transmission Manual 2527 Automatic 3588 Semi-Auto 4666 Name: count, dtype: int64
sort_index()
df['transmission'].value_counts().sort_index()Out:
transmission Automatic 3588 Manual 2527 Semi-Auto 4666 Name: count, dtype: int64
reset_index()
df['transmission'].value_counts().reset_index()Out:
0
Semi-Auto
4666
1
Automatic
3588
2
Manual
2527
Counting Multiple Values with value_counts()
year and transmission
df[['year','transmission']].value_counts()Out (truncated):
year transmission 2019 Semi-Auto 2005 Automatic 997 2016 Automatic 736 2017 Semi-Auto 672 ... Name: count, dtype: int64
Insight: 2019 has the highest number of Semi-Auto cars.
normalize = True (proportions)
df[['year','transmission']].value_counts(normalize=True)Out (truncated; proportions):
year transmission 2019 Semi-Auto 0.185975 Automatic 0.092478 2016 Automatic 0.068268 ... Name: proportion, dtype: float64
percentage
df[['year','transmission']].value_counts(normalize=True) * 100Out (truncated; percentages):
year transmission 2019 Semi-Auto 18.597533 Automatic 9.247751 2016 Automatic 6.826825 ... Name: proportion, dtype: float64
Insight: 18% of Semi-Auto cars were produced in 2019 — the highest.
sort_index()
You can sort by both indexes:
mdf = df[['year','transmission']].value_counts(normalize=True) * 100
mdf = mdf.sort_index()
mdfOut (truncated):
year transmission 1996 Automatic 0.009276 1997 Manual 0.009276 1998 Manual 0.009276 1999 Automatic 0.009276 1999 Manual 0.027827 ... 2020 Semi-Auto 5.509693 Name: proportion, dtype: float64
reset_index()
mdf = df[['year','transmission']].value_counts(normalize=True) * 100
mdf = mdf.reset_index()
mdfOut (truncated):
0
2019
Semi-Auto
18.597533
1
2019
Automatic
9.247751
2
2016
Automatic
6.826825
3
2017
Semi-Auto
6.233188
4
2017
Automatic
5.973472
...
sort by year and proportion
mdf = mdf.sort_values(['year','transmission'], ascending=[True, False])
mdfOut (truncated, sorted):
49
1996
Automatic
0.009276
52
1997
Manual
0.009276
51
1998
Manual
0.009276
40
1999
Manual
0.027827
...
21
2020
Automatic
0.704944
Insight: Shows top manufactured transmission cars in each year.
Groupby in Pandas
Groupby is used to group column(s) and perform aggregate functions.
Syntax:
df.groupby(groupby_col).agg( variable = (operation_col, operation) )Single-column groupby
Average mileage per transmission
mdf = df.groupby('transmission').agg(avg_mileage = ('mileage','mean'))
mdfOut:
Automatic
31830.679487
Manual
34720.485160
Semi-Auto
15631.347621
Note: groupby makes the grouping column the index. Use reset_index() to get it back as a column.
mdf = df.groupby('transmission').agg(avg_mileage = ('mileage','mean')).reset_index()
mdfOut:
0
Automatic
31830.679487
1
Manual
34720.485160
2
Semi-Auto
15631.347621
sort avg_mileage to find best mileage transmission
mdf = mdf.sort_values('avg_mileage', ascending=False)
mdfOut:
Manual
34720.485160
Automatic
31830.679487
Semi-Auto
15631.347621
Insight: Manual cars have the highest average mileage.
model count, average, median per transmission
mdf = df.groupby('transmission').agg(
avg_mileage = ('mileage','mean'),
median_mileage = ('mileage','median'),
car_count = ('model','count')
)
mdf = mdf.sort_values('median_mileage', ascending=False)
mdfOut:
Manual
34720.48516
30000.0
2527
Automatic
31830.67949
26193.5
3588
Semi-Auto
15631.34762
9036.5
4666
Insight: Median mileage for Manual is ~30000; Semi-Auto ~9000.
multiple aggregates per transmission
mdf = df.groupby('transmission').agg(
model_count = ('model','count'),
total_price = ('price','sum'),
avg_price = ('price','mean'),
median_price = ('price','median'),
max_price = ('price','max'),
min_price = ('price','min'),
avg_engine_size = ('engineSize','mean'),
)
mdfOut:
Automatic
3588
80439458
22419.023969
19490.0
99950
1200
2.258250
Manual
2527
36988839
14637.451128
13500.0
46000
1445
1.837594
Semi-Auto
4666
127660584
27359.747964
24990.0
123456
5299
2.277004
Best fuel type for mileage
mdf = df.groupby('fuelType').agg(
avg_mileage = ('mileage','mean'),
median_mileage = ('mileage','median'),
).sort_values('avg_mileage', ascending=False)
mdfOut:
Other
44760.638889
44351.5
Diesel
29852.804042
24600.0
Hybrid
23400.530201
16900.0
Electric
19925.333333
20321.0
Petrol
16524.097454
9097.0
Insight: Excluding "Other", Diesel has the highest avg mileage (~29852) and Petrol the lowest.
Multiple-columns groupby
transmission + fuelType — average price
mdf = df.groupby(['transmission','fuelType']).agg(avg_price = ('price','mean'))
mdfOut:
Automatic
Diesel
21216.705504
Automatic
Electric
18466.000000
Automatic
Hybrid
25543.463542
Automatic
Other
18193.861111
Automatic
Petrol
25364.465438
Manual
Diesel
13722.741104
Manual
Petrol
16299.633222
Semi-Auto
Diesel
26776.627579
Semi-Auto
Hybrid
30115.377358
Semi-Auto
Petrol
28209.395278
To find the highest avg price per transmission type, reset and sort:
mdf = mdf.reset_index().sort_values(['transmission','fuelType'], ascending=False)
mdfOut (sorted):
Semi-Auto
Petrol
28209.395278
Semi-Auto
Hybrid
30115.377358
Semi-Auto
Diesel
26776.627579
Manual
Petrol
16299.633222
Manual
Diesel
13722.741104
Automatic
Petrol
25364.465438
Automatic
Other
18193.861111
Automatic
Hybrid
25543.463542
Automatic
Electric
18466.000000
Automatic
Diesel
21216.705504
transmission + fuelType — avg price, total price, avg mileage
mdf = df.groupby(['transmission','fuelType']).agg(
avg_price = ('price','mean'),
total_price = ('price','sum'),
avg_mileage = ('mileage','mean'),
)
mdfOut (truncated):
Automatic
Diesel
21216.705504
52808380
36286.725191
Automatic
Electric
18466.000000
55398
19925.333333
...
...
...
...
...
Semi-Auto
Petrol
28209.395278
46601921
11222.515738
Assignments (expandable)
(End of notebook content)
Last updated