3.DataFrame and Series
Download Dataset
Two DataTypes in Pandas - DataFrame and Series
When working in pandas, you need to deal with only two datatypes:
DataFrame
Series
What is DataFrame
A DataFrame is just like a table (an Excel sheet). It has rows and columns. You can create your own DataFrame with the help of a dictionary also.
import pandas as pd
data = {
'product': ['iphone', 'samsung', 'vivo', 'blackberry'],
'price' : [200, 300, 100, 50]
}
df = pd.DataFrame(data)
dfOut:
0
iphone
200
1
samsung
300
2
vivo
100
3
blackberry
50
Whenever you load an Excel or CSV file, it gives you a result as a DataFrame. For example:
df = pd.read_csv('retail_sales_dataset.csv')
dfOut (example):
0
CUST001
Male
34
Beauty
3
50
1
CUST002
Female
26
Clothing
2
500
2
CUST003
Male
50
Electronics
1
30
3
CUST004
Male
37
Clothing
1
500
4
CUST005
Male
30
Beauty
2
50
...
...
...
...
...
...
...
995
CUST996
Male
62
Clothing
1
50
996
CUST997
Male
52
Beauty
3
30
997
CUST998
Female
23
Beauty
4
25
998
CUST999
Female
36
Electronics
3
50
999
CUST1000
Male
47
Electronics
4
30
1000 rows × 6 columns
4 Basic Properties of DataFrame
A DataFrame has 4 properties. Each property is shown below with examples.
values
Gives the array of all values.
df.valuesOut (example):
array([['CUST001', 'Male', 34, 'Beauty', 3, 50],
['CUST002', 'Female', 26, 'Clothing', 2, 500],
['CUST003', 'Male', 50, 'Electronics', 1, 30],
...,
['CUST998', 'Female', 23, 'Beauty', 4, 25],
['CUST999', 'Female', 36, 'Electronics', 3, 50],
['CUST1000', 'Male', 47, 'Electronics', 4, 30]], shape=(1000, 6), dtype=object)4 Basic Functions of DataFrame
Common functions you will use with DataFrame are shown below. These are explained with examples in the stepper that follows:
head()
Returns the top n rows in the DataFrame. By default it returns 5 rows.
df.head() # default 5 rows
df.head(7) # first 7 rowsOut (example for df.head()):
0
CUST001
Male
34
Beauty
3
50
1
CUST002
Female
26
Clothing
2
500
2
CUST003
Male
50
Electronics
1
30
3
CUST004
Male
37
Clothing
1
500
4
CUST005
Male
30
Beauty
2
50
tail()
Returns the bottom n rows in the DataFrame. By default it returns 5 rows.
df.tail() # default 5 rows
df.tail(3) # last 3 rowsOut (example for df.tail()):
995
CUST996
Male
62
Clothing
1
50
996
CUST997
Male
52
Beauty
3
30
997
CUST998
Female
23
Beauty
4
25
998
CUST999
Female
36
Electronics
3
50
999
CUST1000
Male
47
Electronics
4
30
info()
Gives an overview of the DataFrame (dtypes, non-null counts, memory usage).
df.info()Out (example):
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Customer ID 1000 non-null object
1 Gender 1000 non-null object
2 Age 1000 non-null int64
3 Product Category 1000 non-null object
4 Quantity 1000 non-null int64
5 Price per Unit 1000 non-null int64
dtypes: int64(3), object(3)
memory usage: 47.0+ KBdescribe()
Gives descriptive statistics / 5-number summary for numerical columns.
df.describe()Out (example):
count
1000.00000
1000.00000
1000.000000
mean
41.39200
2.51400
179.890000
std
13.68143
1.13273
189.681356
min
18.00000
1.00000
25.000000
25%
29.00000
1.00000
30.000000
50%
42.00000
3.00000
50.000000
75%
53.00000
4.00000
300.000000
max
64.00000
4.00000
500.000000
Explanation for the Age column:
count : 1000 age records
mean : average age ≈ 41.39
min : youngest age = 18
max : oldest age = 64
25% / 50% / 75% : percentiles (part of the 5-number summary)
Note: std is the standard deviation (covered later).
Assignments:
Load Adidas Dataset from required files.
Get the basic properties (shape , index , values , columns).
Get only number of rows in the dataset.
Get Top 20 rows.
Get Last 5 rows.
Get the basic information of dataframe.
Find the basic statistical summary of the dataframe.
Find the basic statistical summary of the dataframe with all columns.
Save the statistical summary in excel file and in csv file in your desktop.
Last updated