3.DataFrame and Series

Download Dataset

Two DataTypes in Pandas - DataFrame and Series

When working in pandas, you need to deal with only two datatypes:

  • DataFrame

  • Series

What is DataFrame

A DataFrame is just like a table (an Excel sheet). It has rows and columns. You can create your own DataFrame with the help of a dictionary also.

import pandas as pd

data = {
  'product': ['iphone', 'samsung', 'vivo', 'blackberry'],
  'price'  : [200, 300, 100, 50]
}
df = pd.DataFrame(data)
df

Out:

product
price

0

iphone

200

1

samsung

300

2

vivo

100

3

blackberry

50

Whenever you load an Excel or CSV file, it gives you a result as a DataFrame. For example:

df = pd.read_csv('retail_sales_dataset.csv')
df

Out (example):

Customer ID
Gender
Age
Product Category
Quantity
Price per Unit

0

CUST001

Male

34

Beauty

3

50

1

CUST002

Female

26

Clothing

2

500

2

CUST003

Male

50

Electronics

1

30

3

CUST004

Male

37

Clothing

1

500

4

CUST005

Male

30

Beauty

2

50

...

...

...

...

...

...

...

995

CUST996

Male

62

Clothing

1

50

996

CUST997

Male

52

Beauty

3

30

997

CUST998

Female

23

Beauty

4

25

998

CUST999

Female

36

Electronics

3

50

999

CUST1000

Male

47

Electronics

4

30

1000 rows × 6 columns

4 Basic Properties of DataFrame

A DataFrame has 4 properties. Each property is shown below with examples.

1

shape

Gives the number of rows and columns.

df.shape

Out:

(1000, 6)
2

index

Gives the row index.

df.index

Out:

RangeIndex(start=0, stop=1000, step=1)
3

columns

Gives all the column names.

df.columns

Out:

Index(['Customer ID', 'Gender', 'Age', 'Product Category', 'Quantity', 'Price per Unit'], dtype='object')
4

values

Gives the array of all values.

df.values

Out (example):

array([['CUST001', 'Male', 34, 'Beauty', 3, 50],
       ['CUST002', 'Female', 26, 'Clothing', 2, 500],
       ['CUST003', 'Male', 50, 'Electronics', 1, 30],
       ...,
       ['CUST998', 'Female', 23, 'Beauty', 4, 25],
       ['CUST999', 'Female', 36, 'Electronics', 3, 50],
       ['CUST1000', 'Male', 47, 'Electronics', 4, 30]], shape=(1000, 6), dtype=object)

4 Basic Functions of DataFrame

Common functions you will use with DataFrame are shown below. These are explained with examples in the stepper that follows:

1

Returns the top n rows in the DataFrame. By default it returns 5 rows.

df.head()       # default 5 rows
df.head(7)      # first 7 rows

Out (example for df.head()):

Customer ID
Gender
Age
Product Category
Quantity
Price per Unit

0

CUST001

Male

34

Beauty

3

50

1

CUST002

Female

26

Clothing

2

500

2

CUST003

Male

50

Electronics

1

30

3

CUST004

Male

37

Clothing

1

500

4

CUST005

Male

30

Beauty

2

50

2

tail()

Returns the bottom n rows in the DataFrame. By default it returns 5 rows.

df.tail()       # default 5 rows
df.tail(3)      # last 3 rows

Out (example for df.tail()):

Customer ID
Gender
Age
Product Category
Quantity
Price per Unit

995

CUST996

Male

62

Clothing

1

50

996

CUST997

Male

52

Beauty

3

30

997

CUST998

Female

23

Beauty

4

25

998

CUST999

Female

36

Electronics

3

50

999

CUST1000

Male

47

Electronics

4

30

3

info()

Gives an overview of the DataFrame (dtypes, non-null counts, memory usage).

df.info()

Out (example):

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 6 columns):
 #   Column            Non-Null Count  Dtype
---  ------            --------------  -----
 0   Customer ID       1000 non-null   object
 1   Gender            1000 non-null   object
 2   Age               1000 non-null   int64
 3   Product Category  1000 non-null   object
 4   Quantity          1000 non-null   int64
 5   Price per Unit    1000 non-null   int64
dtypes: int64(3), object(3)
memory usage: 47.0+ KB
4

describe()

Gives descriptive statistics / 5-number summary for numerical columns.

df.describe()

Out (example):

Age
Quantity
Price per Unit

count

1000.00000

1000.00000

1000.000000

mean

41.39200

2.51400

179.890000

std

13.68143

1.13273

189.681356

min

18.00000

1.00000

25.000000

25%

29.00000

1.00000

30.000000

50%

42.00000

3.00000

50.000000

75%

53.00000

4.00000

300.000000

max

64.00000

4.00000

500.000000

Explanation for the Age column:

  • count : 1000 age records

  • mean : average age ≈ 41.39

  • min : youngest age = 18

  • max : oldest age = 64

  • 25% / 50% / 75% : percentiles (part of the 5-number summary)

Note: std is the standard deviation (covered later).

5

to_excel() and to_csv()

Save a DataFrame to an Excel or CSV file. Use index=False if you don't want the index in the output file.

df.to_excel('Adidas.xlsx', index=False)
df.to_csv('Adidas.csv', index=False)

Assignments:

  1. Load Adidas Dataset from required files.

  2. Get the basic properties (shape , index , values , columns).

  3. Get only number of rows in the dataset.

  4. Get Top 20 rows.

  5. Get Last 5 rows.

  6. Get the basic information of dataframe.

  7. Find the basic statistical summary of the dataframe.

  8. Find the basic statistical summary of the dataframe with all columns.

  9. Save the statistical summary in excel file and in csv file in your desktop.

Last updated