11. UDF

235B
Open
1KB
Open

UDF (user-defined function) is a function written by user to apply custom logic on your data.

It is used when there is no built-in function and you need to create a function.

We often use it to create :

  1. Custom Logic

  2. Complex Business Rules

  3. Applying Modules

Example — load data and compute a total column:

import pandas as pd

df = pd.read_csv('score.csv')
df['total'] = df.iloc[:, 1:6].sum(axis=1)
df.head()

Sample output:

Name
Math
Science
English
History
Art
total

0

Alice

78

82

75

68

90

393

1

Bob

85

79

80

74

87

405

2

Charlie

92

95

85

89

93

454

3

David

65

70

60

55

72

322

4

Eva

88

90

78

82

85

423


Create a Grade column

Rules:

  • A : if mark is greater than or equal to 450

  • B : if mark is greater than or equal to 350

  • C : if mark is less than 350

Create a function and apply it to the total column:

def myf(v):
    if v >= 450:
        return 'A'
    elif v >= 350:
        return 'B'
    else:
        return 'C'

df['Grade'] = df['total'].apply(myf)
df.head()

Sample output:

Name
Math
Science
English
History
Art
total
Grade

0

Alice

78

82

75

68

90

393

B

1

Bob

85

79

80

74

87

405

B

2

Charlie

92

95

85

89

93

454

A

3

David

65

70

60

55

72

322

C

4

Eva

88

90

78

82

85

423

B


Use case 1 — Indian Quarter

Question: In which quarter did we have the most expenses?

In India quarter mapping used here:

us india

  • 2 Apr - Jun = 1

  • 3 Jul - Sep = 2

  • 4 Oct - Dec = 3

  • 1 Jan - Mar = 4

Load expense data and convert dates:

df = pd.read_csv('expense.csv')
df['date'] = pd.to_datetime(df['date'], format='%d/%m/%Y')
df.head()

Sample input:

date
amount
category

0

03/01/2024

520

Groceries

1

07/01/2024

150

Transport

2

15/01/2024

980

Electronics

3

28/01/2024

220

Dining

4

05/02/2024

430

Rent

Get pandas quarter and then map to Indian quarter:

df['uq'] = df['date'].dt.quarter

Convert to Indian quarter:

def iq(v):
    if v == 1:
        return 4
    elif v == 2:
        return 1
    elif v == 3:
        return 2
    else:
        return 3

df['india_q'] = df['uq'].apply(iq)

Aggregate and sort to find the quarter with most expenses:

mdf = df.groupby('india_q').agg(total=('amount', 'sum'))
mdf = mdf.sort_values('total', ascending=False)
mdf

Sample aggregated output:

india_q
total

4

7190

2

6170

1

4345

3

3020


Use case 2 — Age Group

Question: Which age group purchases from us the most?

Load retail sales data:

df = pd.read_csv('retail_sales_dataset.csv')
df.head()

Sample input:

Customer ID
Gender
Age
Product Category
Quantity
Price per Unit

0

CUST001

Male

34

Beauty

3

50

1

CUST002

Female

26

Clothing

2

500

2

CUST003

Male

50

Electronics

1

30

3

CUST004

Male

37

Clothing

1

500

4

CUST005

Male

30

Beauty

2

50

Create a function to map ages to age groups and apply:

def myf(v):
    if v >= 0 and v < 10:
        return '0-10'
    elif v >= 10 and v < 20:
        return '10-20'
    elif v >= 20 and v < 30:
        return '20-30'
    elif v >= 30 and v < 40:
        return '30-40'
    elif v >= 40 and v < 50:
        return '40-50'
    elif v >= 50 and v < 60:
        return '50-60'
    elif v >= 60 and v < 70:
        return '60-70'
    elif v >= 70 and v < 80:
        return '70-80'
    else:
        return '80+'

df['age group'] = df['Age'].apply(myf)
df.head()

Sample output:

Customer ID
Gender
Age
Product Category
Quantity
Price per Unit
age group

0

CUST001

Male

34

Beauty

3

50

30-40

1

CUST002

Female

26

Clothing

2

500

20-30

2

CUST003

Male

50

Electronics

1

30

50-60

3

CUST004

Male

37

Clothing

1

500

30-40

4

CUST005

Male

30

Beauty

2

50

30-40

Count age groups:

df['age group'].value_counts()

Sample counts:

age group

  • 40-50 222

  • 50-60 221

  • 20-30 209

  • 30-40 191

  • 60-70 115

  • 10-20 42


Applying External Modules with UDF

1. Review - Language Conversion

Sometimes we need an external module applied to a whole column. In that scenario , we can use function to apply that module on whole column values.

For an example , we have a library deep_translator that can translate any language.

Install: pip install deep_translator

Load review data:

df = pd.read_csv('review in language.csv')
df

Sample input:

id
review
rating

0

1

This product is very good and useful

5

1

2

यह प्रोडक्ट बहुत बढ़िया है और किफायती भी।

4

2

3

I am not satisfied with the quality

2

3

4

C'est un excellent produit j'adore

5

4

5

इसकी पैकिंग खराब थी लेकिन प्रोडक्ट ठीक था।

3

Simple example of conversion with GoogleTranslator:

from deep_translator import GoogleTranslator

review = "this is good product."
translation = GoogleTranslator(source='auto', target='hi').translate(review)
print(translation)
# Output: यह अच्छा उत्पाद है.

Apply translation to the review column:

from deep_translator import GoogleTranslator

df = pd.read_csv('review in language.csv')

def trans(v):
    translation = GoogleTranslator(source='auto', target='en').translate(v)
    return translation

df['eng_review'] = df['review'].apply(trans)
df

Sample output:

id
review
rating
eng_review

0

1

This product is very good and useful

5

This product is very good and useful

1

2

यह प्रोडक्ट बहुत बढ़िया है और किफायती भी।

4

This product is great and affordable too.

2

3

I am not satisfied with the quality

2

I am not satisfied with the quality

3

4

C'est un excellent produit j'adore

5

It's a great product I love it

4

5

इसकी पैकिंग खराब थी लेकिन प्रोडक्ट ठीक था।

3

Its packing was bad but the product was fine.


2. Review Sentiment

Sometimes, the logic we want to apply on a Pandas column cannot be done using built-in Pandas or NumPy functions. One such case is sentiment analysis, which requires a Natural Language Processing (NLP) model. Pandas does not provide sentiment detection on its own, so we use an external library such as TextBlob, and apply it through a User Defined Function (UDF).

What is Sentiment Analysis?

Sentiment analysis is the process of identifying whether a text expresses a positive, negative, or neutral emotion. For example:

  1. “Great product! Loved it.” → Positive

  2. “Terrible experience.” → Negative

  3. “It is okay.” → Neutral

First you need to install textblob by pip install textblob

For Example :

from textblob import TextBlob

review = 'this is best product'
TextBlob(review).sentiment.polarity
# Output: 1.0

Apply TextBlob on the translated review column:

from textblob import TextBlob

def myf(v):
    return TextBlob(v).sentiment.polarity

df['sentiment'] = df['eng_review'].apply(myf)
df

Sample output:

id
review
rating
eng_review
sentiment

0

1

This product is very good and useful

5

This product is very good and useful

0.605000

1

2

यह प्रोडक्ट बहुत बढ़िया है और किफायती भी।

4

This product is great and affordable too.

0.800000

2

3

I am not satisfied with the quality

2

I am not satisfied with the quality

-0.250000

3

4

C'est un excellent produit j'adore

5

It's a great product I love it

0.650000

4

5

इसकी पैकिंग खराब थी लेकिन प्रोडक्ट ठीक था।

3

Its packing was bad but the product was fine.

-0.141667


Complex Business Logic Using UDF

We also use UDF to apply complex business logic as well. Suppose you have a data of car insurance claim and you want to categorize it for claim settlement.

Category Rules

  1. "stolen", "theft" → Category: Theft

  2. "hail", "storm", "rain", "flood" → Category: Natural Damage

  3. "engine", "overheating", "mechanical" → Category: Mechanical Issue

  4. "collision", "rear-ended", "crash", "accident" → Category: Collision

  5. Otherwise → Other

Load claims Data:

df = pd.read_csv('claim.csv')
Policy_ID
Claim_Description

P001

Car was rear-ended on highway - bumper broken

P002

Windshield cracked due to hail storm

P003

Vehicle stolen last night

P004

Engine failure while driving- overheating

P005

Skid accident due to rain- dent on left side

P006

Front airbags deployed after collision

P007

Minor scratch while parking

P008

Car stolen from office parking

P009

Damage due to flash flood

P010

Engine overheating and mechanical breakdown

Creating Business Logic With UDF

def categorize_claim(description):

    desc = description.lower()

    if "stolen" in desc or "theft" in desc:
        return "Theft"

    elif ("hail" in desc or
          "storm" in desc or
          "rain" in desc or
          "flood" in desc):
        return "Natural Damage"

    elif ("engine" in desc or
          "overheat" in desc or
          "mechanical" in desc):
        return "Mechanical Issue"

    elif ("collision" in desc or
          "rear-ended" in desc or
          "crash" in desc or
          "accident" in desc):
        return "Collision"

    else:

          return "Other"

Applying Business Logic With UDF

df['category'] = df['Claim_Description'].apply(categorize_claim)
df.head()

Policy_ID

Claim_Description

category

P001

Car was rear-ended on highway - bumper broken

Collision

P002

Windshield cracked due to hail storm

Natural Damage

P003

Vehicle stolen last night

Theft

P004

Engine failure while driving- overheating

Mechanical Issue

P005

Skid accident due to rain- dent on left side

Natural Damage

Passing the Whole Row to a UDF

When applying User Defined Functions (UDFs) in Pandas, you can choose to pass:

  1. A single column value (df['col'].apply(func))

  2. entire row (df.apply(func, axis=1))

Passing the whole row is extremely useful when your logic requires multiple columns at the same time, such as business rules, scoring systems, or complex calculations.

Risk Score Calculation - Row Example

In Data Analysis , sometimes we need to come up with a composite score.

What is Composite Risk Score?

A Composite Risk Score is a single numeric value that combines multiple attributes or factors into one score. Instead of analyzing many separate columns, a composite score gives a quick and meaningful assessment of overall:

  • Severity

  • Priority

  • Risk

  • Financial impact

  • Operational alertness

This score helps organizations rank, classify, or prioritize cases efficiently.

Examples from Different Industries

  • Insurance – Claim severity score

  • Healthcare – Patient emergency score

  • Banking – Credit risk score

  • E-commerce – Fraud probability score

  • Marketing – Customer lead score

Composite scoring is widely used because it turns complex data into a simple actionable number that supports decision-making, reporting, and automation.

Risk Score Calculation - Row Example

In Data Analysis , sometimes we need to come up with a composite score.

What is Composite Risk Score?

A Composite Risk Score is a single numeric value that combines multiple attributes or factors into one score. Instead of analyzing many separate columns, a composite score gives a quick and meaningful assessment of overall:

  • Severity

  • Priority

  • Risk

  • Financial impact

  • Operational alertness

This score helps organizations rank, classify, or prioritize cases efficiently.

Examples from Different Industries

  • Insurance – Claim severity score

  • Healthcare – Patient emergency score

  • Banking – Credit risk score

  • E-commerce – Fraud probability score

  • Marketing – Customer lead score

Composite scoring is widely used because it turns complex data into a simple actionable number that supports decision-making, reporting, and automation.

Risk Score Calculation – Business Rules

The calculate_risk() function assigns a Risk Score to each policy based on multiple factors in the dataset. Below are the rules used for scoring each component.


1️⃣ Driver Age Risk

Risk points based on the customer’s age:

Customer Age
Points

Less than 25 years

+30

25 to 40 years

+20

Above 40 years

+10


2️⃣ Vehicle Age Risk

Older vehicles are considered riskier:

Vehicle Age
Points

More than 10 years

+25

5 to 10 years

+15

Less than 5 years

+5


3️⃣ Claim History (Years Since Last Claim)

Claim History (yrs)
Points

More than 5 years

+25

Between 1 and 5 years

+15

No claim history

0 additional points


4️⃣ Location-Based Risk

Certain areas may have higher theft or accident rates:

Location Risk
Points

High

+30

Medium

+15

Low

+5


5️⃣ Accidents in Last 3 Years

Each accident adds additional risk:

Points added = Number of accidents × 10

Example: 3 accidents → 30 points added.


Final Output

The Risk Score is calculated by summing points from:

  • Driver Age

  • Vehicle Age

  • Claim History

  • Location Risk

  • Recent Accidents

A higher score indicates a higher insurance risk, helping in:

  • Premium calculation

  • Claim prioritization

  • Customer segmentation

  • Risk forecasting

Load insurance data:

df = pd.read_csv("insurance.csv")
df

Sample input:

Customer_Age
Vehicle_Age
Claim_History (yrs)
Location_Risk
Accidents_Last_3Y

0

22

3

0

High

1

1

35

6

2

Medium

0

2

45

12

7

Low

2

3

28

4

0

High

0

4

51

9

3

Medium

1

5

19

11

6

High

3

6

33

2

0

Low

0

7

42

13

10

Medium

2

8

24

7

1

Low

0

9

38

5

6

High

1

Define scoring function and apply row-wise:

def calculate_risk(row):
    score = 0
    # Driver Age Risk
    if row['Customer_Age'] < 25:
        score += 30
    elif row['Customer_Age'] <= 40:
        score += 20
    else:
        score += 10

    # Vehicle Age Risk
    if row['Vehicle_Age'] > 10:
        score += 25
    elif row['Vehicle_Age'] >= 5:
        score += 15
    else:
        score += 5

    # Claim History Risk
    if row['Claim_History (yrs)'] > 5:
        score += 25
    elif row['Claim_History (yrs)'] > 0:
        score += 15

    # Location Risk
    if row['Location_Risk'] == 'High':
        score += 30
    elif row['Location_Risk'] == 'Medium':
        score += 15
    else:
        score += 5

    # Accidents in last 3 years
    score += row['Accidents_Last_3Y'] * 10

    return score

df['risk score'] = df.apply(calculate_risk, axis=1)
df

Sample output:

Customer_Age
Vehicle_Age
Claim_History (yrs)
Location_Risk
Accidents_Last_3Y
risk score

22

3

0

High

1

75

35

6

2

Medium

0

65

45

12

7

Low

2

85

28

4

0

High

0

55

51

9

3

Medium

1

65

19

11

6

High

3

140

33

2

0

Low

0

30

42

13

10

Medium

2

95

24

7

1

Low

0

65

38

5

6

High

1

100

UDF Assignments

On Retail Sales Dataset

Level 1 – Basic

  1. Create a UDF to calculate Total Amount as Quantity × Price per Unit.

  2. Create a UDF that categorizes Age into:

    • Young (<25)

    • Adult (25–45)

    • Senior (>45)

  3. Create a UDF that converts Product Category into single-letter codes:

    • Beauty → B

    • Clothing → C

    • Electronics → E

  4. Create a UDF that converts Gender:

    • Male → 1

    • Female → 0


Level 2 – Medium

  1. Create a UDF that classifies orders:

    • Quantity ≥ 3 → High Order

    • Otherwise → Low Order

  2. Create a UDF that creates price bands:

    • <100 → Budget

    • 100–300 → Mid Range

    • 300 → Premium

  3. Create a UDF that extracts the numeric part from Customer ID (CUST014 → 14) and returns the number of digits.


Level 3 – Row Based

  1. Create a UDF to calculate a Business Score:

    • Score = (Age / 2) + (Quantity × 5)

  2. Create a UDF that assigns product appeal:

  • Clothing and Age < 30 → Youth Appeal

  • Beauty and Age > 45 → Mature Appeal

  • Otherwise → General Appeal


Level 4 – String Processing

  1. Create a UDF to extract only the numeric portion from Customer ID (CUST009 → 9).

  2. Create a UDF to mask Customer ID so only the first two characters are visible (CUST009 → CU****).


Level 5 – Business Logic

  1. Based on Total Amount, create a UDF to assign Purchase Tier:

  • <500 → Bronze

  • 500–1500 → Silver

  • 1500 → Gold

  1. Create a UDF to assign discount percentage:

  • Age < 25 → 10%

  • Price per Unit > 300 → 15%

  • Otherwise → 5%

  1. Create a UDF to calculate Final Bill:

  • Final Bill = Total – (Total × Discount%)

On Insurance Data

  1. Create a UDF that assigns Age Category based on Customer_Age:

    • <25 → Young

    • 25–45 → Adult

    • 45 → Senior

  2. Create a UDF that assigns Vehicle Category:

    • Vehicle_Age < 5 → New

    • 5–10 → Mid Age

    • 10 → Old

  3. Create a UDF that converts Location_Risk into a score:

    • High → 3

    • Medium → 2

    • Low → 1


Assignment Level 2 – Medium

  1. Create a UDF that checks Accident Risk:

    • Accidents_Last_3Y = 0 → Safe Driver

    • 1–2 → Moderate Risk

    • 2 → High Risk

  2. Create a UDF that classifies claim history:

    • Claim_History (yrs) = 0 → No Claim History

    • 1–5 → Recent Claims

    • 5 → Long Claim History

  3. Create a UDF that assigns a flag:

    • Vehicle Age > 10 AND Location_Risk = High → Critical Vehicle

    • Otherwise → Normal


Assignment Level 3 – Row-Based Logic

  1. Create a UDF to calculate a “Customer Stability Score”: Formula example:

    • Score = Customer_Age – (Accidents_Last_3Y × 5)

  2. Create a UDF to calculate “Risk Index” considering:

    • Higher age adds stability

    • More vehicle age increases risk

    • More accidents increase risk

  3. Create a UDF that assigns:

    • High Priority Claim if:

      • Vehicle_Age > 10 OR Accidents_Last_3Y > 2

    • Otherwise Regular


Assignment Level 4 – Business Logic

  1. Create a UDF to calculate claim settlement amount factor:

  • Young + High location risk → factor 1.25

  • Old + Low risk → factor 0.85

  • Others → factor 1.0

  1. Create a UDF that determines premium category:

  • High risk accident + high location → Premium+

  • Medium combinations → Standard

  • Otherwise → Basic

  1. Create a UDF to decide fraud suspicion:

  • If Claim_History (yrs) > 7 AND Accidents_Last_3Y > 2 → Suspicious

  • Otherwise Not Suspicious


Assignment Level 5 – Scoring & Computation

  1. Create a UDF that generates a composite Insurance Score: Scoring must use:

  • Customer_Age

  • Vehicle_Age

  • Claim_History (yrs)

  • Location_Risk

  • Accidents_Last_3Y

  1. Create a UDF that assigns Final Grade based on score:

  • 80 → A

  • 60–80 → B

  • <60 → C

  1. Create a UDF to calculate Renewal Probability:

  • Higher risk → lower probability

  • Lower risk → higher probability


Assignment Level 6 – Special Cases

  1. Create a UDF to identify:

  • High Mileage Case if vehicle age > 12 AND customer age < 35

  1. Create a UDF to detect:

  • Age-Vehicle Conflict:

    • Customer > 60 AND vehicle age < 3 → Flag “Unusual Case”

    • Otherwise Normal

  1. Create a UDF that returns a text summary sentence for each row, like: “Driver is 45 years old with 2 accidents in last 3 years and lives in a Low risk area.”

On review Data

  • Create a UDF that classifies each review as Positive, Neutral, or Negative based on the rating.

  • Write a UDF that counts the number of words in each review.

  • Create a UDF to detect the primary language of the review (English / Hindi / French).

  • Write a UDF that extracts the first 20 characters of every review as a short summary.

  • Create a UDF that checks whether the word "good" (or its translation) appears in a review (True/False).

  • Write a UDF that assigns a sentiment score to the review based on certain keywords (e.g., good=+1, bad=–1).

  • Create a UDF to calculate the character length of each review (excluding spaces).

  • Write a UDF that converts every review to lowercase.

  • Create a UDF that returns the last word of each review.

  • Write a UDF that returns True if the review contains more than 10 words, else False.

  • Create a UDF that removes punctuation from every review.

  • Write a UDF that counts how many vowels each review contains.

  • Create a UDF to generate a simple text-based star rating visualization (e.g., 5 → "*").

  • Write a UDF that checks whether the review contains any numeric digits.

  • Create a UDF to classify reviews based on length: Short (<50 chars), Medium (50–100), Long (>100).

Last updated