11. UDF

235B

insurance.csv

Open

32KB

retail_sales_dataset.csv

Open

372B

review in language.csv

Open

1KB

expense.csv

Open

UDF (user-defined function) is a function written by user to apply custom logic on your data.

It is used when there is no built-in function and you need to create a function.

We often use it to create :

Custom Logic
Complex Business Rules
Applying Modules

Example — load data and compute a total column:

import pandas as pd

df = pd.read_csv('score.csv')
df['total'] = df.iloc[:, 1:6].sum(axis=1)
df.head()

Sample output:

Name

Math

Science

English

History

Art

total

Alice

393

Bob

405

Charlie

454

David

322

Eva

423

Create a Grade column

Rules:

A : if mark is greater than or equal to 450
B : if mark is greater than or equal to 350
C : if mark is less than 350

Create a function and apply it to the total column:

def myf(v):
    if v >= 450:
        return 'A'
    elif v >= 350:
        return 'B'
    else:
        return 'C'

df['Grade'] = df['total'].apply(myf)
df.head()

Sample output:

Name

Math

Science

English

History

Art

total

Grade

Alice

393

Bob

405

Charlie

454

David

322

Eva

423

Use case 1 — Indian Quarter

Question: In which quarter did we have the most expenses?

In India quarter mapping used here:

us india

2 Apr - Jun = 1
3 Jul - Sep = 2
4 Oct - Dec = 3
1 Jan - Mar = 4

Load expense data and convert dates:

df = pd.read_csv('expense.csv')
df['date'] = pd.to_datetime(df['date'], format='%d/%m/%Y')
df.head()

Sample input:

date

amount

Use case 2 — Age Group

Question: Which age group purchases from us the most?

Load retail sales data:

df = pd.read_csv('retail_sales_dataset.csv')
df.head()

Sample input:

Customer ID

Gender

Age

Product Category

Quantity

Price per Unit

CUST001

Male

Beauty

CUST002

Female

Clothing

500

CUST003

Male

Electronics

CUST004

Male

Clothing

500

CUST005

Male

Beauty

Create a function to map ages to age groups and apply:

def myf(v):
    if v >= 0 and v < 10:
        return '0-10'
    elif v >= 10 and v < 20:
        return '10-20'
    elif v >= 20 and v < 30:
        return '20-30'
    elif v >= 30 and v < 40:
        return '30-40'
    elif v >= 40 and v < 50:
        return '40-50'
    elif v >= 50 and v < 60:
        return '50-60'
    elif v >= 60 and v < 70:
        return '60-70'
    elif v >= 70 and v < 80:
        return '70-80'
    else:
        return '80+'

df['age group'] = df['Age'].apply(myf)
df.head()

Sample output:

Customer ID

Gender

Age

Product Category

Quantity

Price per Unit

age group

CUST001

Male

Beauty

30-40

CUST002

Female

Clothing

500

20-30

CUST003

Male

Electronics

50-60

CUST004

Male

Clothing

500

30-40

CUST005

Male

Beauty

30-40

Count age groups:

df['age group'].value_counts()

Sample counts:

age group

40-50 222
50-60 221
20-30 209
30-40 191
60-70 115
10-20 42

Applying External Modules with UDF

1. Review - Language Conversion

Sometimes we need an external module applied to a whole column. In that scenario , we can use function to apply that module on whole column values.

For an example , we have a library deep_translator that can translate any language.

Install: pip install deep_translator

Load review data:

df = pd.read_csv('review in language.csv')
df

Sample input:

review

rating

This product is very good and useful

यह प्रोडक्ट बहुत बढ़िया है और किफायती भी।

I am not satisfied with the quality

C'est un excellent produit j'adore

इसकी पैकिंग खराब थी लेकिन प्रोडक्ट ठीक था।

Simple example of conversion with GoogleTranslator:

from deep_translator import GoogleTranslator

review = "this is good product."
translation = GoogleTranslator(source='auto', target='hi').translate(review)
print(translation)
# Output: यह अच्छा उत्पाद है.

Apply translation to the review column:

from deep_translator import GoogleTranslator

df = pd.read_csv('review in language.csv')

def trans(v):
    translation = GoogleTranslator(source='auto', target='en').translate(v)
    return translation

df['eng_review'] = df['review'].apply(trans)
df

Sample output:

review

rating

eng_review

This product is very good and useful

यह प्रोडक्ट बहुत बढ़िया है और किफायती भी।

This product is great and affordable too.

I am not satisfied with the quality

C'est un excellent produit j'adore

It's a great product I love it

इसकी पैकिंग खराब थी लेकिन प्रोडक्ट ठीक था।

Its packing was bad but the product was fine.

2. Review Sentiment

Sometimes, the logic we want to apply on a Pandas column cannot be done using built-in Pandas or NumPy functions. One such case is sentiment analysis, which requires a Natural Language Processing (NLP) model. Pandas does not provide sentiment detection on its own, so we use an external library such as TextBlob, and apply it through a User Defined Function (UDF).

What is Sentiment Analysis?

Sentiment analysis is the process of identifying whether a text expresses a positive, negative, or neutral emotion. For example:

“Great product! Loved it.” → Positive
“Terrible experience.” → Negative
“It is okay.” → Neutral

First you need to install textblob by pip install textblob

For Example :

from textblob import TextBlob

review = 'this is best product'
TextBlob(review).sentiment.polarity
# Output: 1.0

Apply TextBlob on the translated review column:

from textblob import TextBlob

def myf(v):
    return TextBlob(v).sentiment.polarity

df['sentiment'] = df['eng_review'].apply(myf)
df

Sample output:

review

rating

eng_review

sentiment

This product is very good and useful

0.605000

यह प्रोडक्ट बहुत बढ़िया है और किफायती भी।

This product is great and affordable too.

0.800000

I am not satisfied with the quality

-0.250000

C'est un excellent produit j'adore

It's a great product I love it

0.650000

इसकी पैकिंग खराब थी लेकिन प्रोडक्ट ठीक था।

Its packing was bad but the product was fine.

-0.141667

Complex Business Logic Using UDF

We also use UDF to apply complex business logic as well. Suppose you have a data of car insurance claim and you want to categorize it for claim settlement.

Category Rules

"stolen", "theft" → Category: Theft
"hail", "storm", "rain", "flood" → Category: Natural Damage
"engine", "overheating", "mechanical" → Category: Mechanical Issue
"collision", "rear-ended", "crash", "accident" → Category: Collision
Otherwise → Other

Load claims Data:

df = pd.read_csv('claim.csv')

Policy_ID

Claim_Description

P001

Car was rear-ended on highway - bumper broken

P002

Windshield cracked due to hail storm

P003

Vehicle stolen last night

P004

Engine failure while driving- overheating

P005

Skid accident due to rain- dent on left side

P006

Front airbags deployed after collision

P007

Minor scratch while parking

P008

Car stolen from office parking

P009

Damage due to flash flood

P010

Engine overheating and mechanical breakdown

Creating Business Logic With UDF

def categorize_claim(description):

    desc = description.lower()

    if "stolen" in desc or "theft" in desc:
        return "Theft"

    elif ("hail" in desc or
          "storm" in desc or
          "rain" in desc or
          "flood" in desc):
        return "Natural Damage"

    elif ("engine" in desc or
          "overheat" in desc or
          "mechanical" in desc):
        return "Mechanical Issue"

    elif ("collision" in desc or
          "rear-ended" in desc or
          "crash" in desc or
          "accident" in desc):
        return "Collision"

    else:

          return "Other"

Applying Business Logic With UDF

df['category'] = df['Claim_Description'].apply(categorize_claim)
df.head()

Policy_ID

Claim_Description

Passing the Whole Row to a UDF

When applying User Defined Functions (UDFs) in Pandas, you can choose to pass:

A single column value (df['col'].apply(func))
entire row (df.apply(func, axis=1))

Passing the whole row is extremely useful when your logic requires multiple columns at the same time, such as business rules, scoring systems, or complex calculations.

Risk Score Calculation - Row Example

In Data Analysis , sometimes we need to come up with a composite score.

What is Composite Risk Score?

A Composite Risk Score is a single numeric value that combines multiple attributes or factors into one score. Instead of analyzing many separate columns, a composite score gives a quick and meaningful assessment of overall:

Severity
Priority
Risk
Financial impact
Operational alertness

This score helps organizations rank, classify, or prioritize cases efficiently.

Examples from Different Industries

Insurance – Claim severity score
Healthcare – Patient emergency score
Banking – Credit risk score
E-commerce – Fraud probability score
Marketing – Customer lead score

Composite scoring is widely used because it turns complex data into a simple actionable number that supports decision-making, reporting, and automation.

Risk Score Calculation - Row Example

In Data Analysis , sometimes we need to come up with a composite score.

What is Composite Risk Score?

Severity
Priority
Risk
Financial impact
Operational alertness

This score helps organizations rank, classify, or prioritize cases efficiently.

Examples from Different Industries

Insurance – Claim severity score
Healthcare – Patient emergency score
Banking – Credit risk score
E-commerce – Fraud probability score
Marketing – Customer lead score

Composite scoring is widely used because it turns complex data into a simple actionable number that supports decision-making, reporting, and automation.

Risk Score Calculation – Business Rules

The calculate_risk() function assigns a Risk Score to each policy based on multiple factors in the dataset. Below are the rules used for scoring each component.

1️⃣ Driver Age Risk

Risk points based on the customer’s age:

Customer Age

Points

Less than 25 years

+30

25 to 40 years

+20

Above 40 years

+10

2️⃣ Vehicle Age Risk

Older vehicles are considered riskier:

Vehicle Age

Points

More than 10 years

+25

5 to 10 years

+15

Less than 5 years

3️⃣ Claim History (Years Since Last Claim)

Claim History (yrs)

Points

More than 5 years

+25

Between 1 and 5 years

+15

No claim history

0 additional points

4️⃣ Location-Based Risk

Certain areas may have higher theft or accident rates:

Location Risk

Points

High

+30

Medium

+15

Low

5️⃣ Accidents in Last 3 Years

Each accident adds additional risk:

Points added = Number of accidents × 10

Example: 3 accidents → 30 points added.

Final Output

The Risk Score is calculated by summing points from:

Driver Age
Vehicle Age
Claim History
Location Risk
Recent Accidents

A higher score indicates a higher insurance risk, helping in:

Premium calculation
Claim prioritization
Customer segmentation
Risk forecasting

Load insurance data:

df = pd.read_csv("insurance.csv")
df

Sample input:

Customer_Age

Vehicle_Age

Claim_History (yrs)

Location_Risk

Accidents_Last_3Y

High

Medium

Low

High

Medium

High

Low

Medium

Low

High

Define scoring function and apply row-wise:

def calculate_risk(row):
    score = 0
    # Driver Age Risk
    if row['Customer_Age'] < 25:
        score += 30
    elif row['Customer_Age'] <= 40:
        score += 20
    else:
        score += 10

    # Vehicle Age Risk
    if row['Vehicle_Age'] > 10:
        score += 25
    elif row['Vehicle_Age'] >= 5:
        score += 15
    else:
        score += 5

    # Claim History Risk
    if row['Claim_History (yrs)'] > 5:
        score += 25
    elif row['Claim_History (yrs)'] > 0:
        score += 15

    # Location Risk
    if row['Location_Risk'] == 'High':
        score += 30
    elif row['Location_Risk'] == 'Medium':
        score += 15
    else:
        score += 5

    # Accidents in last 3 years
    score += row['Accidents_Last_3Y'] * 10

    return score

df['risk score'] = df.apply(calculate_risk, axis=1)
df

Sample output:

Customer_Age

Vehicle_Age

Claim_History (yrs)

Location_Risk

Accidents_Last_3Y

risk score

High

Medium

Low

High

Medium

High

140

Low

Medium

Low

High

100

UDF Assignments

On Retail Sales Dataset

Level 1 – Basic

Create a UDF to calculate Total Amount as Quantity × Price per Unit.
Create a UDF that categorizes Age into:
- Young (<25)
- Adult (25–45)
- Senior (>45)
Create a UDF that converts Product Category into single-letter codes:
- Beauty → B
- Clothing → C
- Electronics → E
Create a UDF that converts Gender:
- Male → 1
- Female → 0

Level 2 – Medium

Create a UDF that classifies orders:
- Quantity ≥ 3 → High Order
- Otherwise → Low Order
Create a UDF that creates price bands:
- <100 → Budget
- 100–300 → Mid Range
- 300 → Premium
Create a UDF that extracts the numeric part from Customer ID (CUST014 → 14) and returns the number of digits.

Level 3 – Row Based

Create a UDF to calculate a Business Score:
- Score = (Age / 2) + (Quantity × 5)
Create a UDF that assigns product appeal:

Clothing and Age < 30 → Youth Appeal
Beauty and Age > 45 → Mature Appeal
Otherwise → General Appeal

Level 4 – String Processing

Create a UDF to extract only the numeric portion from Customer ID (CUST009 → 9).
Create a UDF to mask Customer ID so only the first two characters are visible (CUST009 → CU****).

Level 5 – Business Logic

Based on Total Amount, create a UDF to assign Purchase Tier:

<500 → Bronze
500–1500 → Silver
1500 → Gold

Create a UDF to assign discount percentage:

Age < 25 → 10%
Price per Unit > 300 → 15%
Otherwise → 5%

Create a UDF to calculate Final Bill:

Final Bill = Total – (Total × Discount%)

On Insurance Data

Create a UDF that assigns Age Category based on Customer_Age:
- <25 → Young
- 25–45 → Adult
- 45 → Senior
Create a UDF that assigns Vehicle Category:
- Vehicle_Age < 5 → New
- 5–10 → Mid Age
- 10 → Old
Create a UDF that converts Location_Risk into a score:
- High → 3
- Medium → 2
- Low → 1

Assignment Level 2 – Medium

Create a UDF that checks Accident Risk:
- Accidents_Last_3Y = 0 → Safe Driver
- 1–2 → Moderate Risk
- 2 → High Risk
Create a UDF that classifies claim history:
- Claim_History (yrs) = 0 → No Claim History
- 1–5 → Recent Claims
- 5 → Long Claim History
Create a UDF that assigns a flag:
- Vehicle Age > 10 AND Location_Risk = High → Critical Vehicle
- Otherwise → Normal

Assignment Level 3 – Row-Based Logic

Create a UDF to calculate a “Customer Stability Score”: Formula example:
- Score = Customer_Age – (Accidents_Last_3Y × 5)
Create a UDF to calculate “Risk Index” considering:
- Higher age adds stability
- More vehicle age increases risk
- More accidents increase risk
Create a UDF that assigns:
- High Priority Claim if:
  - Vehicle_Age > 10 OR Accidents_Last_3Y > 2
- Otherwise Regular

Assignment Level 4 – Business Logic

Create a UDF to calculate claim settlement amount factor:

Young + High location risk → factor 1.25
Old + Low risk → factor 0.85
Others → factor 1.0

Create a UDF that determines premium category:

High risk accident + high location → Premium+
Medium combinations → Standard
Otherwise → Basic

Create a UDF to decide fraud suspicion:

If Claim_History (yrs) > 7 AND Accidents_Last_3Y > 2 → Suspicious
Otherwise Not Suspicious

Assignment Level 5 – Scoring & Computation

Create a UDF that generates a composite Insurance Score: Scoring must use:

Customer_Age
Vehicle_Age
Claim_History (yrs)
Location_Risk
Accidents_Last_3Y

Create a UDF that assigns Final Grade based on score:

80 → A
60–80 → B
<60 → C

Create a UDF to calculate Renewal Probability:

Higher risk → lower probability
Lower risk → higher probability

Assignment Level 6 – Special Cases

Create a UDF to identify:

High Mileage Case if vehicle age > 12 AND customer age < 35

Create a UDF to detect:

Age-Vehicle Conflict:
- Customer > 60 AND vehicle age < 3 → Flag “Unusual Case”
- Otherwise Normal

Create a UDF that returns a text summary sentence for each row, like: “Driver is 45 years old with 2 accidents in last 3 years and lives in a Low risk area.”

On review Data

Create a UDF that classifies each review as Positive, Neutral, or Negative based on the rating.
Write a UDF that counts the number of words in each review.
Create a UDF to detect the primary language of the review (English / Hindi / French).
Write a UDF that extracts the first 20 characters of every review as a short summary.
Create a UDF that checks whether the word "good" (or its translation) appears in a review (True/False).
Write a UDF that assigns a sentiment score to the review based on certain keywords (e.g., good=+1, bad=–1).
Create a UDF to calculate the character length of each review (excluding spaces).
Write a UDF that converts every review to lowercase.
Create a UDF that returns the last word of each review.
Write a UDF that returns True if the review contains more than 10 words, else False.
Create a UDF that removes punctuation from every review.
Write a UDF that counts how many vowels each review contains.
Create a UDF to generate a simple text-based star rating visualization (e.g., 5 → "*").
Write a UDF that checks whether the review contains any numeric digits.
Create a UDF to classify reviews based on length: Short (<50 chars), Medium (50–100), Long (>100).

Previous10. datetime in pandas Nextecommerce

Last updated 4 hours ago

Good morning

Create a Grade column

Use case 1 — Indian Quarter

Use case 2 — Age Group

Applying External Modules with UDF

1. Review - Language Conversion

Simple example of conversion with GoogleTranslator:

Apply translation to the review column:

2. Review Sentiment

Apply TextBlob on the translated review column:

Complex Business Logic Using UDF

Load claims Data:

Creating Business Logic With UDF

Applying Business Logic With UDF

Passing the Whole Row to a UDF

Examples from Different Industries

Examples from Different Industries

Risk Score Calculation – Business Rules

1️⃣ Driver Age Risk

2️⃣ Vehicle Age Risk

3️⃣ Claim History (Years Since Last Claim)

4️⃣ Location-Based Risk

5️⃣ Accidents in Last 3 Years

Final Output

UDF Assignments

On Retail Sales Dataset

Level 1 – Basic

Level 2 – Medium

Level 3 – Row Based

Level 4 – String Processing

Level 5 – Business Logic

On Insurance Data

Assignment Level 2 – Medium

Assignment Level 3 – Row-Based Logic

Assignment Level 4 – Business Logic

Assignment Level 5 – Scoring & Computation

Assignment Level 6 – Special Cases

On review Data