11. UDF
UDF (user-defined function) is a function written by user to apply custom logic on your data.
It is used when there is no built-in function and you need to create a function.
We often use it to create :
Custom Logic
Complex Business Rules
Applying Modules
Example — load data and compute a total column:
import pandas as pd
df = pd.read_csv('score.csv')
df['total'] = df.iloc[:, 1:6].sum(axis=1)
df.head()Sample output:
0
Alice
78
82
75
68
90
393
1
Bob
85
79
80
74
87
405
2
Charlie
92
95
85
89
93
454
3
David
65
70
60
55
72
322
4
Eva
88
90
78
82
85
423
Create a Grade column
Rules:
A : if mark is greater than or equal to 450
B : if mark is greater than or equal to 350
C : if mark is less than 350
Create a function and apply it to the total column:
def myf(v):
if v >= 450:
return 'A'
elif v >= 350:
return 'B'
else:
return 'C'
df['Grade'] = df['total'].apply(myf)
df.head()Sample output:
0
Alice
78
82
75
68
90
393
B
1
Bob
85
79
80
74
87
405
B
2
Charlie
92
95
85
89
93
454
A
3
David
65
70
60
55
72
322
C
4
Eva
88
90
78
82
85
423
B
Use case 1 — Indian Quarter
Question: In which quarter did we have the most expenses?
In India quarter mapping used here:
us india
2 Apr - Jun = 1
3 Jul - Sep = 2
4 Oct - Dec = 3
1 Jan - Mar = 4
Load expense data and convert dates:
df = pd.read_csv('expense.csv')
df['date'] = pd.to_datetime(df['date'], format='%d/%m/%Y')
df.head()Sample input:
0
03/01/2024
520
Groceries
1
07/01/2024
150
Transport
2
15/01/2024
980
Electronics
3
28/01/2024
220
Dining
4
05/02/2024
430
Rent
Get pandas quarter and then map to Indian quarter:
df['uq'] = df['date'].dt.quarterConvert to Indian quarter:
def iq(v):
if v == 1:
return 4
elif v == 2:
return 1
elif v == 3:
return 2
else:
return 3
df['india_q'] = df['uq'].apply(iq)Aggregate and sort to find the quarter with most expenses:
mdf = df.groupby('india_q').agg(total=('amount', 'sum'))
mdf = mdf.sort_values('total', ascending=False)
mdfSample aggregated output:
4
7190
2
6170
1
4345
3
3020
Use case 2 — Age Group
Question: Which age group purchases from us the most?
Load retail sales data:
df = pd.read_csv('retail_sales_dataset.csv')
df.head()Sample input:
0
CUST001
Male
34
Beauty
3
50
1
CUST002
Female
26
Clothing
2
500
2
CUST003
Male
50
Electronics
1
30
3
CUST004
Male
37
Clothing
1
500
4
CUST005
Male
30
Beauty
2
50
Create a function to map ages to age groups and apply:
def myf(v):
if v >= 0 and v < 10:
return '0-10'
elif v >= 10 and v < 20:
return '10-20'
elif v >= 20 and v < 30:
return '20-30'
elif v >= 30 and v < 40:
return '30-40'
elif v >= 40 and v < 50:
return '40-50'
elif v >= 50 and v < 60:
return '50-60'
elif v >= 60 and v < 70:
return '60-70'
elif v >= 70 and v < 80:
return '70-80'
else:
return '80+'
df['age group'] = df['Age'].apply(myf)
df.head()Sample output:
0
CUST001
Male
34
Beauty
3
50
30-40
1
CUST002
Female
26
Clothing
2
500
20-30
2
CUST003
Male
50
Electronics
1
30
50-60
3
CUST004
Male
37
Clothing
1
500
30-40
4
CUST005
Male
30
Beauty
2
50
30-40
Count age groups:
df['age group'].value_counts()Sample counts:
age group
40-50 222
50-60 221
20-30 209
30-40 191
60-70 115
10-20 42
Applying External Modules with UDF
1. Review - Language Conversion
Sometimes we need an external module applied to a whole column. In that scenario , we can use function to apply that module on whole column values.
For an example , we have a library deep_translator that can translate any language.
Install: pip install deep_translator
Load review data:
df = pd.read_csv('review in language.csv')
dfSample input:
0
1
This product is very good and useful
5
1
2
यह प्रोडक्ट बहुत बढ़िया है और किफायती भी।
4
2
3
I am not satisfied with the quality
2
3
4
C'est un excellent produit j'adore
5
4
5
इसकी पैकिंग खराब थी लेकिन प्रोडक्ट ठीक था।
3
Simple example of conversion with GoogleTranslator:
from deep_translator import GoogleTranslator
review = "this is good product."
translation = GoogleTranslator(source='auto', target='hi').translate(review)
print(translation)
# Output: यह अच्छा उत्पाद है.Apply translation to the review column:
from deep_translator import GoogleTranslator
df = pd.read_csv('review in language.csv')
def trans(v):
translation = GoogleTranslator(source='auto', target='en').translate(v)
return translation
df['eng_review'] = df['review'].apply(trans)
dfSample output:
0
1
This product is very good and useful
5
This product is very good and useful
1
2
यह प्रोडक्ट बहुत बढ़िया है और किफायती भी।
4
This product is great and affordable too.
2
3
I am not satisfied with the quality
2
I am not satisfied with the quality
3
4
C'est un excellent produit j'adore
5
It's a great product I love it
4
5
इसकी पैकिंग खराब थी लेकिन प्रोडक्ट ठीक था।
3
Its packing was bad but the product was fine.
2. Review Sentiment
Sometimes, the logic we want to apply on a Pandas column cannot be done using built-in Pandas or NumPy functions. One such case is sentiment analysis, which requires a Natural Language Processing (NLP) model. Pandas does not provide sentiment detection on its own, so we use an external library such as TextBlob, and apply it through a User Defined Function (UDF).
What is Sentiment Analysis?
Sentiment analysis is the process of identifying whether a text expresses a positive, negative, or neutral emotion. For example:
“Great product! Loved it.” → Positive
“Terrible experience.” → Negative
“It is okay.” → Neutral
First you need to install textblob by pip install textblob
For Example :
from textblob import TextBlob
review = 'this is best product'
TextBlob(review).sentiment.polarity
# Output: 1.0Apply TextBlob on the translated review column:
from textblob import TextBlob
def myf(v):
return TextBlob(v).sentiment.polarity
df['sentiment'] = df['eng_review'].apply(myf)
dfSample output:
0
1
This product is very good and useful
5
This product is very good and useful
0.605000
1
2
यह प्रोडक्ट बहुत बढ़िया है और किफायती भी।
4
This product is great and affordable too.
0.800000
2
3
I am not satisfied with the quality
2
I am not satisfied with the quality
-0.250000
3
4
C'est un excellent produit j'adore
5
It's a great product I love it
0.650000
4
5
इसकी पैकिंग खराब थी लेकिन प्रोडक्ट ठीक था।
3
Its packing was bad but the product was fine.
-0.141667
Complex Business Logic Using UDF
We also use UDF to apply complex business logic as well. Suppose you have a data of car insurance claim and you want to categorize it for claim settlement.
Category Rules
"stolen", "theft" → Category: Theft
"hail", "storm", "rain", "flood" → Category: Natural Damage
"engine", "overheating", "mechanical" → Category: Mechanical Issue
"collision", "rear-ended", "crash", "accident" → Category: Collision
Otherwise → Other
Load claims Data:
df = pd.read_csv('claim.csv')P001
Car was rear-ended on highway - bumper broken
P002
Windshield cracked due to hail storm
P003
Vehicle stolen last night
P004
Engine failure while driving- overheating
P005
Skid accident due to rain- dent on left side
P006
Front airbags deployed after collision
P007
Minor scratch while parking
P008
Car stolen from office parking
P009
Damage due to flash flood
P010
Engine overheating and mechanical breakdown
Creating Business Logic With UDF
def categorize_claim(description):
desc = description.lower()
if "stolen" in desc or "theft" in desc:
return "Theft"
elif ("hail" in desc or
"storm" in desc or
"rain" in desc or
"flood" in desc):
return "Natural Damage"
elif ("engine" in desc or
"overheat" in desc or
"mechanical" in desc):
return "Mechanical Issue"
elif ("collision" in desc or
"rear-ended" in desc or
"crash" in desc or
"accident" in desc):
return "Collision"
else:
return "Other"Applying Business Logic With UDF
df['category'] = df['Claim_Description'].apply(categorize_claim)
df.head()Policy_ID
Claim_Description
category
P001
Car was rear-ended on highway - bumper broken
Collision
P002
Windshield cracked due to hail storm
Natural Damage
P003
Vehicle stolen last night
Theft
P004
Engine failure while driving- overheating
Mechanical Issue
P005
Skid accident due to rain- dent on left side
Natural Damage
Passing the Whole Row to a UDF
When applying User Defined Functions (UDFs) in Pandas, you can choose to pass:
A single column value (df['col'].apply(func))
entire row (df.apply(func, axis=1))
Passing the whole row is extremely useful when your logic requires multiple columns at the same time, such as business rules, scoring systems, or complex calculations.
Risk Score Calculation - Row Example
In Data Analysis , sometimes we need to come up with a composite score.
What is Composite Risk Score?
A Composite Risk Score is a single numeric value that combines multiple attributes or factors into one score. Instead of analyzing many separate columns, a composite score gives a quick and meaningful assessment of overall:
Severity
Priority
Risk
Financial impact
Operational alertness
This score helps organizations rank, classify, or prioritize cases efficiently.
Examples from Different Industries
Insurance – Claim severity score
Healthcare – Patient emergency score
Banking – Credit risk score
E-commerce – Fraud probability score
Marketing – Customer lead score
Composite scoring is widely used because it turns complex data into a simple actionable number that supports decision-making, reporting, and automation.
Risk Score Calculation - Row Example
In Data Analysis , sometimes we need to come up with a composite score.
What is Composite Risk Score?
A Composite Risk Score is a single numeric value that combines multiple attributes or factors into one score. Instead of analyzing many separate columns, a composite score gives a quick and meaningful assessment of overall:
Severity
Priority
Risk
Financial impact
Operational alertness
This score helps organizations rank, classify, or prioritize cases efficiently.
Examples from Different Industries
Insurance – Claim severity score
Healthcare – Patient emergency score
Banking – Credit risk score
E-commerce – Fraud probability score
Marketing – Customer lead score
Composite scoring is widely used because it turns complex data into a simple actionable number that supports decision-making, reporting, and automation.
Risk Score Calculation – Business Rules
The calculate_risk() function assigns a Risk Score to each policy based on multiple factors in the dataset. Below are the rules used for scoring each component.
1️⃣ Driver Age Risk
Risk points based on the customer’s age:
Less than 25 years
+30
25 to 40 years
+20
Above 40 years
+10
2️⃣ Vehicle Age Risk
Older vehicles are considered riskier:
More than 10 years
+25
5 to 10 years
+15
Less than 5 years
+5
3️⃣ Claim History (Years Since Last Claim)
More than 5 years
+25
Between 1 and 5 years
+15
No claim history
0 additional points
4️⃣ Location-Based Risk
Certain areas may have higher theft or accident rates:
High
+30
Medium
+15
Low
+5
5️⃣ Accidents in Last 3 Years
Each accident adds additional risk:
Points added = Number of accidents × 10Example: 3 accidents → 30 points added.
Final Output
The Risk Score is calculated by summing points from:
Driver Age
Vehicle Age
Claim History
Location Risk
Recent Accidents
A higher score indicates a higher insurance risk, helping in:
Premium calculation
Claim prioritization
Customer segmentation
Risk forecasting
Load insurance data:
df = pd.read_csv("insurance.csv")
dfSample input:
0
22
3
0
High
1
1
35
6
2
Medium
0
2
45
12
7
Low
2
3
28
4
0
High
0
4
51
9
3
Medium
1
5
19
11
6
High
3
6
33
2
0
Low
0
7
42
13
10
Medium
2
8
24
7
1
Low
0
9
38
5
6
High
1
Define scoring function and apply row-wise:
def calculate_risk(row):
score = 0
# Driver Age Risk
if row['Customer_Age'] < 25:
score += 30
elif row['Customer_Age'] <= 40:
score += 20
else:
score += 10
# Vehicle Age Risk
if row['Vehicle_Age'] > 10:
score += 25
elif row['Vehicle_Age'] >= 5:
score += 15
else:
score += 5
# Claim History Risk
if row['Claim_History (yrs)'] > 5:
score += 25
elif row['Claim_History (yrs)'] > 0:
score += 15
# Location Risk
if row['Location_Risk'] == 'High':
score += 30
elif row['Location_Risk'] == 'Medium':
score += 15
else:
score += 5
# Accidents in last 3 years
score += row['Accidents_Last_3Y'] * 10
return score
df['risk score'] = df.apply(calculate_risk, axis=1)
dfSample output:
22
3
0
High
1
75
35
6
2
Medium
0
65
45
12
7
Low
2
85
28
4
0
High
0
55
51
9
3
Medium
1
65
19
11
6
High
3
140
33
2
0
Low
0
30
42
13
10
Medium
2
95
24
7
1
Low
0
65
38
5
6
High
1
100
UDF Assignments
On Retail Sales Dataset
Level 1 – Basic
Create a UDF to calculate Total Amount as Quantity × Price per Unit.
Create a UDF that categorizes Age into:
Young (<25)
Adult (25–45)
Senior (>45)
Create a UDF that converts Product Category into single-letter codes:
Beauty → B
Clothing → C
Electronics → E
Create a UDF that converts Gender:
Male → 1
Female → 0
Level 2 – Medium
Create a UDF that classifies orders:
Quantity ≥ 3 → High Order
Otherwise → Low Order
Create a UDF that creates price bands:
<100 → Budget
100–300 → Mid Range
300 → Premium
Create a UDF that extracts the numeric part from Customer ID (CUST014 → 14) and returns the number of digits.
Level 3 – Row Based
Create a UDF to calculate a Business Score:
Score = (Age / 2) + (Quantity × 5)
Create a UDF that assigns product appeal:
Clothing and Age < 30 → Youth Appeal
Beauty and Age > 45 → Mature Appeal
Otherwise → General Appeal
Level 4 – String Processing
Create a UDF to extract only the numeric portion from Customer ID (CUST009 → 9).
Create a UDF to mask Customer ID so only the first two characters are visible (CUST009 → CU****).
Level 5 – Business Logic
Based on Total Amount, create a UDF to assign Purchase Tier:
<500 → Bronze
500–1500 → Silver
1500 → Gold
Create a UDF to assign discount percentage:
Age < 25 → 10%
Price per Unit > 300 → 15%
Otherwise → 5%
Create a UDF to calculate Final Bill:
Final Bill = Total – (Total × Discount%)
On Insurance Data
Create a UDF that assigns Age Category based on Customer_Age:
<25 → Young
25–45 → Adult
45 → Senior
Create a UDF that assigns Vehicle Category:
Vehicle_Age < 5 → New
5–10 → Mid Age
10 → Old
Create a UDF that converts Location_Risk into a score:
High → 3
Medium → 2
Low → 1
Assignment Level 2 – Medium
Create a UDF that checks Accident Risk:
Accidents_Last_3Y = 0 → Safe Driver
1–2 → Moderate Risk
2 → High Risk
Create a UDF that classifies claim history:
Claim_History (yrs) = 0 → No Claim History
1–5 → Recent Claims
5 → Long Claim History
Create a UDF that assigns a flag:
Vehicle Age > 10 AND Location_Risk = High → Critical Vehicle
Otherwise → Normal
Assignment Level 3 – Row-Based Logic
Create a UDF to calculate a “Customer Stability Score”: Formula example:
Score = Customer_Age – (Accidents_Last_3Y × 5)
Create a UDF to calculate “Risk Index” considering:
Higher age adds stability
More vehicle age increases risk
More accidents increase risk
Create a UDF that assigns:
High Priority Claim if:
Vehicle_Age > 10 OR Accidents_Last_3Y > 2
Otherwise Regular
Assignment Level 4 – Business Logic
Create a UDF to calculate claim settlement amount factor:
Young + High location risk → factor 1.25
Old + Low risk → factor 0.85
Others → factor 1.0
Create a UDF that determines premium category:
High risk accident + high location → Premium+
Medium combinations → Standard
Otherwise → Basic
Create a UDF to decide fraud suspicion:
If Claim_History (yrs) > 7 AND Accidents_Last_3Y > 2 → Suspicious
Otherwise Not Suspicious
Assignment Level 5 – Scoring & Computation
Create a UDF that generates a composite Insurance Score: Scoring must use:
Customer_Age
Vehicle_Age
Claim_History (yrs)
Location_Risk
Accidents_Last_3Y
Create a UDF that assigns Final Grade based on score:
80 → A
60–80 → B
<60 → C
Create a UDF to calculate Renewal Probability:
Higher risk → lower probability
Lower risk → higher probability
Assignment Level 6 – Special Cases
Create a UDF to identify:
High Mileage Case if vehicle age > 12 AND customer age < 35
Create a UDF to detect:
Age-Vehicle Conflict:
Customer > 60 AND vehicle age < 3 → Flag “Unusual Case”
Otherwise Normal
Create a UDF that returns a text summary sentence for each row, like: “Driver is 45 years old with 2 accidents in last 3 years and lives in a Low risk area.”
On review Data
Create a UDF that classifies each review as Positive, Neutral, or Negative based on the rating.
Write a UDF that counts the number of words in each review.
Create a UDF to detect the primary language of the review (English / Hindi / French).
Write a UDF that extracts the first 20 characters of every review as a short summary.
Create a UDF that checks whether the word "good" (or its translation) appears in a review (True/False).
Write a UDF that assigns a sentiment score to the review based on certain keywords (e.g., good=+1, bad=–1).
Create a UDF to calculate the character length of each review (excluding spaces).
Write a UDF that converts every review to lowercase.
Create a UDF that returns the last word of each review.
Write a UDF that returns True if the review contains more than 10 words, else False.
Create a UDF that removes punctuation from every review.
Write a UDF that counts how many vowels each review contains.
Create a UDF to generate a simple text-based star rating visualization (e.g., 5 → "*").
Write a UDF that checks whether the review contains any numeric digits.
Create a UDF to classify reviews based on length: Short (<50 chars), Medium (50–100), Long (>100).
Last updated