13. validation in pandas

file-download
224B
file-download
3MB

Validation in pandas is used to check if we are following the rules. Having a wrong phone number or wrong Aadhaar card are some examples. We need to detect such invalid values and remove them to keep only correct data.

1. Wrong Phone Number

CustomerID
Phone

0

101

9876543210

1

102

12345

2

103

9988776655

3

104

abcdefghij

4

105

987654321

5

106

9876543210

6

107

98765432101

7

108

9123456789

8

109

98765abcde

9

110

9988776655

Rules for phone numbers:

1

Rule

Phone number should have only 10 digits.

2

Rule

It should contain only numeric characters (no letters or other symbols).

CustomerID
Phone
length

0

101

9876543210

10

1

102

12345

5

2

103

9988776655

10

3

104

abcdefghij

10

4

105

987654321

9

5

106

9876543210

10

6

107

98765432101

11

7

108

9123456789

10

8

109

98765abcde

10

9

110

9988776655

10

CustomerID
Phone
length
numbers

0

101

9876543210

10

True

1

102

12345

5

True

2

103

9988776655

10

True

3

104

abcdefghij

10

False

4

105

987654321

9

True

5

106

9876543210

10

True

6

107

98765432101

11

True

7

108

9123456789

10

True

8

109

98765abcde

10

False

9

110

9988776655

10

True

Filter out the right Phone Numbers

CustomerID
Phone
length
numbers

0

101

9876543210

10

True

2

103

9988776655

10

True

5

106

9876543210

10

True

7

108

9123456789

10

True

9

110

9988776655

10

True

You can drop the helper columns:

CustomerID
Phone

0

101

9876543210

2

103

9988776655

5

106

9876543210

7

108

9123456789

9

110

9988776655

2. Wrong PAN Cards

PAN Format:

1

Total length

10 characters

2

First 5

First 5: letters (A-Z)

3

Next 4

Next 4: digits (0-9)

4

Last 1

Last 1: letter (A-Z)

CustomerID
PAN

0

101

ABCDE1234F

1

102

AB1234CDE5

2

103

KLMNO9012P

3

104

ABCDE12345

4

105

UVWXY7890R

5

106

ABCDE12F4G

6

107

ZABCD2345S

7

108

1234ABCDE5

8

109

OPQRS4567V

9

110

ABCDE12345

10

111

STUVW8901W

11

112

ABCDE12G4F

12

113

XYABC2345X

CustomerID
PAN
characters
First 5 Letter
Next 4 digit
Last 1 Letter

0

101

ABCDE1234F

10

ABCDE

1234

F

1

102

AB1234CDE5

10

AB123

4CDE

5

2

103

KLMNO9012P

10

KLMNO

9012

P

3

104

ABCDE12345

10

ABCDE

1234

5

4

105

UVWXY7890R

10

UVWXY

7890

R

5

106

ABCDE12F4G

10

ABCDE

12F4

G

6

107

ZABCD2345S

10

ZABCD

2345

S

7

108

1234ABCDE5

10

1234A

BCDE

5

8

109

OPQRS4567V

10

OPQRS

4567

V

9

110

ABCDE12345

10

ABCDE

1234

5

10

111

STUVW8901W

10

STUVW

8901

W

11

112

ABCDE12G4F

10

ABCDE

12G4

F

12

113

XYABC2345X

10

XYABC

2345

X

Filter out right PAN Cards

CustomerID
PAN
characters
First 5 Letter
Next 4 digit
Last 1 Letter

0

101

ABCDE1234F

10

ABCDE

1234

F

2

103

KLMNO9012P

10

KLMNO

9012

P

4

105

UVWXY7890R

10

UVWXY

7890

R

6

107

ZABCD2345S

10

ZABCD

2345

S

8

109

OPQRS4567V

10

OPQRS

4567

V

10

111

STUVW8901W

10

STUVW

8901

W

12

113

XYABC2345X

10

XYABC

2345

X

CustomerID
PAN

0

101

ABCDE1234F

2

103

KLMNO9012P

4

105

UVWXY7890R

6

107

ZABCD2345S

8

109

OPQRS4567V

10

111

STUVW8901W

12

113

XYABC2345X

How to use regex to do string manipulation

Regex stands for Regular Expression. It’s a pattern-matching tool that helps you find, check, or extract text based on a pattern.

Examples:

  • Find digits: \d → matches any number (0-9)

  • Find letters: [a-zA-Z] → matches any uppercase or lowercase letter

  • Check email format: \w+@\w+.\w+ → matches something like abc@gmail.com

  • Start of a line: ^Hello → matches lines starting with “Hello”

  • End of a line: world$ → matches lines ending with “world”

If you have a text "My number is 1234" and you use \d+, it will find "1234". Regex describes patterns in text to search, extract, or validate them.

Wrong Phone Numbers (using regex)

CustomerID
Phone

0

101

9876543210

2

103

9988776655

5

106

9876543210

7

108

9123456789

9

110

9988776655

Wrong PAN Cards (using regex)

CustomerID
PAN

0

101

ABCDE1234F

2

103

KLMNO9012P

4

105

UVWXY7890R

6

107

ZABCD2345S

8

109

OPQRS4567V

10

111

STUVW8901W

12

113

XYABC2345X

Last updated