Perl Perl Perl Regular Expressions: Pattern Matching Mastery

Perl Regular Expressions: Pattern Matching Mastery

AS
Aman Saurav
| Dec 25, 2024 |
read
#perl #regex #pattern-matching #text-processing

Perl Regular Expressions: Pattern Matching Mastery

Perl is renowned for its powerful regular expression (regex) capabilities. In fact, many modern regex implementations are based on Perl’s syntax. This guide covers everything from basics to advanced pattern matching.

Why Perl for Regex?

Perl’s regex engine is:

  • Powerful: Supports advanced features like lookaheads, lookbehinds, and atomic groups
  • Fast: Highly optimized for text processing
  • Expressive: Concise syntax for complex patterns
  • Influential: PCRE (Perl Compatible Regular Expressions) used in many languages

Basic Pattern Matching

The Match Operator (m//)

#!/usr/bin/perl
use strict;
use warnings;

my $text = "Hello, World!";

# Basic match
if ($text =~ /World/) {
    print "Found 'World'\n";
}

# Case-insensitive match
if ($text =~ /world/i) {
    print "Found 'world' (case-insensitive)\n";
}

# Negated match
if ($text !~ /Goodbye/) {
    print "Didn't find 'Goodbye'\n";
}

Match Modifiers

my $text = "Hello\nWorld\nPerl";

# i - case insensitive
$text =~ /HELLO/i;  # Matches

# m - multiline (^ and $ match line boundaries)
$text =~ /^World/m;  # Matches

# s - single line (. matches newline)
$text =~ /Hello.World/s;  # Matches

# x - extended (allows whitespace and comments)
$text =~ /
    Hello   # Match Hello
    \s+     # One or more whitespace
    World   # Match World
/x;

# g - global (find all matches)
while ($text =~ /\w+/g) {
    print "Found: $&\n";
}

Character Classes

# Predefined character classes
/\d/   # Digit [0-9]
/\D/   # Non-digit
/\w/   # Word character [a-zA-Z0-9_]
/\W/   # Non-word character
/\s/   # Whitespace [ \t\n\r\f]
/\S/   # Non-whitespace

# Custom character classes
/[aeiou]/      # Vowels
/[^aeiou]/     # Not vowels
/[a-z]/        # Lowercase letters
/[A-Z]/        # Uppercase letters
/[0-9]/        # Digits
/[a-zA-Z0-9]/  # Alphanumeric

# Examples
my $text = "abc123XYZ";
$text =~ /\d+/;     # Matches "123"
$text =~ /[A-Z]+/;  # Matches "XYZ"
$text =~ /\w+/;     # Matches entire string

Quantifiers

# Greedy quantifiers
*      # 0 or more
+      # 1 or more
?      # 0 or 1
{n}    # Exactly n
{n,}   # n or more
{n,m}  # Between n and m

# Examples
my $text = "aaabbbccc";
$text =~ /a+/;      # Matches "aaa"
$text =~ /b{3}/;    # Matches "bbb"
$text =~ /c{2,}/;   # Matches "ccc"

# Non-greedy (lazy) quantifiers
*?     # 0 or more (non-greedy)
+?     # 1 or more (non-greedy)
??     # 0 or 1 (non-greedy)
{n,}?  # n or more (non-greedy)
{n,m}? # Between n and m (non-greedy)

# Greedy vs Non-greedy
my $html = "<b>bold</b> text <b>more</b>";
$html =~ /<b>.*<\/b>/;   # Matches entire string (greedy)
$html =~ /<b>.*?<\/b>/;  # Matches "<b>bold</b>" (non-greedy)

Anchors and Boundaries

# Position anchors
^      # Start of string (or line with /m)
$      # End of string (or line with /m)
\A     # Start of string (always)
\Z     # End of string (always)
\b     # Word boundary
\B     # Not word boundary

# Examples
my $text = "Hello World";
$text =~ /^Hello/;    # Matches (starts with Hello)
$text =~ /World$/;    # Matches (ends with World)
$text =~ /\bWorld\b/; # Matches (whole word)
$text =~ /\Bor\B/;    # Matches "or" in "World"

# Email validation
my $email = "user@example.com";
if ($email =~ /^[\w.-]+@[\w.-]+\.\w+$/) {
    print "Valid email format\n";
}

Capturing Groups

# Basic capturing
my $text = "John Doe";
if ($text =~ /(\w+)\s+(\w+)/) {
    print "First: $1\n";   # John
    print "Last: $2\n";    # Doe
}

# Non-capturing groups
$text =~ /(?:Mr|Mrs|Ms)\s+(\w+)/;  # Don't capture title

# Named captures
$text = "2024-12-25";
if ($text =~ /(?<year>\d{4})-(?<month>\d{2})-(?<day>\d{2})/) {
    print "Year: $+{year}\n";    # 2024
    print "Month: $+{month}\n";  # 12
    print "Day: $+{day}\n";      # 25
}

# Backreferences
$text = "hello hello";
$text =~ /(\w+)\s+\1/;  # Matches repeated word

# Extract all matches
my @words = $text =~ /\b(\w+)\b/g;
print "Words: @words\n";

Substitution (s///)

# Basic substitution
my $text = "Hello World";
$text =~ s/World/Perl/;  # "Hello Perl"

# Global substitution
$text = "foo bar foo baz";
$text =~ s/foo/qux/g;    # "qux bar qux baz"

# Case-insensitive substitution
$text = "Hello WORLD";
$text =~ s/world/Perl/i; # "Hello Perl"

# Using captured groups
$text = "John Doe";
$text =~ s/(\w+)\s+(\w+)/$2, $1/;  # "Doe, John"

# Substitution with code
$text = "price: 100";
$text =~ s/(\d+)/$1 * 1.1/e;  # "price: 110" (e modifier executes code)

# Delete pattern
$text = "Hello World";
$text =~ s/World//;  # "Hello "

# Transliteration (tr///)
$text = "Hello World";
$text =~ tr/a-z/A-Z/;  # "HELLO WORLD"
$text =~ tr/aeiou//d;  # Delete vowels

Advanced Patterns

Lookahead and Lookbehind

# Positive lookahead (?=...)
my $text = "password123";
$text =~ /\w+(?=\d)/;  # Matches "password" (followed by digit)

# Negative lookahead (?!...)
$text =~ /\w+(?!\d)/;  # Matches word not followed by digit

# Positive lookbehind (?<=...)
$text = "price: $100";
$text =~ /(?<=\$)\d+/;  # Matches "100" (preceded by $)

# Negative lookbehind (?<!...)
$text =~ /(?<!\$)\d+/;  # Matches digits not preceded by $

# Password validation (lookaheads)
sub validate_password {
    my $pwd = shift;
    return $pwd =~ /
        ^                 # Start
        (?=.*[a-z])       # At least one lowercase
        (?=.*[A-Z])       # At least one uppercase
        (?=.*\d)          # At least one digit
        (?=.*[@#$%])      # At least one special char
        .{8,}             # At least 8 characters
        $                 # End
    /x;
}

Alternation and Grouping

# Alternation (|)
my $text = "cat";
$text =~ /cat|dog|bird/;  # Matches "cat"

# Grouping
$text = "catfish";
$text =~ /(cat|dog)fish/;  # Matches "catfish"

# Non-capturing group
$text =~ /(?:cat|dog)fish/;  # Matches but doesn't capture

# Atomic grouping (?>...)
$text = "foobar";
$text =~ /(?>foo|foobar)/;  # Matches "foo" (doesn't backtrack)

Recursive Patterns

# Match balanced parentheses
my $text = "((a)(b))";
my $balanced = qr/
    \(              # Opening paren
    (?:
        [^()]       # Non-parens
        |
        (?R)        # Recurse
    )*
    \)              # Closing paren
/x;

if ($text =~ /^$balanced$/) {
    print "Balanced!\n";
}

Practical Examples

Email Validation

sub validate_email {
    my $email = shift;
    return $email =~ /^
        [a-zA-Z0-9._%+-]+     # Local part
        @                     # @
        [a-zA-Z0-9.-]+        # Domain
        \.                    # Dot
        [a-zA-Z]{2,}          # TLD
    $/x;
}

print validate_email("user@example.com") ? "Valid\n" : "Invalid\n";

URL Parsing

my $url = "https://www.example.com:8080/path/to/page?query=value#fragment";

if ($url =~ m{
    ^
    (?<protocol>https?)://     # Protocol
    (?<host>[\w.-]+)           # Hostname
    (?::(?<port>\d+))?         # Optional port
    (?<path>/[^?#]*)?          # Path
    (?:\?(?<query>[^#]*))?     # Query string
    (?:\#(?<fragment>.*))?     # Fragment
    $
}x) {
    print "Protocol: $+{protocol}\n";
    print "Host: $+{host}\n";
    print "Port: ", $+{port} // "default", "\n";
    print "Path: ", $+{path} // "/", "\n";
    print "Query: ", $+{query} // "none", "\n";
    print "Fragment: ", $+{fragment} // "none", "\n";
}

Log File Parsing

#!/usr/bin/perl
use strict;
use warnings;

# Parse Apache log format
my $log = '192.168.1.1 - - [25/Dec/2024:10:30:45 +0000] "GET /index.html HTTP/1.1" 200 1234';

if ($log =~ /
    ^
    (?<ip>[\d.]+)                    # IP address
    \s+-\s+-\s+
    \[(?<date>[^\]]+)\]              # Date
    \s+"
    (?<method>\w+)                   # HTTP method
    \s+
    (?<path>\S+)                     # Path
    \s+
    HTTP\/[\d.]+                     # HTTP version
    "\s+
    (?<status>\d+)                   # Status code
    \s+
    (?<size>\d+)                     # Response size
    $
/x) {
    print "IP: $+{ip}\n";
    print "Date: $+{date}\n";
    print "Method: $+{method}\n";
    print "Path: $+{path}\n";
    print "Status: $+{status}\n";
    print "Size: $+{size} bytes\n";
}

Data Extraction

# Extract phone numbers
my $text = "Call me at 555-1234 or (555) 567-8900";
my @phones = $text =~ /\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4}/g;
print "Found phones: @phones\n";

# Extract email addresses
$text = "Contact: john@example.com or jane.doe@company.org";
my @emails = $text =~ /[\w.-]+@[\w.-]+\.\w+/g;
print "Found emails: @emails\n";

# Extract hashtags
$text = "Check out #perl #regex #programming";
my @hashtags = $text =~ /#(\w+)/g;
print "Hashtags: @hashtags\n";

Text Cleaning

my $text = "  Hello   World  \n\n";

# Remove leading/trailing whitespace
$text =~ s/^\s+|\s+$//g;

# Collapse multiple spaces
$text =~ s/\s+/ /g;

# Remove HTML tags
my $html = "<p>Hello <b>World</b></p>";
$html =~ s/<[^>]+>//g;  # "Hello World"

# Remove comments
my $code = "code(); // comment\nmore code";
$code =~ s/\/\/.*$//gm;  # Remove // comments

Performance Tips

1. Compile Regex Once

# ❌ Slow: Compiles every iteration
for my $line (@lines) {
    if ($line =~ /pattern/) {
        # ...
    }
}

# ✅ Fast: Compile once
my $pattern = qr/pattern/;
for my $line (@lines) {
    if ($line =~ $pattern) {
        # ...
    }
}

2. Use Non-Capturing Groups

# ❌ Slower: Captures unnecessarily
/(?:foo|bar|baz)/

# ✅ Faster: Non-capturing
/foo|bar|baz/

3. Anchor Patterns

# ❌ Slower: Searches entire string
/pattern/

# ✅ Faster: Anchored
/^pattern/  # If at start
/pattern$/  # If at end

4. Use Study for Multiple Matches

my $text = "large text...";
study $text;  # Optimize for multiple regex operations
$text =~ /pattern1/;
$text =~ /pattern2/;
$text =~ /pattern3/;

Common Pitfalls

1. Greedy vs Non-Greedy

my $html = "<b>bold</b> text <b>more</b>";

# ❌ Greedy: Matches too much
$html =~ /<b>.*<\/b>/;  # Matches entire string

# ✅ Non-greedy: Matches correctly
$html =~ /<b>.*?<\/b>/;  # Matches "<b>bold</b>"

2. Forgetting to Escape Metacharacters

my $text = "price: $100";

# ❌ Wrong: $ is end anchor
$text =~ /$100/;  # Doesn't match

# ✅ Correct: Escape $
$text =~ /\$100/;  # Matches

3. Not Using /x for Complex Patterns

# ❌ Hard to read
/^(?=.*[a-z])(?=.*[A-Z])(?=.*\d).{8,}$/

# ✅ Readable with /x
/^
    (?=.*[a-z])   # Lowercase
    (?=.*[A-Z])   # Uppercase
    (?=.*\d)      # Digit
    .{8,}         # Min 8 chars
$/x

Testing Regex

#!/usr/bin/perl
use strict;
use warnings;
use Test::More tests => 3;

# Test email validation
my $email_regex = qr/^[\w.-]+@[\w.-]+\.\w+$/;

ok("user@example.com" =~ $email_regex, "Valid email");
ok("invalid.email" !~ $email_regex, "Invalid email");
ok("user@domain.co.uk" =~ $email_regex, "Email with multiple dots");

done_testing();

Resources

  • Perl Regex Tutorial: perldoc perlretut
  • Regex Reference: perldoc perlre
  • Regex Tester: regex101.com (select Perl flavor)
  • Book: “Mastering Regular Expressions” by Jeffrey Friedl

Conclusion

Perl’s regex capabilities are unmatched. Key takeaways:

  1. ✅ Use /x modifier for complex patterns
  2. ✅ Compile patterns once with qr//
  3. ✅ Use non-capturing groups when possible
  4. ✅ Understand greedy vs non-greedy
  5. ✅ Use named captures for clarity
  6. ✅ Test patterns thoroughly

Master Perl regex, and you’ll have a powerful tool for text processing!