Python Regular Expression


Regular Expressions in Python

A Regular Expression (regex or regexp) is a powerful tool for matching patterns in strings. Python provides the re module to work with regular expressions, enabling the search, match, and manipulation of text based on patterns. Regular expressions allow complex text searches and text manipulations in a concise manner.

Key Features of Regular Expressions

  1. Pattern Matching: Regular expressions use a combination of literal characters and special symbols (called metacharacters) to define search patterns.

  2. Flexible Search: They allow searching for specific patterns within text, such as words, numbers, or characters in any order or structure.

  3. Text Manipulation: Beyond searching, regular expressions can be used to replace, extract, or split parts of text.

  4. Validation: Regular expressions can be used to validate formats, such as email addresses, phone numbers, or postal codes.

Commonly Used Metacharacters in Regular Expressions

  • .: Matches any character except a newline.
  • ^: Matches the start of a string.
  • $: Matches the end of a string.
  • *: Matches 0 or more repetitions of the preceding character.
  • +: Matches 1 or more repetitions of the preceding character.
  • ?: Matches 0 or 1 occurrence of the preceding character.
  • []: Denotes a set of characters to match.
  • |: Acts as an OR operator.
  • \d: Matches any digit (0-9).
  • \D: Matches any non-digit character.
  • \w: Matches any word character (alphanumeric + underscore).
  • \W: Matches any non-word character.
  • \s: Matches any whitespace (space, tab, newline).
  • \S: Matches any non-whitespace character.

Basic Functions in Python's re Module

  1. re.search(): Searches for the first occurrence of the pattern in the string.
  2. re.match(): Checks if the beginning of the string matches the pattern.
  3. re.findall(): Returns all occurrences of the pattern in the string.
  4. re.sub(): Replaces occurrences of a pattern with a specified string.
  5. re.split(): Splits a string based on the occurrences of a pattern.
  6. re.compile(): Compiles a regular expression pattern for reuse.

Example of Regular Expressions in Python

import re # Sample text text = "The rain in Spain falls mainly on the plain." # 1. Search for the word "rain" in the text search_result = re.search(r"rain", text) if search_result: print(f"Found: {search_result.group()}") # Output: Found: rain # 2. Find all words that start with "S" or "s" find_all_result = re.findall(r"\b[Ss]\w+", text) print(find_all_result) # Output: ['Spain'] # 3. Replace the word "rain" with "snow" replace_result = re.sub(r"rain", "snow", text) print(replace_result) # Output: The snow in Spain falls mainly on the plain. # 4. Split the string at each whitespace split_result = re.split(r"\s", text) print(split_result) # Output: ['The', 'rain', 'in', 'Spain', 'falls', 'mainly', 'on', 'the', 'plain.']

Explanation of the Example

  1. re.search(): This searches for the first occurrence of the word "rain" in the string. If found, it prints the match.

  2. re.findall(): This finds all words that start with "S" or "s" in the text, where \b represents a word boundary and \w+ matches a sequence of word characters.

  3. re.sub(): This replaces all occurrences of the word "rain" with "snow" in the string.

  4. re.split(): This splits the string at each space, returning a list of words.

Common Use Cases of Regular Expressions

  1. Validating User Input:

    • Example: Ensuring an email address has the correct format.
    pattern = r"^[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$" email = "example@example.com" if re.match(pattern, email): print("Valid email address") else: print("Invalid email address")
  2. Finding Specific Patterns:

    • Example: Extracting all phone numbers from a document.
    text = "Contact me at 123-456-7890 or 987-654-3210" phone_numbers = re.findall(r"\d{3}-\d{3}-\d{4}", text) print(phone_numbers) # Output: ['123-456-7890', '987-654-3210']
  3. Replacing Text:

    • Example: Replacing all dates in a specific format (MM/DD/YYYY) with a new format.
    text = "Today's date is 12/25/2021." new_text = re.sub(r"(\d{2})/(\d{2})/(\d{4})", r"\2-\1-\3", text) print(new_text) # Output: Today's date is 25-12-2021.

Summary

  • Regular expressions are used for pattern matching and text manipulation.
  • Python's re module provides functions like search(), match(), findall(), sub(), and split() to work with regular expressions.
  • Metacharacters such as ., *, +, and [] are used to define search patterns.
  • Regular expressions are widely used in tasks like data validation, string searching, text extraction, and text manipulation.

If you have any further questions or need more examples, feel free to ask!