Regular Expressions in Python

Regular Expressions in Python

>>> import re
>>> phoneNumRegex = re.compile(r'\d\d\d-\d\d\d-\d\d\d\d')
>>> mo = phoneNumRegex.search('My number is 415-555-4242.')
>>> print('Phone number found: ' + mo.group())
Phone number found: 415-555-4242

Import regex module with import re
Create Regex object with the re.compile() function
Pass the string you want to search into the Regex object’s search() method
Call the Match object’s group() method to return the string of matched text

Basic Regular Expression Syntax #

Specifier	Meaning - (This matches …)
`?`	0 or 1 of preceding group
`*`	0 or more of preceding group
`+`	1 or more of preceding group
`{n}`	exactly n of preceding group
`{n,}`	n or more of preceding group
`{,m}`	0 to m of preceding group
`{n,m}`	n to m of preceding group
`{n,m}?` or `*?` or `+?`	non-greedy match of preceding group
`^`	beginning of string
`$`	end of string
`.`	any character (except newlines)
`\d`	any digit
`\w`	any alphanumeric `[0-9a-zA-Z]`
`\s`	any whitespace `[ \t\n\r\f\v]`
`\D`	any non-digit
`\W`	any non-alphanumeric
`\S`	any non-whitespace
`[abc]`	any character between the `[ ]`
`[^abc]`	any character that isn’t between the `[ ]`
`\|`	either or, like logical or
`()`	creates a capture group and indicates precedence

Escaping specifiers #

If you want to detect any of the special characters in the table above, you need to escape them. For example:

\. \^ \$ \* \+ \? \{ \} \[ \] \\ \| \( \)

Groupings #

Use group() and groups() (methods?) to grab specific groups of matches. Like so:

i>>> phoneNumRegex = re.compile(r'(\d\d\d)-(\d\d\d-\d\d\d\d)')
i>>> mo = phoneNumRegex.search('My number is 415-555-4242.')
i>>> mo.group(1)
'415'
i>>> mo.group(2)
'555-4242'
i>>> mo.group(0)
'415-555-4242'
i>>> mo.group()
'415-555-4242'
i>>> mo.groups()
('415', '555-4242')
i>>> areaCode, mainNumber = mo.groups()
i>>> print(areaCode, mainNumber)
415 555-4242

findall() #

In Python regex search() will return the first result. But, findall() returns all results.

Case-Insensitive Matching #

Pass re.IGNORECASE or re.I as a second argument to re.compile() to search case-insensitively.

>>> robocop = re.compile(r'robocop', re.I)
>>> robocop.search('RoboCop is part man, part machine, all cop.').group()
'RoboCop'

Matching Newlines #

Allow .* to match everything including newlines by passing re.DOTALL

Substitution #

Use the sub() method to substitute strings that match a regex.

i>>> namesRegex = re.compile(r'Agent \w+')
i>>> namesRegex.sub('CENSORED', 'Agent Alice gave the secret documents to Agent Bob.')
'CENSORED gave the secret documents to CENSORED.'

Using groupings to include match in substitution.

i>>> agentNamesRegex = re.compile(r'Agent (\w)\w*')
i>>> agentNamesRegex.sub(r'\1****', 'Agent Alice told Agent Carol that Agent Eve knew Agent Bob was a double agent.')
'A**** told C**** that E**** knew B**** was a double agent.'

Managing Complex Regexes #

Tell re.compile() to ignore whitespace and comments with verbose mode by passing re.VERBOSE.

>>> phoneRegex = re.compile(r'''(
...     (\d{3}|\(\d{3}\))?            # area code
...     (\s|-|\.)?                    # separator
...     \d{3}                         # first 3 digits
...     (\s|-|\.)                     # separator
...     \d{4}                         # last 4 digits
...     (\s*(ext|x|ext.)\s*\d{2,5})?  # extension
...     )''', re.VERBOSE)