Skip to content

Python Regular Expression

Basics

Python's regular expression is based on extended Regular Expression

The basic usage:

import re
pattern = r"abc(\w)efg"
in_string = r"abcdefg"
m = re.search(pattern, in_string)
if (m):
    print(m.group(0)) # prints "abcdefg"
    print(m.group(1)) # prints "d"

Greedy vs non-greedy

+, *, ?, and {m,n} are greedy in the sense that it tries to match as much text as possible.

+?, *?, ??, and {m,n}? are non-greedy in the sense that it tries to match as little text as possible.

Example 1:

import re

s1 = 'aaa'
p11 = r'(a?)a*'
p12 = r'(a??)a*'

# m11.group(1) = 'a' 
# RE tries to be greedy with 'a?' and found a match
m11 = re.search(p11, s1) 

# m12.group(1) = ''  
# RE tries to be non-greedy with 'a??' and found a match
m12 = re.search(p12, s1) 

s2 = 'abb'
p21 = r'(a?)b*'
p22 = r'(a??)b*'
p23 = r'(a??)b+'

# m21.group(1) = 'a'
# RE tries to be greedy with 'a?' and found a match
m21 = re.search(p21, s2) 

# m22.group(1) = ''  
# RE tries to be non-greedy with 'a??' and found a match: 
# it can match 0 'a' and 0 'b'
m22 = re.search(p22, s2) 

# m23.group(1) = 'a'
# RE tries to be non-greedy with 'a??' but found no match. 
# Then it tries to consume 1 'a' and found a match
m23 = re.search(p23, s2) 

Example 2:

my_string = '<body>Something something</body>'
pattern1 = r'<.+>'
pattern2 = r'<.+?>'

# m1.group(1) = '<body>Something something</body>'
m1 = re.search(pattern1, my_string) 

# m2.group(1) = '<body>'
m2 = re.search(pattern2, my_string) 

Lookahead and lookbehind assertions

(?=...): Match abc if it is followed by def (lookahead assertion)

pattern = r"abc(?=def)"
in1 = "abcdef"
in2 = "abcxyz"
m1 = re.search(pattern, in1)
if(m1):
    # m1.group(0) is 'abc'. The lookahead pattern (def) is not consumed.
    print(m1.group(0)) 

m2 = re.search(pattern, in2)
if(m2):
    # no match for in2
    print(m2.group(0)) 

(?!...): Match abc if it is NOT followed by def (negative lookahead assertion)

pattern = r"abc(?!def)

(?<=...): Match def if it is preceded by abc (lookbehind assertion)

pattern = r"(?<=abc)def"

(?<!...): Match def if it is NOT preceded by abc (negative lookbehind assertion)

pattern = r"(?<!abc)def"

Named groups

(?P<XYZ>): Define a named group whose name is XYZ.

(?P=XYZ): Backreference to the named group XYZ in the same pattern

pattern = r'(?P<SomeWord>\w+) \w+ (?P=SomeWord)'
sentence = 'day by day'
m = re.search(pattern, sentence)
if (m):
    some_word = m.group('SomeWord')

Non-capturing groups

(?:...): Define a non-capturing group.

Reference