1:mod:`re` --- Regular expression operations
2===========================================
3
4.. module:: re
5   :synopsis: Regular expression operations.
6
7.. moduleauthor:: Fredrik Lundh <[email protected]>
8.. sectionauthor:: Andrew M. Kuchling <[email protected]>
9
10**Source code:** :source:`Lib/re/`
11
12--------------
13
14This module provides regular expression matching operations similar to
15those found in Perl.
16
17Both patterns and strings to be searched can be Unicode strings (:class:`str`)
18as well as 8-bit strings (:class:`bytes`).
19However, Unicode strings and 8-bit strings cannot be mixed:
20that is, you cannot match a Unicode string with a byte pattern or
21vice-versa; similarly, when asking for a substitution, the replacement
22string must be of the same type as both the pattern and the search string.
23
24Regular expressions use the backslash character (``'\'``) to indicate
25special forms or to allow special characters to be used without invoking
26their special meaning.  This collides with Python's usage of the same
27character for the same purpose in string literals; for example, to match
28a literal backslash, one might have to write ``'\\\\'`` as the pattern
29string, because the regular expression must be ``\\``, and each
30backslash must be expressed as ``\\`` inside a regular Python string
31literal. Also, please note that any invalid escape sequences in Python's
32usage of the backslash in string literals now generate a :exc:`DeprecationWarning`
33and in the future this will become a :exc:`SyntaxError`. This behaviour
34will happen even if it is a valid escape sequence for a regular expression.
35
36The solution is to use Python's raw string notation for regular expression
37patterns; backslashes are not handled in any special way in a string literal
38prefixed with ``'r'``.  So ``r"\n"`` is a two-character string containing
39``'\'`` and ``'n'``, while ``"\n"`` is a one-character string containing a
40newline.  Usually patterns will be expressed in Python code using this raw
41string notation.
42
43It is important to note that most regular expression operations are available as
44module-level functions and methods on
45:ref:`compiled regular expressions <re-objects>`.  The functions are shortcuts
46that don't require you to compile a regex object first, but miss some
47fine-tuning parameters.
48
49.. seealso::
50
51   The third-party `regex <https://pypi.org/project/regex/>`_ module,
52   which has an API compatible with the standard library :mod:`re` module,
53   but offers additional functionality and a more thorough Unicode support.
54
55
56.. _re-syntax:
57
58Regular Expression Syntax
59-------------------------
60
61A regular expression (or RE) specifies a set of strings that matches it; the
62functions in this module let you check if a particular string matches a given
63regular expression (or if a given regular expression matches a particular
64string, which comes down to the same thing).
65
66Regular expressions can be concatenated to form new regular expressions; if *A*
67and *B* are both regular expressions, then *AB* is also a regular expression.
68In general, if a string *p* matches *A* and another string *q* matches *B*, the
69string *pq* will match AB.  This holds unless *A* or *B* contain low precedence
70operations; boundary conditions between *A* and *B*; or have numbered group
71references.  Thus, complex expressions can easily be constructed from simpler
72primitive expressions like the ones described here.  For details of the theory
73and implementation of regular expressions, consult the Friedl book [Frie09]_,
74or almost any textbook about compiler construction.
75
76A brief explanation of the format of regular expressions follows.  For further
77information and a gentler presentation, consult the :ref:`regex-howto`.
78
79Regular expressions can contain both special and ordinary characters. Most
80ordinary characters, like ``'A'``, ``'a'``, or ``'0'``, are the simplest regular
81expressions; they simply match themselves.  You can concatenate ordinary
82characters, so ``last`` matches the string ``'last'``.  (In the rest of this
83section, we'll write RE's in ``this special style``, usually without quotes, and
84strings to be matched ``'in single quotes'``.)
85
86Some characters, like ``'|'`` or ``'('``, are special. Special
87characters either stand for classes of ordinary characters, or affect
88how the regular expressions around them are interpreted.
89
90Repetition operators or quantifiers (``*``, ``+``, ``?``, ``{m,n}``, etc) cannot be
91directly nested. This avoids ambiguity with the non-greedy modifier suffix
92``?``, and with other modifiers in other implementations. To apply a second
93repetition to an inner repetition, parentheses may be used. For example,
94the expression ``(?:a{6})*`` matches any multiple of six ``'a'`` characters.
95
96
97The special characters are:
98
99.. index:: single: . (dot); in regular expressions
100
101``.``
102   (Dot.)  In the default mode, this matches any character except a newline.  If
103   the :const:`DOTALL` flag has been specified, this matches any character
104   including a newline.
105
106.. index:: single: ^ (caret); in regular expressions
107
108``^``
109   (Caret.)  Matches the start of the string, and in :const:`MULTILINE` mode also
110   matches immediately after each newline.
111
112.. index:: single: $ (dollar); in regular expressions
113
114``$``
115   Matches the end of the string or just before the newline at the end of the
116   string, and in :const:`MULTILINE` mode also matches before a newline.  ``foo``
117   matches both 'foo' and 'foobar', while the regular expression ``foo$`` matches
118   only 'foo'.  More interestingly, searching for ``foo.$`` in ``'foo1\nfoo2\n'``
119   matches 'foo2' normally, but 'foo1' in :const:`MULTILINE` mode; searching for
120   a single ``$`` in ``'foo\n'`` will find two (empty) matches: one just before
121   the newline, and one at the end of the string.
122
123.. index:: single: * (asterisk); in regular expressions
124
125``*``
126   Causes the resulting RE to match 0 or more repetitions of the preceding RE, as
127   many repetitions as are possible.  ``ab*`` will match 'a', 'ab', or 'a' followed
128   by any number of 'b's.
129
130.. index:: single: + (plus); in regular expressions
131
132``+``
133   Causes the resulting RE to match 1 or more repetitions of the preceding RE.
134   ``ab+`` will match 'a' followed by any non-zero number of 'b's; it will not
135   match just 'a'.
136
137.. index:: single: ? (question mark); in regular expressions
138
139``?``
140   Causes the resulting RE to match 0 or 1 repetitions of the preceding RE.
141   ``ab?`` will match either 'a' or 'ab'.
142
143.. index::
144   single: *?; in regular expressions
145   single: +?; in regular expressions
146   single: ??; in regular expressions
147
148``*?``, ``+?``, ``??``
149   The ``'*'``, ``'+'``, and ``'?'`` quantifiers are all :dfn:`greedy`; they match
150   as much text as possible.  Sometimes this behaviour isn't desired; if the RE
151   ``<.*>`` is matched against ``'<a> b <c>'``, it will match the entire
152   string, and not just ``'<a>'``.  Adding ``?`` after the quantifier makes it
153   perform the match in :dfn:`non-greedy` or :dfn:`minimal` fashion; as *few*
154   characters as possible will be matched.  Using the RE ``<.*?>`` will match
155   only ``'<a>'``.
156
157.. index::
158   single: *+; in regular expressions
159   single: ++; in regular expressions
160   single: ?+; in regular expressions
161
162``*+``, ``++``, ``?+``
163  Like the ``'*'``, ``'+'``, and ``'?'`` quantifiers, those where ``'+'`` is
164  appended also match as many times as possible.
165  However, unlike the true greedy quantifiers, these do not allow
166  back-tracking when the expression following it fails to match.
167  These are known as :dfn:`possessive` quantifiers.
168  For example, ``a*a`` will match ``'aaaa'`` because the ``a*`` will match
169  all 4 ``'a'``\ s, but, when the final ``'a'`` is encountered, the
170  expression is backtracked so that in the end the ``a*`` ends up matching
171  3 ``'a'``\ s total, and the fourth ``'a'`` is matched by the final ``'a'``.
172  However, when ``a*+a`` is used to match ``'aaaa'``, the ``a*+`` will
173  match all 4 ``'a'``, but when the final ``'a'`` fails to find any more
174  characters to match, the expression cannot be backtracked and will thus
175  fail to match.
176  ``x*+``, ``x++`` and ``x?+`` are equivalent to ``(?>x*)``, ``(?>x+)``
177  and ``(?>x?)`` correspondingly.
178
179   .. versionadded:: 3.11
180
181.. index::
182   single: {} (curly brackets); in regular expressions
183
184``{m}``
185   Specifies that exactly *m* copies of the previous RE should be matched; fewer
186   matches cause the entire RE not to match.  For example, ``a{6}`` will match
187   exactly six ``'a'`` characters, but not five.
188
189``{m,n}``
190   Causes the resulting RE to match from *m* to *n* repetitions of the preceding
191   RE, attempting to match as many repetitions as possible.  For example,
192   ``a{3,5}`` will match from 3 to 5 ``'a'`` characters.  Omitting *m* specifies a
193   lower bound of zero,  and omitting *n* specifies an infinite upper bound.  As an
194   example, ``a{4,}b`` will match ``'aaaab'`` or a thousand ``'a'`` characters
195   followed by a ``'b'``, but not ``'aaab'``. The comma may not be omitted or the
196   modifier would be confused with the previously described form.
197
198``{m,n}?``
199   Causes the resulting RE to match from *m* to *n* repetitions of the preceding
200   RE, attempting to match as *few* repetitions as possible.  This is the
201   non-greedy version of the previous quantifier.  For example, on the
202   6-character string ``'aaaaaa'``, ``a{3,5}`` will match 5 ``'a'`` characters,
203   while ``a{3,5}?`` will only match 3 characters.
204
205``{m,n}+``
206   Causes the resulting RE to match from *m* to *n* repetitions of the
207   preceding RE, attempting to match as many repetitions as possible
208   *without* establishing any backtracking points.
209   This is the possessive version of the quantifier above.
210   For example, on the 6-character string ``'aaaaaa'``, ``a{3,5}+aa``
211   attempt to match 5 ``'a'`` characters, then, requiring 2 more ``'a'``\ s,
212   will need more characters than available and thus fail, while
213   ``a{3,5}aa`` will match with ``a{3,5}`` capturing 5, then 4 ``'a'``\ s
214   by backtracking and then the final 2 ``'a'``\ s are matched by the final
215   ``aa`` in the pattern.
216   ``x{m,n}+`` is equivalent to ``(?>x{m,n})``.
217
218   .. versionadded:: 3.11
219
220.. index:: single: \ (backslash); in regular expressions
221
222``\``
223   Either escapes special characters (permitting you to match characters like
224   ``'*'``, ``'?'``, and so forth), or signals a special sequence; special
225   sequences are discussed below.
226
227   If you're not using a raw string to express the pattern, remember that Python
228   also uses the backslash as an escape sequence in string literals; if the escape
229   sequence isn't recognized by Python's parser, the backslash and subsequent
230   character are included in the resulting string.  However, if Python would
231   recognize the resulting sequence, the backslash should be repeated twice.  This
232   is complicated and hard to understand, so it's highly recommended that you use
233   raw strings for all but the simplest expressions.
234
235.. index::
236   single: [] (square brackets); in regular expressions
237
238``[]``
239   Used to indicate a set of characters.  In a set:
240
241   * Characters can be listed individually, e.g. ``[amk]`` will match ``'a'``,
242     ``'m'``, or ``'k'``.
243
244   .. index:: single: - (minus); in regular expressions
245
246   * Ranges of characters can be indicated by giving two characters and separating
247     them by a ``'-'``, for example ``[a-z]`` will match any lowercase ASCII letter,
248     ``[0-5][0-9]`` will match all the two-digits numbers from ``00`` to ``59``, and
249     ``[0-9A-Fa-f]`` will match any hexadecimal digit.  If ``-`` is escaped (e.g.
250     ``[a\-z]``) or if it's placed as the first or last character
251     (e.g. ``[-a]`` or ``[a-]``), it will match a literal ``'-'``.
252
253   * Special characters lose their special meaning inside sets.  For example,
254     ``[(+*)]`` will match any of the literal characters ``'('``, ``'+'``,
255     ``'*'``, or ``')'``.
256
257   .. index:: single: \ (backslash); in regular expressions
258
259   * Character classes such as ``\w`` or ``\S`` (defined below) are also accepted
260     inside a set, although the characters they match depends on whether
261     :const:`ASCII` or :const:`LOCALE` mode is in force.
262
263   .. index:: single: ^ (caret); in regular expressions
264
265   * Characters that are not within a range can be matched by :dfn:`complementing`
266     the set.  If the first character of the set is ``'^'``, all the characters
267     that are *not* in the set will be matched.  For example, ``[^5]`` will match
268     any character except ``'5'``, and ``[^^]`` will match any character except
269     ``'^'``.  ``^`` has no special meaning if it's not the first character in
270     the set.
271
272   * To match a literal ``']'`` inside a set, precede it with a backslash, or
273     place it at the beginning of the set.  For example, both ``[()[\]{}]`` and
274     ``[]()[{}]`` will match a right bracket, as well as left bracket, braces,
275     and parentheses.
276
277   .. .. index:: single: --; in regular expressions
278   .. .. index:: single: &&; in regular expressions
279   .. .. index:: single: ~~; in regular expressions
280   .. .. index:: single: ||; in regular expressions
281
282   * Support of nested sets and set operations as in `Unicode Technical
283     Standard #18`_ might be added in the future.  This would change the
284     syntax, so to facilitate this change a :exc:`FutureWarning` will be raised
285     in ambiguous cases for the time being.
286     That includes sets starting with a literal ``'['`` or containing literal
287     character sequences ``'--'``, ``'&&'``, ``'~~'``, and ``'||'``.  To
288     avoid a warning escape them with a backslash.
289
290   .. _Unicode Technical Standard #18: https://unicode.org/reports/tr18/
291
292   .. versionchanged:: 3.7
293      :exc:`FutureWarning` is raised if a character set contains constructs
294      that will change semantically in the future.
295
296.. index:: single: | (vertical bar); in regular expressions
297
298``|``
299   ``A|B``, where *A* and *B* can be arbitrary REs, creates a regular expression that
300   will match either *A* or *B*.  An arbitrary number of REs can be separated by the
301   ``'|'`` in this way.  This can be used inside groups (see below) as well.  As
302   the target string is scanned, REs separated by ``'|'`` are tried from left to
303   right. When one pattern completely matches, that branch is accepted. This means
304   that once *A* matches, *B* will not be tested further, even if it would
305   produce a longer overall match.  In other words, the ``'|'`` operator is never
306   greedy.  To match a literal ``'|'``, use ``\|``, or enclose it inside a
307   character class, as in ``[|]``.
308
309.. index::
310   single: () (parentheses); in regular expressions
311
312``(...)``
313   Matches whatever regular expression is inside the parentheses, and indicates the
314   start and end of a group; the contents of a group can be retrieved after a match
315   has been performed, and can be matched later in the string with the ``\number``
316   special sequence, described below.  To match the literals ``'('`` or ``')'``,
317   use ``\(`` or ``\)``, or enclose them inside a character class: ``[(]``, ``[)]``.
318
319.. index:: single: (?; in regular expressions
320
321``(?...)``
322   This is an extension notation (a ``'?'`` following a ``'('`` is not meaningful
323   otherwise).  The first character after the ``'?'`` determines what the meaning
324   and further syntax of the construct is. Extensions usually do not create a new
325   group; ``(?P<name>...)`` is the only exception to this rule. Following are the
326   currently supported extensions.
327
328``(?aiLmsux)``
329   (One or more letters from the set ``'a'``, ``'i'``, ``'L'``, ``'m'``,
330   ``'s'``, ``'u'``, ``'x'``.)  The group matches the empty string; the
331   letters set the corresponding flags: :const:`re.A` (ASCII-only matching),
332   :const:`re.I` (ignore case), :const:`re.L` (locale dependent),
333   :const:`re.M` (multi-line), :const:`re.S` (dot matches all),
334   :const:`re.U` (Unicode matching), and :const:`re.X` (verbose),
335   for the entire regular expression.
336   (The flags are described in :ref:`contents-of-module-re`.)
337   This is useful if you wish to include the flags as part of the
338   regular expression, instead of passing a *flag* argument to the
339   :func:`re.compile` function.  Flags should be used first in the
340   expression string.
341
342   .. versionchanged:: 3.11
343      This construction can only be used at the start of the expression.
344
345.. index:: single: (?:; in regular expressions
346
347``(?:...)``
348   A non-capturing version of regular parentheses.  Matches whatever regular
349   expression is inside the parentheses, but the substring matched by the group
350   *cannot* be retrieved after performing a match or referenced later in the
351   pattern.
352
353``(?aiLmsux-imsx:...)``
354   (Zero or more letters from the set ``'a'``, ``'i'``, ``'L'``, ``'m'``,
355   ``'s'``, ``'u'``, ``'x'``, optionally followed by ``'-'`` followed by
356   one or more letters from the ``'i'``, ``'m'``, ``'s'``, ``'x'``.)
357   The letters set or remove the corresponding flags:
358   :const:`re.A` (ASCII-only matching), :const:`re.I` (ignore case),
359   :const:`re.L` (locale dependent), :const:`re.M` (multi-line),
360   :const:`re.S` (dot matches all), :const:`re.U` (Unicode matching),
361   and :const:`re.X` (verbose), for the part of the expression.
362   (The flags are described in :ref:`contents-of-module-re`.)
363
364   The letters ``'a'``, ``'L'`` and ``'u'`` are mutually exclusive when used
365   as inline flags, so they can't be combined or follow ``'-'``.  Instead,
366   when one of them appears in an inline group, it overrides the matching mode
367   in the enclosing group.  In Unicode patterns ``(?a:...)`` switches to
368   ASCII-only matching, and ``(?u:...)`` switches to Unicode matching
369   (default).  In byte pattern ``(?L:...)`` switches to locale depending
370   matching, and ``(?a:...)`` switches to ASCII-only matching (default).
371   This override is only in effect for the narrow inline group, and the
372   original matching mode is restored outside of the group.
373
374   .. versionadded:: 3.6
375
376   .. versionchanged:: 3.7
377      The letters ``'a'``, ``'L'`` and ``'u'`` also can be used in a group.
378
379``(?>...)``
380   Attempts to match ``...`` as if it was a separate regular expression, and
381   if successful, continues to match the rest of the pattern following it.
382   If the subsequent pattern fails to match, the stack can only be unwound
383   to a point *before* the ``(?>...)`` because once exited, the expression,
384   known as an :dfn:`atomic group`, has thrown away all stack points within
385   itself.
386   Thus, ``(?>.*).`` would never match anything because first the ``.*``
387   would match all characters possible, then, having nothing left to match,
388   the final ``.`` would fail to match.
389   Since there are no stack points saved in the Atomic Group, and there is
390   no stack point before it, the entire expression would thus fail to match.
391
392   .. versionadded:: 3.11
393
394.. index:: single: (?P<; in regular expressions
395
396``(?P<name>...)``
397   Similar to regular parentheses, but the substring matched by the group is
398   accessible via the symbolic group name *name*.  Group names must be valid
399   Python identifiers, and each group name must be defined only once within a
400   regular expression.  A symbolic group is also a numbered group, just as if
401   the group were not named.
402
403   Named groups can be referenced in three contexts.  If the pattern is
404   ``(?P<quote>['"]).*?(?P=quote)`` (i.e. matching a string quoted with either
405   single or double quotes):
406
407   +---------------------------------------+----------------------------------+
408   | Context of reference to group "quote" | Ways to reference it             |
409   +=======================================+==================================+
410   | in the same pattern itself            | * ``(?P=quote)`` (as shown)      |
411   |                                       | * ``\1``                         |
412   +---------------------------------------+----------------------------------+
413   | when processing match object *m*      | * ``m.group('quote')``           |
414   |                                       | * ``m.end('quote')`` (etc.)      |
415   +---------------------------------------+----------------------------------+
416   | in a string passed to the *repl*      | * ``\g<quote>``                  |
417   | argument of ``re.sub()``              | * ``\g<1>``                      |
418   |                                       | * ``\1``                         |
419   +---------------------------------------+----------------------------------+
420
421   .. deprecated:: 3.11
422      Group *name* containing characters outside the ASCII range
423      (``b'\x00'``-``b'\x7f'``) in :class:`bytes` patterns.
424
425.. index:: single: (?P=; in regular expressions
426
427``(?P=name)``
428   A backreference to a named group; it matches whatever text was matched by the
429   earlier group named *name*.
430
431.. index:: single: (?#; in regular expressions
432
433``(?#...)``
434   A comment; the contents of the parentheses are simply ignored.
435
436.. index:: single: (?=; in regular expressions
437
438``(?=...)``
439   Matches if ``...`` matches next, but doesn't consume any of the string.  This is
440   called a :dfn:`lookahead assertion`.  For example, ``Isaac (?=Asimov)`` will match
441   ``'Isaac '`` only if it's followed by ``'Asimov'``.
442
443.. index:: single: (?!; in regular expressions
444
445``(?!...)``
446   Matches if ``...`` doesn't match next.  This is a :dfn:`negative lookahead assertion`.
447   For example, ``Isaac (?!Asimov)`` will match ``'Isaac '`` only if it's *not*
448   followed by ``'Asimov'``.
449
450.. index:: single: (?<=; in regular expressions
451
452``(?<=...)``
453   Matches if the current position in the string is preceded by a match for ``...``
454   that ends at the current position.  This is called a :dfn:`positive lookbehind
455   assertion`. ``(?<=abc)def`` will find a match in ``'abcdef'``, since the
456   lookbehind will back up 3 characters and check if the contained pattern matches.
457   The contained pattern must only match strings of some fixed length, meaning that
458   ``abc`` or ``a|b`` are allowed, but ``a*`` and ``a{3,4}`` are not.  Note that
459   patterns which start with positive lookbehind assertions will not match at the
460   beginning of the string being searched; you will most likely want to use the
461   :func:`search` function rather than the :func:`match` function:
462
463      >>> import re
464      >>> m = re.search('(?<=abc)def', 'abcdef')
465      >>> m.group(0)
466      'def'
467
468   This example looks for a word following a hyphen:
469
470      >>> m = re.search(r'(?<=-)\w+', 'spam-egg')
471      >>> m.group(0)
472      'egg'
473
474   .. versionchanged:: 3.5
475      Added support for group references of fixed length.
476
477.. index:: single: (?<!; in regular expressions
478
479``(?<!...)``
480   Matches if the current position in the string is not preceded by a match for
481   ``...``.  This is called a :dfn:`negative lookbehind assertion`.  Similar to
482   positive lookbehind assertions, the contained pattern must only match strings of
483   some fixed length.  Patterns which start with negative lookbehind assertions may
484   match at the beginning of the string being searched.
485
486.. _re-conditional-expression:
487.. index:: single: (?(; in regular expressions
488
489``(?(id/name)yes-pattern|no-pattern)``
490   Will try to match with ``yes-pattern`` if the group with given *id* or
491   *name* exists, and with ``no-pattern`` if it doesn't. ``no-pattern`` is
492   optional and can be omitted. For example,
493   ``(<)?(\w+@\w+(?:\.\w+)+)(?(1)>|$)`` is a poor email matching pattern, which
494   will match with ``'<[email protected]>'`` as well as ``'[email protected]'``, but
495   not with ``'<[email protected]'`` nor ``'[email protected]>'``.
496
497   .. deprecated:: 3.11
498      Group *id* containing anything except ASCII digits.
499      Group *name* containing characters outside the ASCII range
500      (``b'\x00'``-``b'\x7f'``) in :class:`bytes` replacement strings.
501
502
503The special sequences consist of ``'\'`` and a character from the list below.
504If the ordinary character is not an ASCII digit or an ASCII letter, then the
505resulting RE will match the second character.  For example, ``\$`` matches the
506character ``'$'``.
507
508.. index:: single: \ (backslash); in regular expressions
509
510``\number``
511   Matches the contents of the group of the same number.  Groups are numbered
512   starting from 1.  For example, ``(.+) \1`` matches ``'the the'`` or ``'55 55'``,
513   but not ``'thethe'`` (note the space after the group).  This special sequence
514   can only be used to match one of the first 99 groups.  If the first digit of
515   *number* is 0, or *number* is 3 octal digits long, it will not be interpreted as
516   a group match, but as the character with octal value *number*. Inside the
517   ``'['`` and ``']'`` of a character class, all numeric escapes are treated as
518   characters.
519
520.. index:: single: \A; in regular expressions
521
522``\A``
523   Matches only at the start of the string.
524
525.. index:: single: \b; in regular expressions
526
527``\b``
528   Matches the empty string, but only at the beginning or end of a word.
529   A word is defined as a sequence of word characters.  Note that formally,
530   ``\b`` is defined as the boundary between a ``\w`` and a ``\W`` character
531   (or vice versa), or between ``\w`` and the beginning/end of the string.
532   This means that ``r'\bfoo\b'`` matches ``'foo'``, ``'foo.'``, ``'(foo)'``,
533   ``'bar foo baz'`` but not ``'foobar'`` or ``'foo3'``.
534
535   By default Unicode alphanumerics are the ones used in Unicode patterns, but
536   this can be changed by using the :const:`ASCII` flag.  Word boundaries are
537   determined by the current locale if the :const:`LOCALE` flag is used.
538   Inside a character range, ``\b`` represents the backspace character, for
539   compatibility with Python's string literals.
540
541.. index:: single: \B; in regular expressions
542
543``\B``
544   Matches the empty string, but only when it is *not* at the beginning or end
545   of a word.  This means that ``r'py\B'`` matches ``'python'``, ``'py3'``,
546   ``'py2'``, but not ``'py'``, ``'py.'``, or ``'py!'``.
547   ``\B`` is just the opposite of ``\b``, so word characters in Unicode
548   patterns are Unicode alphanumerics or the underscore, although this can
549   be changed by using the :const:`ASCII` flag.  Word boundaries are
550   determined by the current locale if the :const:`LOCALE` flag is used.
551
552.. index:: single: \d; in regular expressions
553
554``\d``
555   For Unicode (str) patterns:
556      Matches any Unicode decimal digit (that is, any character in
557      Unicode character category [Nd]).  This includes ``[0-9]``, and
558      also many other digit characters.  If the :const:`ASCII` flag is
559      used only ``[0-9]`` is matched.
560
561   For 8-bit (bytes) patterns:
562      Matches any decimal digit; this is equivalent to ``[0-9]``.
563
564.. index:: single: \D; in regular expressions
565
566``\D``
567   Matches any character which is not a decimal digit. This is
568   the opposite of ``\d``. If the :const:`ASCII` flag is used this
569   becomes the equivalent of ``[^0-9]``.
570
571.. index:: single: \s; in regular expressions
572
573``\s``
574   For Unicode (str) patterns:
575      Matches Unicode whitespace characters (which includes
576      ``[ \t\n\r\f\v]``, and also many other characters, for example the
577      non-breaking spaces mandated by typography rules in many
578      languages). If the :const:`ASCII` flag is used, only
579      ``[ \t\n\r\f\v]`` is matched.
580
581   For 8-bit (bytes) patterns:
582      Matches characters considered whitespace in the ASCII character set;
583      this is equivalent to ``[ \t\n\r\f\v]``.
584
585.. index:: single: \S; in regular expressions
586
587``\S``
588   Matches any character which is not a whitespace character. This is
589   the opposite of ``\s``. If the :const:`ASCII` flag is used this
590   becomes the equivalent of ``[^ \t\n\r\f\v]``.
591
592.. index:: single: \w; in regular expressions
593
594``\w``
595   For Unicode (str) patterns:
596      Matches Unicode word characters; this includes alphanumeric characters (as defined by :meth:`str.isalnum`)
597      as well as the underscore (``_``).
598      If the :const:`ASCII` flag is used, only ``[a-zA-Z0-9_]`` is matched.
599
600   For 8-bit (bytes) patterns:
601      Matches characters considered alphanumeric in the ASCII character set;
602      this is equivalent to ``[a-zA-Z0-9_]``.  If the :const:`LOCALE` flag is
603      used, matches characters considered alphanumeric in the current locale
604      and the underscore.
605
606.. index:: single: \W; in regular expressions
607
608``\W``
609   Matches any character which is not a word character. This is
610   the opposite of ``\w``. If the :const:`ASCII` flag is used this
611   becomes the equivalent of ``[^a-zA-Z0-9_]``.  If the :const:`LOCALE` flag is
612   used, matches characters which are neither alphanumeric in the current locale
613   nor the underscore.
614
615.. index:: single: \Z; in regular expressions
616
617``\Z``
618   Matches only at the end of the string.
619
620.. index::
621   single: \a; in regular expressions
622   single: \b; in regular expressions
623   single: \f; in regular expressions
624   single: \n; in regular expressions
625   single: \N; in regular expressions
626   single: \r; in regular expressions
627   single: \t; in regular expressions
628   single: \u; in regular expressions
629   single: \U; in regular expressions
630   single: \v; in regular expressions
631   single: \x; in regular expressions
632   single: \\; in regular expressions
633
634Most of the standard escapes supported by Python string literals are also
635accepted by the regular expression parser::
636
637   \a      \b      \f      \n
638   \N      \r      \t      \u
639   \U      \v      \x      \\
640
641(Note that ``\b`` is used to represent word boundaries, and means "backspace"
642only inside character classes.)
643
644``'\u'``, ``'\U'``, and ``'\N'`` escape sequences are only recognized in Unicode
645patterns.  In bytes patterns they are errors.  Unknown escapes of ASCII
646letters are reserved for future use and treated as errors.
647
648Octal escapes are included in a limited form.  If the first digit is a 0, or if
649there are three octal digits, it is considered an octal escape. Otherwise, it is
650a group reference.  As for string literals, octal escapes are always at most
651three digits in length.
652
653.. versionchanged:: 3.3
654   The ``'\u'`` and ``'\U'`` escape sequences have been added.
655
656.. versionchanged:: 3.6
657   Unknown escapes consisting of ``'\'`` and an ASCII letter now are errors.
658
659.. versionchanged:: 3.8
660   The ``'\N{name}'`` escape sequence has been added. As in string literals,
661   it expands to the named Unicode character (e.g. ``'\N{EM DASH}'``).
662
663
664.. _contents-of-module-re:
665
666Module Contents
667---------------
668
669The module defines several functions, constants, and an exception. Some of the
670functions are simplified versions of the full featured methods for compiled
671regular expressions.  Most non-trivial applications always use the compiled
672form.
673
674
675Flags
676^^^^^
677
678.. versionchanged:: 3.6
679   Flag constants are now instances of :class:`RegexFlag`, which is a subclass of
680   :class:`enum.IntFlag`.
681
682
683.. class:: RegexFlag
684
685   An :class:`enum.IntFlag` class containing the regex options listed below.
686
687   .. versionadded:: 3.11 - added to ``__all__``
688
689.. data:: A
690          ASCII
691
692   Make ``\w``, ``\W``, ``\b``, ``\B``, ``\d``, ``\D``, ``\s`` and ``\S``
693   perform ASCII-only matching instead of full Unicode matching.  This is only
694   meaningful for Unicode patterns, and is ignored for byte patterns.
695   Corresponds to the inline flag ``(?a)``.
696
697   Note that for backward compatibility, the :const:`re.U` flag still
698   exists (as well as its synonym :const:`re.UNICODE` and its embedded
699   counterpart ``(?u)``), but these are redundant in Python 3 since
700   matches are Unicode by default for strings (and Unicode matching
701   isn't allowed for bytes).
702
703
704.. data:: DEBUG
705
706   Display debug information about compiled expression.
707   No corresponding inline flag.
708
709
710.. data:: I
711          IGNORECASE
712
713   Perform case-insensitive matching; expressions like ``[A-Z]`` will also
714   match lowercase letters.  Full Unicode matching (such as ``Ü`` matching
715   ``ü``) also works unless the :const:`re.ASCII` flag is used to disable
716   non-ASCII matches.  The current locale does not change the effect of this
717   flag unless the :const:`re.LOCALE` flag is also used.
718   Corresponds to the inline flag ``(?i)``.
719
720   Note that when the Unicode patterns ``[a-z]`` or ``[A-Z]`` are used in
721   combination with the :const:`IGNORECASE` flag, they will match the 52 ASCII
722   letters and 4 additional non-ASCII letters: 'İ' (U+0130, Latin capital
723   letter I with dot above), 'ı' (U+0131, Latin small letter dotless i),
724   'ſ' (U+017F, Latin small letter long s) and 'K' (U+212A, Kelvin sign).
725   If the :const:`ASCII` flag is used, only letters 'a' to 'z'
726   and 'A' to 'Z' are matched.
727
728.. data:: L
729          LOCALE
730
731   Make ``\w``, ``\W``, ``\b``, ``\B`` and case-insensitive matching
732   dependent on the current locale.  This flag can be used only with bytes
733   patterns.  The use of this flag is discouraged as the locale mechanism
734   is very unreliable, it only handles one "culture" at a time, and it only
735   works with 8-bit locales.  Unicode matching is already enabled by default
736   in Python 3 for Unicode (str) patterns, and it is able to handle different
737   locales/languages.
738   Corresponds to the inline flag ``(?L)``.
739
740   .. versionchanged:: 3.6
741      :const:`re.LOCALE` can be used only with bytes patterns and is
742      not compatible with :const:`re.ASCII`.
743
744   .. versionchanged:: 3.7
745      Compiled regular expression objects with the :const:`re.LOCALE` flag no
746      longer depend on the locale at compile time.  Only the locale at
747      matching time affects the result of matching.
748
749
750.. data:: M
751          MULTILINE
752
753   When specified, the pattern character ``'^'`` matches at the beginning of the
754   string and at the beginning of each line (immediately following each newline);
755   and the pattern character ``'$'`` matches at the end of the string and at the
756   end of each line (immediately preceding each newline).  By default, ``'^'``
757   matches only at the beginning of the string, and ``'$'`` only at the end of the
758   string and immediately before the newline (if any) at the end of the string.
759   Corresponds to the inline flag ``(?m)``.
760
761.. data:: NOFLAG
762
763   Indicates no flag being applied, the value is ``0``.  This flag may be used
764   as a default value for a function keyword argument or as a base value that
765   will be conditionally ORed with other flags.  Example of use as a default
766   value::
767
768      def myfunc(text, flag=re.NOFLAG):
769          return re.match(text, flag)
770
771   .. versionadded:: 3.11
772
773.. data:: S
774          DOTALL
775
776   Make the ``'.'`` special character match any character at all, including a
777   newline; without this flag, ``'.'`` will match anything *except* a newline.
778   Corresponds to the inline flag ``(?s)``.
779
780
781.. data:: X
782          VERBOSE
783
784   .. index:: single: # (hash); in regular expressions
785
786   This flag allows you to write regular expressions that look nicer and are
787   more readable by allowing you to visually separate logical sections of the
788   pattern and add comments. Whitespace within the pattern is ignored, except
789   when in a character class, or when preceded by an unescaped backslash,
790   or within tokens like ``*?``, ``(?:`` or ``(?P<...>``. For example, ``(? :``
791   and ``* ?`` are not allowed.
792   When a line contains a ``#`` that is not in a character class and is not
793   preceded by an unescaped backslash, all characters from the leftmost such
794   ``#`` through the end of the line are ignored.
795
796   This means that the two following regular expression objects that match a
797   decimal number are functionally equal::
798
799      a = re.compile(r"""\d +  # the integral part
800                         \.    # the decimal point
801                         \d *  # some fractional digits""", re.X)
802      b = re.compile(r"\d+\.\d*")
803
804   Corresponds to the inline flag ``(?x)``.
805
806
807Functions
808^^^^^^^^^
809
810.. function:: compile(pattern, flags=0)
811
812   Compile a regular expression pattern into a :ref:`regular expression object
813   <re-objects>`, which can be used for matching using its
814   :func:`~Pattern.match`, :func:`~Pattern.search` and other methods, described
815   below.
816
817   The expression's behaviour can be modified by specifying a *flags* value.
818   Values can be any of the following variables, combined using bitwise OR (the
819   ``|`` operator).
820
821   The sequence ::
822
823      prog = re.compile(pattern)
824      result = prog.match(string)
825
826   is equivalent to ::
827
828      result = re.match(pattern, string)
829
830   but using :func:`re.compile` and saving the resulting regular expression
831   object for reuse is more efficient when the expression will be used several
832   times in a single program.
833
834   .. note::
835
836      The compiled versions of the most recent patterns passed to
837      :func:`re.compile` and the module-level matching functions are cached, so
838      programs that use only a few regular expressions at a time needn't worry
839      about compiling regular expressions.
840
841
842.. function:: search(pattern, string, flags=0)
843
844   Scan through *string* looking for the first location where the regular expression
845   *pattern* produces a match, and return a corresponding :ref:`match object
846   <match-objects>`.  Return ``None`` if no position in the string matches the
847   pattern; note that this is different from finding a zero-length match at some
848   point in the string.
849
850
851.. function:: match(pattern, string, flags=0)
852
853   If zero or more characters at the beginning of *string* match the regular
854   expression *pattern*, return a corresponding :ref:`match object
855   <match-objects>`.  Return ``None`` if the string does not match the pattern;
856   note that this is different from a zero-length match.
857
858   Note that even in :const:`MULTILINE` mode, :func:`re.match` will only match
859   at the beginning of the string and not at the beginning of each line.
860
861   If you want to locate a match anywhere in *string*, use :func:`search`
862   instead (see also :ref:`search-vs-match`).
863
864
865.. function:: fullmatch(pattern, string, flags=0)
866
867   If the whole *string* matches the regular expression *pattern*, return a
868   corresponding :ref:`match object <match-objects>`.  Return ``None`` if the
869   string does not match the pattern; note that this is different from a
870   zero-length match.
871
872   .. versionadded:: 3.4
873
874
875.. function:: split(pattern, string, maxsplit=0, flags=0)
876
877   Split *string* by the occurrences of *pattern*.  If capturing parentheses are
878   used in *pattern*, then the text of all groups in the pattern are also returned
879   as part of the resulting list. If *maxsplit* is nonzero, at most *maxsplit*
880   splits occur, and the remainder of the string is returned as the final element
881   of the list. ::
882
883      >>> re.split(r'\W+', 'Words, words, words.')
884      ['Words', 'words', 'words', '']
885      >>> re.split(r'(\W+)', 'Words, words, words.')
886      ['Words', ', ', 'words', ', ', 'words', '.', '']
887      >>> re.split(r'\W+', 'Words, words, words.', 1)
888      ['Words', 'words, words.']
889      >>> re.split('[a-f]+', '0a3B9', flags=re.IGNORECASE)
890      ['0', '3', '9']
891
892   If there are capturing groups in the separator and it matches at the start of
893   the string, the result will start with an empty string.  The same holds for
894   the end of the string::
895
896      >>> re.split(r'(\W+)', '...words, words...')
897      ['', '...', 'words', ', ', 'words', '...', '']
898
899   That way, separator components are always found at the same relative
900   indices within the result list.
901
902   Empty matches for the pattern split the string only when not adjacent
903   to a previous empty match.
904
905      >>> re.split(r'\b', 'Words, words, words.')
906      ['', 'Words', ', ', 'words', ', ', 'words', '.']
907      >>> re.split(r'\W*', '...words...')
908      ['', '', 'w', 'o', 'r', 'd', 's', '', '']
909      >>> re.split(r'(\W*)', '...words...')
910      ['', '...', '', '', 'w', '', 'o', '', 'r', '', 'd', '', 's', '...', '', '', '']
911
912   .. versionchanged:: 3.1
913      Added the optional flags argument.
914
915   .. versionchanged:: 3.7
916      Added support of splitting on a pattern that could match an empty string.
917
918
919.. function:: findall(pattern, string, flags=0)
920
921   Return all non-overlapping matches of *pattern* in *string*, as a list of
922   strings or tuples.  The *string* is scanned left-to-right, and matches
923   are returned in the order found.  Empty matches are included in the result.
924
925   The result depends on the number of capturing groups in the pattern.
926   If there are no groups, return a list of strings matching the whole
927   pattern.  If there is exactly one group, return a list of strings
928   matching that group.  If multiple groups are present, return a list
929   of tuples of strings matching the groups.  Non-capturing groups do not
930   affect the form of the result.
931
932      >>> re.findall(r'\bf[a-z]*', 'which foot or hand fell fastest')
933      ['foot', 'fell', 'fastest']
934      >>> re.findall(r'(\w+)=(\d+)', 'set width=20 and height=10')
935      [('width', '20'), ('height', '10')]
936
937   .. versionchanged:: 3.7
938      Non-empty matches can now start just after a previous empty match.
939
940
941.. function:: finditer(pattern, string, flags=0)
942
943   Return an :term:`iterator` yielding :ref:`match objects <match-objects>` over
944   all non-overlapping matches for the RE *pattern* in *string*.  The *string*
945   is scanned left-to-right, and matches are returned in the order found.  Empty
946   matches are included in the result.
947
948   .. versionchanged:: 3.7
949      Non-empty matches can now start just after a previous empty match.
950
951
952.. function:: sub(pattern, repl, string, count=0, flags=0)
953
954   Return the string obtained by replacing the leftmost non-overlapping occurrences
955   of *pattern* in *string* by the replacement *repl*.  If the pattern isn't found,
956   *string* is returned unchanged.  *repl* can be a string or a function; if it is
957   a string, any backslash escapes in it are processed.  That is, ``\n`` is
958   converted to a single newline character, ``\r`` is converted to a carriage return, and
959   so forth.  Unknown escapes of ASCII letters are reserved for future use and
960   treated as errors.  Other unknown escapes such as ``\&`` are left alone.
961   Backreferences, such
962   as ``\6``, are replaced with the substring matched by group 6 in the pattern.
963   For example::
964
965      >>> re.sub(r'def\s+([a-zA-Z_][a-zA-Z_0-9]*)\s*\(\s*\):',
966      ...        r'static PyObject*\npy_\1(void)\n{',
967      ...        'def myfunc():')
968      'static PyObject*\npy_myfunc(void)\n{'
969
970   If *repl* is a function, it is called for every non-overlapping occurrence of
971   *pattern*.  The function takes a single :ref:`match object <match-objects>`
972   argument, and returns the replacement string.  For example::
973
974      >>> def dashrepl(matchobj):
975      ...     if matchobj.group(0) == '-': return ' '
976      ...     else: return '-'
977      >>> re.sub('-{1,2}', dashrepl, 'pro----gram-files')
978      'pro--gram files'
979      >>> re.sub(r'\sAND\s', ' & ', 'Baked Beans And Spam', flags=re.IGNORECASE)
980      'Baked Beans & Spam'
981
982   The pattern may be a string or a :ref:`pattern object <re-objects>`.
983
984   The optional argument *count* is the maximum number of pattern occurrences to be
985   replaced; *count* must be a non-negative integer.  If omitted or zero, all
986   occurrences will be replaced. Empty matches for the pattern are replaced only
987   when not adjacent to a previous empty match, so ``sub('x*', '-', 'abxd')`` returns
988   ``'-a-b--d-'``.
989
990   .. index:: single: \g; in regular expressions
991
992   In string-type *repl* arguments, in addition to the character escapes and
993   backreferences described above,
994   ``\g<name>`` will use the substring matched by the group named ``name``, as
995   defined by the ``(?P<name>...)`` syntax. ``\g<number>`` uses the corresponding
996   group number; ``\g<2>`` is therefore equivalent to ``\2``, but isn't ambiguous
997   in a replacement such as ``\g<2>0``.  ``\20`` would be interpreted as a
998   reference to group 20, not a reference to group 2 followed by the literal
999   character ``'0'``.  The backreference ``\g<0>`` substitutes in the entire
1000   substring matched by the RE.
1001
1002   .. versionchanged:: 3.1
1003      Added the optional flags argument.
1004
1005   .. versionchanged:: 3.5
1006      Unmatched groups are replaced with an empty string.
1007
1008   .. versionchanged:: 3.6
1009      Unknown escapes in *pattern* consisting of ``'\'`` and an ASCII letter
1010      now are errors.
1011
1012   .. versionchanged:: 3.7
1013      Unknown escapes in *repl* consisting of ``'\'`` and an ASCII letter
1014      now are errors.
1015
1016   .. versionchanged:: 3.7
1017      Empty matches for the pattern are replaced when adjacent to a previous
1018      non-empty match.
1019
1020   .. deprecated:: 3.11
1021      Group *id* containing anything except ASCII digits.
1022      Group *name* containing characters outside the ASCII range
1023      (``b'\x00'``-``b'\x7f'``) in :class:`bytes` replacement strings.
1024
1025
1026.. function:: subn(pattern, repl, string, count=0, flags=0)
1027
1028   Perform the same operation as :func:`sub`, but return a tuple ``(new_string,
1029   number_of_subs_made)``.
1030
1031   .. versionchanged:: 3.1
1032      Added the optional flags argument.
1033
1034   .. versionchanged:: 3.5
1035      Unmatched groups are replaced with an empty string.
1036
1037
1038.. function:: escape(pattern)
1039
1040   Escape special characters in *pattern*.
1041   This is useful if you want to match an arbitrary literal string that may
1042   have regular expression metacharacters in it.  For example::
1043
1044      >>> print(re.escape('https://www.python.org'))
1045      https://www\.python\.org
1046
1047      >>> legal_chars = string.ascii_lowercase + string.digits + "!#$%&'*+-.^_`|~:"
1048      >>> print('[%s]+' % re.escape(legal_chars))
1049      [abcdefghijklmnopqrstuvwxyz0123456789!\#\$%\&'\*\+\-\.\^_`\|\~:]+
1050
1051      >>> operators = ['+', '-', '*', '/', '**']
1052      >>> print('|'.join(map(re.escape, sorted(operators, reverse=True))))
1053      /|\-|\+|\*\*|\*
1054
1055   This function must not be used for the replacement string in :func:`sub`
1056   and :func:`subn`, only backslashes should be escaped.  For example::
1057
1058      >>> digits_re = r'\d+'
1059      >>> sample = '/usr/sbin/sendmail - 0 errors, 12 warnings'
1060      >>> print(re.sub(digits_re, digits_re.replace('\\', r'\\'), sample))
1061      /usr/sbin/sendmail - \d+ errors, \d+ warnings
1062
1063   .. versionchanged:: 3.3
1064      The ``'_'`` character is no longer escaped.
1065
1066   .. versionchanged:: 3.7
1067      Only characters that can have special meaning in a regular expression
1068      are escaped. As a result, ``'!'``, ``'"'``, ``'%'``, ``"'"``, ``','``,
1069      ``'/'``, ``':'``, ``';'``, ``'<'``, ``'='``, ``'>'``, ``'@'``, and
1070      ``"`"`` are no longer escaped.
1071
1072
1073.. function:: purge()
1074
1075   Clear the regular expression cache.
1076
1077
1078Exceptions
1079^^^^^^^^^^
1080
1081.. exception:: error(msg, pattern=None, pos=None)
1082
1083   Exception raised when a string passed to one of the functions here is not a
1084   valid regular expression (for example, it might contain unmatched parentheses)
1085   or when some other error occurs during compilation or matching.  It is never an
1086   error if a string contains no match for a pattern.  The error instance has
1087   the following additional attributes:
1088
1089   .. attribute:: msg
1090
1091      The unformatted error message.
1092
1093   .. attribute:: pattern
1094
1095      The regular expression pattern.
1096
1097   .. attribute:: pos
1098
1099      The index in *pattern* where compilation failed (may be ``None``).
1100
1101   .. attribute:: lineno
1102
1103      The line corresponding to *pos* (may be ``None``).
1104
1105   .. attribute:: colno
1106
1107      The column corresponding to *pos* (may be ``None``).
1108
1109   .. versionchanged:: 3.5
1110      Added additional attributes.
1111
1112.. _re-objects:
1113
1114Regular Expression Objects
1115--------------------------
1116
1117Compiled regular expression objects support the following methods and
1118attributes:
1119
1120.. method:: Pattern.search(string[, pos[, endpos]])
1121
1122   Scan through *string* looking for the first location where this regular
1123   expression produces a match, and return a corresponding :ref:`match object
1124   <match-objects>`.  Return ``None`` if no position in the string matches the
1125   pattern; note that this is different from finding a zero-length match at some
1126   point in the string.
1127
1128   The optional second parameter *pos* gives an index in the string where the
1129   search is to start; it defaults to ``0``.  This is not completely equivalent to
1130   slicing the string; the ``'^'`` pattern character matches at the real beginning
1131   of the string and at positions just after a newline, but not necessarily at the
1132   index where the search is to start.
1133
1134   The optional parameter *endpos* limits how far the string will be searched; it
1135   will be as if the string is *endpos* characters long, so only the characters
1136   from *pos* to ``endpos - 1`` will be searched for a match.  If *endpos* is less
1137   than *pos*, no match will be found; otherwise, if *rx* is a compiled regular
1138   expression object, ``rx.search(string, 0, 50)`` is equivalent to
1139   ``rx.search(string[:50], 0)``. ::
1140
1141      >>> pattern = re.compile("d")
1142      >>> pattern.search("dog")     # Match at index 0
1143      <re.Match object; span=(0, 1), match='d'>
1144      >>> pattern.search("dog", 1)  # No match; search doesn't include the "d"
1145
1146
1147.. method:: Pattern.match(string[, pos[, endpos]])
1148
1149   If zero or more characters at the *beginning* of *string* match this regular
1150   expression, return a corresponding :ref:`match object <match-objects>`.
1151   Return ``None`` if the string does not match the pattern; note that this is
1152   different from a zero-length match.
1153
1154   The optional *pos* and *endpos* parameters have the same meaning as for the
1155   :meth:`~Pattern.search` method. ::
1156
1157      >>> pattern = re.compile("o")
1158      >>> pattern.match("dog")      # No match as "o" is not at the start of "dog".
1159      >>> pattern.match("dog", 1)   # Match as "o" is the 2nd character of "dog".
1160      <re.Match object; span=(1, 2), match='o'>
1161
1162   If you want to locate a match anywhere in *string*, use
1163   :meth:`~Pattern.search` instead (see also :ref:`search-vs-match`).
1164
1165
1166.. method:: Pattern.fullmatch(string[, pos[, endpos]])
1167
1168   If the whole *string* matches this regular expression, return a corresponding
1169   :ref:`match object <match-objects>`.  Return ``None`` if the string does not
1170   match the pattern; note that this is different from a zero-length match.
1171
1172   The optional *pos* and *endpos* parameters have the same meaning as for the
1173   :meth:`~Pattern.search` method. ::
1174
1175      >>> pattern = re.compile("o[gh]")
1176      >>> pattern.fullmatch("dog")      # No match as "o" is not at the start of "dog".
1177      >>> pattern.fullmatch("ogre")     # No match as not the full string matches.
1178      >>> pattern.fullmatch("doggie", 1, 3)   # Matches within given limits.
1179      <re.Match object; span=(1, 3), match='og'>
1180
1181   .. versionadded:: 3.4
1182
1183
1184.. method:: Pattern.split(string, maxsplit=0)
1185
1186   Identical to the :func:`split` function, using the compiled pattern.
1187
1188
1189.. method:: Pattern.findall(string[, pos[, endpos]])
1190
1191   Similar to the :func:`findall` function, using the compiled pattern, but
1192   also accepts optional *pos* and *endpos* parameters that limit the search
1193   region like for :meth:`search`.
1194
1195
1196.. method:: Pattern.finditer(string[, pos[, endpos]])
1197
1198   Similar to the :func:`finditer` function, using the compiled pattern, but
1199   also accepts optional *pos* and *endpos* parameters that limit the search
1200   region like for :meth:`search`.
1201
1202
1203.. method:: Pattern.sub(repl, string, count=0)
1204
1205   Identical to the :func:`sub` function, using the compiled pattern.
1206
1207
1208.. method:: Pattern.subn(repl, string, count=0)
1209
1210   Identical to the :func:`subn` function, using the compiled pattern.
1211
1212
1213.. attribute:: Pattern.flags
1214
1215   The regex matching flags.  This is a combination of the flags given to
1216   :func:`.compile`, any ``(?...)`` inline flags in the pattern, and implicit
1217   flags such as :data:`UNICODE` if the pattern is a Unicode string.
1218
1219
1220.. attribute:: Pattern.groups
1221
1222   The number of capturing groups in the pattern.
1223
1224
1225.. attribute:: Pattern.groupindex
1226
1227   A dictionary mapping any symbolic group names defined by ``(?P<id>)`` to group
1228   numbers.  The dictionary is empty if no symbolic groups were used in the
1229   pattern.
1230
1231
1232.. attribute:: Pattern.pattern
1233
1234   The pattern string from which the pattern object was compiled.
1235
1236
1237.. versionchanged:: 3.7
1238   Added support of :func:`copy.copy` and :func:`copy.deepcopy`.  Compiled
1239   regular expression objects are considered atomic.
1240
1241
1242.. _match-objects:
1243
1244Match Objects
1245-------------
1246
1247Match objects always have a boolean value of ``True``.
1248Since :meth:`~Pattern.match` and :meth:`~Pattern.search` return ``None``
1249when there is no match, you can test whether there was a match with a simple
1250``if`` statement::
1251
1252   match = re.search(pattern, string)
1253   if match:
1254       process(match)
1255
1256Match objects support the following methods and attributes:
1257
1258
1259.. method:: Match.expand(template)
1260
1261   Return the string obtained by doing backslash substitution on the template
1262   string *template*, as done by the :meth:`~Pattern.sub` method.
1263   Escapes such as ``\n`` are converted to the appropriate characters,
1264   and numeric backreferences (``\1``, ``\2``) and named backreferences
1265   (``\g<1>``, ``\g<name>``) are replaced by the contents of the
1266   corresponding group.
1267
1268   .. versionchanged:: 3.5
1269      Unmatched groups are replaced with an empty string.
1270
1271.. method:: Match.group([group1, ...])
1272
1273   Returns one or more subgroups of the match.  If there is a single argument, the
1274   result is a single string; if there are multiple arguments, the result is a
1275   tuple with one item per argument. Without arguments, *group1* defaults to zero
1276   (the whole match is returned). If a *groupN* argument is zero, the corresponding
1277   return value is the entire matching string; if it is in the inclusive range
1278   [1..99], it is the string matching the corresponding parenthesized group.  If a
1279   group number is negative or larger than the number of groups defined in the
1280   pattern, an :exc:`IndexError` exception is raised. If a group is contained in a
1281   part of the pattern that did not match, the corresponding result is ``None``.
1282   If a group is contained in a part of the pattern that matched multiple times,
1283   the last match is returned. ::
1284
1285      >>> m = re.match(r"(\w+) (\w+)", "Isaac Newton, physicist")
1286      >>> m.group(0)       # The entire match
1287      'Isaac Newton'
1288      >>> m.group(1)       # The first parenthesized subgroup.
1289      'Isaac'
1290      >>> m.group(2)       # The second parenthesized subgroup.
1291      'Newton'
1292      >>> m.group(1, 2)    # Multiple arguments give us a tuple.
1293      ('Isaac', 'Newton')
1294
1295   If the regular expression uses the ``(?P<name>...)`` syntax, the *groupN*
1296   arguments may also be strings identifying groups by their group name.  If a
1297   string argument is not used as a group name in the pattern, an :exc:`IndexError`
1298   exception is raised.
1299
1300   A moderately complicated example::
1301
1302      >>> m = re.match(r"(?P<first_name>\w+) (?P<last_name>\w+)", "Malcolm Reynolds")
1303      >>> m.group('first_name')
1304      'Malcolm'
1305      >>> m.group('last_name')
1306      'Reynolds'
1307
1308   Named groups can also be referred to by their index::
1309
1310      >>> m.group(1)
1311      'Malcolm'
1312      >>> m.group(2)
1313      'Reynolds'
1314
1315   If a group matches multiple times, only the last match is accessible::
1316
1317      >>> m = re.match(r"(..)+", "a1b2c3")  # Matches 3 times.
1318      >>> m.group(1)                        # Returns only the last match.
1319      'c3'
1320
1321
1322.. method:: Match.__getitem__(g)
1323
1324   This is identical to ``m.group(g)``.  This allows easier access to
1325   an individual group from a match::
1326
1327      >>> m = re.match(r"(\w+) (\w+)", "Isaac Newton, physicist")
1328      >>> m[0]       # The entire match
1329      'Isaac Newton'
1330      >>> m[1]       # The first parenthesized subgroup.
1331      'Isaac'
1332      >>> m[2]       # The second parenthesized subgroup.
1333      'Newton'
1334
1335   Named groups are supported as well::
1336
1337      >>> m = re.match(r"(?P<first_name>\w+) (?P<last_name>\w+)", "Isaac Newton")
1338      >>> m['first_name']
1339      'Isaac'
1340      >>> m['last_name']
1341      'Newton'
1342
1343   .. versionadded:: 3.6
1344
1345
1346.. method:: Match.groups(default=None)
1347
1348   Return a tuple containing all the subgroups of the match, from 1 up to however
1349   many groups are in the pattern.  The *default* argument is used for groups that
1350   did not participate in the match; it defaults to ``None``.
1351
1352   For example::
1353
1354      >>> m = re.match(r"(\d+)\.(\d+)", "24.1632")
1355      >>> m.groups()
1356      ('24', '1632')
1357
1358   If we make the decimal place and everything after it optional, not all groups
1359   might participate in the match.  These groups will default to ``None`` unless
1360   the *default* argument is given::
1361
1362      >>> m = re.match(r"(\d+)\.?(\d+)?", "24")
1363      >>> m.groups()      # Second group defaults to None.
1364      ('24', None)
1365      >>> m.groups('0')   # Now, the second group defaults to '0'.
1366      ('24', '0')
1367
1368
1369.. method:: Match.groupdict(default=None)
1370
1371   Return a dictionary containing all the *named* subgroups of the match, keyed by
1372   the subgroup name.  The *default* argument is used for groups that did not
1373   participate in the match; it defaults to ``None``.  For example::
1374
1375      >>> m = re.match(r"(?P<first_name>\w+) (?P<last_name>\w+)", "Malcolm Reynolds")
1376      >>> m.groupdict()
1377      {'first_name': 'Malcolm', 'last_name': 'Reynolds'}
1378
1379
1380.. method:: Match.start([group])
1381            Match.end([group])
1382
1383   Return the indices of the start and end of the substring matched by *group*;
1384   *group* defaults to zero (meaning the whole matched substring). Return ``-1`` if
1385   *group* exists but did not contribute to the match.  For a match object *m*, and
1386   a group *g* that did contribute to the match, the substring matched by group *g*
1387   (equivalent to ``m.group(g)``) is ::
1388
1389      m.string[m.start(g):m.end(g)]
1390
1391   Note that ``m.start(group)`` will equal ``m.end(group)`` if *group* matched a
1392   null string.  For example, after ``m = re.search('b(c?)', 'cba')``,
1393   ``m.start(0)`` is 1, ``m.end(0)`` is 2, ``m.start(1)`` and ``m.end(1)`` are both
1394   2, and ``m.start(2)`` raises an :exc:`IndexError` exception.
1395
1396   An example that will remove *remove_this* from email addresses::
1397
1398      >>> email = "tony@tiremove_thisger.net"
1399      >>> m = re.search("remove_this", email)
1400      >>> email[:m.start()] + email[m.end():]
1401      '[email protected]'
1402
1403
1404.. method:: Match.span([group])
1405
1406   For a match *m*, return the 2-tuple ``(m.start(group), m.end(group))``. Note
1407   that if *group* did not contribute to the match, this is ``(-1, -1)``.
1408   *group* defaults to zero, the entire match.
1409
1410
1411.. attribute:: Match.pos
1412
1413   The value of *pos* which was passed to the :meth:`~Pattern.search` or
1414   :meth:`~Pattern.match` method of a :ref:`regex object <re-objects>`.  This is
1415   the index into the string at which the RE engine started looking for a match.
1416
1417
1418.. attribute:: Match.endpos
1419
1420   The value of *endpos* which was passed to the :meth:`~Pattern.search` or
1421   :meth:`~Pattern.match` method of a :ref:`regex object <re-objects>`.  This is
1422   the index into the string beyond which the RE engine will not go.
1423
1424
1425.. attribute:: Match.lastindex
1426
1427   The integer index of the last matched capturing group, or ``None`` if no group
1428   was matched at all. For example, the expressions ``(a)b``, ``((a)(b))``, and
1429   ``((ab))`` will have ``lastindex == 1`` if applied to the string ``'ab'``, while
1430   the expression ``(a)(b)`` will have ``lastindex == 2``, if applied to the same
1431   string.
1432
1433
1434.. attribute:: Match.lastgroup
1435
1436   The name of the last matched capturing group, or ``None`` if the group didn't
1437   have a name, or if no group was matched at all.
1438
1439
1440.. attribute:: Match.re
1441
1442   The :ref:`regular expression object <re-objects>` whose :meth:`~Pattern.match` or
1443   :meth:`~Pattern.search` method produced this match instance.
1444
1445
1446.. attribute:: Match.string
1447
1448   The string passed to :meth:`~Pattern.match` or :meth:`~Pattern.search`.
1449
1450
1451.. versionchanged:: 3.7
1452   Added support of :func:`copy.copy` and :func:`copy.deepcopy`.  Match objects
1453   are considered atomic.
1454
1455
1456.. _re-examples:
1457
1458Regular Expression Examples
1459---------------------------
1460
1461
1462Checking for a Pair
1463^^^^^^^^^^^^^^^^^^^
1464
1465In this example, we'll use the following helper function to display match
1466objects a little more gracefully::
1467
1468   def displaymatch(match):
1469       if match is None:
1470           return None
1471       return '<Match: %r, groups=%r>' % (match.group(), match.groups())
1472
1473Suppose you are writing a poker program where a player's hand is represented as
1474a 5-character string with each character representing a card, "a" for ace, "k"
1475for king, "q" for queen, "j" for jack, "t" for 10, and "2" through "9"
1476representing the card with that value.
1477
1478To see if a given string is a valid hand, one could do the following::
1479
1480   >>> valid = re.compile(r"^[a2-9tjqk]{5}$")
1481   >>> displaymatch(valid.match("akt5q"))  # Valid.
1482   "<Match: 'akt5q', groups=()>"
1483   >>> displaymatch(valid.match("akt5e"))  # Invalid.
1484   >>> displaymatch(valid.match("akt"))    # Invalid.
1485   >>> displaymatch(valid.match("727ak"))  # Valid.
1486   "<Match: '727ak', groups=()>"
1487
1488That last hand, ``"727ak"``, contained a pair, or two of the same valued cards.
1489To match this with a regular expression, one could use backreferences as such::
1490
1491   >>> pair = re.compile(r".*(.).*\1")
1492   >>> displaymatch(pair.match("717ak"))     # Pair of 7s.
1493   "<Match: '717', groups=('7',)>"
1494   >>> displaymatch(pair.match("718ak"))     # No pairs.
1495   >>> displaymatch(pair.match("354aa"))     # Pair of aces.
1496   "<Match: '354aa', groups=('a',)>"
1497
1498To find out what card the pair consists of, one could use the
1499:meth:`~Match.group` method of the match object in the following manner::
1500
1501   >>> pair = re.compile(r".*(.).*\1")
1502   >>> pair.match("717ak").group(1)
1503   '7'
1504
1505   # Error because re.match() returns None, which doesn't have a group() method:
1506   >>> pair.match("718ak").group(1)
1507   Traceback (most recent call last):
1508     File "<pyshell#23>", line 1, in <module>
1509       re.match(r".*(.).*\1", "718ak").group(1)
1510   AttributeError: 'NoneType' object has no attribute 'group'
1511
1512   >>> pair.match("354aa").group(1)
1513   'a'
1514
1515
1516Simulating scanf()
1517^^^^^^^^^^^^^^^^^^
1518
1519.. index:: single: scanf()
1520
1521Python does not currently have an equivalent to :c:func:`scanf`.  Regular
1522expressions are generally more powerful, though also more verbose, than
1523:c:func:`scanf` format strings.  The table below offers some more-or-less
1524equivalent mappings between :c:func:`scanf` format tokens and regular
1525expressions.
1526
1527+--------------------------------+---------------------------------------------+
1528| :c:func:`scanf` Token          | Regular Expression                          |
1529+================================+=============================================+
1530| ``%c``                         | ``.``                                       |
1531+--------------------------------+---------------------------------------------+
1532| ``%5c``                        | ``.{5}``                                    |
1533+--------------------------------+---------------------------------------------+
1534| ``%d``                         | ``[-+]?\d+``                                |
1535+--------------------------------+---------------------------------------------+
1536| ``%e``, ``%E``, ``%f``, ``%g`` | ``[-+]?(\d+(\.\d*)?|\.\d+)([eE][-+]?\d+)?`` |
1537+--------------------------------+---------------------------------------------+
1538| ``%i``                         | ``[-+]?(0[xX][\dA-Fa-f]+|0[0-7]*|\d+)``     |
1539+--------------------------------+---------------------------------------------+
1540| ``%o``                         | ``[-+]?[0-7]+``                             |
1541+--------------------------------+---------------------------------------------+
1542| ``%s``                         | ``\S+``                                     |
1543+--------------------------------+---------------------------------------------+
1544| ``%u``                         | ``\d+``                                     |
1545+--------------------------------+---------------------------------------------+
1546| ``%x``, ``%X``                 | ``[-+]?(0[xX])?[\dA-Fa-f]+``                |
1547+--------------------------------+---------------------------------------------+
1548
1549To extract the filename and numbers from a string like ::
1550
1551   /usr/sbin/sendmail - 0 errors, 4 warnings
1552
1553you would use a :c:func:`scanf` format like ::
1554
1555   %s - %d errors, %d warnings
1556
1557The equivalent regular expression would be ::
1558
1559   (\S+) - (\d+) errors, (\d+) warnings
1560
1561
1562.. _search-vs-match:
1563
1564search() vs. match()
1565^^^^^^^^^^^^^^^^^^^^
1566
1567.. sectionauthor:: Fred L. Drake, Jr. <[email protected]>
1568
1569Python offers different primitive operations based on regular expressions:
1570
1571+ :func:`re.match` checks for a match only at the beginning of the string
1572+ :func:`re.search` checks for a match anywhere in the string
1573  (this is what Perl does by default)
1574+ :func:`re.fullmatch` checks for entire string to be a match
1575
1576
1577For example::
1578
1579   >>> re.match("c", "abcdef")    # No match
1580   >>> re.search("c", "abcdef")   # Match
1581   <re.Match object; span=(2, 3), match='c'>
1582   >>> re.fullmatch("p.*n", "python") # Match
1583   <re.Match object; span=(0, 6), match='python'>
1584   >>> re.fullmatch("r.*n", "python") # No match
1585
1586Regular expressions beginning with ``'^'`` can be used with :func:`search` to
1587restrict the match at the beginning of the string::
1588
1589   >>> re.match("c", "abcdef")    # No match
1590   >>> re.search("^c", "abcdef")  # No match
1591   >>> re.search("^a", "abcdef")  # Match
1592   <re.Match object; span=(0, 1), match='a'>
1593
1594Note however that in :const:`MULTILINE` mode :func:`match` only matches at the
1595beginning of the string, whereas using :func:`search` with a regular expression
1596beginning with ``'^'`` will match at the beginning of each line. ::
1597
1598   >>> re.match("X", "A\nB\nX", re.MULTILINE)  # No match
1599   >>> re.search("^X", "A\nB\nX", re.MULTILINE)  # Match
1600   <re.Match object; span=(4, 5), match='X'>
1601
1602
1603Making a Phonebook
1604^^^^^^^^^^^^^^^^^^
1605
1606:func:`split` splits a string into a list delimited by the passed pattern.  The
1607method is invaluable for converting textual data into data structures that can be
1608easily read and modified by Python as demonstrated in the following example that
1609creates a phonebook.
1610
1611First, here is the input.  Normally it may come from a file, here we are using
1612triple-quoted string syntax
1613
1614.. doctest::
1615
1616   >>> text = """Ross McFluff: 834.345.1254 155 Elm Street
1617   ...
1618   ... Ronald Heathmore: 892.345.3428 436 Finley Avenue
1619   ... Frank Burger: 925.541.7625 662 South Dogwood Way
1620   ...
1621   ...
1622   ... Heather Albrecht: 548.326.4584 919 Park Place"""
1623
1624The entries are separated by one or more newlines. Now we convert the string
1625into a list with each nonempty line having its own entry:
1626
1627.. doctest::
1628   :options: +NORMALIZE_WHITESPACE
1629
1630   >>> entries = re.split("\n+", text)
1631   >>> entries
1632   ['Ross McFluff: 834.345.1254 155 Elm Street',
1633   'Ronald Heathmore: 892.345.3428 436 Finley Avenue',
1634   'Frank Burger: 925.541.7625 662 South Dogwood Way',
1635   'Heather Albrecht: 548.326.4584 919 Park Place']
1636
1637Finally, split each entry into a list with first name, last name, telephone
1638number, and address.  We use the ``maxsplit`` parameter of :func:`split`
1639because the address has spaces, our splitting pattern, in it:
1640
1641.. doctest::
1642   :options: +NORMALIZE_WHITESPACE
1643
1644   >>> [re.split(":? ", entry, 3) for entry in entries]
1645   [['Ross', 'McFluff', '834.345.1254', '155 Elm Street'],
1646   ['Ronald', 'Heathmore', '892.345.3428', '436 Finley Avenue'],
1647   ['Frank', 'Burger', '925.541.7625', '662 South Dogwood Way'],
1648   ['Heather', 'Albrecht', '548.326.4584', '919 Park Place']]
1649
1650The ``:?`` pattern matches the colon after the last name, so that it does not
1651occur in the result list.  With a ``maxsplit`` of ``4``, we could separate the
1652house number from the street name:
1653
1654.. doctest::
1655   :options: +NORMALIZE_WHITESPACE
1656
1657   >>> [re.split(":? ", entry, 4) for entry in entries]
1658   [['Ross', 'McFluff', '834.345.1254', '155', 'Elm Street'],
1659   ['Ronald', 'Heathmore', '892.345.3428', '436', 'Finley Avenue'],
1660   ['Frank', 'Burger', '925.541.7625', '662', 'South Dogwood Way'],
1661   ['Heather', 'Albrecht', '548.326.4584', '919', 'Park Place']]
1662
1663
1664Text Munging
1665^^^^^^^^^^^^
1666
1667:func:`sub` replaces every occurrence of a pattern with a string or the
1668result of a function.  This example demonstrates using :func:`sub` with
1669a function to "munge" text, or randomize the order of all the characters
1670in each word of a sentence except for the first and last characters::
1671
1672   >>> def repl(m):
1673   ...     inner_word = list(m.group(2))
1674   ...     random.shuffle(inner_word)
1675   ...     return m.group(1) + "".join(inner_word) + m.group(3)
1676   >>> text = "Professor Abdolmalek, please report your absences promptly."
1677   >>> re.sub(r"(\w)(\w+)(\w)", repl, text)
1678   'Poefsrosr Aealmlobdk, pslaee reorpt your abnseces plmrptoy.'
1679   >>> re.sub(r"(\w)(\w+)(\w)", repl, text)
1680   'Pofsroser Aodlambelk, plasee reoprt yuor asnebces potlmrpy.'
1681
1682
1683Finding all Adverbs
1684^^^^^^^^^^^^^^^^^^^
1685
1686:func:`findall` matches *all* occurrences of a pattern, not just the first
1687one as :func:`search` does.  For example, if a writer wanted to
1688find all of the adverbs in some text, they might use :func:`findall` in
1689the following manner::
1690
1691   >>> text = "He was carefully disguised but captured quickly by police."
1692   >>> re.findall(r"\w+ly\b", text)
1693   ['carefully', 'quickly']
1694
1695
1696Finding all Adverbs and their Positions
1697^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
1698
1699If one wants more information about all matches of a pattern than the matched
1700text, :func:`finditer` is useful as it provides :ref:`match objects
1701<match-objects>` instead of strings.  Continuing with the previous example, if
1702a writer wanted to find all of the adverbs *and their positions* in
1703some text, they would use :func:`finditer` in the following manner::
1704
1705   >>> text = "He was carefully disguised but captured quickly by police."
1706   >>> for m in re.finditer(r"\w+ly\b", text):
1707   ...     print('%02d-%02d: %s' % (m.start(), m.end(), m.group(0)))
1708   07-16: carefully
1709   40-47: quickly
1710
1711
1712Raw String Notation
1713^^^^^^^^^^^^^^^^^^^
1714
1715Raw string notation (``r"text"``) keeps regular expressions sane.  Without it,
1716every backslash (``'\'``) in a regular expression would have to be prefixed with
1717another one to escape it.  For example, the two following lines of code are
1718functionally identical::
1719
1720   >>> re.match(r"\W(.)\1\W", " ff ")
1721   <re.Match object; span=(0, 4), match=' ff '>
1722   >>> re.match("\\W(.)\\1\\W", " ff ")
1723   <re.Match object; span=(0, 4), match=' ff '>
1724
1725When one wants to match a literal backslash, it must be escaped in the regular
1726expression.  With raw string notation, this means ``r"\\"``.  Without raw string
1727notation, one must use ``"\\\\"``, making the following lines of code
1728functionally identical::
1729
1730   >>> re.match(r"\\", r"\\")
1731   <re.Match object; span=(0, 1), match='\\'>
1732   >>> re.match("\\\\", r"\\")
1733   <re.Match object; span=(0, 1), match='\\'>
1734
1735
1736Writing a Tokenizer
1737^^^^^^^^^^^^^^^^^^^
1738
1739A `tokenizer or scanner <https://en.wikipedia.org/wiki/Lexical_analysis>`_
1740analyzes a string to categorize groups of characters.  This is a useful first
1741step in writing a compiler or interpreter.
1742
1743The text categories are specified with regular expressions.  The technique is
1744to combine those into a single master regular expression and to loop over
1745successive matches::
1746
1747    from typing import NamedTuple
1748    import re
1749
1750    class Token(NamedTuple):
1751        type: str
1752        value: str
1753        line: int
1754        column: int
1755
1756    def tokenize(code):
1757        keywords = {'IF', 'THEN', 'ENDIF', 'FOR', 'NEXT', 'GOSUB', 'RETURN'}
1758        token_specification = [
1759            ('NUMBER',   r'\d+(\.\d*)?'),  # Integer or decimal number
1760            ('ASSIGN',   r':='),           # Assignment operator
1761            ('END',      r';'),            # Statement terminator
1762            ('ID',       r'[A-Za-z]+'),    # Identifiers
1763            ('OP',       r'[+\-*/]'),      # Arithmetic operators
1764            ('NEWLINE',  r'\n'),           # Line endings
1765            ('SKIP',     r'[ \t]+'),       # Skip over spaces and tabs
1766            ('MISMATCH', r'.'),            # Any other character
1767        ]
1768        tok_regex = '|'.join('(?P<%s>%s)' % pair for pair in token_specification)
1769        line_num = 1
1770        line_start = 0
1771        for mo in re.finditer(tok_regex, code):
1772            kind = mo.lastgroup
1773            value = mo.group()
1774            column = mo.start() - line_start
1775            if kind == 'NUMBER':
1776                value = float(value) if '.' in value else int(value)
1777            elif kind == 'ID' and value in keywords:
1778                kind = value
1779            elif kind == 'NEWLINE':
1780                line_start = mo.end()
1781                line_num += 1
1782                continue
1783            elif kind == 'SKIP':
1784                continue
1785            elif kind == 'MISMATCH':
1786                raise RuntimeError(f'{value!r} unexpected on line {line_num}')
1787            yield Token(kind, value, line_num, column)
1788
1789    statements = '''
1790        IF quantity THEN
1791            total := total + price * quantity;
1792            tax := price * 0.05;
1793        ENDIF;
1794    '''
1795
1796    for token in tokenize(statements):
1797        print(token)
1798
1799The tokenizer produces the following output::
1800
1801    Token(type='IF', value='IF', line=2, column=4)
1802    Token(type='ID', value='quantity', line=2, column=7)
1803    Token(type='THEN', value='THEN', line=2, column=16)
1804    Token(type='ID', value='total', line=3, column=8)
1805    Token(type='ASSIGN', value=':=', line=3, column=14)
1806    Token(type='ID', value='total', line=3, column=17)
1807    Token(type='OP', value='+', line=3, column=23)
1808    Token(type='ID', value='price', line=3, column=25)
1809    Token(type='OP', value='*', line=3, column=31)
1810    Token(type='ID', value='quantity', line=3, column=33)
1811    Token(type='END', value=';', line=3, column=41)
1812    Token(type='ID', value='tax', line=4, column=8)
1813    Token(type='ASSIGN', value=':=', line=4, column=12)
1814    Token(type='ID', value='price', line=4, column=15)
1815    Token(type='OP', value='*', line=4, column=21)
1816    Token(type='NUMBER', value=0.05, line=4, column=23)
1817    Token(type='END', value=';', line=4, column=27)
1818    Token(type='ENDIF', value='ENDIF', line=5, column=4)
1819    Token(type='END', value=';', line=5, column=9)
1820
1821
1822.. [Frie09] Friedl, Jeffrey. Mastering Regular Expressions. 3rd ed., O'Reilly
1823   Media, 2009. The third edition of the book no longer covers Python at all,
1824   but the first edition covered writing good regular expression patterns in
1825   great detail.
1826