1:mod:`re` --- Regular expression operations 2=========================================== 3 4.. module:: re 5 :synopsis: Regular expression operations. 6 7.. moduleauthor:: Fredrik Lundh <[email protected]> 8.. sectionauthor:: Andrew M. Kuchling <[email protected]> 9 10**Source code:** :source:`Lib/re/` 11 12-------------- 13 14This module provides regular expression matching operations similar to 15those found in Perl. 16 17Both patterns and strings to be searched can be Unicode strings (:class:`str`) 18as well as 8-bit strings (:class:`bytes`). 19However, Unicode strings and 8-bit strings cannot be mixed: 20that is, you cannot match a Unicode string with a byte pattern or 21vice-versa; similarly, when asking for a substitution, the replacement 22string must be of the same type as both the pattern and the search string. 23 24Regular expressions use the backslash character (``'\'``) to indicate 25special forms or to allow special characters to be used without invoking 26their special meaning. This collides with Python's usage of the same 27character for the same purpose in string literals; for example, to match 28a literal backslash, one might have to write ``'\\\\'`` as the pattern 29string, because the regular expression must be ``\\``, and each 30backslash must be expressed as ``\\`` inside a regular Python string 31literal. Also, please note that any invalid escape sequences in Python's 32usage of the backslash in string literals now generate a :exc:`DeprecationWarning` 33and in the future this will become a :exc:`SyntaxError`. This behaviour 34will happen even if it is a valid escape sequence for a regular expression. 35 36The solution is to use Python's raw string notation for regular expression 37patterns; backslashes are not handled in any special way in a string literal 38prefixed with ``'r'``. So ``r"\n"`` is a two-character string containing 39``'\'`` and ``'n'``, while ``"\n"`` is a one-character string containing a 40newline. Usually patterns will be expressed in Python code using this raw 41string notation. 42 43It is important to note that most regular expression operations are available as 44module-level functions and methods on 45:ref:`compiled regular expressions <re-objects>`. The functions are shortcuts 46that don't require you to compile a regex object first, but miss some 47fine-tuning parameters. 48 49.. seealso:: 50 51 The third-party `regex <https://pypi.org/project/regex/>`_ module, 52 which has an API compatible with the standard library :mod:`re` module, 53 but offers additional functionality and a more thorough Unicode support. 54 55 56.. _re-syntax: 57 58Regular Expression Syntax 59------------------------- 60 61A regular expression (or RE) specifies a set of strings that matches it; the 62functions in this module let you check if a particular string matches a given 63regular expression (or if a given regular expression matches a particular 64string, which comes down to the same thing). 65 66Regular expressions can be concatenated to form new regular expressions; if *A* 67and *B* are both regular expressions, then *AB* is also a regular expression. 68In general, if a string *p* matches *A* and another string *q* matches *B*, the 69string *pq* will match AB. This holds unless *A* or *B* contain low precedence 70operations; boundary conditions between *A* and *B*; or have numbered group 71references. Thus, complex expressions can easily be constructed from simpler 72primitive expressions like the ones described here. For details of the theory 73and implementation of regular expressions, consult the Friedl book [Frie09]_, 74or almost any textbook about compiler construction. 75 76A brief explanation of the format of regular expressions follows. For further 77information and a gentler presentation, consult the :ref:`regex-howto`. 78 79Regular expressions can contain both special and ordinary characters. Most 80ordinary characters, like ``'A'``, ``'a'``, or ``'0'``, are the simplest regular 81expressions; they simply match themselves. You can concatenate ordinary 82characters, so ``last`` matches the string ``'last'``. (In the rest of this 83section, we'll write RE's in ``this special style``, usually without quotes, and 84strings to be matched ``'in single quotes'``.) 85 86Some characters, like ``'|'`` or ``'('``, are special. Special 87characters either stand for classes of ordinary characters, or affect 88how the regular expressions around them are interpreted. 89 90Repetition operators or quantifiers (``*``, ``+``, ``?``, ``{m,n}``, etc) cannot be 91directly nested. This avoids ambiguity with the non-greedy modifier suffix 92``?``, and with other modifiers in other implementations. To apply a second 93repetition to an inner repetition, parentheses may be used. For example, 94the expression ``(?:a{6})*`` matches any multiple of six ``'a'`` characters. 95 96 97The special characters are: 98 99.. index:: single: . (dot); in regular expressions 100 101``.`` 102 (Dot.) In the default mode, this matches any character except a newline. If 103 the :const:`DOTALL` flag has been specified, this matches any character 104 including a newline. 105 106.. index:: single: ^ (caret); in regular expressions 107 108``^`` 109 (Caret.) Matches the start of the string, and in :const:`MULTILINE` mode also 110 matches immediately after each newline. 111 112.. index:: single: $ (dollar); in regular expressions 113 114``$`` 115 Matches the end of the string or just before the newline at the end of the 116 string, and in :const:`MULTILINE` mode also matches before a newline. ``foo`` 117 matches both 'foo' and 'foobar', while the regular expression ``foo$`` matches 118 only 'foo'. More interestingly, searching for ``foo.$`` in ``'foo1\nfoo2\n'`` 119 matches 'foo2' normally, but 'foo1' in :const:`MULTILINE` mode; searching for 120 a single ``$`` in ``'foo\n'`` will find two (empty) matches: one just before 121 the newline, and one at the end of the string. 122 123.. index:: single: * (asterisk); in regular expressions 124 125``*`` 126 Causes the resulting RE to match 0 or more repetitions of the preceding RE, as 127 many repetitions as are possible. ``ab*`` will match 'a', 'ab', or 'a' followed 128 by any number of 'b's. 129 130.. index:: single: + (plus); in regular expressions 131 132``+`` 133 Causes the resulting RE to match 1 or more repetitions of the preceding RE. 134 ``ab+`` will match 'a' followed by any non-zero number of 'b's; it will not 135 match just 'a'. 136 137.. index:: single: ? (question mark); in regular expressions 138 139``?`` 140 Causes the resulting RE to match 0 or 1 repetitions of the preceding RE. 141 ``ab?`` will match either 'a' or 'ab'. 142 143.. index:: 144 single: *?; in regular expressions 145 single: +?; in regular expressions 146 single: ??; in regular expressions 147 148``*?``, ``+?``, ``??`` 149 The ``'*'``, ``'+'``, and ``'?'`` quantifiers are all :dfn:`greedy`; they match 150 as much text as possible. Sometimes this behaviour isn't desired; if the RE 151 ``<.*>`` is matched against ``'<a> b <c>'``, it will match the entire 152 string, and not just ``'<a>'``. Adding ``?`` after the quantifier makes it 153 perform the match in :dfn:`non-greedy` or :dfn:`minimal` fashion; as *few* 154 characters as possible will be matched. Using the RE ``<.*?>`` will match 155 only ``'<a>'``. 156 157.. index:: 158 single: *+; in regular expressions 159 single: ++; in regular expressions 160 single: ?+; in regular expressions 161 162``*+``, ``++``, ``?+`` 163 Like the ``'*'``, ``'+'``, and ``'?'`` quantifiers, those where ``'+'`` is 164 appended also match as many times as possible. 165 However, unlike the true greedy quantifiers, these do not allow 166 back-tracking when the expression following it fails to match. 167 These are known as :dfn:`possessive` quantifiers. 168 For example, ``a*a`` will match ``'aaaa'`` because the ``a*`` will match 169 all 4 ``'a'``\ s, but, when the final ``'a'`` is encountered, the 170 expression is backtracked so that in the end the ``a*`` ends up matching 171 3 ``'a'``\ s total, and the fourth ``'a'`` is matched by the final ``'a'``. 172 However, when ``a*+a`` is used to match ``'aaaa'``, the ``a*+`` will 173 match all 4 ``'a'``, but when the final ``'a'`` fails to find any more 174 characters to match, the expression cannot be backtracked and will thus 175 fail to match. 176 ``x*+``, ``x++`` and ``x?+`` are equivalent to ``(?>x*)``, ``(?>x+)`` 177 and ``(?>x?)`` correspondingly. 178 179 .. versionadded:: 3.11 180 181.. index:: 182 single: {} (curly brackets); in regular expressions 183 184``{m}`` 185 Specifies that exactly *m* copies of the previous RE should be matched; fewer 186 matches cause the entire RE not to match. For example, ``a{6}`` will match 187 exactly six ``'a'`` characters, but not five. 188 189``{m,n}`` 190 Causes the resulting RE to match from *m* to *n* repetitions of the preceding 191 RE, attempting to match as many repetitions as possible. For example, 192 ``a{3,5}`` will match from 3 to 5 ``'a'`` characters. Omitting *m* specifies a 193 lower bound of zero, and omitting *n* specifies an infinite upper bound. As an 194 example, ``a{4,}b`` will match ``'aaaab'`` or a thousand ``'a'`` characters 195 followed by a ``'b'``, but not ``'aaab'``. The comma may not be omitted or the 196 modifier would be confused with the previously described form. 197 198``{m,n}?`` 199 Causes the resulting RE to match from *m* to *n* repetitions of the preceding 200 RE, attempting to match as *few* repetitions as possible. This is the 201 non-greedy version of the previous quantifier. For example, on the 202 6-character string ``'aaaaaa'``, ``a{3,5}`` will match 5 ``'a'`` characters, 203 while ``a{3,5}?`` will only match 3 characters. 204 205``{m,n}+`` 206 Causes the resulting RE to match from *m* to *n* repetitions of the 207 preceding RE, attempting to match as many repetitions as possible 208 *without* establishing any backtracking points. 209 This is the possessive version of the quantifier above. 210 For example, on the 6-character string ``'aaaaaa'``, ``a{3,5}+aa`` 211 attempt to match 5 ``'a'`` characters, then, requiring 2 more ``'a'``\ s, 212 will need more characters than available and thus fail, while 213 ``a{3,5}aa`` will match with ``a{3,5}`` capturing 5, then 4 ``'a'``\ s 214 by backtracking and then the final 2 ``'a'``\ s are matched by the final 215 ``aa`` in the pattern. 216 ``x{m,n}+`` is equivalent to ``(?>x{m,n})``. 217 218 .. versionadded:: 3.11 219 220.. index:: single: \ (backslash); in regular expressions 221 222``\`` 223 Either escapes special characters (permitting you to match characters like 224 ``'*'``, ``'?'``, and so forth), or signals a special sequence; special 225 sequences are discussed below. 226 227 If you're not using a raw string to express the pattern, remember that Python 228 also uses the backslash as an escape sequence in string literals; if the escape 229 sequence isn't recognized by Python's parser, the backslash and subsequent 230 character are included in the resulting string. However, if Python would 231 recognize the resulting sequence, the backslash should be repeated twice. This 232 is complicated and hard to understand, so it's highly recommended that you use 233 raw strings for all but the simplest expressions. 234 235.. index:: 236 single: [] (square brackets); in regular expressions 237 238``[]`` 239 Used to indicate a set of characters. In a set: 240 241 * Characters can be listed individually, e.g. ``[amk]`` will match ``'a'``, 242 ``'m'``, or ``'k'``. 243 244 .. index:: single: - (minus); in regular expressions 245 246 * Ranges of characters can be indicated by giving two characters and separating 247 them by a ``'-'``, for example ``[a-z]`` will match any lowercase ASCII letter, 248 ``[0-5][0-9]`` will match all the two-digits numbers from ``00`` to ``59``, and 249 ``[0-9A-Fa-f]`` will match any hexadecimal digit. If ``-`` is escaped (e.g. 250 ``[a\-z]``) or if it's placed as the first or last character 251 (e.g. ``[-a]`` or ``[a-]``), it will match a literal ``'-'``. 252 253 * Special characters lose their special meaning inside sets. For example, 254 ``[(+*)]`` will match any of the literal characters ``'('``, ``'+'``, 255 ``'*'``, or ``')'``. 256 257 .. index:: single: \ (backslash); in regular expressions 258 259 * Character classes such as ``\w`` or ``\S`` (defined below) are also accepted 260 inside a set, although the characters they match depends on whether 261 :const:`ASCII` or :const:`LOCALE` mode is in force. 262 263 .. index:: single: ^ (caret); in regular expressions 264 265 * Characters that are not within a range can be matched by :dfn:`complementing` 266 the set. If the first character of the set is ``'^'``, all the characters 267 that are *not* in the set will be matched. For example, ``[^5]`` will match 268 any character except ``'5'``, and ``[^^]`` will match any character except 269 ``'^'``. ``^`` has no special meaning if it's not the first character in 270 the set. 271 272 * To match a literal ``']'`` inside a set, precede it with a backslash, or 273 place it at the beginning of the set. For example, both ``[()[\]{}]`` and 274 ``[]()[{}]`` will match a right bracket, as well as left bracket, braces, 275 and parentheses. 276 277 .. .. index:: single: --; in regular expressions 278 .. .. index:: single: &&; in regular expressions 279 .. .. index:: single: ~~; in regular expressions 280 .. .. index:: single: ||; in regular expressions 281 282 * Support of nested sets and set operations as in `Unicode Technical 283 Standard #18`_ might be added in the future. This would change the 284 syntax, so to facilitate this change a :exc:`FutureWarning` will be raised 285 in ambiguous cases for the time being. 286 That includes sets starting with a literal ``'['`` or containing literal 287 character sequences ``'--'``, ``'&&'``, ``'~~'``, and ``'||'``. To 288 avoid a warning escape them with a backslash. 289 290 .. _Unicode Technical Standard #18: https://unicode.org/reports/tr18/ 291 292 .. versionchanged:: 3.7 293 :exc:`FutureWarning` is raised if a character set contains constructs 294 that will change semantically in the future. 295 296.. index:: single: | (vertical bar); in regular expressions 297 298``|`` 299 ``A|B``, where *A* and *B* can be arbitrary REs, creates a regular expression that 300 will match either *A* or *B*. An arbitrary number of REs can be separated by the 301 ``'|'`` in this way. This can be used inside groups (see below) as well. As 302 the target string is scanned, REs separated by ``'|'`` are tried from left to 303 right. When one pattern completely matches, that branch is accepted. This means 304 that once *A* matches, *B* will not be tested further, even if it would 305 produce a longer overall match. In other words, the ``'|'`` operator is never 306 greedy. To match a literal ``'|'``, use ``\|``, or enclose it inside a 307 character class, as in ``[|]``. 308 309.. index:: 310 single: () (parentheses); in regular expressions 311 312``(...)`` 313 Matches whatever regular expression is inside the parentheses, and indicates the 314 start and end of a group; the contents of a group can be retrieved after a match 315 has been performed, and can be matched later in the string with the ``\number`` 316 special sequence, described below. To match the literals ``'('`` or ``')'``, 317 use ``\(`` or ``\)``, or enclose them inside a character class: ``[(]``, ``[)]``. 318 319.. index:: single: (?; in regular expressions 320 321``(?...)`` 322 This is an extension notation (a ``'?'`` following a ``'('`` is not meaningful 323 otherwise). The first character after the ``'?'`` determines what the meaning 324 and further syntax of the construct is. Extensions usually do not create a new 325 group; ``(?P<name>...)`` is the only exception to this rule. Following are the 326 currently supported extensions. 327 328``(?aiLmsux)`` 329 (One or more letters from the set ``'a'``, ``'i'``, ``'L'``, ``'m'``, 330 ``'s'``, ``'u'``, ``'x'``.) The group matches the empty string; the 331 letters set the corresponding flags: :const:`re.A` (ASCII-only matching), 332 :const:`re.I` (ignore case), :const:`re.L` (locale dependent), 333 :const:`re.M` (multi-line), :const:`re.S` (dot matches all), 334 :const:`re.U` (Unicode matching), and :const:`re.X` (verbose), 335 for the entire regular expression. 336 (The flags are described in :ref:`contents-of-module-re`.) 337 This is useful if you wish to include the flags as part of the 338 regular expression, instead of passing a *flag* argument to the 339 :func:`re.compile` function. Flags should be used first in the 340 expression string. 341 342 .. versionchanged:: 3.11 343 This construction can only be used at the start of the expression. 344 345.. index:: single: (?:; in regular expressions 346 347``(?:...)`` 348 A non-capturing version of regular parentheses. Matches whatever regular 349 expression is inside the parentheses, but the substring matched by the group 350 *cannot* be retrieved after performing a match or referenced later in the 351 pattern. 352 353``(?aiLmsux-imsx:...)`` 354 (Zero or more letters from the set ``'a'``, ``'i'``, ``'L'``, ``'m'``, 355 ``'s'``, ``'u'``, ``'x'``, optionally followed by ``'-'`` followed by 356 one or more letters from the ``'i'``, ``'m'``, ``'s'``, ``'x'``.) 357 The letters set or remove the corresponding flags: 358 :const:`re.A` (ASCII-only matching), :const:`re.I` (ignore case), 359 :const:`re.L` (locale dependent), :const:`re.M` (multi-line), 360 :const:`re.S` (dot matches all), :const:`re.U` (Unicode matching), 361 and :const:`re.X` (verbose), for the part of the expression. 362 (The flags are described in :ref:`contents-of-module-re`.) 363 364 The letters ``'a'``, ``'L'`` and ``'u'`` are mutually exclusive when used 365 as inline flags, so they can't be combined or follow ``'-'``. Instead, 366 when one of them appears in an inline group, it overrides the matching mode 367 in the enclosing group. In Unicode patterns ``(?a:...)`` switches to 368 ASCII-only matching, and ``(?u:...)`` switches to Unicode matching 369 (default). In byte pattern ``(?L:...)`` switches to locale depending 370 matching, and ``(?a:...)`` switches to ASCII-only matching (default). 371 This override is only in effect for the narrow inline group, and the 372 original matching mode is restored outside of the group. 373 374 .. versionadded:: 3.6 375 376 .. versionchanged:: 3.7 377 The letters ``'a'``, ``'L'`` and ``'u'`` also can be used in a group. 378 379``(?>...)`` 380 Attempts to match ``...`` as if it was a separate regular expression, and 381 if successful, continues to match the rest of the pattern following it. 382 If the subsequent pattern fails to match, the stack can only be unwound 383 to a point *before* the ``(?>...)`` because once exited, the expression, 384 known as an :dfn:`atomic group`, has thrown away all stack points within 385 itself. 386 Thus, ``(?>.*).`` would never match anything because first the ``.*`` 387 would match all characters possible, then, having nothing left to match, 388 the final ``.`` would fail to match. 389 Since there are no stack points saved in the Atomic Group, and there is 390 no stack point before it, the entire expression would thus fail to match. 391 392 .. versionadded:: 3.11 393 394.. index:: single: (?P<; in regular expressions 395 396``(?P<name>...)`` 397 Similar to regular parentheses, but the substring matched by the group is 398 accessible via the symbolic group name *name*. Group names must be valid 399 Python identifiers, and each group name must be defined only once within a 400 regular expression. A symbolic group is also a numbered group, just as if 401 the group were not named. 402 403 Named groups can be referenced in three contexts. If the pattern is 404 ``(?P<quote>['"]).*?(?P=quote)`` (i.e. matching a string quoted with either 405 single or double quotes): 406 407 +---------------------------------------+----------------------------------+ 408 | Context of reference to group "quote" | Ways to reference it | 409 +=======================================+==================================+ 410 | in the same pattern itself | * ``(?P=quote)`` (as shown) | 411 | | * ``\1`` | 412 +---------------------------------------+----------------------------------+ 413 | when processing match object *m* | * ``m.group('quote')`` | 414 | | * ``m.end('quote')`` (etc.) | 415 +---------------------------------------+----------------------------------+ 416 | in a string passed to the *repl* | * ``\g<quote>`` | 417 | argument of ``re.sub()`` | * ``\g<1>`` | 418 | | * ``\1`` | 419 +---------------------------------------+----------------------------------+ 420 421 .. deprecated:: 3.11 422 Group *name* containing characters outside the ASCII range 423 (``b'\x00'``-``b'\x7f'``) in :class:`bytes` patterns. 424 425.. index:: single: (?P=; in regular expressions 426 427``(?P=name)`` 428 A backreference to a named group; it matches whatever text was matched by the 429 earlier group named *name*. 430 431.. index:: single: (?#; in regular expressions 432 433``(?#...)`` 434 A comment; the contents of the parentheses are simply ignored. 435 436.. index:: single: (?=; in regular expressions 437 438``(?=...)`` 439 Matches if ``...`` matches next, but doesn't consume any of the string. This is 440 called a :dfn:`lookahead assertion`. For example, ``Isaac (?=Asimov)`` will match 441 ``'Isaac '`` only if it's followed by ``'Asimov'``. 442 443.. index:: single: (?!; in regular expressions 444 445``(?!...)`` 446 Matches if ``...`` doesn't match next. This is a :dfn:`negative lookahead assertion`. 447 For example, ``Isaac (?!Asimov)`` will match ``'Isaac '`` only if it's *not* 448 followed by ``'Asimov'``. 449 450.. index:: single: (?<=; in regular expressions 451 452``(?<=...)`` 453 Matches if the current position in the string is preceded by a match for ``...`` 454 that ends at the current position. This is called a :dfn:`positive lookbehind 455 assertion`. ``(?<=abc)def`` will find a match in ``'abcdef'``, since the 456 lookbehind will back up 3 characters and check if the contained pattern matches. 457 The contained pattern must only match strings of some fixed length, meaning that 458 ``abc`` or ``a|b`` are allowed, but ``a*`` and ``a{3,4}`` are not. Note that 459 patterns which start with positive lookbehind assertions will not match at the 460 beginning of the string being searched; you will most likely want to use the 461 :func:`search` function rather than the :func:`match` function: 462 463 >>> import re 464 >>> m = re.search('(?<=abc)def', 'abcdef') 465 >>> m.group(0) 466 'def' 467 468 This example looks for a word following a hyphen: 469 470 >>> m = re.search(r'(?<=-)\w+', 'spam-egg') 471 >>> m.group(0) 472 'egg' 473 474 .. versionchanged:: 3.5 475 Added support for group references of fixed length. 476 477.. index:: single: (?<!; in regular expressions 478 479``(?<!...)`` 480 Matches if the current position in the string is not preceded by a match for 481 ``...``. This is called a :dfn:`negative lookbehind assertion`. Similar to 482 positive lookbehind assertions, the contained pattern must only match strings of 483 some fixed length. Patterns which start with negative lookbehind assertions may 484 match at the beginning of the string being searched. 485 486.. _re-conditional-expression: 487.. index:: single: (?(; in regular expressions 488 489``(?(id/name)yes-pattern|no-pattern)`` 490 Will try to match with ``yes-pattern`` if the group with given *id* or 491 *name* exists, and with ``no-pattern`` if it doesn't. ``no-pattern`` is 492 optional and can be omitted. For example, 493 ``(<)?(\w+@\w+(?:\.\w+)+)(?(1)>|$)`` is a poor email matching pattern, which 494 will match with ``'<[email protected]>'`` as well as ``'[email protected]'``, but 495 not with ``'<[email protected]'`` nor ``'[email protected]>'``. 496 497 .. deprecated:: 3.11 498 Group *id* containing anything except ASCII digits. 499 Group *name* containing characters outside the ASCII range 500 (``b'\x00'``-``b'\x7f'``) in :class:`bytes` replacement strings. 501 502 503The special sequences consist of ``'\'`` and a character from the list below. 504If the ordinary character is not an ASCII digit or an ASCII letter, then the 505resulting RE will match the second character. For example, ``\$`` matches the 506character ``'$'``. 507 508.. index:: single: \ (backslash); in regular expressions 509 510``\number`` 511 Matches the contents of the group of the same number. Groups are numbered 512 starting from 1. For example, ``(.+) \1`` matches ``'the the'`` or ``'55 55'``, 513 but not ``'thethe'`` (note the space after the group). This special sequence 514 can only be used to match one of the first 99 groups. If the first digit of 515 *number* is 0, or *number* is 3 octal digits long, it will not be interpreted as 516 a group match, but as the character with octal value *number*. Inside the 517 ``'['`` and ``']'`` of a character class, all numeric escapes are treated as 518 characters. 519 520.. index:: single: \A; in regular expressions 521 522``\A`` 523 Matches only at the start of the string. 524 525.. index:: single: \b; in regular expressions 526 527``\b`` 528 Matches the empty string, but only at the beginning or end of a word. 529 A word is defined as a sequence of word characters. Note that formally, 530 ``\b`` is defined as the boundary between a ``\w`` and a ``\W`` character 531 (or vice versa), or between ``\w`` and the beginning/end of the string. 532 This means that ``r'\bfoo\b'`` matches ``'foo'``, ``'foo.'``, ``'(foo)'``, 533 ``'bar foo baz'`` but not ``'foobar'`` or ``'foo3'``. 534 535 By default Unicode alphanumerics are the ones used in Unicode patterns, but 536 this can be changed by using the :const:`ASCII` flag. Word boundaries are 537 determined by the current locale if the :const:`LOCALE` flag is used. 538 Inside a character range, ``\b`` represents the backspace character, for 539 compatibility with Python's string literals. 540 541.. index:: single: \B; in regular expressions 542 543``\B`` 544 Matches the empty string, but only when it is *not* at the beginning or end 545 of a word. This means that ``r'py\B'`` matches ``'python'``, ``'py3'``, 546 ``'py2'``, but not ``'py'``, ``'py.'``, or ``'py!'``. 547 ``\B`` is just the opposite of ``\b``, so word characters in Unicode 548 patterns are Unicode alphanumerics or the underscore, although this can 549 be changed by using the :const:`ASCII` flag. Word boundaries are 550 determined by the current locale if the :const:`LOCALE` flag is used. 551 552.. index:: single: \d; in regular expressions 553 554``\d`` 555 For Unicode (str) patterns: 556 Matches any Unicode decimal digit (that is, any character in 557 Unicode character category [Nd]). This includes ``[0-9]``, and 558 also many other digit characters. If the :const:`ASCII` flag is 559 used only ``[0-9]`` is matched. 560 561 For 8-bit (bytes) patterns: 562 Matches any decimal digit; this is equivalent to ``[0-9]``. 563 564.. index:: single: \D; in regular expressions 565 566``\D`` 567 Matches any character which is not a decimal digit. This is 568 the opposite of ``\d``. If the :const:`ASCII` flag is used this 569 becomes the equivalent of ``[^0-9]``. 570 571.. index:: single: \s; in regular expressions 572 573``\s`` 574 For Unicode (str) patterns: 575 Matches Unicode whitespace characters (which includes 576 ``[ \t\n\r\f\v]``, and also many other characters, for example the 577 non-breaking spaces mandated by typography rules in many 578 languages). If the :const:`ASCII` flag is used, only 579 ``[ \t\n\r\f\v]`` is matched. 580 581 For 8-bit (bytes) patterns: 582 Matches characters considered whitespace in the ASCII character set; 583 this is equivalent to ``[ \t\n\r\f\v]``. 584 585.. index:: single: \S; in regular expressions 586 587``\S`` 588 Matches any character which is not a whitespace character. This is 589 the opposite of ``\s``. If the :const:`ASCII` flag is used this 590 becomes the equivalent of ``[^ \t\n\r\f\v]``. 591 592.. index:: single: \w; in regular expressions 593 594``\w`` 595 For Unicode (str) patterns: 596 Matches Unicode word characters; this includes alphanumeric characters (as defined by :meth:`str.isalnum`) 597 as well as the underscore (``_``). 598 If the :const:`ASCII` flag is used, only ``[a-zA-Z0-9_]`` is matched. 599 600 For 8-bit (bytes) patterns: 601 Matches characters considered alphanumeric in the ASCII character set; 602 this is equivalent to ``[a-zA-Z0-9_]``. If the :const:`LOCALE` flag is 603 used, matches characters considered alphanumeric in the current locale 604 and the underscore. 605 606.. index:: single: \W; in regular expressions 607 608``\W`` 609 Matches any character which is not a word character. This is 610 the opposite of ``\w``. If the :const:`ASCII` flag is used this 611 becomes the equivalent of ``[^a-zA-Z0-9_]``. If the :const:`LOCALE` flag is 612 used, matches characters which are neither alphanumeric in the current locale 613 nor the underscore. 614 615.. index:: single: \Z; in regular expressions 616 617``\Z`` 618 Matches only at the end of the string. 619 620.. index:: 621 single: \a; in regular expressions 622 single: \b; in regular expressions 623 single: \f; in regular expressions 624 single: \n; in regular expressions 625 single: \N; in regular expressions 626 single: \r; in regular expressions 627 single: \t; in regular expressions 628 single: \u; in regular expressions 629 single: \U; in regular expressions 630 single: \v; in regular expressions 631 single: \x; in regular expressions 632 single: \\; in regular expressions 633 634Most of the standard escapes supported by Python string literals are also 635accepted by the regular expression parser:: 636 637 \a \b \f \n 638 \N \r \t \u 639 \U \v \x \\ 640 641(Note that ``\b`` is used to represent word boundaries, and means "backspace" 642only inside character classes.) 643 644``'\u'``, ``'\U'``, and ``'\N'`` escape sequences are only recognized in Unicode 645patterns. In bytes patterns they are errors. Unknown escapes of ASCII 646letters are reserved for future use and treated as errors. 647 648Octal escapes are included in a limited form. If the first digit is a 0, or if 649there are three octal digits, it is considered an octal escape. Otherwise, it is 650a group reference. As for string literals, octal escapes are always at most 651three digits in length. 652 653.. versionchanged:: 3.3 654 The ``'\u'`` and ``'\U'`` escape sequences have been added. 655 656.. versionchanged:: 3.6 657 Unknown escapes consisting of ``'\'`` and an ASCII letter now are errors. 658 659.. versionchanged:: 3.8 660 The ``'\N{name}'`` escape sequence has been added. As in string literals, 661 it expands to the named Unicode character (e.g. ``'\N{EM DASH}'``). 662 663 664.. _contents-of-module-re: 665 666Module Contents 667--------------- 668 669The module defines several functions, constants, and an exception. Some of the 670functions are simplified versions of the full featured methods for compiled 671regular expressions. Most non-trivial applications always use the compiled 672form. 673 674 675Flags 676^^^^^ 677 678.. versionchanged:: 3.6 679 Flag constants are now instances of :class:`RegexFlag`, which is a subclass of 680 :class:`enum.IntFlag`. 681 682 683.. class:: RegexFlag 684 685 An :class:`enum.IntFlag` class containing the regex options listed below. 686 687 .. versionadded:: 3.11 - added to ``__all__`` 688 689.. data:: A 690 ASCII 691 692 Make ``\w``, ``\W``, ``\b``, ``\B``, ``\d``, ``\D``, ``\s`` and ``\S`` 693 perform ASCII-only matching instead of full Unicode matching. This is only 694 meaningful for Unicode patterns, and is ignored for byte patterns. 695 Corresponds to the inline flag ``(?a)``. 696 697 Note that for backward compatibility, the :const:`re.U` flag still 698 exists (as well as its synonym :const:`re.UNICODE` and its embedded 699 counterpart ``(?u)``), but these are redundant in Python 3 since 700 matches are Unicode by default for strings (and Unicode matching 701 isn't allowed for bytes). 702 703 704.. data:: DEBUG 705 706 Display debug information about compiled expression. 707 No corresponding inline flag. 708 709 710.. data:: I 711 IGNORECASE 712 713 Perform case-insensitive matching; expressions like ``[A-Z]`` will also 714 match lowercase letters. Full Unicode matching (such as ``Ü`` matching 715 ``ü``) also works unless the :const:`re.ASCII` flag is used to disable 716 non-ASCII matches. The current locale does not change the effect of this 717 flag unless the :const:`re.LOCALE` flag is also used. 718 Corresponds to the inline flag ``(?i)``. 719 720 Note that when the Unicode patterns ``[a-z]`` or ``[A-Z]`` are used in 721 combination with the :const:`IGNORECASE` flag, they will match the 52 ASCII 722 letters and 4 additional non-ASCII letters: 'İ' (U+0130, Latin capital 723 letter I with dot above), 'ı' (U+0131, Latin small letter dotless i), 724 'ſ' (U+017F, Latin small letter long s) and 'K' (U+212A, Kelvin sign). 725 If the :const:`ASCII` flag is used, only letters 'a' to 'z' 726 and 'A' to 'Z' are matched. 727 728.. data:: L 729 LOCALE 730 731 Make ``\w``, ``\W``, ``\b``, ``\B`` and case-insensitive matching 732 dependent on the current locale. This flag can be used only with bytes 733 patterns. The use of this flag is discouraged as the locale mechanism 734 is very unreliable, it only handles one "culture" at a time, and it only 735 works with 8-bit locales. Unicode matching is already enabled by default 736 in Python 3 for Unicode (str) patterns, and it is able to handle different 737 locales/languages. 738 Corresponds to the inline flag ``(?L)``. 739 740 .. versionchanged:: 3.6 741 :const:`re.LOCALE` can be used only with bytes patterns and is 742 not compatible with :const:`re.ASCII`. 743 744 .. versionchanged:: 3.7 745 Compiled regular expression objects with the :const:`re.LOCALE` flag no 746 longer depend on the locale at compile time. Only the locale at 747 matching time affects the result of matching. 748 749 750.. data:: M 751 MULTILINE 752 753 When specified, the pattern character ``'^'`` matches at the beginning of the 754 string and at the beginning of each line (immediately following each newline); 755 and the pattern character ``'$'`` matches at the end of the string and at the 756 end of each line (immediately preceding each newline). By default, ``'^'`` 757 matches only at the beginning of the string, and ``'$'`` only at the end of the 758 string and immediately before the newline (if any) at the end of the string. 759 Corresponds to the inline flag ``(?m)``. 760 761.. data:: NOFLAG 762 763 Indicates no flag being applied, the value is ``0``. This flag may be used 764 as a default value for a function keyword argument or as a base value that 765 will be conditionally ORed with other flags. Example of use as a default 766 value:: 767 768 def myfunc(text, flag=re.NOFLAG): 769 return re.match(text, flag) 770 771 .. versionadded:: 3.11 772 773.. data:: S 774 DOTALL 775 776 Make the ``'.'`` special character match any character at all, including a 777 newline; without this flag, ``'.'`` will match anything *except* a newline. 778 Corresponds to the inline flag ``(?s)``. 779 780 781.. data:: X 782 VERBOSE 783 784 .. index:: single: # (hash); in regular expressions 785 786 This flag allows you to write regular expressions that look nicer and are 787 more readable by allowing you to visually separate logical sections of the 788 pattern and add comments. Whitespace within the pattern is ignored, except 789 when in a character class, or when preceded by an unescaped backslash, 790 or within tokens like ``*?``, ``(?:`` or ``(?P<...>``. For example, ``(? :`` 791 and ``* ?`` are not allowed. 792 When a line contains a ``#`` that is not in a character class and is not 793 preceded by an unescaped backslash, all characters from the leftmost such 794 ``#`` through the end of the line are ignored. 795 796 This means that the two following regular expression objects that match a 797 decimal number are functionally equal:: 798 799 a = re.compile(r"""\d + # the integral part 800 \. # the decimal point 801 \d * # some fractional digits""", re.X) 802 b = re.compile(r"\d+\.\d*") 803 804 Corresponds to the inline flag ``(?x)``. 805 806 807Functions 808^^^^^^^^^ 809 810.. function:: compile(pattern, flags=0) 811 812 Compile a regular expression pattern into a :ref:`regular expression object 813 <re-objects>`, which can be used for matching using its 814 :func:`~Pattern.match`, :func:`~Pattern.search` and other methods, described 815 below. 816 817 The expression's behaviour can be modified by specifying a *flags* value. 818 Values can be any of the following variables, combined using bitwise OR (the 819 ``|`` operator). 820 821 The sequence :: 822 823 prog = re.compile(pattern) 824 result = prog.match(string) 825 826 is equivalent to :: 827 828 result = re.match(pattern, string) 829 830 but using :func:`re.compile` and saving the resulting regular expression 831 object for reuse is more efficient when the expression will be used several 832 times in a single program. 833 834 .. note:: 835 836 The compiled versions of the most recent patterns passed to 837 :func:`re.compile` and the module-level matching functions are cached, so 838 programs that use only a few regular expressions at a time needn't worry 839 about compiling regular expressions. 840 841 842.. function:: search(pattern, string, flags=0) 843 844 Scan through *string* looking for the first location where the regular expression 845 *pattern* produces a match, and return a corresponding :ref:`match object 846 <match-objects>`. Return ``None`` if no position in the string matches the 847 pattern; note that this is different from finding a zero-length match at some 848 point in the string. 849 850 851.. function:: match(pattern, string, flags=0) 852 853 If zero or more characters at the beginning of *string* match the regular 854 expression *pattern*, return a corresponding :ref:`match object 855 <match-objects>`. Return ``None`` if the string does not match the pattern; 856 note that this is different from a zero-length match. 857 858 Note that even in :const:`MULTILINE` mode, :func:`re.match` will only match 859 at the beginning of the string and not at the beginning of each line. 860 861 If you want to locate a match anywhere in *string*, use :func:`search` 862 instead (see also :ref:`search-vs-match`). 863 864 865.. function:: fullmatch(pattern, string, flags=0) 866 867 If the whole *string* matches the regular expression *pattern*, return a 868 corresponding :ref:`match object <match-objects>`. Return ``None`` if the 869 string does not match the pattern; note that this is different from a 870 zero-length match. 871 872 .. versionadded:: 3.4 873 874 875.. function:: split(pattern, string, maxsplit=0, flags=0) 876 877 Split *string* by the occurrences of *pattern*. If capturing parentheses are 878 used in *pattern*, then the text of all groups in the pattern are also returned 879 as part of the resulting list. If *maxsplit* is nonzero, at most *maxsplit* 880 splits occur, and the remainder of the string is returned as the final element 881 of the list. :: 882 883 >>> re.split(r'\W+', 'Words, words, words.') 884 ['Words', 'words', 'words', ''] 885 >>> re.split(r'(\W+)', 'Words, words, words.') 886 ['Words', ', ', 'words', ', ', 'words', '.', ''] 887 >>> re.split(r'\W+', 'Words, words, words.', 1) 888 ['Words', 'words, words.'] 889 >>> re.split('[a-f]+', '0a3B9', flags=re.IGNORECASE) 890 ['0', '3', '9'] 891 892 If there are capturing groups in the separator and it matches at the start of 893 the string, the result will start with an empty string. The same holds for 894 the end of the string:: 895 896 >>> re.split(r'(\W+)', '...words, words...') 897 ['', '...', 'words', ', ', 'words', '...', ''] 898 899 That way, separator components are always found at the same relative 900 indices within the result list. 901 902 Empty matches for the pattern split the string only when not adjacent 903 to a previous empty match. 904 905 >>> re.split(r'\b', 'Words, words, words.') 906 ['', 'Words', ', ', 'words', ', ', 'words', '.'] 907 >>> re.split(r'\W*', '...words...') 908 ['', '', 'w', 'o', 'r', 'd', 's', '', ''] 909 >>> re.split(r'(\W*)', '...words...') 910 ['', '...', '', '', 'w', '', 'o', '', 'r', '', 'd', '', 's', '...', '', '', ''] 911 912 .. versionchanged:: 3.1 913 Added the optional flags argument. 914 915 .. versionchanged:: 3.7 916 Added support of splitting on a pattern that could match an empty string. 917 918 919.. function:: findall(pattern, string, flags=0) 920 921 Return all non-overlapping matches of *pattern* in *string*, as a list of 922 strings or tuples. The *string* is scanned left-to-right, and matches 923 are returned in the order found. Empty matches are included in the result. 924 925 The result depends on the number of capturing groups in the pattern. 926 If there are no groups, return a list of strings matching the whole 927 pattern. If there is exactly one group, return a list of strings 928 matching that group. If multiple groups are present, return a list 929 of tuples of strings matching the groups. Non-capturing groups do not 930 affect the form of the result. 931 932 >>> re.findall(r'\bf[a-z]*', 'which foot or hand fell fastest') 933 ['foot', 'fell', 'fastest'] 934 >>> re.findall(r'(\w+)=(\d+)', 'set width=20 and height=10') 935 [('width', '20'), ('height', '10')] 936 937 .. versionchanged:: 3.7 938 Non-empty matches can now start just after a previous empty match. 939 940 941.. function:: finditer(pattern, string, flags=0) 942 943 Return an :term:`iterator` yielding :ref:`match objects <match-objects>` over 944 all non-overlapping matches for the RE *pattern* in *string*. The *string* 945 is scanned left-to-right, and matches are returned in the order found. Empty 946 matches are included in the result. 947 948 .. versionchanged:: 3.7 949 Non-empty matches can now start just after a previous empty match. 950 951 952.. function:: sub(pattern, repl, string, count=0, flags=0) 953 954 Return the string obtained by replacing the leftmost non-overlapping occurrences 955 of *pattern* in *string* by the replacement *repl*. If the pattern isn't found, 956 *string* is returned unchanged. *repl* can be a string or a function; if it is 957 a string, any backslash escapes in it are processed. That is, ``\n`` is 958 converted to a single newline character, ``\r`` is converted to a carriage return, and 959 so forth. Unknown escapes of ASCII letters are reserved for future use and 960 treated as errors. Other unknown escapes such as ``\&`` are left alone. 961 Backreferences, such 962 as ``\6``, are replaced with the substring matched by group 6 in the pattern. 963 For example:: 964 965 >>> re.sub(r'def\s+([a-zA-Z_][a-zA-Z_0-9]*)\s*\(\s*\):', 966 ... r'static PyObject*\npy_\1(void)\n{', 967 ... 'def myfunc():') 968 'static PyObject*\npy_myfunc(void)\n{' 969 970 If *repl* is a function, it is called for every non-overlapping occurrence of 971 *pattern*. The function takes a single :ref:`match object <match-objects>` 972 argument, and returns the replacement string. For example:: 973 974 >>> def dashrepl(matchobj): 975 ... if matchobj.group(0) == '-': return ' ' 976 ... else: return '-' 977 >>> re.sub('-{1,2}', dashrepl, 'pro----gram-files') 978 'pro--gram files' 979 >>> re.sub(r'\sAND\s', ' & ', 'Baked Beans And Spam', flags=re.IGNORECASE) 980 'Baked Beans & Spam' 981 982 The pattern may be a string or a :ref:`pattern object <re-objects>`. 983 984 The optional argument *count* is the maximum number of pattern occurrences to be 985 replaced; *count* must be a non-negative integer. If omitted or zero, all 986 occurrences will be replaced. Empty matches for the pattern are replaced only 987 when not adjacent to a previous empty match, so ``sub('x*', '-', 'abxd')`` returns 988 ``'-a-b--d-'``. 989 990 .. index:: single: \g; in regular expressions 991 992 In string-type *repl* arguments, in addition to the character escapes and 993 backreferences described above, 994 ``\g<name>`` will use the substring matched by the group named ``name``, as 995 defined by the ``(?P<name>...)`` syntax. ``\g<number>`` uses the corresponding 996 group number; ``\g<2>`` is therefore equivalent to ``\2``, but isn't ambiguous 997 in a replacement such as ``\g<2>0``. ``\20`` would be interpreted as a 998 reference to group 20, not a reference to group 2 followed by the literal 999 character ``'0'``. The backreference ``\g<0>`` substitutes in the entire 1000 substring matched by the RE. 1001 1002 .. versionchanged:: 3.1 1003 Added the optional flags argument. 1004 1005 .. versionchanged:: 3.5 1006 Unmatched groups are replaced with an empty string. 1007 1008 .. versionchanged:: 3.6 1009 Unknown escapes in *pattern* consisting of ``'\'`` and an ASCII letter 1010 now are errors. 1011 1012 .. versionchanged:: 3.7 1013 Unknown escapes in *repl* consisting of ``'\'`` and an ASCII letter 1014 now are errors. 1015 1016 .. versionchanged:: 3.7 1017 Empty matches for the pattern are replaced when adjacent to a previous 1018 non-empty match. 1019 1020 .. deprecated:: 3.11 1021 Group *id* containing anything except ASCII digits. 1022 Group *name* containing characters outside the ASCII range 1023 (``b'\x00'``-``b'\x7f'``) in :class:`bytes` replacement strings. 1024 1025 1026.. function:: subn(pattern, repl, string, count=0, flags=0) 1027 1028 Perform the same operation as :func:`sub`, but return a tuple ``(new_string, 1029 number_of_subs_made)``. 1030 1031 .. versionchanged:: 3.1 1032 Added the optional flags argument. 1033 1034 .. versionchanged:: 3.5 1035 Unmatched groups are replaced with an empty string. 1036 1037 1038.. function:: escape(pattern) 1039 1040 Escape special characters in *pattern*. 1041 This is useful if you want to match an arbitrary literal string that may 1042 have regular expression metacharacters in it. For example:: 1043 1044 >>> print(re.escape('https://www.python.org')) 1045 https://www\.python\.org 1046 1047 >>> legal_chars = string.ascii_lowercase + string.digits + "!#$%&'*+-.^_`|~:" 1048 >>> print('[%s]+' % re.escape(legal_chars)) 1049 [abcdefghijklmnopqrstuvwxyz0123456789!\#\$%\&'\*\+\-\.\^_`\|\~:]+ 1050 1051 >>> operators = ['+', '-', '*', '/', '**'] 1052 >>> print('|'.join(map(re.escape, sorted(operators, reverse=True)))) 1053 /|\-|\+|\*\*|\* 1054 1055 This function must not be used for the replacement string in :func:`sub` 1056 and :func:`subn`, only backslashes should be escaped. For example:: 1057 1058 >>> digits_re = r'\d+' 1059 >>> sample = '/usr/sbin/sendmail - 0 errors, 12 warnings' 1060 >>> print(re.sub(digits_re, digits_re.replace('\\', r'\\'), sample)) 1061 /usr/sbin/sendmail - \d+ errors, \d+ warnings 1062 1063 .. versionchanged:: 3.3 1064 The ``'_'`` character is no longer escaped. 1065 1066 .. versionchanged:: 3.7 1067 Only characters that can have special meaning in a regular expression 1068 are escaped. As a result, ``'!'``, ``'"'``, ``'%'``, ``"'"``, ``','``, 1069 ``'/'``, ``':'``, ``';'``, ``'<'``, ``'='``, ``'>'``, ``'@'``, and 1070 ``"`"`` are no longer escaped. 1071 1072 1073.. function:: purge() 1074 1075 Clear the regular expression cache. 1076 1077 1078Exceptions 1079^^^^^^^^^^ 1080 1081.. exception:: error(msg, pattern=None, pos=None) 1082 1083 Exception raised when a string passed to one of the functions here is not a 1084 valid regular expression (for example, it might contain unmatched parentheses) 1085 or when some other error occurs during compilation or matching. It is never an 1086 error if a string contains no match for a pattern. The error instance has 1087 the following additional attributes: 1088 1089 .. attribute:: msg 1090 1091 The unformatted error message. 1092 1093 .. attribute:: pattern 1094 1095 The regular expression pattern. 1096 1097 .. attribute:: pos 1098 1099 The index in *pattern* where compilation failed (may be ``None``). 1100 1101 .. attribute:: lineno 1102 1103 The line corresponding to *pos* (may be ``None``). 1104 1105 .. attribute:: colno 1106 1107 The column corresponding to *pos* (may be ``None``). 1108 1109 .. versionchanged:: 3.5 1110 Added additional attributes. 1111 1112.. _re-objects: 1113 1114Regular Expression Objects 1115-------------------------- 1116 1117Compiled regular expression objects support the following methods and 1118attributes: 1119 1120.. method:: Pattern.search(string[, pos[, endpos]]) 1121 1122 Scan through *string* looking for the first location where this regular 1123 expression produces a match, and return a corresponding :ref:`match object 1124 <match-objects>`. Return ``None`` if no position in the string matches the 1125 pattern; note that this is different from finding a zero-length match at some 1126 point in the string. 1127 1128 The optional second parameter *pos* gives an index in the string where the 1129 search is to start; it defaults to ``0``. This is not completely equivalent to 1130 slicing the string; the ``'^'`` pattern character matches at the real beginning 1131 of the string and at positions just after a newline, but not necessarily at the 1132 index where the search is to start. 1133 1134 The optional parameter *endpos* limits how far the string will be searched; it 1135 will be as if the string is *endpos* characters long, so only the characters 1136 from *pos* to ``endpos - 1`` will be searched for a match. If *endpos* is less 1137 than *pos*, no match will be found; otherwise, if *rx* is a compiled regular 1138 expression object, ``rx.search(string, 0, 50)`` is equivalent to 1139 ``rx.search(string[:50], 0)``. :: 1140 1141 >>> pattern = re.compile("d") 1142 >>> pattern.search("dog") # Match at index 0 1143 <re.Match object; span=(0, 1), match='d'> 1144 >>> pattern.search("dog", 1) # No match; search doesn't include the "d" 1145 1146 1147.. method:: Pattern.match(string[, pos[, endpos]]) 1148 1149 If zero or more characters at the *beginning* of *string* match this regular 1150 expression, return a corresponding :ref:`match object <match-objects>`. 1151 Return ``None`` if the string does not match the pattern; note that this is 1152 different from a zero-length match. 1153 1154 The optional *pos* and *endpos* parameters have the same meaning as for the 1155 :meth:`~Pattern.search` method. :: 1156 1157 >>> pattern = re.compile("o") 1158 >>> pattern.match("dog") # No match as "o" is not at the start of "dog". 1159 >>> pattern.match("dog", 1) # Match as "o" is the 2nd character of "dog". 1160 <re.Match object; span=(1, 2), match='o'> 1161 1162 If you want to locate a match anywhere in *string*, use 1163 :meth:`~Pattern.search` instead (see also :ref:`search-vs-match`). 1164 1165 1166.. method:: Pattern.fullmatch(string[, pos[, endpos]]) 1167 1168 If the whole *string* matches this regular expression, return a corresponding 1169 :ref:`match object <match-objects>`. Return ``None`` if the string does not 1170 match the pattern; note that this is different from a zero-length match. 1171 1172 The optional *pos* and *endpos* parameters have the same meaning as for the 1173 :meth:`~Pattern.search` method. :: 1174 1175 >>> pattern = re.compile("o[gh]") 1176 >>> pattern.fullmatch("dog") # No match as "o" is not at the start of "dog". 1177 >>> pattern.fullmatch("ogre") # No match as not the full string matches. 1178 >>> pattern.fullmatch("doggie", 1, 3) # Matches within given limits. 1179 <re.Match object; span=(1, 3), match='og'> 1180 1181 .. versionadded:: 3.4 1182 1183 1184.. method:: Pattern.split(string, maxsplit=0) 1185 1186 Identical to the :func:`split` function, using the compiled pattern. 1187 1188 1189.. method:: Pattern.findall(string[, pos[, endpos]]) 1190 1191 Similar to the :func:`findall` function, using the compiled pattern, but 1192 also accepts optional *pos* and *endpos* parameters that limit the search 1193 region like for :meth:`search`. 1194 1195 1196.. method:: Pattern.finditer(string[, pos[, endpos]]) 1197 1198 Similar to the :func:`finditer` function, using the compiled pattern, but 1199 also accepts optional *pos* and *endpos* parameters that limit the search 1200 region like for :meth:`search`. 1201 1202 1203.. method:: Pattern.sub(repl, string, count=0) 1204 1205 Identical to the :func:`sub` function, using the compiled pattern. 1206 1207 1208.. method:: Pattern.subn(repl, string, count=0) 1209 1210 Identical to the :func:`subn` function, using the compiled pattern. 1211 1212 1213.. attribute:: Pattern.flags 1214 1215 The regex matching flags. This is a combination of the flags given to 1216 :func:`.compile`, any ``(?...)`` inline flags in the pattern, and implicit 1217 flags such as :data:`UNICODE` if the pattern is a Unicode string. 1218 1219 1220.. attribute:: Pattern.groups 1221 1222 The number of capturing groups in the pattern. 1223 1224 1225.. attribute:: Pattern.groupindex 1226 1227 A dictionary mapping any symbolic group names defined by ``(?P<id>)`` to group 1228 numbers. The dictionary is empty if no symbolic groups were used in the 1229 pattern. 1230 1231 1232.. attribute:: Pattern.pattern 1233 1234 The pattern string from which the pattern object was compiled. 1235 1236 1237.. versionchanged:: 3.7 1238 Added support of :func:`copy.copy` and :func:`copy.deepcopy`. Compiled 1239 regular expression objects are considered atomic. 1240 1241 1242.. _match-objects: 1243 1244Match Objects 1245------------- 1246 1247Match objects always have a boolean value of ``True``. 1248Since :meth:`~Pattern.match` and :meth:`~Pattern.search` return ``None`` 1249when there is no match, you can test whether there was a match with a simple 1250``if`` statement:: 1251 1252 match = re.search(pattern, string) 1253 if match: 1254 process(match) 1255 1256Match objects support the following methods and attributes: 1257 1258 1259.. method:: Match.expand(template) 1260 1261 Return the string obtained by doing backslash substitution on the template 1262 string *template*, as done by the :meth:`~Pattern.sub` method. 1263 Escapes such as ``\n`` are converted to the appropriate characters, 1264 and numeric backreferences (``\1``, ``\2``) and named backreferences 1265 (``\g<1>``, ``\g<name>``) are replaced by the contents of the 1266 corresponding group. 1267 1268 .. versionchanged:: 3.5 1269 Unmatched groups are replaced with an empty string. 1270 1271.. method:: Match.group([group1, ...]) 1272 1273 Returns one or more subgroups of the match. If there is a single argument, the 1274 result is a single string; if there are multiple arguments, the result is a 1275 tuple with one item per argument. Without arguments, *group1* defaults to zero 1276 (the whole match is returned). If a *groupN* argument is zero, the corresponding 1277 return value is the entire matching string; if it is in the inclusive range 1278 [1..99], it is the string matching the corresponding parenthesized group. If a 1279 group number is negative or larger than the number of groups defined in the 1280 pattern, an :exc:`IndexError` exception is raised. If a group is contained in a 1281 part of the pattern that did not match, the corresponding result is ``None``. 1282 If a group is contained in a part of the pattern that matched multiple times, 1283 the last match is returned. :: 1284 1285 >>> m = re.match(r"(\w+) (\w+)", "Isaac Newton, physicist") 1286 >>> m.group(0) # The entire match 1287 'Isaac Newton' 1288 >>> m.group(1) # The first parenthesized subgroup. 1289 'Isaac' 1290 >>> m.group(2) # The second parenthesized subgroup. 1291 'Newton' 1292 >>> m.group(1, 2) # Multiple arguments give us a tuple. 1293 ('Isaac', 'Newton') 1294 1295 If the regular expression uses the ``(?P<name>...)`` syntax, the *groupN* 1296 arguments may also be strings identifying groups by their group name. If a 1297 string argument is not used as a group name in the pattern, an :exc:`IndexError` 1298 exception is raised. 1299 1300 A moderately complicated example:: 1301 1302 >>> m = re.match(r"(?P<first_name>\w+) (?P<last_name>\w+)", "Malcolm Reynolds") 1303 >>> m.group('first_name') 1304 'Malcolm' 1305 >>> m.group('last_name') 1306 'Reynolds' 1307 1308 Named groups can also be referred to by their index:: 1309 1310 >>> m.group(1) 1311 'Malcolm' 1312 >>> m.group(2) 1313 'Reynolds' 1314 1315 If a group matches multiple times, only the last match is accessible:: 1316 1317 >>> m = re.match(r"(..)+", "a1b2c3") # Matches 3 times. 1318 >>> m.group(1) # Returns only the last match. 1319 'c3' 1320 1321 1322.. method:: Match.__getitem__(g) 1323 1324 This is identical to ``m.group(g)``. This allows easier access to 1325 an individual group from a match:: 1326 1327 >>> m = re.match(r"(\w+) (\w+)", "Isaac Newton, physicist") 1328 >>> m[0] # The entire match 1329 'Isaac Newton' 1330 >>> m[1] # The first parenthesized subgroup. 1331 'Isaac' 1332 >>> m[2] # The second parenthesized subgroup. 1333 'Newton' 1334 1335 Named groups are supported as well:: 1336 1337 >>> m = re.match(r"(?P<first_name>\w+) (?P<last_name>\w+)", "Isaac Newton") 1338 >>> m['first_name'] 1339 'Isaac' 1340 >>> m['last_name'] 1341 'Newton' 1342 1343 .. versionadded:: 3.6 1344 1345 1346.. method:: Match.groups(default=None) 1347 1348 Return a tuple containing all the subgroups of the match, from 1 up to however 1349 many groups are in the pattern. The *default* argument is used for groups that 1350 did not participate in the match; it defaults to ``None``. 1351 1352 For example:: 1353 1354 >>> m = re.match(r"(\d+)\.(\d+)", "24.1632") 1355 >>> m.groups() 1356 ('24', '1632') 1357 1358 If we make the decimal place and everything after it optional, not all groups 1359 might participate in the match. These groups will default to ``None`` unless 1360 the *default* argument is given:: 1361 1362 >>> m = re.match(r"(\d+)\.?(\d+)?", "24") 1363 >>> m.groups() # Second group defaults to None. 1364 ('24', None) 1365 >>> m.groups('0') # Now, the second group defaults to '0'. 1366 ('24', '0') 1367 1368 1369.. method:: Match.groupdict(default=None) 1370 1371 Return a dictionary containing all the *named* subgroups of the match, keyed by 1372 the subgroup name. The *default* argument is used for groups that did not 1373 participate in the match; it defaults to ``None``. For example:: 1374 1375 >>> m = re.match(r"(?P<first_name>\w+) (?P<last_name>\w+)", "Malcolm Reynolds") 1376 >>> m.groupdict() 1377 {'first_name': 'Malcolm', 'last_name': 'Reynolds'} 1378 1379 1380.. method:: Match.start([group]) 1381 Match.end([group]) 1382 1383 Return the indices of the start and end of the substring matched by *group*; 1384 *group* defaults to zero (meaning the whole matched substring). Return ``-1`` if 1385 *group* exists but did not contribute to the match. For a match object *m*, and 1386 a group *g* that did contribute to the match, the substring matched by group *g* 1387 (equivalent to ``m.group(g)``) is :: 1388 1389 m.string[m.start(g):m.end(g)] 1390 1391 Note that ``m.start(group)`` will equal ``m.end(group)`` if *group* matched a 1392 null string. For example, after ``m = re.search('b(c?)', 'cba')``, 1393 ``m.start(0)`` is 1, ``m.end(0)`` is 2, ``m.start(1)`` and ``m.end(1)`` are both 1394 2, and ``m.start(2)`` raises an :exc:`IndexError` exception. 1395 1396 An example that will remove *remove_this* from email addresses:: 1397 1398 >>> email = "tony@tiremove_thisger.net" 1399 >>> m = re.search("remove_this", email) 1400 >>> email[:m.start()] + email[m.end():] 1401 '[email protected]' 1402 1403 1404.. method:: Match.span([group]) 1405 1406 For a match *m*, return the 2-tuple ``(m.start(group), m.end(group))``. Note 1407 that if *group* did not contribute to the match, this is ``(-1, -1)``. 1408 *group* defaults to zero, the entire match. 1409 1410 1411.. attribute:: Match.pos 1412 1413 The value of *pos* which was passed to the :meth:`~Pattern.search` or 1414 :meth:`~Pattern.match` method of a :ref:`regex object <re-objects>`. This is 1415 the index into the string at which the RE engine started looking for a match. 1416 1417 1418.. attribute:: Match.endpos 1419 1420 The value of *endpos* which was passed to the :meth:`~Pattern.search` or 1421 :meth:`~Pattern.match` method of a :ref:`regex object <re-objects>`. This is 1422 the index into the string beyond which the RE engine will not go. 1423 1424 1425.. attribute:: Match.lastindex 1426 1427 The integer index of the last matched capturing group, or ``None`` if no group 1428 was matched at all. For example, the expressions ``(a)b``, ``((a)(b))``, and 1429 ``((ab))`` will have ``lastindex == 1`` if applied to the string ``'ab'``, while 1430 the expression ``(a)(b)`` will have ``lastindex == 2``, if applied to the same 1431 string. 1432 1433 1434.. attribute:: Match.lastgroup 1435 1436 The name of the last matched capturing group, or ``None`` if the group didn't 1437 have a name, or if no group was matched at all. 1438 1439 1440.. attribute:: Match.re 1441 1442 The :ref:`regular expression object <re-objects>` whose :meth:`~Pattern.match` or 1443 :meth:`~Pattern.search` method produced this match instance. 1444 1445 1446.. attribute:: Match.string 1447 1448 The string passed to :meth:`~Pattern.match` or :meth:`~Pattern.search`. 1449 1450 1451.. versionchanged:: 3.7 1452 Added support of :func:`copy.copy` and :func:`copy.deepcopy`. Match objects 1453 are considered atomic. 1454 1455 1456.. _re-examples: 1457 1458Regular Expression Examples 1459--------------------------- 1460 1461 1462Checking for a Pair 1463^^^^^^^^^^^^^^^^^^^ 1464 1465In this example, we'll use the following helper function to display match 1466objects a little more gracefully:: 1467 1468 def displaymatch(match): 1469 if match is None: 1470 return None 1471 return '<Match: %r, groups=%r>' % (match.group(), match.groups()) 1472 1473Suppose you are writing a poker program where a player's hand is represented as 1474a 5-character string with each character representing a card, "a" for ace, "k" 1475for king, "q" for queen, "j" for jack, "t" for 10, and "2" through "9" 1476representing the card with that value. 1477 1478To see if a given string is a valid hand, one could do the following:: 1479 1480 >>> valid = re.compile(r"^[a2-9tjqk]{5}$") 1481 >>> displaymatch(valid.match("akt5q")) # Valid. 1482 "<Match: 'akt5q', groups=()>" 1483 >>> displaymatch(valid.match("akt5e")) # Invalid. 1484 >>> displaymatch(valid.match("akt")) # Invalid. 1485 >>> displaymatch(valid.match("727ak")) # Valid. 1486 "<Match: '727ak', groups=()>" 1487 1488That last hand, ``"727ak"``, contained a pair, or two of the same valued cards. 1489To match this with a regular expression, one could use backreferences as such:: 1490 1491 >>> pair = re.compile(r".*(.).*\1") 1492 >>> displaymatch(pair.match("717ak")) # Pair of 7s. 1493 "<Match: '717', groups=('7',)>" 1494 >>> displaymatch(pair.match("718ak")) # No pairs. 1495 >>> displaymatch(pair.match("354aa")) # Pair of aces. 1496 "<Match: '354aa', groups=('a',)>" 1497 1498To find out what card the pair consists of, one could use the 1499:meth:`~Match.group` method of the match object in the following manner:: 1500 1501 >>> pair = re.compile(r".*(.).*\1") 1502 >>> pair.match("717ak").group(1) 1503 '7' 1504 1505 # Error because re.match() returns None, which doesn't have a group() method: 1506 >>> pair.match("718ak").group(1) 1507 Traceback (most recent call last): 1508 File "<pyshell#23>", line 1, in <module> 1509 re.match(r".*(.).*\1", "718ak").group(1) 1510 AttributeError: 'NoneType' object has no attribute 'group' 1511 1512 >>> pair.match("354aa").group(1) 1513 'a' 1514 1515 1516Simulating scanf() 1517^^^^^^^^^^^^^^^^^^ 1518 1519.. index:: single: scanf() 1520 1521Python does not currently have an equivalent to :c:func:`scanf`. Regular 1522expressions are generally more powerful, though also more verbose, than 1523:c:func:`scanf` format strings. The table below offers some more-or-less 1524equivalent mappings between :c:func:`scanf` format tokens and regular 1525expressions. 1526 1527+--------------------------------+---------------------------------------------+ 1528| :c:func:`scanf` Token | Regular Expression | 1529+================================+=============================================+ 1530| ``%c`` | ``.`` | 1531+--------------------------------+---------------------------------------------+ 1532| ``%5c`` | ``.{5}`` | 1533+--------------------------------+---------------------------------------------+ 1534| ``%d`` | ``[-+]?\d+`` | 1535+--------------------------------+---------------------------------------------+ 1536| ``%e``, ``%E``, ``%f``, ``%g`` | ``[-+]?(\d+(\.\d*)?|\.\d+)([eE][-+]?\d+)?`` | 1537+--------------------------------+---------------------------------------------+ 1538| ``%i`` | ``[-+]?(0[xX][\dA-Fa-f]+|0[0-7]*|\d+)`` | 1539+--------------------------------+---------------------------------------------+ 1540| ``%o`` | ``[-+]?[0-7]+`` | 1541+--------------------------------+---------------------------------------------+ 1542| ``%s`` | ``\S+`` | 1543+--------------------------------+---------------------------------------------+ 1544| ``%u`` | ``\d+`` | 1545+--------------------------------+---------------------------------------------+ 1546| ``%x``, ``%X`` | ``[-+]?(0[xX])?[\dA-Fa-f]+`` | 1547+--------------------------------+---------------------------------------------+ 1548 1549To extract the filename and numbers from a string like :: 1550 1551 /usr/sbin/sendmail - 0 errors, 4 warnings 1552 1553you would use a :c:func:`scanf` format like :: 1554 1555 %s - %d errors, %d warnings 1556 1557The equivalent regular expression would be :: 1558 1559 (\S+) - (\d+) errors, (\d+) warnings 1560 1561 1562.. _search-vs-match: 1563 1564search() vs. match() 1565^^^^^^^^^^^^^^^^^^^^ 1566 1567.. sectionauthor:: Fred L. Drake, Jr. <[email protected]> 1568 1569Python offers different primitive operations based on regular expressions: 1570 1571+ :func:`re.match` checks for a match only at the beginning of the string 1572+ :func:`re.search` checks for a match anywhere in the string 1573 (this is what Perl does by default) 1574+ :func:`re.fullmatch` checks for entire string to be a match 1575 1576 1577For example:: 1578 1579 >>> re.match("c", "abcdef") # No match 1580 >>> re.search("c", "abcdef") # Match 1581 <re.Match object; span=(2, 3), match='c'> 1582 >>> re.fullmatch("p.*n", "python") # Match 1583 <re.Match object; span=(0, 6), match='python'> 1584 >>> re.fullmatch("r.*n", "python") # No match 1585 1586Regular expressions beginning with ``'^'`` can be used with :func:`search` to 1587restrict the match at the beginning of the string:: 1588 1589 >>> re.match("c", "abcdef") # No match 1590 >>> re.search("^c", "abcdef") # No match 1591 >>> re.search("^a", "abcdef") # Match 1592 <re.Match object; span=(0, 1), match='a'> 1593 1594Note however that in :const:`MULTILINE` mode :func:`match` only matches at the 1595beginning of the string, whereas using :func:`search` with a regular expression 1596beginning with ``'^'`` will match at the beginning of each line. :: 1597 1598 >>> re.match("X", "A\nB\nX", re.MULTILINE) # No match 1599 >>> re.search("^X", "A\nB\nX", re.MULTILINE) # Match 1600 <re.Match object; span=(4, 5), match='X'> 1601 1602 1603Making a Phonebook 1604^^^^^^^^^^^^^^^^^^ 1605 1606:func:`split` splits a string into a list delimited by the passed pattern. The 1607method is invaluable for converting textual data into data structures that can be 1608easily read and modified by Python as demonstrated in the following example that 1609creates a phonebook. 1610 1611First, here is the input. Normally it may come from a file, here we are using 1612triple-quoted string syntax 1613 1614.. doctest:: 1615 1616 >>> text = """Ross McFluff: 834.345.1254 155 Elm Street 1617 ... 1618 ... Ronald Heathmore: 892.345.3428 436 Finley Avenue 1619 ... Frank Burger: 925.541.7625 662 South Dogwood Way 1620 ... 1621 ... 1622 ... Heather Albrecht: 548.326.4584 919 Park Place""" 1623 1624The entries are separated by one or more newlines. Now we convert the string 1625into a list with each nonempty line having its own entry: 1626 1627.. doctest:: 1628 :options: +NORMALIZE_WHITESPACE 1629 1630 >>> entries = re.split("\n+", text) 1631 >>> entries 1632 ['Ross McFluff: 834.345.1254 155 Elm Street', 1633 'Ronald Heathmore: 892.345.3428 436 Finley Avenue', 1634 'Frank Burger: 925.541.7625 662 South Dogwood Way', 1635 'Heather Albrecht: 548.326.4584 919 Park Place'] 1636 1637Finally, split each entry into a list with first name, last name, telephone 1638number, and address. We use the ``maxsplit`` parameter of :func:`split` 1639because the address has spaces, our splitting pattern, in it: 1640 1641.. doctest:: 1642 :options: +NORMALIZE_WHITESPACE 1643 1644 >>> [re.split(":? ", entry, 3) for entry in entries] 1645 [['Ross', 'McFluff', '834.345.1254', '155 Elm Street'], 1646 ['Ronald', 'Heathmore', '892.345.3428', '436 Finley Avenue'], 1647 ['Frank', 'Burger', '925.541.7625', '662 South Dogwood Way'], 1648 ['Heather', 'Albrecht', '548.326.4584', '919 Park Place']] 1649 1650The ``:?`` pattern matches the colon after the last name, so that it does not 1651occur in the result list. With a ``maxsplit`` of ``4``, we could separate the 1652house number from the street name: 1653 1654.. doctest:: 1655 :options: +NORMALIZE_WHITESPACE 1656 1657 >>> [re.split(":? ", entry, 4) for entry in entries] 1658 [['Ross', 'McFluff', '834.345.1254', '155', 'Elm Street'], 1659 ['Ronald', 'Heathmore', '892.345.3428', '436', 'Finley Avenue'], 1660 ['Frank', 'Burger', '925.541.7625', '662', 'South Dogwood Way'], 1661 ['Heather', 'Albrecht', '548.326.4584', '919', 'Park Place']] 1662 1663 1664Text Munging 1665^^^^^^^^^^^^ 1666 1667:func:`sub` replaces every occurrence of a pattern with a string or the 1668result of a function. This example demonstrates using :func:`sub` with 1669a function to "munge" text, or randomize the order of all the characters 1670in each word of a sentence except for the first and last characters:: 1671 1672 >>> def repl(m): 1673 ... inner_word = list(m.group(2)) 1674 ... random.shuffle(inner_word) 1675 ... return m.group(1) + "".join(inner_word) + m.group(3) 1676 >>> text = "Professor Abdolmalek, please report your absences promptly." 1677 >>> re.sub(r"(\w)(\w+)(\w)", repl, text) 1678 'Poefsrosr Aealmlobdk, pslaee reorpt your abnseces plmrptoy.' 1679 >>> re.sub(r"(\w)(\w+)(\w)", repl, text) 1680 'Pofsroser Aodlambelk, plasee reoprt yuor asnebces potlmrpy.' 1681 1682 1683Finding all Adverbs 1684^^^^^^^^^^^^^^^^^^^ 1685 1686:func:`findall` matches *all* occurrences of a pattern, not just the first 1687one as :func:`search` does. For example, if a writer wanted to 1688find all of the adverbs in some text, they might use :func:`findall` in 1689the following manner:: 1690 1691 >>> text = "He was carefully disguised but captured quickly by police." 1692 >>> re.findall(r"\w+ly\b", text) 1693 ['carefully', 'quickly'] 1694 1695 1696Finding all Adverbs and their Positions 1697^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 1698 1699If one wants more information about all matches of a pattern than the matched 1700text, :func:`finditer` is useful as it provides :ref:`match objects 1701<match-objects>` instead of strings. Continuing with the previous example, if 1702a writer wanted to find all of the adverbs *and their positions* in 1703some text, they would use :func:`finditer` in the following manner:: 1704 1705 >>> text = "He was carefully disguised but captured quickly by police." 1706 >>> for m in re.finditer(r"\w+ly\b", text): 1707 ... print('%02d-%02d: %s' % (m.start(), m.end(), m.group(0))) 1708 07-16: carefully 1709 40-47: quickly 1710 1711 1712Raw String Notation 1713^^^^^^^^^^^^^^^^^^^ 1714 1715Raw string notation (``r"text"``) keeps regular expressions sane. Without it, 1716every backslash (``'\'``) in a regular expression would have to be prefixed with 1717another one to escape it. For example, the two following lines of code are 1718functionally identical:: 1719 1720 >>> re.match(r"\W(.)\1\W", " ff ") 1721 <re.Match object; span=(0, 4), match=' ff '> 1722 >>> re.match("\\W(.)\\1\\W", " ff ") 1723 <re.Match object; span=(0, 4), match=' ff '> 1724 1725When one wants to match a literal backslash, it must be escaped in the regular 1726expression. With raw string notation, this means ``r"\\"``. Without raw string 1727notation, one must use ``"\\\\"``, making the following lines of code 1728functionally identical:: 1729 1730 >>> re.match(r"\\", r"\\") 1731 <re.Match object; span=(0, 1), match='\\'> 1732 >>> re.match("\\\\", r"\\") 1733 <re.Match object; span=(0, 1), match='\\'> 1734 1735 1736Writing a Tokenizer 1737^^^^^^^^^^^^^^^^^^^ 1738 1739A `tokenizer or scanner <https://en.wikipedia.org/wiki/Lexical_analysis>`_ 1740analyzes a string to categorize groups of characters. This is a useful first 1741step in writing a compiler or interpreter. 1742 1743The text categories are specified with regular expressions. The technique is 1744to combine those into a single master regular expression and to loop over 1745successive matches:: 1746 1747 from typing import NamedTuple 1748 import re 1749 1750 class Token(NamedTuple): 1751 type: str 1752 value: str 1753 line: int 1754 column: int 1755 1756 def tokenize(code): 1757 keywords = {'IF', 'THEN', 'ENDIF', 'FOR', 'NEXT', 'GOSUB', 'RETURN'} 1758 token_specification = [ 1759 ('NUMBER', r'\d+(\.\d*)?'), # Integer or decimal number 1760 ('ASSIGN', r':='), # Assignment operator 1761 ('END', r';'), # Statement terminator 1762 ('ID', r'[A-Za-z]+'), # Identifiers 1763 ('OP', r'[+\-*/]'), # Arithmetic operators 1764 ('NEWLINE', r'\n'), # Line endings 1765 ('SKIP', r'[ \t]+'), # Skip over spaces and tabs 1766 ('MISMATCH', r'.'), # Any other character 1767 ] 1768 tok_regex = '|'.join('(?P<%s>%s)' % pair for pair in token_specification) 1769 line_num = 1 1770 line_start = 0 1771 for mo in re.finditer(tok_regex, code): 1772 kind = mo.lastgroup 1773 value = mo.group() 1774 column = mo.start() - line_start 1775 if kind == 'NUMBER': 1776 value = float(value) if '.' in value else int(value) 1777 elif kind == 'ID' and value in keywords: 1778 kind = value 1779 elif kind == 'NEWLINE': 1780 line_start = mo.end() 1781 line_num += 1 1782 continue 1783 elif kind == 'SKIP': 1784 continue 1785 elif kind == 'MISMATCH': 1786 raise RuntimeError(f'{value!r} unexpected on line {line_num}') 1787 yield Token(kind, value, line_num, column) 1788 1789 statements = ''' 1790 IF quantity THEN 1791 total := total + price * quantity; 1792 tax := price * 0.05; 1793 ENDIF; 1794 ''' 1795 1796 for token in tokenize(statements): 1797 print(token) 1798 1799The tokenizer produces the following output:: 1800 1801 Token(type='IF', value='IF', line=2, column=4) 1802 Token(type='ID', value='quantity', line=2, column=7) 1803 Token(type='THEN', value='THEN', line=2, column=16) 1804 Token(type='ID', value='total', line=3, column=8) 1805 Token(type='ASSIGN', value=':=', line=3, column=14) 1806 Token(type='ID', value='total', line=3, column=17) 1807 Token(type='OP', value='+', line=3, column=23) 1808 Token(type='ID', value='price', line=3, column=25) 1809 Token(type='OP', value='*', line=3, column=31) 1810 Token(type='ID', value='quantity', line=3, column=33) 1811 Token(type='END', value=';', line=3, column=41) 1812 Token(type='ID', value='tax', line=4, column=8) 1813 Token(type='ASSIGN', value=':=', line=4, column=12) 1814 Token(type='ID', value='price', line=4, column=15) 1815 Token(type='OP', value='*', line=4, column=21) 1816 Token(type='NUMBER', value=0.05, line=4, column=23) 1817 Token(type='END', value=';', line=4, column=27) 1818 Token(type='ENDIF', value='ENDIF', line=5, column=4) 1819 Token(type='END', value=';', line=5, column=9) 1820 1821 1822.. [Frie09] Friedl, Jeffrey. Mastering Regular Expressions. 3rd ed., O'Reilly 1823 Media, 2009. The third edition of the book no longer covers Python at all, 1824 but the first edition covered writing good regular expression patterns in 1825 great detail. 1826