1.. highlight:: c
2
3.. _unicodeobjects:
4
5Unicode Objects and Codecs
6--------------------------
7
8.. sectionauthor:: Marc-André Lemburg <[email protected]>
9.. sectionauthor:: Georg Brandl <[email protected]>
10
11Unicode Objects
12^^^^^^^^^^^^^^^
13
14Since the implementation of :pep:`393` in Python 3.3, Unicode objects internally
15use a variety of representations, in order to allow handling the complete range
16of Unicode characters while staying memory efficient.  There are special cases
17for strings where all code points are below 128, 256, or 65536; otherwise, code
18points must be below 1114112 (which is the full Unicode range).
19
20:c:expr:`Py_UNICODE*` and UTF-8 representations are created on demand and cached
21in the Unicode object.  The :c:expr:`Py_UNICODE*` representation is deprecated
22and inefficient.
23
24Due to the transition between the old APIs and the new APIs, Unicode objects
25can internally be in two states depending on how they were created:
26
27* "canonical" Unicode objects are all objects created by a non-deprecated
28  Unicode API.  They use the most efficient representation allowed by the
29  implementation.
30
31* "legacy" Unicode objects have been created through one of the deprecated
32  APIs (typically :c:func:`PyUnicode_FromUnicode`) and only bear the
33  :c:expr:`Py_UNICODE*` representation; you will have to call
34  :c:func:`PyUnicode_READY` on them before calling any other API.
35
36.. note::
37   The "legacy" Unicode object will be removed in Python 3.12 with deprecated
38   APIs. All Unicode objects will be "canonical" since then. See :pep:`623`
39   for more information.
40
41
42Unicode Type
43""""""""""""
44
45These are the basic Unicode object types used for the Unicode implementation in
46Python:
47
48.. c:type:: Py_UCS4
49            Py_UCS2
50            Py_UCS1
51
52   These types are typedefs for unsigned integer types wide enough to contain
53   characters of 32 bits, 16 bits and 8 bits, respectively.  When dealing with
54   single Unicode characters, use :c:type:`Py_UCS4`.
55
56   .. versionadded:: 3.3
57
58
59.. c:type:: Py_UNICODE
60
61   This is a typedef of :c:expr:`wchar_t`, which is a 16-bit type or 32-bit type
62   depending on the platform.
63
64   .. versionchanged:: 3.3
65      In previous versions, this was a 16-bit type or a 32-bit type depending on
66      whether you selected a "narrow" or "wide" Unicode version of Python at
67      build time.
68
69
70.. c:type:: PyASCIIObject
71            PyCompactUnicodeObject
72            PyUnicodeObject
73
74   These subtypes of :c:type:`PyObject` represent a Python Unicode object.  In
75   almost all cases, they shouldn't be used directly, since all API functions
76   that deal with Unicode objects take and return :c:type:`PyObject` pointers.
77
78   .. versionadded:: 3.3
79
80
81.. c:var:: PyTypeObject PyUnicode_Type
82
83   This instance of :c:type:`PyTypeObject` represents the Python Unicode type.  It
84   is exposed to Python code as ``str``.
85
86
87The following APIs are C macros and static inlined functions for fast checks and
88access to internal read-only data of Unicode objects:
89
90.. c:function:: int PyUnicode_Check(PyObject *o)
91
92   Return true if the object *o* is a Unicode object or an instance of a Unicode
93   subtype.  This function always succeeds.
94
95
96.. c:function:: int PyUnicode_CheckExact(PyObject *o)
97
98   Return true if the object *o* is a Unicode object, but not an instance of a
99   subtype.  This function always succeeds.
100
101
102.. c:function:: int PyUnicode_READY(PyObject *o)
103
104   Ensure the string object *o* is in the "canonical" representation.  This is
105   required before using any of the access macros described below.
106
107   .. XXX expand on when it is not required
108
109   Returns ``0`` on success and ``-1`` with an exception set on failure, which in
110   particular happens if memory allocation fails.
111
112   .. versionadded:: 3.3
113
114   .. deprecated-removed:: 3.10 3.12
115      This API will be removed with :c:func:`PyUnicode_FromUnicode`.
116
117
118.. c:function:: Py_ssize_t PyUnicode_GET_LENGTH(PyObject *o)
119
120   Return the length of the Unicode string, in code points.  *o* has to be a
121   Unicode object in the "canonical" representation (not checked).
122
123   .. versionadded:: 3.3
124
125
126.. c:function:: Py_UCS1* PyUnicode_1BYTE_DATA(PyObject *o)
127                Py_UCS2* PyUnicode_2BYTE_DATA(PyObject *o)
128                Py_UCS4* PyUnicode_4BYTE_DATA(PyObject *o)
129
130   Return a pointer to the canonical representation cast to UCS1, UCS2 or UCS4
131   integer types for direct character access.  No checks are performed if the
132   canonical representation has the correct character size; use
133   :c:func:`PyUnicode_KIND` to select the right macro.  Make sure
134   :c:func:`PyUnicode_READY` has been called before accessing this.
135
136   .. versionadded:: 3.3
137
138
139.. c:macro:: PyUnicode_WCHAR_KIND
140             PyUnicode_1BYTE_KIND
141             PyUnicode_2BYTE_KIND
142             PyUnicode_4BYTE_KIND
143
144   Return values of the :c:func:`PyUnicode_KIND` macro.
145
146   .. versionadded:: 3.3
147
148   .. deprecated-removed:: 3.10 3.12
149      ``PyUnicode_WCHAR_KIND`` is deprecated.
150
151
152.. c:function:: int PyUnicode_KIND(PyObject *o)
153
154   Return one of the PyUnicode kind constants (see above) that indicate how many
155   bytes per character this Unicode object uses to store its data.  *o* has to
156   be a Unicode object in the "canonical" representation (not checked).
157
158   .. XXX document "0" return value?
159
160   .. versionadded:: 3.3
161
162
163.. c:function:: void* PyUnicode_DATA(PyObject *o)
164
165   Return a void pointer to the raw Unicode buffer.  *o* has to be a Unicode
166   object in the "canonical" representation (not checked).
167
168   .. versionadded:: 3.3
169
170
171.. c:function:: void PyUnicode_WRITE(int kind, void *data, \
172                                     Py_ssize_t index, Py_UCS4 value)
173
174   Write into a canonical representation *data* (as obtained with
175   :c:func:`PyUnicode_DATA`).  This function performs no sanity checks, and is
176   intended for usage in loops.  The caller should cache the *kind* value and
177   *data* pointer as obtained from other calls.  *index* is the index in
178   the string (starts at 0) and *value* is the new code point value which should
179   be written to that location.
180
181   .. versionadded:: 3.3
182
183
184.. c:function:: Py_UCS4 PyUnicode_READ(int kind, void *data, \
185                                       Py_ssize_t index)
186
187   Read a code point from a canonical representation *data* (as obtained with
188   :c:func:`PyUnicode_DATA`).  No checks or ready calls are performed.
189
190   .. versionadded:: 3.3
191
192
193.. c:function:: Py_UCS4 PyUnicode_READ_CHAR(PyObject *o, Py_ssize_t index)
194
195   Read a character from a Unicode object *o*, which must be in the "canonical"
196   representation.  This is less efficient than :c:func:`PyUnicode_READ` if you
197   do multiple consecutive reads.
198
199   .. versionadded:: 3.3
200
201
202.. c:function:: Py_UCS4 PyUnicode_MAX_CHAR_VALUE(PyObject *o)
203
204   Return the maximum code point that is suitable for creating another string
205   based on *o*, which must be in the "canonical" representation.  This is
206   always an approximation but more efficient than iterating over the string.
207
208   .. versionadded:: 3.3
209
210
211.. c:function:: Py_ssize_t PyUnicode_GET_SIZE(PyObject *o)
212
213   Return the size of the deprecated :c:type:`Py_UNICODE` representation, in
214   code units (this includes surrogate pairs as 2 units).  *o* has to be a
215   Unicode object (not checked).
216
217   .. deprecated-removed:: 3.3 3.12
218      Part of the old-style Unicode API, please migrate to using
219      :c:func:`PyUnicode_GET_LENGTH`.
220
221
222.. c:function:: Py_ssize_t PyUnicode_GET_DATA_SIZE(PyObject *o)
223
224   Return the size of the deprecated :c:type:`Py_UNICODE` representation in
225   bytes.  *o* has to be a Unicode object (not checked).
226
227   .. deprecated-removed:: 3.3 3.12
228      Part of the old-style Unicode API, please migrate to using
229      :c:func:`PyUnicode_GET_LENGTH`.
230
231
232.. c:function:: Py_UNICODE* PyUnicode_AS_UNICODE(PyObject *o)
233                const char* PyUnicode_AS_DATA(PyObject *o)
234
235   Return a pointer to a :c:type:`Py_UNICODE` representation of the object.  The
236   returned buffer is always terminated with an extra null code point.  It
237   may also contain embedded null code points, which would cause the string
238   to be truncated when used in most C functions.  The ``AS_DATA`` form
239   casts the pointer to :c:expr:`const char *`.  The *o* argument has to be
240   a Unicode object (not checked).
241
242   .. versionchanged:: 3.3
243      This function is now inefficient -- because in many cases the
244      :c:type:`Py_UNICODE` representation does not exist and needs to be created
245      -- and can fail (return ``NULL`` with an exception set).  Try to port the
246      code to use the new :c:func:`PyUnicode_nBYTE_DATA` macros or use
247      :c:func:`PyUnicode_WRITE` or :c:func:`PyUnicode_READ`.
248
249   .. deprecated-removed:: 3.3 3.12
250      Part of the old-style Unicode API, please migrate to using the
251      :c:func:`PyUnicode_nBYTE_DATA` family of macros.
252
253
254.. c:function:: int PyUnicode_IsIdentifier(PyObject *o)
255
256   Return ``1`` if the string is a valid identifier according to the language
257   definition, section :ref:`identifiers`. Return ``0`` otherwise.
258
259   .. versionchanged:: 3.9
260      The function does not call :c:func:`Py_FatalError` anymore if the string
261      is not ready.
262
263
264Unicode Character Properties
265""""""""""""""""""""""""""""
266
267Unicode provides many different character properties. The most often needed ones
268are available through these macros which are mapped to C functions depending on
269the Python configuration.
270
271
272.. c:function:: int Py_UNICODE_ISSPACE(Py_UCS4 ch)
273
274   Return ``1`` or ``0`` depending on whether *ch* is a whitespace character.
275
276
277.. c:function:: int Py_UNICODE_ISLOWER(Py_UCS4 ch)
278
279   Return ``1`` or ``0`` depending on whether *ch* is a lowercase character.
280
281
282.. c:function:: int Py_UNICODE_ISUPPER(Py_UCS4 ch)
283
284   Return ``1`` or ``0`` depending on whether *ch* is an uppercase character.
285
286
287.. c:function:: int Py_UNICODE_ISTITLE(Py_UCS4 ch)
288
289   Return ``1`` or ``0`` depending on whether *ch* is a titlecase character.
290
291
292.. c:function:: int Py_UNICODE_ISLINEBREAK(Py_UCS4 ch)
293
294   Return ``1`` or ``0`` depending on whether *ch* is a linebreak character.
295
296
297.. c:function:: int Py_UNICODE_ISDECIMAL(Py_UCS4 ch)
298
299   Return ``1`` or ``0`` depending on whether *ch* is a decimal character.
300
301
302.. c:function:: int Py_UNICODE_ISDIGIT(Py_UCS4 ch)
303
304   Return ``1`` or ``0`` depending on whether *ch* is a digit character.
305
306
307.. c:function:: int Py_UNICODE_ISNUMERIC(Py_UCS4 ch)
308
309   Return ``1`` or ``0`` depending on whether *ch* is a numeric character.
310
311
312.. c:function:: int Py_UNICODE_ISALPHA(Py_UCS4 ch)
313
314   Return ``1`` or ``0`` depending on whether *ch* is an alphabetic character.
315
316
317.. c:function:: int Py_UNICODE_ISALNUM(Py_UCS4 ch)
318
319   Return ``1`` or ``0`` depending on whether *ch* is an alphanumeric character.
320
321
322.. c:function:: int Py_UNICODE_ISPRINTABLE(Py_UCS4 ch)
323
324   Return ``1`` or ``0`` depending on whether *ch* is a printable character.
325   Nonprintable characters are those characters defined in the Unicode character
326   database as "Other" or "Separator", excepting the ASCII space (0x20) which is
327   considered printable.  (Note that printable characters in this context are
328   those which should not be escaped when :func:`repr` is invoked on a string.
329   It has no bearing on the handling of strings written to :data:`sys.stdout` or
330   :data:`sys.stderr`.)
331
332
333These APIs can be used for fast direct character conversions:
334
335
336.. c:function:: Py_UCS4 Py_UNICODE_TOLOWER(Py_UCS4 ch)
337
338   Return the character *ch* converted to lower case.
339
340   .. deprecated:: 3.3
341      This function uses simple case mappings.
342
343
344.. c:function:: Py_UCS4 Py_UNICODE_TOUPPER(Py_UCS4 ch)
345
346   Return the character *ch* converted to upper case.
347
348   .. deprecated:: 3.3
349      This function uses simple case mappings.
350
351
352.. c:function:: Py_UCS4 Py_UNICODE_TOTITLE(Py_UCS4 ch)
353
354   Return the character *ch* converted to title case.
355
356   .. deprecated:: 3.3
357      This function uses simple case mappings.
358
359
360.. c:function:: int Py_UNICODE_TODECIMAL(Py_UCS4 ch)
361
362   Return the character *ch* converted to a decimal positive integer.  Return
363   ``-1`` if this is not possible.  This macro does not raise exceptions.
364
365
366.. c:function:: int Py_UNICODE_TODIGIT(Py_UCS4 ch)
367
368   Return the character *ch* converted to a single digit integer. Return ``-1`` if
369   this is not possible.  This macro does not raise exceptions.
370
371
372.. c:function:: double Py_UNICODE_TONUMERIC(Py_UCS4 ch)
373
374   Return the character *ch* converted to a double. Return ``-1.0`` if this is not
375   possible.  This macro does not raise exceptions.
376
377
378These APIs can be used to work with surrogates:
379
380.. c:macro:: Py_UNICODE_IS_SURROGATE(ch)
381
382   Check if *ch* is a surrogate (``0xD800 <= ch <= 0xDFFF``).
383
384.. c:macro:: Py_UNICODE_IS_HIGH_SURROGATE(ch)
385
386   Check if *ch* is a high surrogate (``0xD800 <= ch <= 0xDBFF``).
387
388.. c:macro:: Py_UNICODE_IS_LOW_SURROGATE(ch)
389
390   Check if *ch* is a low surrogate (``0xDC00 <= ch <= 0xDFFF``).
391
392.. c:macro:: Py_UNICODE_JOIN_SURROGATES(high, low)
393
394   Join two surrogate characters and return a single Py_UCS4 value.
395   *high* and *low* are respectively the leading and trailing surrogates in a
396   surrogate pair.
397
398
399Creating and accessing Unicode strings
400""""""""""""""""""""""""""""""""""""""
401
402To create Unicode objects and access their basic sequence properties, use these
403APIs:
404
405.. c:function:: PyObject* PyUnicode_New(Py_ssize_t size, Py_UCS4 maxchar)
406
407   Create a new Unicode object.  *maxchar* should be the true maximum code point
408   to be placed in the string.  As an approximation, it can be rounded up to the
409   nearest value in the sequence 127, 255, 65535, 1114111.
410
411   This is the recommended way to allocate a new Unicode object.  Objects
412   created using this function are not resizable.
413
414   .. versionadded:: 3.3
415
416
417.. c:function:: PyObject* PyUnicode_FromKindAndData(int kind, const void *buffer, \
418                                                    Py_ssize_t size)
419
420   Create a new Unicode object with the given *kind* (possible values are
421   :c:macro:`PyUnicode_1BYTE_KIND` etc., as returned by
422   :c:func:`PyUnicode_KIND`).  The *buffer* must point to an array of *size*
423   units of 1, 2 or 4 bytes per character, as given by the kind.
424
425   If necessary, the input *buffer* is copied and transformed into the
426   canonical representation.  For example, if the *buffer* is a UCS4 string
427   (:c:macro:`PyUnicode_4BYTE_KIND`) and it consists only of codepoints in
428   the UCS1 range, it will be transformed into UCS1
429   (:c:macro:`PyUnicode_1BYTE_KIND`).
430
431   .. versionadded:: 3.3
432
433
434.. c:function:: PyObject* PyUnicode_FromStringAndSize(const char *u, Py_ssize_t size)
435
436   Create a Unicode object from the char buffer *u*.  The bytes will be
437   interpreted as being UTF-8 encoded.  The buffer is copied into the new
438   object. If the buffer is not ``NULL``, the return value might be a shared
439   object, i.e. modification of the data is not allowed.
440
441   If *u* is ``NULL``, this function behaves like :c:func:`PyUnicode_FromUnicode`
442   with the buffer set to ``NULL``.  This usage is deprecated in favor of
443   :c:func:`PyUnicode_New`, and will be removed in Python 3.12.
444
445
446.. c:function:: PyObject *PyUnicode_FromString(const char *u)
447
448   Create a Unicode object from a UTF-8 encoded null-terminated char buffer
449   *u*.
450
451
452.. c:function:: PyObject* PyUnicode_FromFormat(const char *format, ...)
453
454   Take a C :c:func:`printf`\ -style *format* string and a variable number of
455   arguments, calculate the size of the resulting Python Unicode string and return
456   a string with the values formatted into it.  The variable arguments must be C
457   types and must correspond exactly to the format characters in the *format*
458   ASCII-encoded string. The following format characters are allowed:
459
460   .. % This should be exactly the same as the table in PyErr_Format.
461   .. % The descriptions for %zd and %zu are wrong, but the truth is complicated
462   .. % because not all compilers support the %z width modifier -- we fake it
463   .. % when necessary via interpolating PY_FORMAT_SIZE_T.
464   .. % Similar comments apply to the %ll width modifier and
465
466   .. tabularcolumns:: |l|l|L|
467
468   +-------------------+---------------------+----------------------------------+
469   | Format Characters | Type                | Comment                          |
470   +===================+=====================+==================================+
471   | :attr:`%%`        | *n/a*               | The literal % character.         |
472   +-------------------+---------------------+----------------------------------+
473   | :attr:`%c`        | int                 | A single character,              |
474   |                   |                     | represented as a C int.          |
475   +-------------------+---------------------+----------------------------------+
476   | :attr:`%d`        | int                 | Equivalent to                    |
477   |                   |                     | ``printf("%d")``. [1]_           |
478   +-------------------+---------------------+----------------------------------+
479   | :attr:`%u`        | unsigned int        | Equivalent to                    |
480   |                   |                     | ``printf("%u")``. [1]_           |
481   +-------------------+---------------------+----------------------------------+
482   | :attr:`%ld`       | long                | Equivalent to                    |
483   |                   |                     | ``printf("%ld")``. [1]_          |
484   +-------------------+---------------------+----------------------------------+
485   | :attr:`%li`       | long                | Equivalent to                    |
486   |                   |                     | ``printf("%li")``. [1]_          |
487   +-------------------+---------------------+----------------------------------+
488   | :attr:`%lu`       | unsigned long       | Equivalent to                    |
489   |                   |                     | ``printf("%lu")``. [1]_          |
490   +-------------------+---------------------+----------------------------------+
491   | :attr:`%lld`      | long long           | Equivalent to                    |
492   |                   |                     | ``printf("%lld")``. [1]_         |
493   +-------------------+---------------------+----------------------------------+
494   | :attr:`%lli`      | long long           | Equivalent to                    |
495   |                   |                     | ``printf("%lli")``. [1]_         |
496   +-------------------+---------------------+----------------------------------+
497   | :attr:`%llu`      | unsigned long long  | Equivalent to                    |
498   |                   |                     | ``printf("%llu")``. [1]_         |
499   +-------------------+---------------------+----------------------------------+
500   | :attr:`%zd`       | :c:type:`\          | Equivalent to                    |
501   |                   | Py_ssize_t`         | ``printf("%zd")``. [1]_          |
502   +-------------------+---------------------+----------------------------------+
503   | :attr:`%zi`       | :c:type:`\          | Equivalent to                    |
504   |                   | Py_ssize_t`         | ``printf("%zi")``. [1]_          |
505   +-------------------+---------------------+----------------------------------+
506   | :attr:`%zu`       | size_t              | Equivalent to                    |
507   |                   |                     | ``printf("%zu")``. [1]_          |
508   +-------------------+---------------------+----------------------------------+
509   | :attr:`%i`        | int                 | Equivalent to                    |
510   |                   |                     | ``printf("%i")``. [1]_           |
511   +-------------------+---------------------+----------------------------------+
512   | :attr:`%x`        | int                 | Equivalent to                    |
513   |                   |                     | ``printf("%x")``. [1]_           |
514   +-------------------+---------------------+----------------------------------+
515   | :attr:`%s`        | const char\*        | A null-terminated C character    |
516   |                   |                     | array.                           |
517   +-------------------+---------------------+----------------------------------+
518   | :attr:`%p`        | const void\*        | The hex representation of a C    |
519   |                   |                     | pointer. Mostly equivalent to    |
520   |                   |                     | ``printf("%p")`` except that     |
521   |                   |                     | it is guaranteed to start with   |
522   |                   |                     | the literal ``0x`` regardless    |
523   |                   |                     | of what the platform's           |
524   |                   |                     | ``printf`` yields.               |
525   +-------------------+---------------------+----------------------------------+
526   | :attr:`%A`        | PyObject\*          | The result of calling            |
527   |                   |                     | :func:`ascii`.                   |
528   +-------------------+---------------------+----------------------------------+
529   | :attr:`%U`        | PyObject\*          | A Unicode object.                |
530   +-------------------+---------------------+----------------------------------+
531   | :attr:`%V`        | PyObject\*,         | A Unicode object (which may be   |
532   |                   | const char\*        | ``NULL``) and a null-terminated  |
533   |                   |                     | C character array as a second    |
534   |                   |                     | parameter (which will be used,   |
535   |                   |                     | if the first parameter is        |
536   |                   |                     | ``NULL``).                       |
537   +-------------------+---------------------+----------------------------------+
538   | :attr:`%S`        | PyObject\*          | The result of calling            |
539   |                   |                     | :c:func:`PyObject_Str`.          |
540   +-------------------+---------------------+----------------------------------+
541   | :attr:`%R`        | PyObject\*          | The result of calling            |
542   |                   |                     | :c:func:`PyObject_Repr`.         |
543   +-------------------+---------------------+----------------------------------+
544
545   An unrecognized format character causes all the rest of the format string to be
546   copied as-is to the result string, and any extra arguments discarded.
547
548   .. note::
549      The width formatter unit is number of characters rather than bytes.
550      The precision formatter unit is number of bytes for ``"%s"`` and
551      ``"%V"`` (if the ``PyObject*`` argument is ``NULL``), and a number of
552      characters for ``"%A"``, ``"%U"``, ``"%S"``, ``"%R"`` and ``"%V"``
553      (if the ``PyObject*`` argument is not ``NULL``).
554
555   .. [1] For integer specifiers (d, u, ld, li, lu, lld, lli, llu, zd, zi,
556      zu, i, x): the 0-conversion flag has effect even when a precision is given.
557
558   .. versionchanged:: 3.2
559      Support for ``"%lld"`` and ``"%llu"`` added.
560
561   .. versionchanged:: 3.3
562      Support for ``"%li"``, ``"%lli"`` and ``"%zi"`` added.
563
564   .. versionchanged:: 3.4
565      Support width and precision formatter for ``"%s"``, ``"%A"``, ``"%U"``,
566      ``"%V"``, ``"%S"``, ``"%R"`` added.
567
568
569.. c:function:: PyObject* PyUnicode_FromFormatV(const char *format, va_list vargs)
570
571   Identical to :c:func:`PyUnicode_FromFormat` except that it takes exactly two
572   arguments.
573
574
575.. c:function:: PyObject* PyUnicode_FromObject(PyObject *obj)
576
577   Copy an instance of a Unicode subtype to a new true Unicode object if
578   necessary. If *obj* is already a true Unicode object (not a subtype),
579   return the reference with incremented refcount.
580
581   Objects other than Unicode or its subtypes will cause a :exc:`TypeError`.
582
583
584.. c:function:: PyObject* PyUnicode_FromEncodedObject(PyObject *obj, \
585                               const char *encoding, const char *errors)
586
587   Decode an encoded object *obj* to a Unicode object.
588
589   :class:`bytes`, :class:`bytearray` and other
590   :term:`bytes-like objects <bytes-like object>`
591   are decoded according to the given *encoding* and using the error handling
592   defined by *errors*. Both can be ``NULL`` to have the interface use the default
593   values (see :ref:`builtincodecs` for details).
594
595   All other objects, including Unicode objects, cause a :exc:`TypeError` to be
596   set.
597
598   The API returns ``NULL`` if there was an error.  The caller is responsible for
599   decref'ing the returned objects.
600
601
602.. c:function:: Py_ssize_t PyUnicode_GetLength(PyObject *unicode)
603
604   Return the length of the Unicode object, in code points.
605
606   .. versionadded:: 3.3
607
608
609.. c:function:: Py_ssize_t PyUnicode_CopyCharacters(PyObject *to, \
610                                                    Py_ssize_t to_start, \
611                                                    PyObject *from, \
612                                                    Py_ssize_t from_start, \
613                                                    Py_ssize_t how_many)
614
615   Copy characters from one Unicode object into another.  This function performs
616   character conversion when necessary and falls back to :c:func:`memcpy` if
617   possible.  Returns ``-1`` and sets an exception on error, otherwise returns
618   the number of copied characters.
619
620   .. versionadded:: 3.3
621
622
623.. c:function:: Py_ssize_t PyUnicode_Fill(PyObject *unicode, Py_ssize_t start, \
624                        Py_ssize_t length, Py_UCS4 fill_char)
625
626   Fill a string with a character: write *fill_char* into
627   ``unicode[start:start+length]``.
628
629   Fail if *fill_char* is bigger than the string maximum character, or if the
630   string has more than 1 reference.
631
632   Return the number of written character, or return ``-1`` and raise an
633   exception on error.
634
635   .. versionadded:: 3.3
636
637
638.. c:function:: int PyUnicode_WriteChar(PyObject *unicode, Py_ssize_t index, \
639                                        Py_UCS4 character)
640
641   Write a character to a string.  The string must have been created through
642   :c:func:`PyUnicode_New`.  Since Unicode strings are supposed to be immutable,
643   the string must not be shared, or have been hashed yet.
644
645   This function checks that *unicode* is a Unicode object, that the index is
646   not out of bounds, and that the object can be modified safely (i.e. that it
647   its reference count is one).
648
649   .. versionadded:: 3.3
650
651
652.. c:function:: Py_UCS4 PyUnicode_ReadChar(PyObject *unicode, Py_ssize_t index)
653
654   Read a character from a string.  This function checks that *unicode* is a
655   Unicode object and the index is not out of bounds, in contrast to
656   :c:func:`PyUnicode_READ_CHAR`, which performs no error checking.
657
658   .. versionadded:: 3.3
659
660
661.. c:function:: PyObject* PyUnicode_Substring(PyObject *str, Py_ssize_t start, \
662                                              Py_ssize_t end)
663
664   Return a substring of *str*, from character index *start* (included) to
665   character index *end* (excluded).  Negative indices are not supported.
666
667   .. versionadded:: 3.3
668
669
670.. c:function:: Py_UCS4* PyUnicode_AsUCS4(PyObject *u, Py_UCS4 *buffer, \
671                                          Py_ssize_t buflen, int copy_null)
672
673   Copy the string *u* into a UCS4 buffer, including a null character, if
674   *copy_null* is set.  Returns ``NULL`` and sets an exception on error (in
675   particular, a :exc:`SystemError` if *buflen* is smaller than the length of
676   *u*).  *buffer* is returned on success.
677
678   .. versionadded:: 3.3
679
680
681.. c:function:: Py_UCS4* PyUnicode_AsUCS4Copy(PyObject *u)
682
683   Copy the string *u* into a new UCS4 buffer that is allocated using
684   :c:func:`PyMem_Malloc`.  If this fails, ``NULL`` is returned with a
685   :exc:`MemoryError` set.  The returned buffer always has an extra
686   null code point appended.
687
688   .. versionadded:: 3.3
689
690
691Deprecated Py_UNICODE APIs
692""""""""""""""""""""""""""
693
694.. deprecated-removed:: 3.3 3.12
695
696These API functions are deprecated with the implementation of :pep:`393`.
697Extension modules can continue using them, as they will not be removed in Python
6983.x, but need to be aware that their use can now cause performance and memory hits.
699
700
701.. c:function:: PyObject* PyUnicode_FromUnicode(const Py_UNICODE *u, Py_ssize_t size)
702
703   Create a Unicode object from the Py_UNICODE buffer *u* of the given size. *u*
704   may be ``NULL`` which causes the contents to be undefined. It is the user's
705   responsibility to fill in the needed data.  The buffer is copied into the new
706   object.
707
708   If the buffer is not ``NULL``, the return value might be a shared object.
709   Therefore, modification of the resulting Unicode object is only allowed when
710   *u* is ``NULL``.
711
712   If the buffer is ``NULL``, :c:func:`PyUnicode_READY` must be called once the
713   string content has been filled before using any of the access macros such as
714   :c:func:`PyUnicode_KIND`.
715
716   .. deprecated-removed:: 3.3 3.12
717      Part of the old-style Unicode API, please migrate to using
718      :c:func:`PyUnicode_FromKindAndData`, :c:func:`PyUnicode_FromWideChar`, or
719      :c:func:`PyUnicode_New`.
720
721
722.. c:function:: Py_UNICODE* PyUnicode_AsUnicode(PyObject *unicode)
723
724   Return a read-only pointer to the Unicode object's internal
725   :c:type:`Py_UNICODE` buffer, or ``NULL`` on error. This will create the
726   :c:expr:`Py_UNICODE*` representation of the object if it is not yet
727   available. The buffer is always terminated with an extra null code point.
728   Note that the resulting :c:type:`Py_UNICODE` string may also contain
729   embedded null code points, which would cause the string to be truncated when
730   used in most C functions.
731
732   .. deprecated-removed:: 3.3 3.12
733      Part of the old-style Unicode API, please migrate to using
734      :c:func:`PyUnicode_AsUCS4`, :c:func:`PyUnicode_AsWideChar`,
735      :c:func:`PyUnicode_ReadChar` or similar new APIs.
736
737
738.. c:function:: Py_UNICODE* PyUnicode_AsUnicodeAndSize(PyObject *unicode, Py_ssize_t *size)
739
740   Like :c:func:`PyUnicode_AsUnicode`, but also saves the :c:func:`Py_UNICODE`
741   array length (excluding the extra null terminator) in *size*.
742   Note that the resulting :c:expr:`Py_UNICODE*` string
743   may contain embedded null code points, which would cause the string to be
744   truncated when used in most C functions.
745
746   .. versionadded:: 3.3
747
748   .. deprecated-removed:: 3.3 3.12
749      Part of the old-style Unicode API, please migrate to using
750      :c:func:`PyUnicode_AsUCS4`, :c:func:`PyUnicode_AsWideChar`,
751      :c:func:`PyUnicode_ReadChar` or similar new APIs.
752
753
754.. c:function:: Py_ssize_t PyUnicode_GetSize(PyObject *unicode)
755
756   Return the size of the deprecated :c:type:`Py_UNICODE` representation, in
757   code units (this includes surrogate pairs as 2 units).
758
759   .. deprecated-removed:: 3.3 3.12
760      Part of the old-style Unicode API, please migrate to using
761      :c:func:`PyUnicode_GET_LENGTH`.
762
763
764Locale Encoding
765"""""""""""""""
766
767The current locale encoding can be used to decode text from the operating
768system.
769
770.. c:function:: PyObject* PyUnicode_DecodeLocaleAndSize(const char *str, \
771                                                        Py_ssize_t len, \
772                                                        const char *errors)
773
774   Decode a string from UTF-8 on Android and VxWorks, or from the current
775   locale encoding on other platforms. The supported
776   error handlers are ``"strict"`` and ``"surrogateescape"``
777   (:pep:`383`). The decoder uses ``"strict"`` error handler if
778   *errors* is ``NULL``.  *str* must end with a null character but
779   cannot contain embedded null characters.
780
781   Use :c:func:`PyUnicode_DecodeFSDefaultAndSize` to decode a string from
782   :c:data:`Py_FileSystemDefaultEncoding` (the locale encoding read at
783   Python startup).
784
785   This function ignores the :ref:`Python UTF-8 Mode <utf8-mode>`.
786
787   .. seealso::
788
789      The :c:func:`Py_DecodeLocale` function.
790
791   .. versionadded:: 3.3
792
793   .. versionchanged:: 3.7
794      The function now also uses the current locale encoding for the
795      ``surrogateescape`` error handler, except on Android. Previously, :c:func:`Py_DecodeLocale`
796      was used for the ``surrogateescape``, and the current locale encoding was
797      used for ``strict``.
798
799
800.. c:function:: PyObject* PyUnicode_DecodeLocale(const char *str, const char *errors)
801
802   Similar to :c:func:`PyUnicode_DecodeLocaleAndSize`, but compute the string
803   length using :c:func:`strlen`.
804
805   .. versionadded:: 3.3
806
807
808.. c:function:: PyObject* PyUnicode_EncodeLocale(PyObject *unicode, const char *errors)
809
810   Encode a Unicode object to UTF-8 on Android and VxWorks, or to the current
811   locale encoding on other platforms. The
812   supported error handlers are ``"strict"`` and ``"surrogateescape"``
813   (:pep:`383`). The encoder uses ``"strict"`` error handler if
814   *errors* is ``NULL``. Return a :class:`bytes` object. *unicode* cannot
815   contain embedded null characters.
816
817   Use :c:func:`PyUnicode_EncodeFSDefault` to encode a string to
818   :c:data:`Py_FileSystemDefaultEncoding` (the locale encoding read at
819   Python startup).
820
821   This function ignores the :ref:`Python UTF-8 Mode <utf8-mode>`.
822
823   .. seealso::
824
825      The :c:func:`Py_EncodeLocale` function.
826
827   .. versionadded:: 3.3
828
829   .. versionchanged:: 3.7
830      The function now also uses the current locale encoding for the
831      ``surrogateescape`` error handler, except on Android. Previously,
832      :c:func:`Py_EncodeLocale`
833      was used for the ``surrogateescape``, and the current locale encoding was
834      used for ``strict``.
835
836
837File System Encoding
838""""""""""""""""""""
839
840To encode and decode file names and other environment strings,
841:c:data:`Py_FileSystemDefaultEncoding` should be used as the encoding, and
842:c:data:`Py_FileSystemDefaultEncodeErrors` should be used as the error handler
843(:pep:`383` and :pep:`529`). To encode file names to :class:`bytes` during
844argument parsing, the ``"O&"`` converter should be used, passing
845:c:func:`PyUnicode_FSConverter` as the conversion function:
846
847.. c:function:: int PyUnicode_FSConverter(PyObject* obj, void* result)
848
849   ParseTuple converter: encode :class:`str` objects -- obtained directly or
850   through the :class:`os.PathLike` interface -- to :class:`bytes` using
851   :c:func:`PyUnicode_EncodeFSDefault`; :class:`bytes` objects are output as-is.
852   *result* must be a :c:expr:`PyBytesObject*` which must be released when it is
853   no longer used.
854
855   .. versionadded:: 3.1
856
857   .. versionchanged:: 3.6
858      Accepts a :term:`path-like object`.
859
860To decode file names to :class:`str` during argument parsing, the ``"O&"``
861converter should be used, passing :c:func:`PyUnicode_FSDecoder` as the
862conversion function:
863
864.. c:function:: int PyUnicode_FSDecoder(PyObject* obj, void* result)
865
866   ParseTuple converter: decode :class:`bytes` objects -- obtained either
867   directly or indirectly through the :class:`os.PathLike` interface -- to
868   :class:`str` using :c:func:`PyUnicode_DecodeFSDefaultAndSize`; :class:`str`
869   objects are output as-is. *result* must be a :c:expr:`PyUnicodeObject*` which
870   must be released when it is no longer used.
871
872   .. versionadded:: 3.2
873
874   .. versionchanged:: 3.6
875      Accepts a :term:`path-like object`.
876
877
878.. c:function:: PyObject* PyUnicode_DecodeFSDefaultAndSize(const char *s, Py_ssize_t size)
879
880   Decode a string from the :term:`filesystem encoding and error handler`.
881
882   If :c:data:`Py_FileSystemDefaultEncoding` is not set, fall back to the
883   locale encoding.
884
885   :c:data:`Py_FileSystemDefaultEncoding` is initialized at startup from the
886   locale encoding and cannot be modified later. If you need to decode a string
887   from the current locale encoding, use
888   :c:func:`PyUnicode_DecodeLocaleAndSize`.
889
890   .. seealso::
891
892      The :c:func:`Py_DecodeLocale` function.
893
894   .. versionchanged:: 3.6
895      Use :c:data:`Py_FileSystemDefaultEncodeErrors` error handler.
896
897
898.. c:function:: PyObject* PyUnicode_DecodeFSDefault(const char *s)
899
900   Decode a null-terminated string from the :term:`filesystem encoding and
901   error handler`.
902
903   If :c:data:`Py_FileSystemDefaultEncoding` is not set, fall back to the
904   locale encoding.
905
906   Use :c:func:`PyUnicode_DecodeFSDefaultAndSize` if you know the string length.
907
908   .. versionchanged:: 3.6
909      Use :c:data:`Py_FileSystemDefaultEncodeErrors` error handler.
910
911
912.. c:function:: PyObject* PyUnicode_EncodeFSDefault(PyObject *unicode)
913
914   Encode a Unicode object to :c:data:`Py_FileSystemDefaultEncoding` with the
915   :c:data:`Py_FileSystemDefaultEncodeErrors` error handler, and return
916   :class:`bytes`. Note that the resulting :class:`bytes` object may contain
917   null bytes.
918
919   If :c:data:`Py_FileSystemDefaultEncoding` is not set, fall back to the
920   locale encoding.
921
922   :c:data:`Py_FileSystemDefaultEncoding` is initialized at startup from the
923   locale encoding and cannot be modified later. If you need to encode a string
924   to the current locale encoding, use :c:func:`PyUnicode_EncodeLocale`.
925
926   .. seealso::
927
928      The :c:func:`Py_EncodeLocale` function.
929
930   .. versionadded:: 3.2
931
932   .. versionchanged:: 3.6
933      Use :c:data:`Py_FileSystemDefaultEncodeErrors` error handler.
934
935wchar_t Support
936"""""""""""""""
937
938:c:expr:`wchar_t` support for platforms which support it:
939
940.. c:function:: PyObject* PyUnicode_FromWideChar(const wchar_t *w, Py_ssize_t size)
941
942   Create a Unicode object from the :c:expr:`wchar_t` buffer *w* of the given *size*.
943   Passing ``-1`` as the *size* indicates that the function must itself compute the length,
944   using wcslen.
945   Return ``NULL`` on failure.
946
947
948.. c:function:: Py_ssize_t PyUnicode_AsWideChar(PyObject *unicode, wchar_t *w, Py_ssize_t size)
949
950   Copy the Unicode object contents into the :c:expr:`wchar_t` buffer *w*.  At most
951   *size* :c:expr:`wchar_t` characters are copied (excluding a possibly trailing
952   null termination character).  Return the number of :c:expr:`wchar_t` characters
953   copied or ``-1`` in case of an error.  Note that the resulting :c:expr:`wchar_t*`
954   string may or may not be null-terminated.  It is the responsibility of the caller
955   to make sure that the :c:expr:`wchar_t*` string is null-terminated in case this is
956   required by the application. Also, note that the :c:expr:`wchar_t*` string
957   might contain null characters, which would cause the string to be truncated
958   when used with most C functions.
959
960
961.. c:function:: wchar_t* PyUnicode_AsWideCharString(PyObject *unicode, Py_ssize_t *size)
962
963   Convert the Unicode object to a wide character string. The output string
964   always ends with a null character. If *size* is not ``NULL``, write the number
965   of wide characters (excluding the trailing null termination character) into
966   *\*size*. Note that the resulting :c:expr:`wchar_t` string might contain
967   null characters, which would cause the string to be truncated when used with
968   most C functions. If *size* is ``NULL`` and the :c:expr:`wchar_t*` string
969   contains null characters a :exc:`ValueError` is raised.
970
971   Returns a buffer allocated by :c:func:`PyMem_Alloc` (use
972   :c:func:`PyMem_Free` to free it) on success. On error, returns ``NULL``
973   and *\*size* is undefined. Raises a :exc:`MemoryError` if memory allocation
974   is failed.
975
976   .. versionadded:: 3.2
977
978   .. versionchanged:: 3.7
979      Raises a :exc:`ValueError` if *size* is ``NULL`` and the :c:expr:`wchar_t*`
980      string contains null characters.
981
982
983.. _builtincodecs:
984
985Built-in Codecs
986^^^^^^^^^^^^^^^
987
988Python provides a set of built-in codecs which are written in C for speed. All of
989these codecs are directly usable via the following functions.
990
991Many of the following APIs take two arguments encoding and errors, and they
992have the same semantics as the ones of the built-in :func:`str` string object
993constructor.
994
995Setting encoding to ``NULL`` causes the default encoding to be used
996which is UTF-8.  The file system calls should use
997:c:func:`PyUnicode_FSConverter` for encoding file names. This uses the
998variable :c:data:`Py_FileSystemDefaultEncoding` internally. This
999variable should be treated as read-only: on some systems, it will be a
1000pointer to a static string, on others, it will change at run-time
1001(such as when the application invokes setlocale).
1002
1003Error handling is set by errors which may also be set to ``NULL`` meaning to use
1004the default handling defined for the codec.  Default error handling for all
1005built-in codecs is "strict" (:exc:`ValueError` is raised).
1006
1007The codecs all use a similar interface.  Only deviations from the following
1008generic ones are documented for simplicity.
1009
1010
1011Generic Codecs
1012""""""""""""""
1013
1014These are the generic codec APIs:
1015
1016
1017.. c:function:: PyObject* PyUnicode_Decode(const char *s, Py_ssize_t size, \
1018                              const char *encoding, const char *errors)
1019
1020   Create a Unicode object by decoding *size* bytes of the encoded string *s*.
1021   *encoding* and *errors* have the same meaning as the parameters of the same name
1022   in the :func:`str` built-in function.  The codec to be used is looked up
1023   using the Python codec registry.  Return ``NULL`` if an exception was raised by
1024   the codec.
1025
1026
1027.. c:function:: PyObject* PyUnicode_AsEncodedString(PyObject *unicode, \
1028                              const char *encoding, const char *errors)
1029
1030   Encode a Unicode object and return the result as Python bytes object.
1031   *encoding* and *errors* have the same meaning as the parameters of the same
1032   name in the Unicode :meth:`~str.encode` method. The codec to be used is looked up
1033   using the Python codec registry. Return ``NULL`` if an exception was raised by
1034   the codec.
1035
1036
1037UTF-8 Codecs
1038""""""""""""
1039
1040These are the UTF-8 codec APIs:
1041
1042
1043.. c:function:: PyObject* PyUnicode_DecodeUTF8(const char *s, Py_ssize_t size, const char *errors)
1044
1045   Create a Unicode object by decoding *size* bytes of the UTF-8 encoded string
1046   *s*. Return ``NULL`` if an exception was raised by the codec.
1047
1048
1049.. c:function:: PyObject* PyUnicode_DecodeUTF8Stateful(const char *s, Py_ssize_t size, \
1050                              const char *errors, Py_ssize_t *consumed)
1051
1052   If *consumed* is ``NULL``, behave like :c:func:`PyUnicode_DecodeUTF8`. If
1053   *consumed* is not ``NULL``, trailing incomplete UTF-8 byte sequences will not be
1054   treated as an error. Those bytes will not be decoded and the number of bytes
1055   that have been decoded will be stored in *consumed*.
1056
1057
1058.. c:function:: PyObject* PyUnicode_AsUTF8String(PyObject *unicode)
1059
1060   Encode a Unicode object using UTF-8 and return the result as Python bytes
1061   object.  Error handling is "strict".  Return ``NULL`` if an exception was
1062   raised by the codec.
1063
1064
1065.. c:function:: const char* PyUnicode_AsUTF8AndSize(PyObject *unicode, Py_ssize_t *size)
1066
1067   Return a pointer to the UTF-8 encoding of the Unicode object, and
1068   store the size of the encoded representation (in bytes) in *size*.  The
1069   *size* argument can be ``NULL``; in this case no size will be stored.  The
1070   returned buffer always has an extra null byte appended (not included in
1071   *size*), regardless of whether there are any other null code points.
1072
1073   In the case of an error, ``NULL`` is returned with an exception set and no
1074   *size* is stored.
1075
1076   This caches the UTF-8 representation of the string in the Unicode object, and
1077   subsequent calls will return a pointer to the same buffer.  The caller is not
1078   responsible for deallocating the buffer. The buffer is deallocated and
1079   pointers to it become invalid when the Unicode object is garbage collected.
1080
1081   .. versionadded:: 3.3
1082
1083   .. versionchanged:: 3.7
1084      The return type is now ``const char *`` rather of ``char *``.
1085
1086   .. versionchanged:: 3.10
1087      This function is a part of the :ref:`limited API <stable>`.
1088
1089
1090.. c:function:: const char* PyUnicode_AsUTF8(PyObject *unicode)
1091
1092   As :c:func:`PyUnicode_AsUTF8AndSize`, but does not store the size.
1093
1094   .. versionadded:: 3.3
1095
1096   .. versionchanged:: 3.7
1097      The return type is now ``const char *`` rather of ``char *``.
1098
1099
1100UTF-32 Codecs
1101"""""""""""""
1102
1103These are the UTF-32 codec APIs:
1104
1105
1106.. c:function:: PyObject* PyUnicode_DecodeUTF32(const char *s, Py_ssize_t size, \
1107                              const char *errors, int *byteorder)
1108
1109   Decode *size* bytes from a UTF-32 encoded buffer string and return the
1110   corresponding Unicode object.  *errors* (if non-``NULL``) defines the error
1111   handling. It defaults to "strict".
1112
1113   If *byteorder* is non-``NULL``, the decoder starts decoding using the given byte
1114   order::
1115
1116      *byteorder == -1: little endian
1117      *byteorder == 0:  native order
1118      *byteorder == 1:  big endian
1119
1120   If ``*byteorder`` is zero, and the first four bytes of the input data are a
1121   byte order mark (BOM), the decoder switches to this byte order and the BOM is
1122   not copied into the resulting Unicode string.  If ``*byteorder`` is ``-1`` or
1123   ``1``, any byte order mark is copied to the output.
1124
1125   After completion, *\*byteorder* is set to the current byte order at the end
1126   of input data.
1127
1128   If *byteorder* is ``NULL``, the codec starts in native order mode.
1129
1130   Return ``NULL`` if an exception was raised by the codec.
1131
1132
1133.. c:function:: PyObject* PyUnicode_DecodeUTF32Stateful(const char *s, Py_ssize_t size, \
1134                              const char *errors, int *byteorder, Py_ssize_t *consumed)
1135
1136   If *consumed* is ``NULL``, behave like :c:func:`PyUnicode_DecodeUTF32`. If
1137   *consumed* is not ``NULL``, :c:func:`PyUnicode_DecodeUTF32Stateful` will not treat
1138   trailing incomplete UTF-32 byte sequences (such as a number of bytes not divisible
1139   by four) as an error. Those bytes will not be decoded and the number of bytes
1140   that have been decoded will be stored in *consumed*.
1141
1142
1143.. c:function:: PyObject* PyUnicode_AsUTF32String(PyObject *unicode)
1144
1145   Return a Python byte string using the UTF-32 encoding in native byte
1146   order. The string always starts with a BOM mark.  Error handling is "strict".
1147   Return ``NULL`` if an exception was raised by the codec.
1148
1149
1150UTF-16 Codecs
1151"""""""""""""
1152
1153These are the UTF-16 codec APIs:
1154
1155
1156.. c:function:: PyObject* PyUnicode_DecodeUTF16(const char *s, Py_ssize_t size, \
1157                              const char *errors, int *byteorder)
1158
1159   Decode *size* bytes from a UTF-16 encoded buffer string and return the
1160   corresponding Unicode object.  *errors* (if non-``NULL``) defines the error
1161   handling. It defaults to "strict".
1162
1163   If *byteorder* is non-``NULL``, the decoder starts decoding using the given byte
1164   order::
1165
1166      *byteorder == -1: little endian
1167      *byteorder == 0:  native order
1168      *byteorder == 1:  big endian
1169
1170   If ``*byteorder`` is zero, and the first two bytes of the input data are a
1171   byte order mark (BOM), the decoder switches to this byte order and the BOM is
1172   not copied into the resulting Unicode string.  If ``*byteorder`` is ``-1`` or
1173   ``1``, any byte order mark is copied to the output (where it will result in
1174   either a ``\ufeff`` or a ``\ufffe`` character).
1175
1176   After completion, ``*byteorder`` is set to the current byte order at the end
1177   of input data.
1178
1179   If *byteorder* is ``NULL``, the codec starts in native order mode.
1180
1181   Return ``NULL`` if an exception was raised by the codec.
1182
1183
1184.. c:function:: PyObject* PyUnicode_DecodeUTF16Stateful(const char *s, Py_ssize_t size, \
1185                              const char *errors, int *byteorder, Py_ssize_t *consumed)
1186
1187   If *consumed* is ``NULL``, behave like :c:func:`PyUnicode_DecodeUTF16`. If
1188   *consumed* is not ``NULL``, :c:func:`PyUnicode_DecodeUTF16Stateful` will not treat
1189   trailing incomplete UTF-16 byte sequences (such as an odd number of bytes or a
1190   split surrogate pair) as an error. Those bytes will not be decoded and the
1191   number of bytes that have been decoded will be stored in *consumed*.
1192
1193
1194.. c:function:: PyObject* PyUnicode_AsUTF16String(PyObject *unicode)
1195
1196   Return a Python byte string using the UTF-16 encoding in native byte
1197   order. The string always starts with a BOM mark.  Error handling is "strict".
1198   Return ``NULL`` if an exception was raised by the codec.
1199
1200
1201UTF-7 Codecs
1202""""""""""""
1203
1204These are the UTF-7 codec APIs:
1205
1206
1207.. c:function:: PyObject* PyUnicode_DecodeUTF7(const char *s, Py_ssize_t size, const char *errors)
1208
1209   Create a Unicode object by decoding *size* bytes of the UTF-7 encoded string
1210   *s*.  Return ``NULL`` if an exception was raised by the codec.
1211
1212
1213.. c:function:: PyObject* PyUnicode_DecodeUTF7Stateful(const char *s, Py_ssize_t size, \
1214                              const char *errors, Py_ssize_t *consumed)
1215
1216   If *consumed* is ``NULL``, behave like :c:func:`PyUnicode_DecodeUTF7`.  If
1217   *consumed* is not ``NULL``, trailing incomplete UTF-7 base-64 sections will not
1218   be treated as an error.  Those bytes will not be decoded and the number of
1219   bytes that have been decoded will be stored in *consumed*.
1220
1221
1222Unicode-Escape Codecs
1223"""""""""""""""""""""
1224
1225These are the "Unicode Escape" codec APIs:
1226
1227
1228.. c:function:: PyObject* PyUnicode_DecodeUnicodeEscape(const char *s, \
1229                              Py_ssize_t size, const char *errors)
1230
1231   Create a Unicode object by decoding *size* bytes of the Unicode-Escape encoded
1232   string *s*.  Return ``NULL`` if an exception was raised by the codec.
1233
1234
1235.. c:function:: PyObject* PyUnicode_AsUnicodeEscapeString(PyObject *unicode)
1236
1237   Encode a Unicode object using Unicode-Escape and return the result as a
1238   bytes object.  Error handling is "strict".  Return ``NULL`` if an exception was
1239   raised by the codec.
1240
1241
1242Raw-Unicode-Escape Codecs
1243"""""""""""""""""""""""""
1244
1245These are the "Raw Unicode Escape" codec APIs:
1246
1247
1248.. c:function:: PyObject* PyUnicode_DecodeRawUnicodeEscape(const char *s, \
1249                              Py_ssize_t size, const char *errors)
1250
1251   Create a Unicode object by decoding *size* bytes of the Raw-Unicode-Escape
1252   encoded string *s*.  Return ``NULL`` if an exception was raised by the codec.
1253
1254
1255.. c:function:: PyObject* PyUnicode_AsRawUnicodeEscapeString(PyObject *unicode)
1256
1257   Encode a Unicode object using Raw-Unicode-Escape and return the result as
1258   a bytes object.  Error handling is "strict".  Return ``NULL`` if an exception
1259   was raised by the codec.
1260
1261
1262Latin-1 Codecs
1263""""""""""""""
1264
1265These are the Latin-1 codec APIs: Latin-1 corresponds to the first 256 Unicode
1266ordinals and only these are accepted by the codecs during encoding.
1267
1268
1269.. c:function:: PyObject* PyUnicode_DecodeLatin1(const char *s, Py_ssize_t size, const char *errors)
1270
1271   Create a Unicode object by decoding *size* bytes of the Latin-1 encoded string
1272   *s*.  Return ``NULL`` if an exception was raised by the codec.
1273
1274
1275.. c:function:: PyObject* PyUnicode_AsLatin1String(PyObject *unicode)
1276
1277   Encode a Unicode object using Latin-1 and return the result as Python bytes
1278   object.  Error handling is "strict".  Return ``NULL`` if an exception was
1279   raised by the codec.
1280
1281
1282ASCII Codecs
1283""""""""""""
1284
1285These are the ASCII codec APIs.  Only 7-bit ASCII data is accepted. All other
1286codes generate errors.
1287
1288
1289.. c:function:: PyObject* PyUnicode_DecodeASCII(const char *s, Py_ssize_t size, const char *errors)
1290
1291   Create a Unicode object by decoding *size* bytes of the ASCII encoded string
1292   *s*.  Return ``NULL`` if an exception was raised by the codec.
1293
1294
1295.. c:function:: PyObject* PyUnicode_AsASCIIString(PyObject *unicode)
1296
1297   Encode a Unicode object using ASCII and return the result as Python bytes
1298   object.  Error handling is "strict".  Return ``NULL`` if an exception was
1299   raised by the codec.
1300
1301
1302Character Map Codecs
1303""""""""""""""""""""
1304
1305This codec is special in that it can be used to implement many different codecs
1306(and this is in fact what was done to obtain most of the standard codecs
1307included in the :mod:`encodings` package). The codec uses mappings to encode and
1308decode characters.  The mapping objects provided must support the
1309:meth:`__getitem__` mapping interface; dictionaries and sequences work well.
1310
1311These are the mapping codec APIs:
1312
1313.. c:function:: PyObject* PyUnicode_DecodeCharmap(const char *data, Py_ssize_t size, \
1314                              PyObject *mapping, const char *errors)
1315
1316   Create a Unicode object by decoding *size* bytes of the encoded string *s*
1317   using the given *mapping* object.  Return ``NULL`` if an exception was raised
1318   by the codec.
1319
1320   If *mapping* is ``NULL``, Latin-1 decoding will be applied.  Else
1321   *mapping* must map bytes ordinals (integers in the range from 0 to 255)
1322   to Unicode strings, integers (which are then interpreted as Unicode
1323   ordinals) or ``None``.  Unmapped data bytes -- ones which cause a
1324   :exc:`LookupError`, as well as ones which get mapped to ``None``,
1325   ``0xFFFE`` or ``'\ufffe'``, are treated as undefined mappings and cause
1326   an error.
1327
1328
1329.. c:function:: PyObject* PyUnicode_AsCharmapString(PyObject *unicode, PyObject *mapping)
1330
1331   Encode a Unicode object using the given *mapping* object and return the
1332   result as a bytes object.  Error handling is "strict".  Return ``NULL`` if an
1333   exception was raised by the codec.
1334
1335   The *mapping* object must map Unicode ordinal integers to bytes objects,
1336   integers in the range from 0 to 255 or ``None``.  Unmapped character
1337   ordinals (ones which cause a :exc:`LookupError`) as well as mapped to
1338   ``None`` are treated as "undefined mapping" and cause an error.
1339
1340
1341The following codec API is special in that maps Unicode to Unicode.
1342
1343.. c:function:: PyObject* PyUnicode_Translate(PyObject *str, PyObject *table, const char *errors)
1344
1345   Translate a string by applying a character mapping table to it and return the
1346   resulting Unicode object. Return ``NULL`` if an exception was raised by the
1347   codec.
1348
1349   The mapping table must map Unicode ordinal integers to Unicode ordinal integers
1350   or ``None`` (causing deletion of the character).
1351
1352   Mapping tables need only provide the :meth:`__getitem__` interface; dictionaries
1353   and sequences work well.  Unmapped character ordinals (ones which cause a
1354   :exc:`LookupError`) are left untouched and are copied as-is.
1355
1356   *errors* has the usual meaning for codecs. It may be ``NULL`` which indicates to
1357   use the default error handling.
1358
1359
1360MBCS codecs for Windows
1361"""""""""""""""""""""""
1362
1363These are the MBCS codec APIs. They are currently only available on Windows and
1364use the Win32 MBCS converters to implement the conversions.  Note that MBCS (or
1365DBCS) is a class of encodings, not just one.  The target encoding is defined by
1366the user settings on the machine running the codec.
1367
1368.. c:function:: PyObject* PyUnicode_DecodeMBCS(const char *s, Py_ssize_t size, const char *errors)
1369
1370   Create a Unicode object by decoding *size* bytes of the MBCS encoded string *s*.
1371   Return ``NULL`` if an exception was raised by the codec.
1372
1373
1374.. c:function:: PyObject* PyUnicode_DecodeMBCSStateful(const char *s, Py_ssize_t size, \
1375                              const char *errors, Py_ssize_t *consumed)
1376
1377   If *consumed* is ``NULL``, behave like :c:func:`PyUnicode_DecodeMBCS`. If
1378   *consumed* is not ``NULL``, :c:func:`PyUnicode_DecodeMBCSStateful` will not decode
1379   trailing lead byte and the number of bytes that have been decoded will be stored
1380   in *consumed*.
1381
1382
1383.. c:function:: PyObject* PyUnicode_AsMBCSString(PyObject *unicode)
1384
1385   Encode a Unicode object using MBCS and return the result as Python bytes
1386   object.  Error handling is "strict".  Return ``NULL`` if an exception was
1387   raised by the codec.
1388
1389
1390.. c:function:: PyObject* PyUnicode_EncodeCodePage(int code_page, PyObject *unicode, const char *errors)
1391
1392   Encode the Unicode object using the specified code page and return a Python
1393   bytes object.  Return ``NULL`` if an exception was raised by the codec. Use
1394   :c:data:`CP_ACP` code page to get the MBCS encoder.
1395
1396   .. versionadded:: 3.3
1397
1398
1399Methods & Slots
1400"""""""""""""""
1401
1402
1403.. _unicodemethodsandslots:
1404
1405Methods and Slot Functions
1406^^^^^^^^^^^^^^^^^^^^^^^^^^
1407
1408The following APIs are capable of handling Unicode objects and strings on input
1409(we refer to them as strings in the descriptions) and return Unicode objects or
1410integers as appropriate.
1411
1412They all return ``NULL`` or ``-1`` if an exception occurs.
1413
1414
1415.. c:function:: PyObject* PyUnicode_Concat(PyObject *left, PyObject *right)
1416
1417   Concat two strings giving a new Unicode string.
1418
1419
1420.. c:function:: PyObject* PyUnicode_Split(PyObject *s, PyObject *sep, Py_ssize_t maxsplit)
1421
1422   Split a string giving a list of Unicode strings.  If *sep* is ``NULL``, splitting
1423   will be done at all whitespace substrings.  Otherwise, splits occur at the given
1424   separator.  At most *maxsplit* splits will be done.  If negative, no limit is
1425   set.  Separators are not included in the resulting list.
1426
1427
1428.. c:function:: PyObject* PyUnicode_Splitlines(PyObject *s, int keepend)
1429
1430   Split a Unicode string at line breaks, returning a list of Unicode strings.
1431   CRLF is considered to be one line break.  If *keepend* is ``0``, the line break
1432   characters are not included in the resulting strings.
1433
1434
1435.. c:function:: PyObject* PyUnicode_Join(PyObject *separator, PyObject *seq)
1436
1437   Join a sequence of strings using the given *separator* and return the resulting
1438   Unicode string.
1439
1440
1441.. c:function:: Py_ssize_t PyUnicode_Tailmatch(PyObject *str, PyObject *substr, \
1442                        Py_ssize_t start, Py_ssize_t end, int direction)
1443
1444   Return ``1`` if *substr* matches ``str[start:end]`` at the given tail end
1445   (*direction* == ``-1`` means to do a prefix match, *direction* == ``1`` a suffix match),
1446   ``0`` otherwise. Return ``-1`` if an error occurred.
1447
1448
1449.. c:function:: Py_ssize_t PyUnicode_Find(PyObject *str, PyObject *substr, \
1450                               Py_ssize_t start, Py_ssize_t end, int direction)
1451
1452   Return the first position of *substr* in ``str[start:end]`` using the given
1453   *direction* (*direction* == ``1`` means to do a forward search, *direction* == ``-1`` a
1454   backward search).  The return value is the index of the first match; a value of
1455   ``-1`` indicates that no match was found, and ``-2`` indicates that an error
1456   occurred and an exception has been set.
1457
1458
1459.. c:function:: Py_ssize_t PyUnicode_FindChar(PyObject *str, Py_UCS4 ch, \
1460                               Py_ssize_t start, Py_ssize_t end, int direction)
1461
1462   Return the first position of the character *ch* in ``str[start:end]`` using
1463   the given *direction* (*direction* == ``1`` means to do a forward search,
1464   *direction* == ``-1`` a backward search).  The return value is the index of the
1465   first match; a value of ``-1`` indicates that no match was found, and ``-2``
1466   indicates that an error occurred and an exception has been set.
1467
1468   .. versionadded:: 3.3
1469
1470   .. versionchanged:: 3.7
1471      *start* and *end* are now adjusted to behave like ``str[start:end]``.
1472
1473
1474.. c:function:: Py_ssize_t PyUnicode_Count(PyObject *str, PyObject *substr, \
1475                               Py_ssize_t start, Py_ssize_t end)
1476
1477   Return the number of non-overlapping occurrences of *substr* in
1478   ``str[start:end]``.  Return ``-1`` if an error occurred.
1479
1480
1481.. c:function:: PyObject* PyUnicode_Replace(PyObject *str, PyObject *substr, \
1482                              PyObject *replstr, Py_ssize_t maxcount)
1483
1484   Replace at most *maxcount* occurrences of *substr* in *str* with *replstr* and
1485   return the resulting Unicode object. *maxcount* == ``-1`` means replace all
1486   occurrences.
1487
1488
1489.. c:function:: int PyUnicode_Compare(PyObject *left, PyObject *right)
1490
1491   Compare two strings and return ``-1``, ``0``, ``1`` for less than, equal, and greater than,
1492   respectively.
1493
1494   This function returns ``-1`` upon failure, so one should call
1495   :c:func:`PyErr_Occurred` to check for errors.
1496
1497
1498.. c:function:: int PyUnicode_CompareWithASCIIString(PyObject *uni, const char *string)
1499
1500   Compare a Unicode object, *uni*, with *string* and return ``-1``, ``0``, ``1`` for less
1501   than, equal, and greater than, respectively. It is best to pass only
1502   ASCII-encoded strings, but the function interprets the input string as
1503   ISO-8859-1 if it contains non-ASCII characters.
1504
1505   This function does not raise exceptions.
1506
1507
1508.. c:function:: PyObject* PyUnicode_RichCompare(PyObject *left,  PyObject *right,  int op)
1509
1510   Rich compare two Unicode strings and return one of the following:
1511
1512   * ``NULL`` in case an exception was raised
1513   * :const:`Py_True` or :const:`Py_False` for successful comparisons
1514   * :const:`Py_NotImplemented` in case the type combination is unknown
1515
1516   Possible values for *op* are :const:`Py_GT`, :const:`Py_GE`, :const:`Py_EQ`,
1517   :const:`Py_NE`, :const:`Py_LT`, and :const:`Py_LE`.
1518
1519
1520.. c:function:: PyObject* PyUnicode_Format(PyObject *format, PyObject *args)
1521
1522   Return a new string object from *format* and *args*; this is analogous to
1523   ``format % args``.
1524
1525
1526.. c:function:: int PyUnicode_Contains(PyObject *container, PyObject *element)
1527
1528   Check whether *element* is contained in *container* and return true or false
1529   accordingly.
1530
1531   *element* has to coerce to a one element Unicode string. ``-1`` is returned
1532   if there was an error.
1533
1534
1535.. c:function:: void PyUnicode_InternInPlace(PyObject **string)
1536
1537   Intern the argument *\*string* in place.  The argument must be the address of a
1538   pointer variable pointing to a Python Unicode string object.  If there is an
1539   existing interned string that is the same as *\*string*, it sets *\*string* to
1540   it (decrementing the reference count of the old string object and incrementing
1541   the reference count of the interned string object), otherwise it leaves
1542   *\*string* alone and interns it (incrementing its reference count).
1543   (Clarification: even though there is a lot of talk about reference counts, think
1544   of this function as reference-count-neutral; you own the object after the call
1545   if and only if you owned it before the call.)
1546
1547
1548.. c:function:: PyObject* PyUnicode_InternFromString(const char *v)
1549
1550   A combination of :c:func:`PyUnicode_FromString` and
1551   :c:func:`PyUnicode_InternInPlace`, returning either a new Unicode string
1552   object that has been interned, or a new ("owned") reference to an earlier
1553   interned string object with the same value.
1554