EBNF | Validity / Comments | |
---|---|---|
unicode_language_id |
|
"root" is treated as a special unicode_language_subtag |
unicode_language_subtag |
= alpha{2,3} | alpha{5,8}; |
validity latest-data |
unicode_script_subtag |
= alpha{4} ; |
validity latest-data |
unicode_region_subtag
| = (alpha{2} | digit{3}) ; |
validity latest-data |
unicode_variant_subtag
| = (alphanum{5,8} |
validity latest-data |
sep | = [-_] ; | |
digit | = [0-9] ; | |
alpha | = [A-Z a-z] ; | |
alphanum | = [0-9 A-Z a-z] ; |
Qaag |
Zawgyi | Qaag is a special script code for identifying the non-standard use of Myanmar characters for display with the Zawgyi font. The purpose of the code is to enable migration to standard, interoperable use of Unicode by providing an identifier for Zawgyi for tagging text, applications, input methods, font tables, transformations, and other mechanisms used for migration. | |
Qaai |
Inherited | deprecated: the canonicalized form is Zinh | |
Zinh |
Inherited | ||
Zsye |
Emoji Style | Prefer emoji style for characters that have both text and emoji styles available. | |
Zsym |
Text Style | Prefer text style for characters that have both text and emoji styles available. | |
Zxxx |
Unwritten | Indicates spoken or otherwise unwritten content. For example: | |
Sample(s) | Description | ||
---|---|---|---|
uz | either written or spoken content | ||
uz-Latn or uz-Arab | written-only content (particular script) | ||
uz-Zyyy | written-only content (unspecified script) | ||
uz-Zxxx | spoken-only content | ||
uz-Latn, uz-Zxxx | both specific written and spoken content (using a language list) | ||
Zyyy |
Common | ||
Zzzz |
Unknown |
key (old key name) | key description | example type (old type name) | type description |
---|---|---|---|
A Unicode Calendar Identifier
defines a type of calendar. The valid values are those name attribute values in the type elements of key name="ca"
in bcp47/calendar.xml. This selects calendar-specific data within a locale used for formatting and parsing, such as date/time symbols and patterns; it also selects supplemental calendarData used for calendrical calculations. The value can affect the computation of the first day of the week: see First Day Overrides. | |||
"ca" (calendar) |
Calendar algorithm (For information on the calendar algorithms associated with the data used with these, see [Calendars].) |
"buddhist" | Thai Buddhist calendar (same as Gregorian except for the year) |
"chinese" | Traditional Chinese calendar | ||
… | |||
"gregory" (gregorian) |
Gregorian calendar | ||
… | |||
"islamic" | Islamic calendar | ||
"islamic-civil" | Islamic calendar, tabular (intercalary years [2,5,7,10,13,16,18,21,24,26,29] - civil epoch) | ||
"islamic-umalqura" | Islamic calendar, Umm al-Qura | ||
… | |||
Note: Some calendar types are represented by two subtags. In such cases, the first subtag specifies a generic calendar type and the second subtag specifies a calendar algorithm variant. The CLDR uses generic calendar types (single subtag types) for tagging data when calendar algorithm variations within a generic calendar type are irrelevant. For example, type "islamic" is used for specifying Islamic calendar formatting data for all Islamic calendar types, including "islamic-civil" and "islamic-umalqura". | |||
A Unicode Currency Format Identifier
defines a style for currency formatting. The valid values are those name attribute values in the type elements of key name="cf" in
bcp47/currency.xml. This selects the specific type of currency formatting pattern within a locale. | |||
"cf" | Currency Format style | "standard" | Negative numbers use the minusSign symbol (the default). |
"account" | Negative numbers use parentheses or equivalent. | ||
A Unicode Collation Identifier defines a type of collation (sort order). The valid values are those name attribute values in the type elements of bcp47/collation.xml. | |||
For information on each collation setting parameter, from ka to vt, see Setting Options | |||
"co" (collation) |
Collation type | "standard" | The default ordering for each language. For root it is based on the [DUCET] (Default Unicode Collation Element Table): see Root Collation. Each other locale is based on that, except for appropriate modifications to certain characters for that language. |
"search" | A special collation type dedicated for string search—it is not used to determine the relative order of two strings, but only to determine whether they should be considered equivalent for the specified strength, using the string search matching rules appropriate for the language. Compared to the normal collator for the language, this may add or remove primary equivalences, may make additional characters ignorable or change secondary equivalences, and may modify contractions to allow matching within them, depending on the desired behavior. For example, in Czech, the distinction between ‘a’ and ‘á’ is secondary for normal collation, but primary for search; a search for ‘a’ should never match ‘á’ and vice versa. A search collator is normally used with strength set to PRIMARY or SECONDARY (should be SECONDARY if using “asymmetric” search as described in the [UCA] section Asymmetric Search). The search collator in root supplies matching rules that are appropriate for most languages (and which are different than the root collation behavior); language-specific search collators may be provided to override the matching rules for a given language as necessary. | ||
Other keywords provide additional choices for certain locales; they only have effect in certain locales. | |||
… | |||
"phonetic" | Requests a phonetic variant if available, where text is sorted based on pronunciation. It may interleave different scripts, if multiple scripts are in common use. | ||
"pinyin" | Pinyin ordering for Latin and for CJK characters; that is, an ordering for CJK characters based on a character-by-character transliteration into a pinyin. (used in Chinese) | ||
"searchjl" | Special collation type for a modified string search in which a pattern consisting of a sequence of Hangul initial consonants (jamo lead consonants) will match a sequence of Hangul syllable characters whose initial consonants match the pattern. The jamo lead consonants can be represented using conjoining or compatibility jamo. This search collator is best used at SECONDARY strength with an "asymmetric" search as described in the [UCA] section Asymmetric Search and obtained, for example, using ICU4C's usearch facility with attribute USEARCH_ELEMENT_COMPARISON set to value USEARCH_PATTERN_BASE_WEIGHT_IS_WILDCARD; this ensures that a full Hangul syllable in the search pattern will only match the same syllable in the searched text (instead of matching any syllable with the same initial consonant), while a Hangul initial consonant in the search pattern will match any Hangul syllable in the searched text with the same initial consonant. | ||
… | |||
A Unicode Currency Identifier defines a type of currency. The valid values are those name attribute values in the type elements of key name="cu" in bcp47/currency.xml. | |||
"cu" (currency) |
Currency type | ISO 4217 code, plus others in common use |
Codes consisting of 3 ASCII letters that are or have been valid in ISO 4217, plus certain additional codes that are or have been in common use. The list of countries and time periods associated with each currency value is available in Supplemental Currency Data, plus the default number of decimals. The XXX code is given a broader interpretation as Unknown or Invalid Currency. |
A Unicode Dictionary Break Exclusion Identifier specifies scripts to be excluded from dictionary-based text break
(for words and lines). The valid values are of one or more items of type SCRIPT_CODE as specified in the name attribute value in the type element of
key name="dx" in bcp47/segmentation.xml. This affects break iteration regardless of locale. | |||
"dx" | Dictionary break script exclusions | unicode_script_subtag values |
|
A Unicode Emoji Presentation Style Identifier specifies a request for the preferred emoji presentation style. This can be used as part of the value for an HTML lang attribute, for example <html lang="sr-Latn-u-em-emoji"> . The valid values are those name attribute values in the type elements of key name="em" in bcp47/variant.xml. | |||
"em" | Emoji presentation style | "emoji" | Use an emoji presentation for emoji characters if possible. |
"text" | Use a text presentation for emoji characters if possible. | ||
"default" | Use the default presentation for emoji characters as specified in UTR #51 Presentation Style. | ||
A Unicode First Day Identifier defines the preferred first day of the week for calendar display. Specifying "fw" in a locale identifier overrides the default value specified by supplemental week data for the region (see Part 4 Dates, Week Data). The valid values are those name attribute values in the type elements of key name="fw" in bcp47/calendar.xml. The value can affect the computation of the first day of the week: see First Day Overrides. | |||
"fw" | First day of week | "sun" | Sunday |
"mon" | Monday | ||
… | |||
"sat" | Saturday | ||
A Unicode Hour Cycle Identifier defines the preferred time cycle. Specifying "hc" in a locale identifier overrides the default value specified by supplemental time data for the region (see Part 4 Dates, Time Data). The valid values are those name attribute values in the type elements of key name="hc" in bcp47/calendar.xml. | |||
"hc" | Hour cycle | "h12" | Hour system using 1–12; corresponds to 'h' in patterns |
"h23" | Hour system using 0–23; corresponds to 'H' in patterns | ||
"h11" | Hour system using 0–11; corresponds to 'K' in patterns | ||
"h24" | Hour system using 1–24; corresponds to 'k' in pattern | ||
A Unicode Line Break Style Identifier defines a preferred line break style corresponding to the CSS level 3 line-break option. Specifying "lb" in a locale identifier overrides the locale’s default style (which may correspond to "normal" or "strict"). The valid values are those name attribute values in the type elements of key name="lb" in bcp47/segmentation.xml. | |||
"lb" | Line break style | "strict" | CSS level 3 line-break=strict, e.g. treat CJ as NS |
"normal" | CSS level 3 line-break=normal, e.g. treat CJ as ID, break before hyphens for ja,zh | ||
"loose" | CSS lev 3 line-break=loose | ||
A Unicode Line Break Word Identifier defines preferred line break word handling behavior corresponding to the CSS level 3 word-break option. Specifying "lw" in a locale identifier overrides the locale’s default style (which may correspond to "normal" or "keepall"). The valid values are those name attribute values in the type elements of key name="lw" in bcp47/segmentation.xml. | |||
"lw" | Line break word handling | "normal" | CSS level 3 word-break=normal, normal script/language behavior for midword breaks |
"breakall" | CSS level 3 word-break=break-all, allow midword breaks unless forbidden by lb setting | ||
"keepall" | CSS level 3 word-break=keep-all, prohibit midword breaks except for dictionary breaks | ||
"phrase" | Prioritize keeping natural phrases (of multiple words) together when breaking, used in short text like title and headline | ||
A Unicode Measurement System Identifier defines a preferred measurement system. Specifying "ms" in a locale identifier overrides the default value specified by supplemental measurement system data for the region (see Part 2 General, Measurement System Data). The valid values are those name attribute values in the type elements of key name="ms" in bcp47/measure.xml. The determination of preferred units depends on the locale identifer: the keys ms, mu, rg, the base locale (language, script, region) and the user preferences. For information about preferred units and unit conversion, see Unit Conversion and Unit Preferences. | |||
"ms" | Measurement system | "metric" | Metric System |
"ussystem" | US System of measurement: feet, pints, etc.; pints are 16oz | ||
"uksystem" | UK System of measurement: feet, pints, etc.; pints are 20oz | ||
A Measurement Unit Preference Override defines an override for measurement unit preference. The valid values are those name attribute values in the type elements of key name="mu" in bcp47/measure.xml. For information about preferred units and unit conversion, see Unit Conversion and Unit Preferences. | |||
"mu" | Measurement unit override | "celsius" | Celsius as temperature unit |
"kelvin" | Kelvin as temperature unit | ||
"fahrenhe" | Fahrenheit as temperature unit | ||
A Unicode Number System Identifier defines a type of number system. The valid values are those name attribute values in the type elements of bcp47/number.xml. | |||
"nu" (numbers) |
Numbering system | Unicode script subtag | Four-letter types indicating the primary numbering system for the corresponding script represented in Unicode. Unless otherwise specified, it is a decimal numbering system using digits [:GeneralCategory=Nd:]. For example, "latn" refers to the ASCII / Western digits 0-9, while "taml" is an algorithmic (non-decimal) numbering system. (The code "tamldec" is indicates the "modern Tamil decimal digits".) For more information, see Numbering Systems. |
"arabext" | Extended Arabic-Indic digits ("arab" means the base Arabic-Indic digits) | ||
"armnlow" | Armenian lowercase numerals | ||
… | |||
"roman" | Roman numerals | ||
"romanlow" | Roman lowercase numerals | ||
"tamldec" | Modern Tamil decimal digits | ||
A Region Override specifies an alternate region to use for obtaining certain region-specific default values (those specified by the <rgScope> element), instead of using the region specified by the unicode_region_subtag in the Unicode Language Identifier (or inferred from the unicode_language_subtag). | |||
"rg" | Region Override | "uszzzz" | The value is a unicode_subdivision_id of type “unknown” or “regular”; this consists of a unicode_region_subtag for a regular region (not a macroregion), suffixed either by “zzzz” (case is not significant) to designate the region as a whole, or by a unicode_subdivision_suffix to provide more specificity. For example, “en-GB-u-rg-uszzzz” represents a locale for British English but with region-specific defaults set to US for items such as default currency, default calendar and week data, default time cycle, and default measurement system and unit preferences. The determination of preferred units depends on the locale identifer: the keys ms, mu, rg, the base locale (language, script, region) and the user preferences. The value can affect the computation of the first day of the week: see First Day Overrides. For information about preferred units and unit conversion, see Unit Conversion and Unit Preferences. |
… | |||
A Unicode Subdivision Identifier defines a regional subdivision used for locales. The valid values are based on the subdivisionContainment element as described in Section 3.6.5 Subdivision Codes. | |||
"sd" | Regional Subdivision | "gbsct" | A unicode_subdivision_id, which is a unicode_region_subtag concatenated with a unicode_subdivision_suffix. For example, gbsct is “gb”+“sct” (where sct represents the subdivision code for Scotland). Thus “en-GB-u-sd-gbsct” represents the language variant “English as used in Scotland”. And both “en-u-sd-usca” and “en-US-u-sd-usca” represent “English as used in California”. See 3.6.5 Subdivision Codes. The value can affect the computation of the first day of the week: see First Day Overrides. |
… | |||
A Unicode Sentence Break Suppressions Identifier defines a set of data to be used for suppressing certain sentence breaks that would otherwise be found by UAX #14 rules. The valid values are those name attribute values in the type elements of key name="ss" in bcp47/segmentation.xml. | |||
"ss" | Sentence break suppressions | "none" | Don’t use sentence break suppressions data (the default). |
"standard" | Use sentence break suppressions data of type "standard" | ||
A Unicode Timezone Identifier defines a timezone. The valid values are those name attribute values in the type elements of bcp47/timezone.xml. | |||
"tz" (timezone) |
Time zone | Unicode short time zone IDs | Short identifiers defined in terms of a TZ time zone database [Olson] identifier in the common/bcp47/timezone.xml file, plus a few extra values. For more information, see Time Zone Identifiers. CLDR provides data for normalizing timezone codes. |
A Unicode Variant Identifier defines a special variant used for locales. The valid values are those name attribute values in the type elements of bcp47/variant.xml. | |||
"va" | Common variant type | "posix" | POSIX style locale variant. About handling of the "POSIX" variant see Legacy Variants. |
On the 24th of May, 1863, my uncle, Professor Liedenbrock, rushed into his little house, No. 19 Königstrasse, one of the oldest streets in the oldest portion of the city of Hamburg… | Le 24 mai 1863, un dimanche, mon oncle, le professeur Lidenbrock, revint précipitamment vers sa petite maison située au numéro 19 de Königstrasse, l’une des plus anciennes rues du vieux quartier de Hambourg… |
Lookup Type | Example | Comments |
---|---|---|
Resource bundle lookup |
se-FI → se → default‑locale* → root |
* The default-locale may have its own inheritance change; for example, it may be "en-GB → en" In that case, the chain is expanded by inserting the chain, resulting in:
se-FI → |
Inherited item lookup |
se-FI+key → se+key → root_alias*+key → root+key |
* If there is a root_alias to another key or locale, then insert that entire chain. For example, suppose that months for another calendar system have a root alias to Gregorian months. In that case, the root alias would change the key, and retry from se-FI downward. This can happen multiple times.
se-FI+key → |
User Input | Lookup in Locale | For | Comment |
---|---|---|---|
de_CH no keyword | de_CH | default collation type | finds "B" |
de_CH | collation type=B | not found | |
de | collation type=B | found | |
de no keyword | de | default collation type | not found |
root | default collation type | finds "standard" | |
de | collation type=standard | not found | |
root | collation type=standard | found | |
de_u_co_A | de | collation type=A | found |
de_u_co_standard | de | collation type=standard | not found |
root | collation type=standard | found | |
de_u_co_foobar | de | collation type=foobar | not found |
root | collation type=foobar | not found, starts looking for default | |
de | default collation type | not found | |
root | default collation type | finds "standard" | |
de | collation type=standard | not found | |
root | collation type=standard | found |
User Input | Lookup in Locale | For | Comment |
---|---|---|---|
de_CH_u_co_search | de_CH | collation type=search | not found |
de | collation type=search | found | |
en_US_u_co_search | en_US | collation type=search | not found |
en | collation type=search | not found | |
root | collation type=search | found |
User Input | Lookup in Locale | For | Comment |
---|---|---|---|
zh_Hant no keyword | zh_Hant | default collation type | finds "stroke" |
zh_Hant | collation type=stroke | not found | |
zh | collation type=stroke | found | |
zh_Hant_HK_u_co_pinyin | zh_Hant_HK | collation type=pinyin | not found |
zh_Hant | collation type=pinyin | not found | |
zh | collation type=pinyin | found | |
zh no keyword | zh | default collation type | finds "pinyin" |
zh | collation type=pinyin | found |
Inheritance | Part of the internal mechanism used by CLDR to organize and manage locale data. This is used to share common resources, and ease maintenance, and provide the best fallback behavior in the absence of data. Should not be used for locale matching or likely subtags. | |
---|---|---|
Example: | parent(en_AU) ⇒ en_001 parent(en_001) ⇒ en parent(en) ⇒ root | |
Data: | supplementalData.xml <parentLocale> | |
Spec: | Section 4.2 Inheritance and Validity | |
DefaultContent | Part of the internal mechanism used by CLDR to manage locale data. A particular sublocale is designated the defaultContent for a parent, so that the parent exhibits consistent behavior. Should not be used for locale matching or likely subtags. | |
Example: | addLikelySubtags(sr-ME) ⇒ sr-Latn-ME, minimize(de-Latn-DE) ⇒ de | |
Data: | supplementalMetadata.xml <defaultContent> | |
Spec: | Part 6: Section 9.3 Default Content | |
LikelySubtags | Provides most likely full subtag (script and region) in the absence of other information. A core component of LocaleMatching. | |
Example: | addLikelySubtags(zh) ⇒ zh-Hans-CN addLikelySubtags(zh-TW) ⇒ zh-Hant-TW addLikelySubtags(zh-Hant) ⇒ zh-Hant-TW minimize(zh-Hans-CN, favorRegion|favorScript) ⇒ zh minimize(zh-Hant-TW, favorRegion) ⇒ zh-TW minimize(zh-Hant-TW, favorScript) ⇒ zh-Hant | |
Data: | likelySubtags.xml <likelySubtags> | |
Spec: | Section 4.3 Likely Subtags | |
LocaleMatching | Provides the best match for the user’s language(s) among an application’s supported languages. | |
Example: | bestLocale(userLangs=<en, fr>, appLangs=<fr-CA, ru>) ⇒ fr-CA | |
Data: | languageInfo.xml <languageMatching> | |
Spec: | Section 4.4 Language Matching |
Root | de | Resolved |
---|---|---|
```xml
|
```xml
|
```xml
|
```xml
|
```xml
|
```xml
|
= prop| \\p\{x=y\},
\| '\[' '^'? s '-'? s seq\* \[\\$ \\-\]? s '\]'
\| var
= unicodeSet \(s \[\\&\\\-\] s unicodeSet\)\* s| \[abc\]\-\[cde\], a | | `range` |
\| range s
= element \('\-' element\)? | a, a\-c, \{abc\}, a\-\{z\}
_note: in ranges, elements must resolve to exactly one code point._ | | `element` |= char \| string \| var| %, b, \{hello\}, \{\}, \\x\{61 62\} | | `prop` |= '\\' \[pP\] '\{' propName \(\[≠=\] s pValuePerl\+\)? '\}'| \\p\{x=y\}, \[:x=y:\]
\| '\[:' '^'? propName \(\[≠=\] s pValuePosix\+\)? ':\]'
| | `propName` |= s \[A\-Za\-z0\-9\] \[A\-Za\-z0\-9\_\\x20\]\* s| General\_Category,
General Category | | `pValuePerl` |= \[^\\\}\]| Lm,
\| '\\' quoted
\\n,
\\\} | | `pValuePosix` |= \[^:\]| Lm,
\| '\\' quoted
\\n,
\\: | | `string` |= '\{' \(s charInString\)\* s '\}'| \{hello\} | | `char` |= \[^ \\^ \\& \\\- \\\[ \\\] \\\\ \\\{ \\$ \[:Pat_WS:\]\]| a, b, c, \\n, \\\{, \\$ | | `charInString` |
\| '\\' quoted= \[^ \\\\ \\\} \[:Pat_WS:\]\]| a, b, c, \\n, \{, $ | | `quoted` |
\| '\\' quoted= 'u' \(hex\{4\} \| bracketedHex\)| n, U0000FFFE, \{, $, \]
\| 'x' \(hex\{2\} \| bracketedHex\)
\| 'U00' \('0' hex\{5\} \| '10' hex\{4\}\)
\| 'N\{' charName '\}'
\| \[\[\\u0000\-\\U00010FFFF\]\-\[uxUN\]\]
_note: lengths are exact_ | | `charName` |= s \[A\-Za\-z0\-9\] \[\-A\-Za\-z0\-9\_\\x20\]\* s| TIBETAN LETTER \-A | | `bracketedHex` |= '\{' s hexCodePoint \(sRequired hexCodePoint\)\* s '\}'| \{61 2019 62\}, \{61\} | | `hexCodePoint` |= hex\{1,5\} \| '10' hex\{4\}| | | `hex` |= \[0\-9A\-Fa\-f\]| | | `var` |= '$' \[:XID_Start:\] \[:XID_Continue:\]\*| $a, $elt5 (optional support) | | `s` |= \[:Pattern_White_Space:\]\*| optional whitespace | | `sRequired` |= \[:Pattern_White_Space:\]\+| required whitespace | The following are additional well-formedness and validity constraints: 1. [ wfc: Ranges (**X**-**Y**) are only well-formed in the case that elements **X** and **Y** resolve to single code points. That is, **\[a-b\]** and **\[\{a\}-\{b\}\]** are well-formed because single-codepoint-strings are equivalent to that code point, while **\[a-{bz}\]** and **\[\{ax\}-\{bz\}\]** are ill-formed. ] 2. [ vc: Property names and values are restricted to those supported by the implementation, and have additional constraints imposed by [[UAX44](https://www.unicode.org/reports/tr41/#UAX44)]. ] Note also that: 1. Escapes that use multiple code points are equivalent to their flattened representation, i.e., `\x{61 62}` is equivalent to `\x{61}\x{62}`. These can also occur in strings, so **\[\{\\x\{ 061 62 0063\}\}\]** is equivalent to **\[\{abc\}\]**. 2. If **\[…\]** starts with \[:, then it begins a prop, and must also terminate with :\]. Thus **\[:di:\]** is a valid property expression, **\[di:\]** is a 3 code-point set, and **\[:di\]** raises an error. 3. Whitespace is significant when initiating/terminating a POSIX property expression, so **\[ :\]** is syntactically valid and equivalent to **\[\\:\]**. The syntax characters are listed in the table below: | Char | Hex | Name | Usage | | ---- | ------ | -------------------- | ------------------------------------------ | | $ | U+0024 | DOLLAR SIGN | Equivalent to \\uFFFF when followed by '\]', initiator for variable identifiers otherwise | | & | U+0026 | AMPERSAND | Intersecting UnicodeSets | | - | U+002D | HYPHEN-MINUS | Ranges of characters; also set difference. | | : | U+003A | COLON | POSIX-style property syntax | | [ | U+005B | LEFT SQUARE BRACKET | Grouping; POSIX property syntax | | ] | U+005D | RIGHT SQUARE BRACKET | Grouping; POSIX property syntax | | \\ | U+005C | REVERSE SOLIDUS | Escaping | | ^ | U+005E | CIRCUMFLEX ACCENT | Posix negation syntax | | { | U+007B | LEFT CURLY BRACKET | Strings in set; Perl property syntax | | } | U+007D | RIGHT CURLY BRACKET | Strings in set; Perl property syntax | | | U+0020 U+0009..U+000D U+0085
U+200E U+200F
U+2028 U+2029 | ASCII whitespace,
LRM, RLM,
LINE/PARAGRAPH SEPARATOR | Ignored except when escaped | Note that some syntax characters only have a special meaning in a certain context. In particular: * Out of all above syntax characters, only \\, \}, and whitespace have a special meaning inside strings (**\[\{\[a-z\]\}\]** is the set of the string '\[a-z\]', **\[\{\$blah\}\]** is the set of the string '\$blah'). * \$ is equivalent to \uFFFF when appearing at the very end of a set with or without trailing whitespace (**[a-z\$]**, **[a-z\$ ]**), and used as starting indicator for a variable reference elsewhere, in which case the variable name will be the longest match on the `var` nonterminal (such as **[\$my_set]**). * \- is equivalent to the literal character \\- when occuring at the very beginning of a set, after a \^ at the beginning of a set, or at the very end of a set, in all cases with or without whitespace (**[-abc]**, **[ ^ -abc]**, **[abc-]**), and used as the set difference or range operator elsewhere (**[[abc]-[bc]]**, **[a-z]**) * \: initiates a POSIX property set when directly after a \[ without whitespace inbetween (**[:L:]**), ends a POSIX property set when directly before a \] without whitespace inbetween (**[:L:]**), and is equivalent to the literal character \\\: in any other place (**[ \:]**, **[L\:]**) * \} ends a string when occurring inside a string (**[{hello}]**), and is equivalent to the literal character \\\} in any other place (**[}a]**) ###### Syntax Special Case Examples In the following, a table of examples including common sources of confusion concerning the UnicodeSet syntax: | Expression | Contained Elements | Syntax Errors | | - | - | - | | **\[^a\]** | All Unicode code points except 'a' | **\[ ^a\]**, **\[a^\]** | | **\[\\^a\]** | 'a' and '^' | | | **\[:L:\]** | All code points with Unicode property 'General_Category' equal to 'Letter' | **\[:L\]**, **\[:\]** | | **\[ :\]** | ':' | | | **\[L:\]** | 'L' and ':' | | | **\[-\]** | '-'. | | | **\[ - \]** | '-' | | | **\[a-\]**, **\[-a\]** | 'a' and '-' | | | **\[a -b\]** | All code points between 'a' and 'b' (inclusive) | | | **\[\[a-b\] -\[b\]\]**, **\[\[a\]-\[b\]-\[c\]\]** | 'a' | **\[a-b-c\]** | | **\[^ - \]** | All Unicode code points except '-' | **\[ ^ - \]** | | **\[\$\]**, **\[ \$ \]** | U+FFFF | | | **\[\$a\]** | The value of the variable '\$a' | **\[\$ a\]**, **\[\$und\]** | | **\[\$a\$\]** | U+FFFF and the value of the variable '\$a' | | | **\[a\$\]** | 'a' and U+FFFF | | | **\[\}\]** | '\}' | **\[\{\]** | | **\[\{\}\]** | the empty string, '' | | | **\[\{\}\}\]** | '\}' and the empty string, '' | | | **\[\{\{\}\]** | '\{' | | | **\[\{\$var\}\]** | the string '\$var' | | | **\[\{\[a-z\}\]**, **\[\{ \[ a - z\}\]** | the string '\[a-z' | | | **\[\\x\{10FFFF 1\}\]** | U+10FFFF and U+1 | **\[\\x\{10FFFF1\}\]** | | **\[\\x\{61\}-d\]** | 'a', 'b', 'c', and 'd' | **\[\\x\{61 63\}-d\]**, **\[\\x\{61 63\}-\\x\{62 64\}\]** | *Note: the above assumes that variables are supported, \$a is defined as a full UnicodeSet, a string, or a char, and \$und is not defined at all.* ##### Lists of Code Points Lists are a sequence of strings that may include ranges, which are indicated by a '-' between two code points, as in "a-z". The sequence _start-end_ specifies the range of all code points from the start to end, inclusive, in Unicode order. For example, **[a c d-f m]** is equivalent to **[a c d e f m]**. Whitespace can be freely used for clarity, as **[a c d-f m]** means the same as **[acd-fm]**. A string with multiple code points is represented in a list by being surrounded by curly braces, such as in **[a-z \{ch}]**. It can be used with the range notation, with the restriction that each string contains exactly one code point. Thus **\[\{ab\}-\{c\}\]**, **\[\{ax\}-\{bz\}\]**, and **\[\{ab\}-c\]** are invalid. A string consisting of a single code point is equivalent to that code point, that is, **[\{a}-c]** is valid and equivalent to **[a b c]**. ##### Backslash Escapes Certain backslashed code point sequences can be used to quote code points: | Sequence | Code point | | --------------- | ------------------------------------ | | \\x\{h...h}
\\u\{h...h} | list of 1-6 hex digits ([0-9A-Fa-f]), separated by spaces | | \\xhh | 2 hex digits | | \\uhhhh | Exactly 4 hex digits | | \\Uhhhhhhhh | Exactly 8 hex digits | | \\a | U+0007 (BEL / ALERT) | | \\b | U+0008 (BACKSPACE) | | \\t | U+0009 (TAB / CHARACTER TABULATION) | | \\n | U+000A (LINE FEED) | | \\v | U+000B (LINE TABULATION) | | \\f | U+000C (FORM FEED) | | \\r | U+000D (CARRIAGE RETURN) | | \\\\ | U+005C (BACKSLASH / REVERSE SOLIDUS) | | \\N\{name} | The Unicode code point named "name". | | \\p\{…},\\P\{…} | Unicode property (see below) | Anything else following a backslash is mapped to itself, except the property syntax described below, or in an environment where it is defined to have some special meaning. Any code point formed as the result of a backslash escape loses any special meaning and is treated as a literal. In particular, note that \\x, \\u and \\U escapes create literal code points. (In contrast, Java treats Unicode escapes as just a way to represent arbitrary code points in an ASCII source file, and any resulting code points are _**not**_ tagged as literals.) Unicode property sets are defined as described in _UTS #18: Unicode Regular Expressions_ [[UTS18](https://www.unicode.org/reports/tr41/#UTS18)], Level 1 and RL2.5, including the syntax where given. For an example of a concrete implementation of this, see [[ICUUnicodeSet](#ICUUnicodeSet)]. ##### Unicode Properties Briefly, Unicode property sets are specified by any Unicode property and a value of that property, such as **[:General_Category=Letter:]** for Unicode letters or **\\p\{uppercase}** for the set of upper case letters in Unicode. The property names are defined by the PropertyAliases.txt file and the property values by the PropertyValueAliases.txt file. For more information, see [[UAX44](https://www.unicode.org/reports/tr41/#UAX44)]. The syntax for specifying the property sets is an extension of either POSIX or Perl syntax, by the addition of `"="`. For example, you can match letters by using the POSIX-style syntax: **[:General_Category=Letter:]** or by using the Perl-style syntax **\\p\{General_Category=Letter}**. Property names and values are case-insensitive, and whitespace, "-", and "\_" are ignored. The property name can be omitted for the **General_Category** and **Script** properties, but is required for other properties. If the property value is omitted, it is assumed to represent a boolean property with the value "true". Thus **[:Letter:]** is equivalent to **[:General_Category=Letter:]**, and **[:Wh-ite-s pa_ce:]** is equivalent to **[:Whitespace=true:]**. The table below shows the two kinds of syntax: POSIX and Perl style. Also, the table shows the "Negative" version, which is a property that excludes all code points of a given kind. For example, **[:^Letter:]** matches all code points that are not **[:Letter:]**. | | Positive | Negative | | ------------------ | ---------------- | ----------------- | | POSIX-style Syntax | [:type=value:] | [:^type=value:] | | Perl-style Syntax | \\p\{type=value} | \\P\{type=value} | ##### Boolean Operations The low-level lists or properties then can be freely combined with the normal set operations (union, inverse, difference, and intersection): * To union two sets, simply concatenate them. For example, **[[:letter:] [:number:]]** * To intersect two sets, use the '&' operator. For example, **[[:letter:] & [a-z]]** * To take the set-difference of two sets, use the '-' operator. For example, **[[:letter:] - [a-z]]** * To invert a set, place a '\^' immediately after the opening '['. For example, **[\^a-z]**. In any other location, the '\^' does not have a special meaning. The inversion [\^X] is equivalent to [[\\x{0}-\\x{10FFFF}]-[X]]. Thus multi-code point strings are discarded. * Symmetric difference (~) is not supported. The binary operators '&', '-', and the implicit union have equal precedence and bind left-to-right. Thus **[[:letter:]-[a-z]-[\\u0100-\\u01FF]]** is equal to **[[[:letter:]-[a-z]]-[\\u0100-\\u01FF]]**. Another example is the set **[[ace][bdf] - [abc][def]]**, which is not the empty set, but instead equal to **[[[[ace] [bdf]] - [abc]] [def]]**, which equals **[[[abcdef] - [abc]] [def]]**, which equals **[[def] [def]]**, which equals **[def]**. **One caution:** the '&' and '-' operators operate between sets. That is, they must be immediately preceded and immediately followed by a set. For example, the pattern **[[:Lu:]-A]** is illegal, since it is interpreted as the set **[:Lu:]** followed by the incomplete range **-A**. To specify the set of upper case letters except for 'A', enclose the 'A' in brackets: **[[:Lu:]-[A]]**. ##### Variables in UnicodeSets Support for variable identifiers (var) is optional. They are used in certain contexts such as in [Transforms](tr35-general.md#Transforms). When they are used, they are defined as follows: UnicodeSets may contain variables (`$my_char`, `$the_set`, ...) in place of full UnicodeSets and strings/characters. If variable support is enabled, variables must be defined (out-of-scope for UnicodeSets). In particular, referring to undefined variables is an error. Not all variable maps are valid for a given expression in UnicodeSet syntax. For instance, consider **[$a-$b]**; this may be a range of characters if both **$a** and **$b** are characters, or a difference of sets if they are both sets; but given the map `{ a => '0', b => [:L:] }`, it is invalid. **Note:** In particular, the variable map is needed not just to compute the actual set of characters and strings represented by the UnicodeSet, but also to parse the UnicodeSet syntax: if **$a** and **$b** were unknown, the parsing of **[$a-$b]** would be ambiguous. Variables are replaced by value, that is, **[a \$minus z]** with a variable map `{ minus => '-' }` is equivalent to **[-az]**, not **[a-z]** (i.e., cardinality of 3 instead of 26). The full `var` nonterminal is replaced, i.e., the variable name together with the prefixed \$. The variable syntax implements UAX31-R1-2 with XID_Start and XID_Continue. For more information, see [[UAX31](https://www.unicode.org/reports/tr41/#UAX31)]. Variables are equivalent normalized identifiers with Normalization Form C, implementing UAX31-R4. Furthermore, variables are case-sensitive. Notes: 1. The 'type' of a variable value is not specified syntactically. Thus \[\$a\-\$b\] can resolve whether \$a and \$b are chars/strings (eg, \$a=δ, \$b=θ) or full UnicodeSets (eg, \$a=\\p\{script=greek\}, \$b=\\p\{general_category=letter\}). The only restriction is that the result be syntactic; thus (\$a=w, \$b=xy) would raise an error. 2. Variable substitution is currently disallowed inside of property expressions. Thus \\p{gc=\$blah} raises an error. 3. '\$' when followed by '\]' is interpreted as \\uFFFF, and is used to match before the start of a string or after the end. Thus \[ab\$\] matches the string "xaby" in the locations (marked with '()'): "()xaby", "x(a)by", "xa(b)y", "xaby()". 4. If an unescaped '\$' is neither followed by a character of type \[:XID_Start:\] nor a '\]', it is a syntax error. **Backwards compatibility**: In prior versions of this document, the character \$ was a valid element of the `char` nonterminal with the special meaning of `\uFFFF`. In current versions, the \$ character may only appear by itself at the end of a UnicodeSet, e.g., **[a-z\$]**, where it keeps that interpretation. Allowing \$ to appear in any other location is only allowed as the prefix for variables. The previous behavior of allowing \$ in the `char` nonterminal is considered obsolete and must be avoided by new implementations. ##### UnicodeSet Examples The following table summarizes the syntax that can be used. | Example | Description | | -------------------- | ----------- | | [a] | The set containing 'a' alone | | [a-z] | The set containing 'a' through 'z' and all letters in between, in Unicode order.
Thus it is the same as [\\u0061-\\u007A]. | | [^a-z] | The set containing all code points but 'a' through 'z'.
Thus it is the same as [\\u0000-\\u0060 \\u007B-\\x{10FFFF}]. | | [[pat1][pat2]] | The union of sets specified by pat1 and pat2 | | [[pat1]&[pat2]] | The intersection of sets specified by pat1 and pat2 | | [[pat1]-[pat2]] | The asymmetric difference of sets specified by pat1 and pat2 | | [a \{ab} \{ac}] | The code point 'a' and the multi-code point strings "ab" and "ac" | | [x\\u\{61 2019 62}y] | Equivalent to [x\\u0061\\u2019\\u0062y] (= [xa’by]) | | [:Lu:] | The set of code points with a given property value, as defined by PropertyValueAliases.txt. In this case, these are the Unicode upper case letters. The long form for this is **[:General_Category=Uppercase_Letter:]**. | | [:L:] | The set of code points belonging to all Unicode categories starting with 'L', that is, **[[:Lu:][:Ll:][:Lt:][:Lm:][:Lo:]]**. The long form for this is **[:General_Category=Letter:]**. | #### String Range A String Range is a compact format for specifying a list of strings. **Syntax:** > X _sep_ Y The separator and the format of strings X, Y may vary depending on the domain. For example, * for the validity files the separator is ~, * for UnicodeSet the separator is -, and any multi-codepoint string is enclosed in {…}. **Validity:** > A string range X _sep_ Y is valid iff len(X) ≥ len(Y) > 0, where len(X) is the length of X in code points. > > _There may be additional, domain-specific requirements for validity of the expansion of the string range._ **Interpretation:** 1. Break X into P and S, where len(S) = len(Y) * Note that P will be an empty string if the lengths of X and Y are equal. 2. Form the combinations of all P+(s₀..y₀)+(s₁..y₁)+...(sₙ..yₙ) * s₀ is the first code point in S, etc. **Examples:**### Identity Elements ```xml ``` The `identity` element contains information identifying the target locale for this data, and general information about the version of this data. ```xml
ab-ad → ab ac ad ab-d → ab ac ad ab-cd → ab ac ad bb bc bd cb cc cd 👦🏻-👦🏿 → 👦🏻 👦🏼 👦🏽 👦🏾 👦🏿 👦🏻-🏿 → 👦🏻 👦🏼 👦🏽 👦🏾 👦🏿 ``` The `version` element provides, in an attribute, the version of this file. The contents of the element can contain textual notes about the changes between this version and the last. For example: > ```xml > Various notes and changes in version 1.1 > ``` > > This is not to be confused with the `version` attribute on the `ldml` element, which tracks the dtd version. ```xml``` The `generation` element is now deprecated. It was used to contain the last modified date for the data. This could be in two formats: ISO 8601 format, or CVS format (illustrated by the example above). ```xml ``` The language code is the primary part of the specification of the locale id, with values as described above. ```xml ``` The script code may be used in the identification of written languages, with values described above. ```xml ``` The territory code is a common part of the specification of the locale id, with values as described above. ```xml ``` The variant code is the tertiary part of the specification of the locale id, with values as described above. When combined according to the rules described in _[Unicode Language and Locale Identifiers](#Unicode_Language_and_Locale_Identifiers)_, the `language` element, along with any of the optional `script`, `territory`, and `variant` elements, must identify a known, stable locale identifier. Otherwise, it is an error. ### Valid Attribute Values The [DTD Annotations](#DTD_Annotations) in are used to determine whether elements, attributes, or attribute values are valid (or deprecated). ### Canonical Form The following are restrictions on the format of LDML files to allow for easier parsing and comparison of files. Peer elements have consistent order. That is, if the DTD or this specification requires the following order in an element `foo`: ```xml ``` It can never require the reverse order in a different element `bar`. ```xml ``` Note that there was one case that had to be corrected in order to make this true. For that reason, pattern occurs twice under currency: ```xml ``` [XML](https://www.w3.org/TR/REC-xml/) files can have a wide variation in textual form, while representing precisely the same data. By putting the LDML files in the repository into a canonical form, this allows us to use the simple diff tools used widely (and in CVS) to detect differences when vetting changes, without those tools being confused. This is not a requirement on other uses of LDML; just simply a way to manage repository data more easily. #### Content 1. All start elements are on their own line, indented by _depth_ tabs. 2. All end elements (except for leaf nodes) are on their own line, indented by _depth_ tabs. 3. Any leaf node with empty content is in the form ` `. 4. There are no blank lines except within comments or content. 5. Spaces are used within a start element. There are no extra spaces within elements. * ` `, not ` ` * ``, not `` 6. All attribute values use double quote ("), not single ('). 7. There are no CDATA sections, and no escapes except those absolutely required. * no `'` since it is not necessary * no `'a'`, it would be just `'a'` 8. All attributes with defaulted values are suppressed. 9. The draft and `alt="proposed.*"` attributes are only on leaf elements. 10. The tzid are canonicalized in the following way: * All tzids as of CLDR 1.1 (2004.06.08) in zone.tab are canonical. * After that point, the first time a tzid is introduced, that is the canonical form. That is, new IDs are added, but existing ones keep the original form. The _TZ_ timezone database keeps a set of equivalences in the "backward" file. These are used to map other tzids to the canonical form. For example, when `America/Argentina/Catamarca` was introduced as the new name for the previous `America/Catamarca` , a link was added in the backward file. `Link America/Argentina/Catamarca America/Catamarca` _Example:_ ```xml ``` #### Ordering An element is ordered first by the element name, and then if the element names are identical, by the sorted set of attribute-value pairs. For the latter, compare the first pair in each (in sorted order by attribute pair). If not identical, go to the second pair, and so on. Elements and attributes are ordered according to their order in the respective DTDs. Attribute value comparison is a bit more complicated, and may depend on the attribute and type. This is currently done with specific ordering tables. Any future additions to the DTD must be structured so as to allow compatibility with this ordering. See also [Valid Attribute Values.](#Valid_Attribute_Values) #### Comments 1. Comments are of the form ``. 2. They are logically attached to a node. There are 4 kinds: 1. Inline always appear after a leaf node, on the same line at the end. These are a single line. 2. Preblock comments always precede the attachment node, and are indented on the same level. 3. Postblock comments always follow the attachment node, and are indented on the same level. 4. Final comment, after ` ¤#,##0.00;(¤#,##0.00)
{ja} ⊇ {} | success, und = {} |
{hepburn, heploc} ⊇ {hepburn} | success |
{ja} ⊇ {} | success, und = {} |
{hepburn} ⊉ {hepburn, heploc} | failure |