1*912701f9SAndroid Build Coastguard Worker## Unicode Technical Standard #35 2*912701f9SAndroid Build Coastguard Worker 3*912701f9SAndroid Build Coastguard Worker# Unicode Locale Data Markup Language (LDML)<br/>Part 5: Collation 4*912701f9SAndroid Build Coastguard Worker 5*912701f9SAndroid Build Coastguard Worker|Version|45 | 6*912701f9SAndroid Build Coastguard Worker|-------|----------------| 7*912701f9SAndroid Build Coastguard Worker|Editors|Markus Scherer (<a href="mailto:[email protected]">[email protected]</a>) and <a href="tr35.md#Acknowledgments">other CLDR committee members</a>| 8*912701f9SAndroid Build Coastguard Worker 9*912701f9SAndroid Build Coastguard WorkerFor the full header, summary, and status, see [Part 1: Core](tr35.md). 10*912701f9SAndroid Build Coastguard Worker 11*912701f9SAndroid Build Coastguard Worker### _Summary_ 12*912701f9SAndroid Build Coastguard Worker 13*912701f9SAndroid Build Coastguard WorkerThis document describes parts of an XML format (_vocabulary_) for the exchange of structured locale data. This format is used in the [Unicode Common Locale Data Repository](https://www.unicode.org/cldr/). 14*912701f9SAndroid Build Coastguard Worker 15*912701f9SAndroid Build Coastguard WorkerThis is a partial document, describing only those parts of the LDML that are relevant for collation (sorting, searching & grouping). For the other parts of the LDML see the [main LDML document](tr35.md) and the links above. 16*912701f9SAndroid Build Coastguard Worker 17*912701f9SAndroid Build Coastguard Worker_Note:_ 18*912701f9SAndroid Build Coastguard WorkerSome links may lead to in-development or older 19*912701f9SAndroid Build Coastguard Workerversions of the data files. 20*912701f9SAndroid Build Coastguard WorkerSee <https://cldr.unicode.org> for up-to-date CLDR release data. 21*912701f9SAndroid Build Coastguard Worker 22*912701f9SAndroid Build Coastguard Worker### _Status_ 23*912701f9SAndroid Build Coastguard Worker 24*912701f9SAndroid Build Coastguard Worker<!-- _This is a draft document which may be updated, replaced, or superseded by other documents at any time. 25*912701f9SAndroid Build Coastguard WorkerPublication does not imply endorsement by the Unicode Consortium. 26*912701f9SAndroid Build Coastguard WorkerThis is not a stable document; it is inappropriate to cite this document as other than a work in progress._ --> 27*912701f9SAndroid Build Coastguard Worker 28*912701f9SAndroid Build Coastguard Worker_This document has been reviewed by Unicode members and other interested parties, and has been approved for publication by the Unicode Consortium. 29*912701f9SAndroid Build Coastguard WorkerThis is a stable document and may be used as reference material or cited as a normative reference by other specifications._ 30*912701f9SAndroid Build Coastguard Worker 31*912701f9SAndroid Build Coastguard Worker> _**A Unicode Technical Standard (UTS)** is an independent specification. Conformance to the Unicode Standard does not imply conformance to any UTS._ 32*912701f9SAndroid Build Coastguard Worker 33*912701f9SAndroid Build Coastguard Worker_Please submit corrigenda and other comments with the CLDR bug reporting form [[Bugs](tr35.md#Bugs)]. Related information that is useful in understanding this document is found in the [References](tr35.md#References). For the latest version of the Unicode Standard see [[Unicode](tr35.md#Unicode)]. For a list of current Unicode Technical Reports see [[Reports](tr35.md#Reports)]. For more information about versions of the Unicode Standard, see [[Versions](tr35.md#Versions)]._ 34*912701f9SAndroid Build Coastguard Worker 35*912701f9SAndroid Build Coastguard Worker## <a name="Parts" href="#Parts">Parts</a> 36*912701f9SAndroid Build Coastguard Worker 37*912701f9SAndroid Build Coastguard WorkerThe LDML specification is divided into the following parts: 38*912701f9SAndroid Build Coastguard Worker 39*912701f9SAndroid Build Coastguard Worker* Part 1: [Core](tr35.md#Contents) (languages, locales, basic structure) 40*912701f9SAndroid Build Coastguard Worker* Part 2: [General](tr35-general.md#Contents) (display names & transforms, etc.) 41*912701f9SAndroid Build Coastguard Worker* Part 3: [Numbers](tr35-numbers.md#Contents) (number & currency formatting) 42*912701f9SAndroid Build Coastguard Worker* Part 4: [Dates](tr35-dates.md#Contents) (date, time, time zone formatting) 43*912701f9SAndroid Build Coastguard Worker* Part 5: [Collation](tr35-collation.md#Contents) (sorting, searching, grouping) 44*912701f9SAndroid Build Coastguard Worker* Part 6: [Supplemental](tr35-info.md#Contents) (supplemental data) 45*912701f9SAndroid Build Coastguard Worker* Part 7: [Keyboards](tr35-keyboards.md#Contents) (keyboard mappings) 46*912701f9SAndroid Build Coastguard Worker* Part 8: [Person Names](tr35-personNames.md#Contents) (person names) 47*912701f9SAndroid Build Coastguard Worker* Part 9: [MessageFormat](tr35-messageFormat.md#Contents) (message format) 48*912701f9SAndroid Build Coastguard Worker 49*912701f9SAndroid Build Coastguard Worker## <a name="Contents" href="#Contents">Contents of Part 5, Collation</a> 50*912701f9SAndroid Build Coastguard Worker 51*912701f9SAndroid Build Coastguard Worker* [CLDR Collation](#CLDR_Collation) 52*912701f9SAndroid Build Coastguard Worker * [CLDR Collation Algorithm](#CLDR_Collation_Algorithm) 53*912701f9SAndroid Build Coastguard Worker * [U+FFFE](#Algorithm_FFFE) 54*912701f9SAndroid Build Coastguard Worker * [Context-Sensitive Mappings](#Context_Sensitive_Mappings) 55*912701f9SAndroid Build Coastguard Worker * [Case Handling](#Algorithm_Case) 56*912701f9SAndroid Build Coastguard Worker * [Reordering Groups](#Algorithm_Reordering_Groups) 57*912701f9SAndroid Build Coastguard Worker * [Combining Rules](#Combining_Rules) 58*912701f9SAndroid Build Coastguard Worker* [Root Collation](#Root_Collation) 59*912701f9SAndroid Build Coastguard Worker * [Grouping classes of characters](#grouping_classes_of_characters) 60*912701f9SAndroid Build Coastguard Worker * [Non-variable symbols](#non_variable_symbols) 61*912701f9SAndroid Build Coastguard Worker * [Additional contractions for Tibetan](#tibetan_contractions) 62*912701f9SAndroid Build Coastguard Worker * [Tailored noncharacter weights](#tailored_noncharacter_weights) 63*912701f9SAndroid Build Coastguard Worker * [Root Collation Data Files](#Root_Data_Files) 64*912701f9SAndroid Build Coastguard Worker * [Root Collation Data File Formats](#Root_Data_File_Formats) 65*912701f9SAndroid Build Coastguard Worker * [allkeys_CLDR.txt](#File_Format_allkeys_CLDR_txt) 66*912701f9SAndroid Build Coastguard Worker * [FractionalUCA.txt](#File_Format_FractionalUCA_txt) 67*912701f9SAndroid Build Coastguard Worker * [UCA_Rules.txt](#File_Format_UCA_Rules_txt) 68*912701f9SAndroid Build Coastguard Worker* [Collation Tailorings](#Collation_Tailorings) 69*912701f9SAndroid Build Coastguard Worker * [Collation Types](#Collation_Types) 70*912701f9SAndroid Build Coastguard Worker * [Collation Type Fallback](#Collation_Type_Fallback) 71*912701f9SAndroid Build Coastguard Worker * Table: [Sample requested and actual collation locales and types](#Sample_requested_and_actual_collation_locales_and_types) 72*912701f9SAndroid Build Coastguard Worker * [Version](#Collation_Version) 73*912701f9SAndroid Build Coastguard Worker * [Collation Element](#Collation_Element) 74*912701f9SAndroid Build Coastguard Worker * [Setting Options](#Setting_Options) 75*912701f9SAndroid Build Coastguard Worker * Table: [Collation Settings](#Collation_Settings) 76*912701f9SAndroid Build Coastguard Worker * [Common settings combinations](#Common_Settings) 77*912701f9SAndroid Build Coastguard Worker * [Notes on the normalization setting](#Normalization_Setting) 78*912701f9SAndroid Build Coastguard Worker * [Notes on variable top settings](#Variable_Top_Settings) 79*912701f9SAndroid Build Coastguard Worker * [Collation Rule Syntax](#Rules) 80*912701f9SAndroid Build Coastguard Worker * [Orderings](#Orderings) 81*912701f9SAndroid Build Coastguard Worker * Table: [Specifying Collation Ordering](#Specifying_Collation_Ordering) 82*912701f9SAndroid Build Coastguard Worker * Table: [Abbreviating Ordering Specifications](#Abbreviating_Ordering_Specifications) 83*912701f9SAndroid Build Coastguard Worker * [Contractions](#Contractions) 84*912701f9SAndroid Build Coastguard Worker * Table: [Specifying Contractions](#Specifying_Contractions) 85*912701f9SAndroid Build Coastguard Worker * [Expansions](#Expansions) 86*912701f9SAndroid Build Coastguard Worker * [Context Before](#Context_Before) 87*912701f9SAndroid Build Coastguard Worker * Table: [Specifying Previous Context](#Specifying_Previous_Context) 88*912701f9SAndroid Build Coastguard Worker * [Placing Characters Before Others](#Placing_Characters_Before_Others) 89*912701f9SAndroid Build Coastguard Worker * [Logical Reset Positions](#Logical_Reset_Positions) 90*912701f9SAndroid Build Coastguard Worker * Table: [Specifying Logical Positions](#Specifying_Logical_Positions) 91*912701f9SAndroid Build Coastguard Worker * [Special-Purpose Commands](#Special_Purpose_Commands) 92*912701f9SAndroid Build Coastguard Worker * Table: [Special-Purpose Elements](#Special_Purpose_Elements) 93*912701f9SAndroid Build Coastguard Worker * [Collation Reordering](#Script_Reordering) 94*912701f9SAndroid Build Coastguard Worker * [Interpretation of a reordering list](#Interpretation_reordering) 95*912701f9SAndroid Build Coastguard Worker * [Reordering Groups for allkeys.txt](#Reordering_Groups_allkeys) 96*912701f9SAndroid Build Coastguard Worker * [Case Parameters](#Case_Parameters) 97*912701f9SAndroid Build Coastguard Worker * [Untailored Characters](#Case_Untailored) 98*912701f9SAndroid Build Coastguard Worker * [Compute Modified Collation Elements](#Case_Weights) 99*912701f9SAndroid Build Coastguard Worker * [Tailored Strings](#Case_Tailored) 100*912701f9SAndroid Build Coastguard Worker * [Visibility](#Visibility) 101*912701f9SAndroid Build Coastguard Worker * [Collation Indexes](#Collation_Indexes) 102*912701f9SAndroid Build Coastguard Worker * [Index Characters](#Index_Characters) 103*912701f9SAndroid Build Coastguard Worker * [CJK Index Markers](#CJK_Index_Markers) 104*912701f9SAndroid Build Coastguard Worker 105*912701f9SAndroid Build Coastguard Worker## <a name="CLDR_Collation" href="#CLDR_Collation">CLDR Collation</a> 106*912701f9SAndroid Build Coastguard Worker 107*912701f9SAndroid Build Coastguard WorkerCollation is the general term for the process and function of determining the sorting order of strings of characters, for example for lists of strings presented to users, or in databases for sorting and selecting records. 108*912701f9SAndroid Build Coastguard Worker 109*912701f9SAndroid Build Coastguard WorkerCollation varies by language, by application (some languages use special phonebook sorting), and other criteria (for example, phonetic vs. visual). 110*912701f9SAndroid Build Coastguard Worker 111*912701f9SAndroid Build Coastguard WorkerCLDR provides collation data for many languages and styles. The data supports not only sorting but also language-sensitive searching and grouping under index headers. All CLDR collations are based on the [[UCA](https://www.unicode.org/reports/tr41/#UTS10)] default order, with common modifications applied in the CLDR root collation, and further tailored for language and style as needed. 112*912701f9SAndroid Build Coastguard Worker 113*912701f9SAndroid Build Coastguard Worker### <a name="CLDR_Collation_Algorithm" href="#CLDR_Collation_Algorithm">CLDR Collation Algorithm</a> 114*912701f9SAndroid Build Coastguard Worker 115*912701f9SAndroid Build Coastguard WorkerThe CLDR collation algorithm is an extension of the [Unicode Collation Algorithm](https://www.unicode.org/reports/tr10/#Main_Algorithm). 116*912701f9SAndroid Build Coastguard Worker 117*912701f9SAndroid Build Coastguard Worker#### <a name="Algorithm_FFFE" href="#Algorithm_FFFE">U+FFFE</a> 118*912701f9SAndroid Build Coastguard Worker 119*912701f9SAndroid Build Coastguard WorkerU+FFFE maps to a CE with a minimal, unique primary weight. Its primary weight is not "variable": U+FFFE must not become ignorable in alternate handling. On the identical level, a minimal, unique “weight” must be emitted for U+FFFE as well. This allows for [Merging Sort Keys](https://www.unicode.org/reports/tr10/#Merging_Sort_Keys) within code point space. 120*912701f9SAndroid Build Coastguard Worker 121*912701f9SAndroid Build Coastguard WorkerFor example, when sorting names in a database, a sortable string can be formed with _last_name_ + '\\uFFFE' + _first_name_. These strings would sort properly, without ever comparing the last part of a last name with the first part of another first name. 122*912701f9SAndroid Build Coastguard Worker 123*912701f9SAndroid Build Coastguard WorkerFor backwards secondary level sorting, text _segments_ separated by U+FFFE are processed in forward segment order, and _within_ each segment the secondary weights are compared backwards. This is so that such combined strings are processed consistently with merging their sort keys (for example, by concatenating them level by level with a low separator). 124*912701f9SAndroid Build Coastguard Worker 125*912701f9SAndroid Build Coastguard Worker> **Note**: With unique, low weights on _all_ levels it is possible to achieve `sortkey(str1 + "\uFFFE" + str2) == mergeSortkeys(sortkey(str1), sortkey(str2))` . When that is not necessary, then code can be a little simpler (no special handling for U+FFFE except for backwards-secondary), sort keys can be a little shorter (when using compressible common non-primary weights for U+FFFE), and another low weight can be used in tailorings. 126*912701f9SAndroid Build Coastguard Worker 127*912701f9SAndroid Build Coastguard Worker#### <a name="Context_Sensitive_Mappings" href="#Context_Sensitive_Mappings">Context-Sensitive Mappings</a> 128*912701f9SAndroid Build Coastguard Worker 129*912701f9SAndroid Build Coastguard WorkerContraction matching, as in the UCA, starts from the first character of the contraction string. It slows down processing of that first character even when none of its contractions matches. In some cases, it is preferrable to change such contractions to mappings with a prefix (context before a character), so that complex processing is done only when the less-frequently occurring trailing character is encountered. 130*912701f9SAndroid Build Coastguard Worker 131*912701f9SAndroid Build Coastguard WorkerFor example, the DUCET contains contractions for several variants of L· (L followed by middle dot). Collating ASCII text is slowed down by contraction matching starting with L/l. In the CLDR root collation, these contractions are replaced by prefix mappings (L|·) which are triggered only when the middle dot is encountered. CLDR also uses prefix rules in the Japanese tailoring, for processing of Hiragana/Katakana length and iteration marks. 132*912701f9SAndroid Build Coastguard Worker 133*912701f9SAndroid Build Coastguard WorkerThe mapping is conditional on the prefix match but does not change the mappings for the preceding text. As a result, a contraction mapping for "px" can be replaced by a prefix rule "p|x" only if px maps to the collation elements for p followed by the collation elements for "x if after p". In the DUCET, L· maps to CE(L) followed by a special secondary CE (which differs from CE(·) when · is not preceded by L). In the CLDR root collation, L has no context-sensitive mappings, but · maps to that special secondary CE if preceded by L. 134*912701f9SAndroid Build Coastguard Worker 135*912701f9SAndroid Build Coastguard WorkerA prefix mapping for p|x behaves mostly like the contraction px, except when there is a contraction that overlaps with the prefix, for example one for "op". A contraction matches only new text (and consumes it), while a prefix matches only already-consumed text. 136*912701f9SAndroid Build Coastguard Worker 137*912701f9SAndroid Build Coastguard Worker* With mappings for "op" and "px", only the first contraction matches in text "opx". (It consumes the "op" characters, and there is no context-sensitive mapping for x.) 138*912701f9SAndroid Build Coastguard Worker* With mappings for "op" and "p|x", both the contraction and the prefix rule match in text "opx". (The prefix always matches already-consumed characters, regardless of whether they mapped as part of contractions.) 139*912701f9SAndroid Build Coastguard Worker 140*912701f9SAndroid Build Coastguard Worker> **Note**: Matching of discontiguous contractions should be implemented without rewriting the text (unlike in the [[UCA](https://www.unicode.org/reports/tr41/#UTS10)] algorithm specification), so that prefix matching is predictable. (It should also help with contraction matching performance.) An implementation that does rewrite the text, as in the UCA, will get different results for some (unusual) combinations of contractions, prefix rules, and input text. 141*912701f9SAndroid Build Coastguard Worker 142*912701f9SAndroid Build Coastguard WorkerPrefix matching uses a simple longest-match algorithm (op|c wins over p|c). It is recommended that prefix rules be limited to mappings where both the prefix string and the mapped string begin with an NFC boundary (that is, with a normalization starter that does not combine backwards). (In op|ch both o and c should be starters (ccc=0) and NFC_QC=Yes.) Otherwise, prefix matching would be affected by canonical reordering and discontiguous matching, like contractions. Prefix matching is thus always contiguous. 143*912701f9SAndroid Build Coastguard Worker 144*912701f9SAndroid Build Coastguard WorkerA character can have mappings with both prefixes (context before) and contraction suffixes. Prefixes are matched first. This is to keep them reasonably implementable: When there is a mapping with both a prefix and a contraction suffix (like in Japanese: ぐ|ゞ), then the matching needs to go in both directions. The contraction might involve discontiguous matching, which needs complex text iteration and handling of skipped combining marks, and will consume the matching suffix. Prefix matching should be first because, regardless of whether there is a match, the implementation will always return to the original text index (right after the prefix) from where it will start to look at all of the contractions for that prefix. 145*912701f9SAndroid Build Coastguard Worker 146*912701f9SAndroid Build Coastguard WorkerIf there is a match for a prefix but no match for any of the suffixes for that prefix, then fall back to mappings with the next-longest matching prefix, and so on, ultimately to mappings with no prefix. (Otherwise mappings with longer prefixes would “hide” mappings with shorter prefixes.) 147*912701f9SAndroid Build Coastguard Worker 148*912701f9SAndroid Build Coastguard WorkerConsider the following mappings. 149*912701f9SAndroid Build Coastguard Worker 150*912701f9SAndroid Build Coastguard Worker1. p → CE(p) 151*912701f9SAndroid Build Coastguard Worker2. h → CE(h) 152*912701f9SAndroid Build Coastguard Worker3. c → CE(c) 153*912701f9SAndroid Build Coastguard Worker4. ch → CE(d) 154*912701f9SAndroid Build Coastguard Worker5. p|c → CE(u) 155*912701f9SAndroid Build Coastguard Worker6. p|ci → CE(v) 156*912701f9SAndroid Build Coastguard Worker7. p|ĉ → CE(w) 157*912701f9SAndroid Build Coastguard Worker8. op|ck → CE(x) 158*912701f9SAndroid Build Coastguard Worker 159*912701f9SAndroid Build Coastguard WorkerWith these, text collates like this: 160*912701f9SAndroid Build Coastguard Worker 161*912701f9SAndroid Build Coastguard Worker* pc → CE(p)CE(u) 162*912701f9SAndroid Build Coastguard Worker* pci → CE(p)CE(v) 163*912701f9SAndroid Build Coastguard Worker* pch → CE(p)CE(u)CE(h) 164*912701f9SAndroid Build Coastguard Worker* pĉ → CE(p)CE(w) 165*912701f9SAndroid Build Coastguard Worker* pĉ̣ → CE(p)CE(w)CE(U+0323) // discontiguous 166*912701f9SAndroid Build Coastguard Worker* opck → CE(o)CE(p)CE(x) 167*912701f9SAndroid Build Coastguard Worker* opch → CE(o)CE(p)CE(u)CE(h) 168*912701f9SAndroid Build Coastguard Worker 169*912701f9SAndroid Build Coastguard WorkerHowever, if the mapping p|c → CE(u) is missing, then text "pch" maps to CE(p)CE(d), "opch" maps to CE(o)CE(p)CE(d), and "pĉ̣" maps to CE(p)CE(c)CE(U+0323)CE(U+0302) (because discontiguous contraction matching extends _an existing match_ by one non-starter at a time). 170*912701f9SAndroid Build Coastguard Worker 171*912701f9SAndroid Build Coastguard Worker#### <a name="Algorithm_Case" href="#Algorithm_Case">Case Handling</a> 172*912701f9SAndroid Build Coastguard Worker 173*912701f9SAndroid Build Coastguard WorkerCLDR specifies how to sort lowercase or uppercase first, as a stronger distinction than other tertiary variants (**caseFirst**) or while completely ignoring all other tertiary distinctions (**caseLevel**). See _[Setting Options](#Setting_Options)_ and _[Case Parameters](#Case_Parameters)_. 174*912701f9SAndroid Build Coastguard Worker 175*912701f9SAndroid Build Coastguard Worker#### <a name="Algorithm_Reordering_Groups" href="#Algorithm_Reordering_Groups">Reordering Groups</a> 176*912701f9SAndroid Build Coastguard Worker 177*912701f9SAndroid Build Coastguard WorkerCLDR specifies how to do parametric reordering of groups of scripts (e.g., “native script first”) as well as special groups (e.g., “digits after letters”), and provides data for the effective implementation of such reordering. 178*912701f9SAndroid Build Coastguard Worker 179*912701f9SAndroid Build Coastguard Worker#### <a name="Combining_Rules" href="#Combining_Rules">Combining Rules</a> 180*912701f9SAndroid Build Coastguard Worker 181*912701f9SAndroid Build Coastguard WorkerRules from different sources can be combined, with the later rules overriding the earlier ones. The following is an example of how this can be useful. 182*912701f9SAndroid Build Coastguard Worker 183*912701f9SAndroid Build Coastguard WorkerThere is a root collation for "emoji" in CLDR. So use of "-u-co-emoji" in a Unicode locale identifier will access that ordering. 184*912701f9SAndroid Build Coastguard Worker 185*912701f9SAndroid Build Coastguard WorkerExample, using ICU: 186*912701f9SAndroid Build Coastguard Worker 187*912701f9SAndroid Build Coastguard Worker```java 188*912701f9SAndroid Build Coastguard Workercollator = Collator.getInstance(ULocale.forLanguageTag("en-u-co-emoji")); 189*912701f9SAndroid Build Coastguard Worker``` 190*912701f9SAndroid Build Coastguard Worker 191*912701f9SAndroid Build Coastguard WorkerHowever, use of the emoji will supplant the language's customizations. So the above is the equivalent of: 192*912701f9SAndroid Build Coastguard Worker 193*912701f9SAndroid Build Coastguard Worker```java 194*912701f9SAndroid Build Coastguard Workercollator = Collator.getInstance(ULocale.forLanguageTag("und-u-co-emoji")); 195*912701f9SAndroid Build Coastguard Worker``` 196*912701f9SAndroid Build Coastguard Worker 197*912701f9SAndroid Build Coastguard WorkerThe same structure will not work for a language that does require customization, like Danish. That is, the following will fail. 198*912701f9SAndroid Build Coastguard Worker 199*912701f9SAndroid Build Coastguard Worker```java 200*912701f9SAndroid Build Coastguard Workercollator = Collator.getInstance(ULocale.forLanguageTag("da-u-co-emoji")); 201*912701f9SAndroid Build Coastguard Worker``` 202*912701f9SAndroid Build Coastguard Worker 203*912701f9SAndroid Build Coastguard WorkerFor that, a slightly more cumbersome method needs to be employed, which is to take the rules for Danish, and explicitly add the rules for emoji. 204*912701f9SAndroid Build Coastguard Worker 205*912701f9SAndroid Build Coastguard Worker```java 206*912701f9SAndroid Build Coastguard WorkerRuleBasedCollator collator = new RuleBasedCollator( 207*912701f9SAndroid Build Coastguard Worker((RuleBasedCollator) Collator.getInstance(ULocale.forLanguageTag("da"))).getRules() + 208*912701f9SAndroid Build Coastguard Worker((RuleBasedCollator) Collator.getInstance(ULocale.forLanguageTag("und-u-co-emoji"))) 209*912701f9SAndroid Build Coastguard Worker.getRules()); 210*912701f9SAndroid Build Coastguard Worker``` 211*912701f9SAndroid Build Coastguard Worker 212*912701f9SAndroid Build Coastguard WorkerThe following table shows the differences. When emoji ordering is supported, the two faces will be adjacent. When Danish ordering is supported, the ü is after the y. 213*912701f9SAndroid Build Coastguard Worker 214*912701f9SAndroid Build Coastguard Worker<!-- HTML: no header row, jagged --> 215*912701f9SAndroid Build Coastguard Worker<table><tbody> 216*912701f9SAndroid Build Coastguard Worker<tr><td>code point order</td><td>,</td><td>Z</td><td>a</td><td>y</td><td>ü</td><td>☹️</td><td>✈️️</td><td>글</td><td></td></tr> 217*912701f9SAndroid Build Coastguard Worker<tr><td>en</td><td>,</td><td>☹️</td><td>✈️️</td><td></td><td>a</td><td>ü</td><td>y</td><td>Z</td><td>글</td></tr> 218*912701f9SAndroid Build Coastguard Worker<tr><td>en-u-co-emoji</td><td>,</td><td></td><td>☹️</td><td>✈️️</td><td>a</td><td>ü</td><td>y</td><td>Z</td><td>글</td></tr> 219*912701f9SAndroid Build Coastguard Worker<tr><td>da</td><td>,</td><td>☹️</td><td>✈️️</td><td></td><td>a</td><td>y</td><td><strong><u>ü</u></strong></td><td>Z</td><td>글</td></tr> 220*912701f9SAndroid Build Coastguard Worker<tr><td>da-u-co-emoji</td><td>,</td><td></td><td>☹️</td><td>✈️️</td><td>a</td><td><strong><u>ü</u></strong></td><td>y</td><td>Z</td><td>글</td></tr> 221*912701f9SAndroid Build Coastguard Worker<tr><td>combined rules</td><td>,</td><td></td><td>☹️</td><td>✈️️</td><td>a</td><td>y</td><td><strong><u>ü</u></strong></td><td>Z</td><td>글</td></tr> 222*912701f9SAndroid Build Coastguard Worker</tbody></table> 223*912701f9SAndroid Build Coastguard Worker 224*912701f9SAndroid Build Coastguard Worker## <a name="Root_Collation" href="#Root_Collation">Root Collation</a> 225*912701f9SAndroid Build Coastguard Worker 226*912701f9SAndroid Build Coastguard WorkerThe CLDR root collation order is based on the [Default Unicode Collation Element Table (DUCET)](https://www.unicode.org/reports/tr10/#Default_Unicode_Collation_Element_Table) defined in _UTS #10: Unicode Collation Algorithm_ [[UCA](https://www.unicode.org/reports/tr41/#UTS10)]. It is used by all other locales by default, or as the base for their tailorings. (For a chart view of the UCA, see Collation Chart [[UCAChart](tr35.md#UCAChart)].) 227*912701f9SAndroid Build Coastguard Worker 228*912701f9SAndroid Build Coastguard WorkerStarting with CLDR 1.9, CLDR uses modified tables for the root collation order. The root locale ordering is tailored in the following ways: 229*912701f9SAndroid Build Coastguard Worker 230*912701f9SAndroid Build Coastguard Worker### <a name="grouping_classes_of_characters" href="#grouping_classes_of_characters">Grouping classes of characters</a> 231*912701f9SAndroid Build Coastguard Worker 232*912701f9SAndroid Build Coastguard WorkerAs of Version 6.1.0, the DUCET puts characters into the following ordering: 233*912701f9SAndroid Build Coastguard Worker 234*912701f9SAndroid Build Coastguard Worker* First "common characters": whitespace, punctuation, general symbols, some numbers, currency symbols, and other numbers. 235*912701f9SAndroid Build Coastguard Worker* Then "script characters": Latin, Greek, and the rest of the scripts. 236*912701f9SAndroid Build Coastguard Worker 237*912701f9SAndroid Build Coastguard Worker(There are a few exceptions to this general ordering.) 238*912701f9SAndroid Build Coastguard Worker 239*912701f9SAndroid Build Coastguard WorkerThe CLDR root locale modifies the DUCET tailoring by ordering the common characters more strictly by category: 240*912701f9SAndroid Build Coastguard Worker 241*912701f9SAndroid Build Coastguard Worker* whitespace, punctuation, general symbols, currency symbols, and numbers. 242*912701f9SAndroid Build Coastguard Worker 243*912701f9SAndroid Build Coastguard WorkerWhat the regrouping allows is for users to parametrically reorder the groups. For example, users can reorder numbers after all scripts, or reorder Greek before Latin. 244*912701f9SAndroid Build Coastguard Worker 245*912701f9SAndroid Build Coastguard WorkerThe relative order within each of these groups still matches the DUCET. Symbols, punctuation, and numbers that are grouped with a particular script stay with that script. The differences between CLDR and the DUCET order are: 246*912701f9SAndroid Build Coastguard Worker 247*912701f9SAndroid Build Coastguard Worker1. CLDR groups the numbers together after currency symbols, instead of splitting them with some before and some after. Thus the following are put _after_ currencies and just before all the other numbers. 248*912701f9SAndroid Build Coastguard Worker 249*912701f9SAndroid Build Coastguard Worker U+09F4 ( ৴ ) [No] BENGALI CURRENCY NUMERATOR ONE 250*912701f9SAndroid Build Coastguard Worker ... 251*912701f9SAndroid Build Coastguard Worker U+1D371 ( ) [No] COUNTING ROD TENS DIGIT NINE 252*912701f9SAndroid Build Coastguard Worker 253*912701f9SAndroid Build Coastguard Worker2. CLDR handles a few other characters differently 254*912701f9SAndroid Build Coastguard Worker 1. U+10A7F ( ) [Po] OLD SOUTH ARABIAN NUMERIC INDICATOR is put with punctuation, not symbols 255*912701f9SAndroid Build Coastguard Worker 2. U+20A8 ( ₨ ) [Sc] RUPEE SIGN and U+FDFC ( ﷼ ) [Sc] RIAL SIGN are put with currency signs, not with R and REH. 256*912701f9SAndroid Build Coastguard Worker 257*912701f9SAndroid Build Coastguard Worker### <a name="non_variable_symbols" href="#non_variable_symbols">Non-variable symbols</a> 258*912701f9SAndroid Build Coastguard Worker 259*912701f9SAndroid Build Coastguard WorkerThere are multiple [Variable-Weighting](https://www.unicode.org/reports/tr10/#Variable_Weighting) options in the UCA for symbols and punctuation, including _non-ignorable_ and _shifted_. With the _shifted_ option, almost all symbols and punctuation are ignored—except at a fourth level. The CLDR root locale ordering is modified so that symbols are not affected by the _shifted_ option. That is, by default, symbols are not “variable” in CLDR. So _shifted_ only causes whitespace and punctuation to be ignored, but not symbols (like ♥). The DUCET behavior can be specified with a locale ID using the "kv" keyword, to set the Variable section to include all of the symbols below it, or be set parametrically where implementations allow access. 260*912701f9SAndroid Build Coastguard Worker 261*912701f9SAndroid Build Coastguard WorkerSee also: 262*912701f9SAndroid Build Coastguard Worker 263*912701f9SAndroid Build Coastguard Worker* _[Setting Options](#Setting_Options)_ 264*912701f9SAndroid Build Coastguard Worker* [https://www.unicode.org/charts/collation/](https://www.unicode.org/charts/collation/) 265*912701f9SAndroid Build Coastguard Worker 266*912701f9SAndroid Build Coastguard Worker### <a name="tibetan_contractions" href="#tibetan_contractions">Additional contractions for Tibetan</a> 267*912701f9SAndroid Build Coastguard Worker 268*912701f9SAndroid Build Coastguard WorkerTen contractions are added for Tibetan: Two to fulfill [well-formedness condition 5](https://www.unicode.org/reports/tr10/#WF5), and eight more to preserve the default order for Tibetan. For details see _UTS #10, Section 3.8.2, [Well-Formedness of the DUCET](https://www.unicode.org/reports/tr10/#Well_Formed_DUCET)_. 269*912701f9SAndroid Build Coastguard Worker 270*912701f9SAndroid Build Coastguard Worker### <a name="tailored_noncharacter_weights" href="#tailored_noncharacter_weights">Tailored noncharacter weights</a> 271*912701f9SAndroid Build Coastguard Worker 272*912701f9SAndroid Build Coastguard WorkerU+FFFE and U+FFFF have special tailorings: 273*912701f9SAndroid Build Coastguard Worker 274*912701f9SAndroid Build Coastguard Worker> **U+FFFF:** This code point is tailored to have a primary weight higher than all other characters. This allows the reliable specification of a range, such as “Sch” ≤ X ≤ “Sch\\uFFFF”, to include all strings starting with "sch" or equivalent. 275*912701f9SAndroid Build Coastguard Worker> 276*912701f9SAndroid Build Coastguard Worker> **U+FFFE:** This code point produces a CE with minimal, unique weights on primary and identical levels. For details see the _[CLDR Collation Algorithm](#Algorithm_FFFE)_ above. 277*912701f9SAndroid Build Coastguard Worker 278*912701f9SAndroid Build Coastguard WorkerUCA (beginning with version 6.3) also maps **U+FFFD** to a special collation element with a very high primary weight, so that it is reliably non-[variable](https://www.unicode.org/reports/tr10/#Variable_Weighting), for use with [ill-formed code unit sequences](https://www.unicode.org/reports/tr10/#Handling_Illformed). 279*912701f9SAndroid Build Coastguard Worker 280*912701f9SAndroid Build Coastguard WorkerIn CLDR, so as to maintain the special collation elements, **U+FFFD..U+FFFF** are not further tailorable, and nothing can tailor to them. That is, neither can occur in a collation rule. For example, the following rules are illegal: 281*912701f9SAndroid Build Coastguard Worker 282*912701f9SAndroid Build Coastguard Worker``` 283*912701f9SAndroid Build Coastguard Worker&\uFFFF < x 284*912701f9SAndroid Build Coastguard Worker``` 285*912701f9SAndroid Build Coastguard Worker 286*912701f9SAndroid Build Coastguard Worker``` 287*912701f9SAndroid Build Coastguard Worker&x <\uFFFF 288*912701f9SAndroid Build Coastguard Worker``` 289*912701f9SAndroid Build Coastguard Worker 290*912701f9SAndroid Build Coastguard Worker> **Note**: Java uses an early version of this collation syntax, but has not been updated recently. It does not support any of the syntax marked with [...], and its default table is not the DUCET nor the CLDR root collation. 291*912701f9SAndroid Build Coastguard Worker 292*912701f9SAndroid Build Coastguard Worker### <a name="Root_Data_Files" href="#Root_Data_Files">Root Collation Data Files</a> 293*912701f9SAndroid Build Coastguard Worker 294*912701f9SAndroid Build Coastguard WorkerThe CLDR root collation data files are in the CLDR repository and release, under the path [common/uca/](https://github.com/unicode-org/cldr/blob/main/common/uca/). 295*912701f9SAndroid Build Coastguard Worker 296*912701f9SAndroid Build Coastguard WorkerFor most data files there are **\_SHORT** versions available. They contain the same data but only minimal comments, to reduce the file sizes. 297*912701f9SAndroid Build Coastguard Worker 298*912701f9SAndroid Build Coastguard WorkerComments with DUCET-style weights in files other than allkeys_CLDR.txt and allkeys_DUCET.txt use the weights defined in allkeys_CLDR.txt. 299*912701f9SAndroid Build Coastguard Worker 300*912701f9SAndroid Build Coastguard Worker* **allkeys_CLDR** - A file that provides a remapping of UCA DUCET weights for use with CLDR. 301*912701f9SAndroid Build Coastguard Worker* **allkeys_DUCET** - The same as DUCET allkeys.txt, but in alternate=non-ignorable sort order, for easier comparison with allkeys_CLDR.txt. 302*912701f9SAndroid Build Coastguard Worker* **FractionalUCA** - A file that provides a remapping of UCA DUCET weights for use with CLDR. The weight values are modified: 303*912701f9SAndroid Build Coastguard Worker * The weights have variable length, with 1..4 bytes each. Each secondary or tertiary weight currently uses at most 2 bytes. 304*912701f9SAndroid Build Coastguard Worker * There are tailoring gaps between adjacent weights, so that a number of characters can be tailored to sort between any two root collation elements. 305*912701f9SAndroid Build Coastguard Worker * There are collation elements with primary weights at the boundaries between reordering groups and Unicode scripts, so that tailoring around the first or last primary of a group/script results in new collation elements that sort and reorder together with that group or script. These boundary weights also define the primary weight ranges for parametric group and script reordering. 306*912701f9SAndroid Build Coastguard Worker 307*912701f9SAndroid Build Coastguard Worker An implementation may modify the weights further to fit the needs of its data structures. 308*912701f9SAndroid Build Coastguard Worker 309*912701f9SAndroid Build Coastguard Worker* **UCA_Rules** - A file that specifies the root collation order in the form of [tailoring rules](#Collation_Tailorings). This is only an approximation of the FractionalUCA data, since the rule syntax cannot express every detail of the collation elements. For example, in the DUCET and in FractionalUCA, tertiary differences are usually expressed with special tertiary weights on all collation elements of an expansion, while a typical from-rules builder will modify the tertiary weight of only one of the collation elements. 310*912701f9SAndroid Build Coastguard Worker* **CollationTest_CLDR** - The CLDR versions of the CollationTest files, which use the tailorings for CLDR. For information on the format, see [CollationTest.html](https://www.unicode.org/Public/UCA/latest/CollationTest.html) in the [UCA data directory](https://www.unicode.org/reports/tr10/#Data10). 311*912701f9SAndroid Build Coastguard Worker * CollationTest_CLDR_NON_IGNORABLE.txt 312*912701f9SAndroid Build Coastguard Worker * CollationTest_CLDR_SHIFTED.txt 313*912701f9SAndroid Build Coastguard Worker 314*912701f9SAndroid Build Coastguard Worker### <a name="Root_Data_File_Formats" href="#Root_Data_File_Formats">Root Collation Data File Formats</a> 315*912701f9SAndroid Build Coastguard Worker 316*912701f9SAndroid Build Coastguard WorkerThe file formats may change between versions of CLDR. The formats for CLDR 23 and beyond are as follows. As usual, text after a # is a comment. 317*912701f9SAndroid Build Coastguard Worker 318*912701f9SAndroid Build Coastguard Worker#### <a name="File_Format_allkeys_CLDR_txt" href="#File_Format_allkeys_CLDR_txt">allkeys_CLDR.txt</a> 319*912701f9SAndroid Build Coastguard Worker 320*912701f9SAndroid Build Coastguard WorkerThis file defines CLDR’s tailoring of the DUCET, as described in _[Root Collation](#Root_Collation)_ . 321*912701f9SAndroid Build Coastguard Worker 322*912701f9SAndroid Build Coastguard WorkerThe format is similar to that of [allkeys.txt](https://www.unicode.org/reports/tr10/#File_Format), although there may be some differences in whitespace. 323*912701f9SAndroid Build Coastguard Worker 324*912701f9SAndroid Build Coastguard Worker#### <a name="File_Format_FractionalUCA_txt" href="#File_Format_FractionalUCA_txt">FractionalUCA.txt</a> 325*912701f9SAndroid Build Coastguard Worker 326*912701f9SAndroid Build Coastguard WorkerThe format is illustrated by the following sample lines, with commentary afterwards. 327*912701f9SAndroid Build Coastguard Worker 328*912701f9SAndroid Build Coastguard Worker``` 329*912701f9SAndroid Build Coastguard Worker[UCA version = 6.0.0] 330*912701f9SAndroid Build Coastguard Worker``` 331*912701f9SAndroid Build Coastguard Worker 332*912701f9SAndroid Build Coastguard WorkerProvides the version number of the UCA table. 333*912701f9SAndroid Build Coastguard Worker 334*912701f9SAndroid Build Coastguard Worker``` 335*912701f9SAndroid Build Coastguard Worker[Unified_Ideograph 4E00..9FCC FA0E..FA0F FA11 FA13..FA14 FA1F FA21 FA23..FA24 FA27..FA29 3400..4DB5 20000..2A6D6 2A700..2B734 2B740..2B81D] 336*912701f9SAndroid Build Coastguard Worker``` 337*912701f9SAndroid Build Coastguard Worker 338*912701f9SAndroid Build Coastguard WorkerLists the ranges of Unified_Ideograph characters in collation order. (New in CLDR 24.) They map to collation elements with [implicit (constructed) primary weights](https://www.unicode.org/reports/tr10/#Implicit_Weights). 339*912701f9SAndroid Build Coastguard Worker 340*912701f9SAndroid Build Coastguard Worker``` 341*912701f9SAndroid Build Coastguard Worker[radical 6=⼅亅:亅了-亇予㐧-争亊-事㐨-] 342*912701f9SAndroid Build Coastguard Worker[radical 210=⿑齊:齊齋䶒䶓齌齍-齎齏-] 343*912701f9SAndroid Build Coastguard Worker[radical 210'=⻬齐:齐齑] 344*912701f9SAndroid Build Coastguard Worker[radical end] 345*912701f9SAndroid Build Coastguard Worker``` 346*912701f9SAndroid Build Coastguard Worker 347*912701f9SAndroid Build Coastguard WorkerData for Unihan radical-stroke order. (New in CLDR 26.) Following the [Unified_Ideograph] line, a section of `[radical ...]` lines defines a radical-stroke order of the Unified_Ideograph characters. 348*912701f9SAndroid Build Coastguard Worker 349*912701f9SAndroid Build Coastguard WorkerFor Han characters, an implementation may choose either to implement the order defined in the UCA and the [Unified_Ideograph] data, or to implement the order defined by the `[radical ...]` lines. Beginning with CLDR 26, the CJK type="unihan" tailorings assume that the root collation order sorts Han characters in Unihan radical-stroke order according to the `[radical ...]` data. The CollationTest_CLDR files only contain Han characters that are in the same relative order using implicit weights or the radical-stroke order. 350*912701f9SAndroid Build Coastguard Worker 351*912701f9SAndroid Build Coastguard WorkerThe root collation radical-stroke order is derived from the first (normative) values of the [Unihan kRSUnicode](https://www.unicode.org/reports/tr38/#kRSUnicode) field for each Han character. Han characters are ordered by radical, with traditional forms sorting before simplified ones. Characters with the same radical are ordered by residual stroke count. Characters with the same radical-stroke values are ordered by block and code point, as for [UCA implicit weights](https://www.unicode.org/reports/tr10/#Implicit_Weights). 352*912701f9SAndroid Build Coastguard Worker 353*912701f9SAndroid Build Coastguard WorkerThere is one `[radical ...]` line per radical, in the order of radical numbers. Each line shows the radical number and the representative characters from the [UCD file CJKRadicals.txt](https://www.unicode.org/reports/tr44/#UCD_Files_Table), followed by a colon (“:”) and the Han characters with that radical in the order as described above. A range like `万-丌` indicates that the code points in that range sort in code point order. 354*912701f9SAndroid Build Coastguard Worker 355*912701f9SAndroid Build Coastguard WorkerThe radical number and characters are informational. The sort order is established only by the order of the `[radical ...]` lines, and within each line by the characters and ranges between the colon (“:”) and the bracket (“]”). 356*912701f9SAndroid Build Coastguard Worker 357*912701f9SAndroid Build Coastguard WorkerEach Unified_Ideograph occurs exactly once. Only Unified_Ideograph characters are listed on `[radical ...]` lines. 358*912701f9SAndroid Build Coastguard Worker 359*912701f9SAndroid Build Coastguard WorkerThis section is terminated with one `[radical end]` line. 360*912701f9SAndroid Build Coastguard Worker 361*912701f9SAndroid Build Coastguard Worker``` 362*912701f9SAndroid Build Coastguard Worker0000; [,,] # Zyyy Cc [0000.0000.0000] * <NULL> 363*912701f9SAndroid Build Coastguard Worker``` 364*912701f9SAndroid Build Coastguard Worker 365*912701f9SAndroid Build Coastguard WorkerProvides a weight line. The first element (before the ";") is a hex codepoint sequence. The second field is a sequence of collation elements. Each collation element has 3 parts separated by commas: the primary weight, secondary weight, and tertiary weight. The tertiary weight actually consists of two components: the top two bits (0xC0) are used for the _case level_, and should be masked off where a case level is not used. 366*912701f9SAndroid Build Coastguard Worker 367*912701f9SAndroid Build Coastguard WorkerA weight is either empty (meaning a zero or ignorable weight) or is a sequence of one or more bytes. The bytes are interpreted as a "fraction", meaning that the ordering is 04 < 05 05 < 06. The weights are constructed so that no weight is an initial subsequence of another: that is, having both the weights 05 and 05 05 is illegal. The above line consists of all ignorable weights. 368*912701f9SAndroid Build Coastguard Worker 369*912701f9SAndroid Build Coastguard WorkerThe vertical bar (“|”) character is used to indicate context, as in: 370*912701f9SAndroid Build Coastguard Worker 371*912701f9SAndroid Build Coastguard Worker``` 372*912701f9SAndroid Build Coastguard Worker006C | 00B7; [, DB A9, 05] 373*912701f9SAndroid Build Coastguard Worker``` 374*912701f9SAndroid Build Coastguard Worker 375*912701f9SAndroid Build Coastguard WorkerThis example indicates that if U+00B7 appears immediately after U+006C, it is given the corresponding collation element instead. This syntax is roughly equivalent to the following contraction, but is more efficient. For details see the specification of _[Context-Sensitive Mappings](#Context_Sensitive_Mappings)_ above. 376*912701f9SAndroid Build Coastguard Worker 377*912701f9SAndroid Build Coastguard Worker``` 378*912701f9SAndroid Build Coastguard Worker006C 00B7; CE(006C) [, DB A9, 05] 379*912701f9SAndroid Build Coastguard Worker``` 380*912701f9SAndroid Build Coastguard Worker 381*912701f9SAndroid Build Coastguard WorkerSingle-byte primary weights are given to particularly frequent characters, such as space, digits, and a-z. More frequent characters are given two-byte weights, while relatively infrequent characters are given three-byte weights. For example: 382*912701f9SAndroid Build Coastguard Worker 383*912701f9SAndroid Build Coastguard Worker``` 384*912701f9SAndroid Build Coastguard Worker... 385*912701f9SAndroid Build Coastguard Worker0009; [03 05, 05, 05] # Zyyy Cc [0100.0020.0002] * <CHARACTER TABULATION> 386*912701f9SAndroid Build Coastguard Worker... 387*912701f9SAndroid Build Coastguard Worker1B60; [06 14 0C, 05, 05] # Bali Po [0111.0020.0002] * BALINESE PAMENENG 388*912701f9SAndroid Build Coastguard Worker... 389*912701f9SAndroid Build Coastguard Worker0031; [14, 05, 05] # Zyyy Nd [149B.0020.0002] * DIGIT ONE 390*912701f9SAndroid Build Coastguard Worker``` 391*912701f9SAndroid Build Coastguard Worker 392*912701f9SAndroid Build Coastguard WorkerThe assignment of 2 vs 3 bytes does not reflect importance, or exact frequency. 393*912701f9SAndroid Build Coastguard Worker 394*912701f9SAndroid Build Coastguard Worker``` 395*912701f9SAndroid Build Coastguard Worker3041; [76 06, 05, 03] # Hira Lo [3888.0020.000D] * HIRAGANA LETTER SMALL A 396*912701f9SAndroid Build Coastguard Worker3042; [76 06, 05, 85] # Hira Lo [3888.0020.000E] * HIRAGANA LETTER A 397*912701f9SAndroid Build Coastguard Worker30A1; [76 06, 05, 10] # Kana Lo [3888.0020.000F] * KATAKANA LETTER SMALL A 398*912701f9SAndroid Build Coastguard Worker30A2; [76 06, 05, 9E] # Kana Lo [3888.0020.0011] * KATAKANA LETTER A 399*912701f9SAndroid Build Coastguard Worker``` 400*912701f9SAndroid Build Coastguard Worker 401*912701f9SAndroid Build Coastguard WorkerBeginning with CLDR 27, some primary or secondary collation elements may have below-common tertiary weights (e.g., `03` ), in particular to allow normal Hiragana letters to have common tertiary weights. 402*912701f9SAndroid Build Coastguard Worker 403*912701f9SAndroid Build Coastguard Worker``` 404*912701f9SAndroid Build Coastguard Worker# SPECIAL MAX/MIN COLLATION ELEMENTS 405*912701f9SAndroid Build Coastguard WorkerFFFE; [02, 05, 05] # Special LOWEST primary, for merge/interleaving 406*912701f9SAndroid Build Coastguard WorkerFFFF; [EF FE, 05, 05] # Special HIGHEST primary, for ranges 407*912701f9SAndroid Build Coastguard Worker``` 408*912701f9SAndroid Build Coastguard Worker 409*912701f9SAndroid Build Coastguard WorkerThe two tailored noncharacters have their own primary weights. 410*912701f9SAndroid Build Coastguard Worker 411*912701f9SAndroid Build Coastguard Worker``` 412*912701f9SAndroid Build Coastguard WorkerF967; [U+4E0D] # Hani Lo [FB40.0020.0002][CE0D.0000.0000] * CJK COMPATIBILITY IDEOGRAPH-F967 413*912701f9SAndroid Build Coastguard Worker2F02; [U+4E36, 10] # Hani So [FB40.0020.0004][CE36.0000.0000] * KANGXI RADICAL DOT 414*912701f9SAndroid Build Coastguard Worker2E80; [U+4E36, 70, 20] # Hani So [FB40.0020.0004][CE36.0000.0000][0000.00FC.0004] * CJK RADICAL REPEAT 415*912701f9SAndroid Build Coastguard Worker``` 416*912701f9SAndroid Build Coastguard Worker 417*912701f9SAndroid Build Coastguard WorkerSome collation elements are specified by reference to other mappings. This is particularly useful for Han characters which are given implicit/constructed primary weights; the reference to a Unified_Ideograph makes these mappings independent of implementation details. This technique may also be used in other mappings to show the relationship of character variants. 418*912701f9SAndroid Build Coastguard Worker 419*912701f9SAndroid Build Coastguard WorkerThe referenced character must have a mapping listed earlier in the file, or the mapping must have been defined via the [Unified_Ideograph] data line. The referenced character must map to exactly one collation element. 420*912701f9SAndroid Build Coastguard Worker 421*912701f9SAndroid Build Coastguard Worker`[U+4E0D]` copies U+4E0D’s entire collation element. `[U+4E36, 10]` copies U+4E36’s primary and secondary weights and specifies a different tertiary weight. `[U+4E36, 70, 20]` only copies U+4E36’s primary weight and specifies other secondary and tertiary weights. 422*912701f9SAndroid Build Coastguard Worker 423*912701f9SAndroid Build Coastguard WorkerFractionalUCA.txt does not have any explicit mappings for implicit weights. Therefore, an implementation is free to choose an algorithm for computing implicit weights according to the principles specified in the UCA. 424*912701f9SAndroid Build Coastguard Worker 425*912701f9SAndroid Build Coastguard Worker``` 426*912701f9SAndroid Build Coastguard WorkerFDD1 20AC; [0D 20 02, 05, 05] # CURRENCY first primary 427*912701f9SAndroid Build Coastguard WorkerFDD1 0034; [0E 02 02, 05, 05] # DIGIT first primary starts new lead byte 428*912701f9SAndroid Build Coastguard WorkerFDD0 FF21; [26 02 02, 05, 05] # REORDER_RESERVED_BEFORE_LATIN first primary starts new lead byte 429*912701f9SAndroid Build Coastguard WorkerFDD1 004C; [28 02 02, 05, 05] # LATIN first primary starts new lead byte 430*912701f9SAndroid Build Coastguard WorkerFDD0 FF3A; [5D 02 02, 05, 05] # REORDER_RESERVED_AFTER_LATIN first primary starts new lead byte 431*912701f9SAndroid Build Coastguard WorkerFDD1 03A9; [5F 04 02, 05, 05] # GREEK first primary starts new lead byte (compressible) 432*912701f9SAndroid Build Coastguard WorkerFDD1 03E2; [5F 60 02, 05, 05] # COPTIC first primary (compressible) 433*912701f9SAndroid Build Coastguard Worker``` 434*912701f9SAndroid Build Coastguard Worker 435*912701f9SAndroid Build Coastguard WorkerThese are special mappings with primaries at the boundaries of scripts and reordering groups. They serve as tailoring boundaries, so that tailoring near the first or last character of a script or group places the tailored item into the same group. Beginning with CLDR 24, each of these is a contraction of U+FDD1 with a character of the corresponding script (or of the General_Category [Z, P, S, Sc, Nd] corresponding to a special reordering group), mapping to the first possible primary weight per script or group. They can be enumerated for implementations of [Collation Indexes](#Collation_Indexes). (Earlier versions mapped contractions with U+FDD0 to the last primary weights of each group but not each script.) 436*912701f9SAndroid Build Coastguard Worker 437*912701f9SAndroid Build Coastguard WorkerBeginning with CLDR 27, these mappings alone define the boundaries for reordering single scripts. (There are no mappings for Hrkt, Hans, or Hant because they are not fully distinct scripts; they share primary weights with other scripts: Hrkt=Hira=Kana & Hans=Hant=Hani.) There are some reserved ranges, beginning at boundaries marked with U+FDD0 plus following characters as shown above. The reserved ranges are not used for collation elements and are not available for tailoring. 438*912701f9SAndroid Build Coastguard Worker 439*912701f9SAndroid Build Coastguard WorkerSome primary lead bytes must be reserved so that reordering of scripts along partial-lead-byte boundaries can “split” the primary lead byte and use up a reserved byte. This is for implementations that write sort keys, which must reorder primary weights by offsetting them by whole lead bytes. There are reorder-reserved ranges before and after Latin, so that reordering scripts with few primary lead bytes relative to Latin can move those scripts into the reserved ranges without changing the primary weights of any other script. Each of these boundaries begins with a new two-byte primary; that is, no two groups/scripts/ranges share the top 16 bits of their primary weights. 440*912701f9SAndroid Build Coastguard Worker 441*912701f9SAndroid Build Coastguard Worker``` 442*912701f9SAndroid Build Coastguard WorkerFDD0 0034; [11, 05, 05] # lead byte for numeric sorting 443*912701f9SAndroid Build Coastguard Worker``` 444*912701f9SAndroid Build Coastguard Worker 445*912701f9SAndroid Build Coastguard WorkerThis mapping specifies the lead byte for numeric sorting. It must be different from the lead byte of any other primary weight, otherwise numeric sorting would generate ill-formed collation elements. Therefore, this mapping itself must be excluded from the set of regular mappings. This value can be ignored by implementations that do not support numeric sorting. (Other contractions with U+FDD0 can normally be ignored altogether.) 446*912701f9SAndroid Build Coastguard Worker 447*912701f9SAndroid Build Coastguard Worker``` 448*912701f9SAndroid Build Coastguard Worker# HOMELESS COLLATION ELEMENTS 449*912701f9SAndroid Build Coastguard WorkerFDD0 0063; [, 97, 3D] # [15E4.0020.0004] [1844.0020.0004] [0000.0041.001F] * U+01C6 LATIN SMALL LETTER DZ WITH CARON 450*912701f9SAndroid Build Coastguard WorkerFDD0 0064; [, A7, 09] # [15D1.0020.0004] [0000.0056.0004] * U+1DD7 COMBINING LATIN SMALL LETTER C CEDILLA 451*912701f9SAndroid Build Coastguard WorkerFDD0 0065; [, B1, 09] # [1644.0020.0004] [0000.0061.0004] * U+A7A1 LATIN SMALL LETTER G WITH OBLIQUE STROKE 452*912701f9SAndroid Build Coastguard Worker``` 453*912701f9SAndroid Build Coastguard Worker 454*912701f9SAndroid Build Coastguard WorkerThe DUCET has some weights that don't correspond directly to a character. To allow for implementations to have a mapping for each collation element (necessary for certain implementations of tailoring), this requires the construction of special sequences for those weights. These collation elements can normally be ignored. 455*912701f9SAndroid Build Coastguard Worker 456*912701f9SAndroid Build Coastguard WorkerNext, a number of tables are defined. The function of each of the tables is summarized afterwards. 457*912701f9SAndroid Build Coastguard Worker 458*912701f9SAndroid Build Coastguard Worker``` 459*912701f9SAndroid Build Coastguard Worker# VALUES BASED ON UCA 460*912701f9SAndroid Build Coastguard Worker... 461*912701f9SAndroid Build Coastguard Worker[first regular [0D 0A, 05, 05]] # U+0060 GRAVE ACCENT 462*912701f9SAndroid Build Coastguard Worker[last regular [7A FE, 05, 05]] # U+1342E EGYPTIAN HIEROGLYPH AA032 463*912701f9SAndroid Build Coastguard Worker[first implicit [E0 04 06, 05, 05]] # CONSTRUCTED 464*912701f9SAndroid Build Coastguard Worker[last implicit [E4 DF 7E 20, 05, 05]] # CONSTRUCTED 465*912701f9SAndroid Build Coastguard Worker[first trailing [E5, 05, 05]] # CONSTRUCTED 466*912701f9SAndroid Build Coastguard Worker[last trailing [E5, 05, 05]] # CONSTRUCTED 467*912701f9SAndroid Build Coastguard Worker... 468*912701f9SAndroid Build Coastguard Worker``` 469*912701f9SAndroid Build Coastguard Worker 470*912701f9SAndroid Build Coastguard WorkerThis table summarizes ranges of important groups of characters for implementations. 471*912701f9SAndroid Build Coastguard Worker 472*912701f9SAndroid Build Coastguard Worker``` 473*912701f9SAndroid Build Coastguard Worker# Top Byte => Reordering Tokens 474*912701f9SAndroid Build Coastguard Worker[top_byte 00 TERMINATOR ] # [0] TERMINATOR=1 475*912701f9SAndroid Build Coastguard Worker[top_byte 01 LEVEL-SEPARATOR ] # [0] LEVEL-SEPARATOR=1 476*912701f9SAndroid Build Coastguard Worker[top_byte 02 FIELD-SEPARATOR ] # [0] FIELD-SEPARATOR=1 477*912701f9SAndroid Build Coastguard Worker[top_byte 03 SPACE ] # [9] SPACE=1 Cc=6 Zl=1 Zp=1 Zs=1 478*912701f9SAndroid Build Coastguard Worker... 479*912701f9SAndroid Build Coastguard Worker``` 480*912701f9SAndroid Build Coastguard Worker 481*912701f9SAndroid Build Coastguard WorkerThis table defines the reordering groups, for script reordering. The table maps from the first bytes of the fractional weights to a reordering token. The format is "[top_byte " byte-value reordering-token "COMPRESS"? "]". The "COMPRESS" value is present when there is only one byte in the reordering token, and primary-weight compression can be applied. Most reordering tokens are script values; others are special-purpose values, such as PUNCTUATION. Beginning with CLDR 24, this table precedes the regular mappings, so that parsers can use this information while processing and optimizing mappings. Beginning with CLDR 27, most of this data is irrelevant because single scripts can be reordered. Only the "COMPRESS" data is still useful. 482*912701f9SAndroid Build Coastguard Worker 483*912701f9SAndroid Build Coastguard Worker``` 484*912701f9SAndroid Build Coastguard Worker# Reordering Tokens => Top Bytes 485*912701f9SAndroid Build Coastguard Worker[reorderingTokens Arab 61=910 62=910 ] 486*912701f9SAndroid Build Coastguard Worker[reorderingTokens Armi 7A=22 ] 487*912701f9SAndroid Build Coastguard Worker[reorderingTokens Armn 5F=82 ] 488*912701f9SAndroid Build Coastguard Worker[reorderingTokens Avst 7A=54 ] 489*912701f9SAndroid Build Coastguard Worker... 490*912701f9SAndroid Build Coastguard Worker``` 491*912701f9SAndroid Build Coastguard Worker 492*912701f9SAndroid Build Coastguard WorkerThis table is an inverse mapping from reordering token to top byte(s). In terms like "61=910", the first value is the top byte, while the second is informational, indicating the number of primaries assigned with that top byte. 493*912701f9SAndroid Build Coastguard Worker 494*912701f9SAndroid Build Coastguard Worker``` 495*912701f9SAndroid Build Coastguard Worker# General Categories => Top Byte 496*912701f9SAndroid Build Coastguard Worker[categories Cc 03{SPACE}=6 ] 497*912701f9SAndroid Build Coastguard Worker[categories Cf 77{Khmr Tale Talu Lana Cham Bali Java Mong Olck Cher Cans Ogam Runr Orkh Vaii Bamu}=2 ] 498*912701f9SAndroid Build Coastguard Worker[categories Lm 0D{SYMBOL}=25 0E{SYMBOL}=22 27{Latn}=12 28{Latn}=12 29{Latn}=12 2A{Latn}=12... 499*912701f9SAndroid Build Coastguard Worker``` 500*912701f9SAndroid Build Coastguard Worker 501*912701f9SAndroid Build Coastguard WorkerThis table is informational, providing the top bytes, scripts, and primaries associated with each general category value. 502*912701f9SAndroid Build Coastguard Worker 503*912701f9SAndroid Build Coastguard Worker``` 504*912701f9SAndroid Build Coastguard Worker# FIXED VALUES 505*912701f9SAndroid Build Coastguard Worker[fixed first implicit byte E0] 506*912701f9SAndroid Build Coastguard Worker[fixed last implicit byte E4] 507*912701f9SAndroid Build Coastguard Worker[fixed first trail byte E5] 508*912701f9SAndroid Build Coastguard Worker[fixed last trail byte EF] 509*912701f9SAndroid Build Coastguard Worker[fixed first special byte F0] 510*912701f9SAndroid Build Coastguard Worker[fixed last special byte FF] 511*912701f9SAndroid Build Coastguard Worker 512*912701f9SAndroid Build Coastguard Worker[fixed secondary common byte 05] 513*912701f9SAndroid Build Coastguard Worker[fixed last secondary common byte 45] 514*912701f9SAndroid Build Coastguard Worker[fixed first ignorable secondary byte 80] 515*912701f9SAndroid Build Coastguard Worker 516*912701f9SAndroid Build Coastguard Worker[fixed tertiary common byte 05] 517*912701f9SAndroid Build Coastguard Worker[fixed first ignorable tertiary byte 3C] 518*912701f9SAndroid Build Coastguard Worker``` 519*912701f9SAndroid Build Coastguard Worker 520*912701f9SAndroid Build Coastguard WorkerThe final table gives certain hard-coded byte values. The "trail" area is provided for implementation of the "trailing weights" as described in the UCA. 521*912701f9SAndroid Build Coastguard Worker 522*912701f9SAndroid Build Coastguard Worker> **Note**: The particular primary lead bytes for Hani vs. IMPLICIT vs. TRAILING are only an example. An implementation is free to move them if it also moves the explicit TRAILING weights. This affects only a small number of explicit mappings in FractionalUCA.txt, such as for U+FFFD, U+FFFF, and the “unassigned first primary”. It is possible to use no SPECIAL bytes at all, and to use only the one primary lead byte FF for TRAILING weights. 523*912701f9SAndroid Build Coastguard Worker 524*912701f9SAndroid Build Coastguard Worker#### <a name="File_Format_UCA_Rules_txt" href="#File_Format_UCA_Rules_txt">UCA_Rules.txt</a> 525*912701f9SAndroid Build Coastguard Worker 526*912701f9SAndroid Build Coastguard WorkerThe format for this file uses the CLDR collation syntax, see _[Collation Tailorings](#Collation_Tailorings)_. 527*912701f9SAndroid Build Coastguard Worker 528*912701f9SAndroid Build Coastguard Worker## <a name="Collation_Tailorings" href="#Collation_Tailorings">Collation Tailorings</a> 529*912701f9SAndroid Build Coastguard Worker 530*912701f9SAndroid Build Coastguard Worker```xml 531*912701f9SAndroid Build Coastguard Worker<!ELEMENT collations (alias | (defaultCollation?, collation*, special*)) > 532*912701f9SAndroid Build Coastguard Worker 533*912701f9SAndroid Build Coastguard Worker<!ELEMENT defaultCollation ( #PCDATA ) > 534*912701f9SAndroid Build Coastguard Worker``` 535*912701f9SAndroid Build Coastguard Worker 536*912701f9SAndroid Build Coastguard WorkerThis element of the LDML format contains one or more `collation` elements, distinguished by type. Each `collation` contains elements with parametric settings, or rules that specify a certain sort order, as a tailoring of the root order, or both. 537*912701f9SAndroid Build Coastguard Worker 538*912701f9SAndroid Build Coastguard Worker> **Note**: CLDR collation tailoring data should follow the [CLDR Collation Guidelines](https://cldr.unicode.org/index/cldr-spec/collation-guidelines). 539*912701f9SAndroid Build Coastguard Worker 540*912701f9SAndroid Build Coastguard Worker### <a name="Collation_Types" href="#Collation_Types">Collation Types</a> 541*912701f9SAndroid Build Coastguard Worker 542*912701f9SAndroid Build Coastguard WorkerEach locale may have multiple sort orders (types). The `defaultCollation` element defines the default tailoring for a locale and its sublocales. For example: 543*912701f9SAndroid Build Coastguard Worker 544*912701f9SAndroid Build Coastguard Worker* root.xml: `<defaultCollation>standard</defaultCollation>` 545*912701f9SAndroid Build Coastguard Worker* zh.xml: `<defaultCollation>pinyin</defaultCollation>` 546*912701f9SAndroid Build Coastguard Worker* zh_Hant.xml: `<defaultCollation>stroke</defaultCollation>` 547*912701f9SAndroid Build Coastguard Worker 548*912701f9SAndroid Build Coastguard WorkerTo allow implementations in reduced memory environments to use CJK sorting, there are also short forms of each of these collation sequences. These provide for the most common characters in common use, and are marked with `alt="short"`. 549*912701f9SAndroid Build Coastguard Worker 550*912701f9SAndroid Build Coastguard WorkerA collation type name that starts with "private-", for example, "private-kana", indicates an incomplete tailoring that is only intended for import into one or more other tailorings (usually for sharing common rules). It does not establish a complete sort order. An implementation should not build data tables for a private collation type, and should not include a private collation type in a list of available types. 551*912701f9SAndroid Build Coastguard Worker 552*912701f9SAndroid Build Coastguard Worker> **Note**: There is an on-line demonstration of collation at [[LocaleExplorer](tr35.md#LocaleExplorer)] that uses the same rule syntax. (Pick the locale and scroll to "Collation Rules", near the end.) 553*912701f9SAndroid Build Coastguard Worker 554*912701f9SAndroid Build Coastguard Worker> **Note**: In CLDR 23 and before, LDML collation files used an XML format. Starting with CLDR 24, the XML collation syntax is deprecated and no longer used. See the _[CLDR 23 version of this document](https://www.unicode.org/reports/tr35/tr35-31/tr35-collation.html#Collation_Tailorings)_ for details about the XML collation syntax. 555*912701f9SAndroid Build Coastguard Worker 556*912701f9SAndroid Build Coastguard Worker#### <a name="Collation_Type_Fallback" href="#Collation_Type_Fallback">Collation Type Fallback</a> 557*912701f9SAndroid Build Coastguard Worker 558*912701f9SAndroid Build Coastguard WorkerWhen loading a requested tailoring from its data file and the parent file chain, use the following type fallback to find the tailoring. 559*912701f9SAndroid Build Coastguard Worker 560*912701f9SAndroid Build Coastguard Worker1. Determine the default type from the `<defaultCollation>` element; map the default type to its alias if one is defined. If there is no `<defaultCollation>` element, then use "standard" as the default type. 561*912701f9SAndroid Build Coastguard Worker2. If the request language tag specifies the collation type (keyword "co"), then map it to its alias if one is defined (e.g., "-co-phonebk" → "phonebook"). If the language tag does not specify the type, then use the default type. 562*912701f9SAndroid Build Coastguard Worker3. Use the `<collation>` element with this type. 563*912701f9SAndroid Build Coastguard Worker4. If it does not exist, and the type starts with "search" but is longer, then set the type to "search" and use that `<collation>` element. (For example, "searchjl" → "search".) 564*912701f9SAndroid Build Coastguard Worker5. If it does not exist, and the type is not the default type, then set the type to the default type and use that `<collation>` element. 565*912701f9SAndroid Build Coastguard Worker6. If it does not exist, and the type is not "standard", then set the type to "standard" and use that `<collation>` element. 566*912701f9SAndroid Build Coastguard Worker7. If it does not exist, then use the CLDR root collation. 567*912701f9SAndroid Build Coastguard Worker 568*912701f9SAndroid Build Coastguard Worker> **Note**: that the CLDR collation/root.xml contains `<defaultCollation>standard</defaultCollation>`, `<collation type="standard">` (with an empty tailoring, so this is the same as the CLDR root collation), and `<collation type="search">`. 569*912701f9SAndroid Build Coastguard Worker 570*912701f9SAndroid Build Coastguard WorkerFor example, assume that we have collation data for the following tailorings. ("da/search" is shorthand for "da-u-co-search".) 571*912701f9SAndroid Build Coastguard Worker 572*912701f9SAndroid Build Coastguard Worker* root/defaultCollation=standard 573*912701f9SAndroid Build Coastguard Worker* root/standard (this is the same as “the CLDR root collator”) 574*912701f9SAndroid Build Coastguard Worker* root/search 575*912701f9SAndroid Build Coastguard Worker* da/standard 576*912701f9SAndroid Build Coastguard Worker* da/search 577*912701f9SAndroid Build Coastguard Worker* el/standard 578*912701f9SAndroid Build Coastguard Worker* ko/standard 579*912701f9SAndroid Build Coastguard Worker* ko/search 580*912701f9SAndroid Build Coastguard Worker* ko/searchjl 581*912701f9SAndroid Build Coastguard Worker* zh/defaultCollation=pinyin 582*912701f9SAndroid Build Coastguard Worker* zh/pinyin 583*912701f9SAndroid Build Coastguard Worker* zh/stroke 584*912701f9SAndroid Build Coastguard Worker* zh-Hant/defaultCollation=stroke 585*912701f9SAndroid Build Coastguard Worker 586*912701f9SAndroid Build Coastguard Worker###### Table: <a name="Sample_requested_and_actual_collation_locales_and_types" href="#Sample_requested_and_actual_collation_locales_and_types">Sample requested and actual collation locales and types</a> 587*912701f9SAndroid Build Coastguard Worker 588*912701f9SAndroid Build Coastguard Worker| requested | actual | comment | 589*912701f9SAndroid Build Coastguard Worker| ----------------- | ------------- | ------- | 590*912701f9SAndroid Build Coastguard Worker| da/phonebook | da/standard | default type for Danish | 591*912701f9SAndroid Build Coastguard Worker| zh | zh/pinyin | default type for zh | 592*912701f9SAndroid Build Coastguard Worker| zh/standard | root/standard | no "standard" tailoring for zh, falls back to root | 593*912701f9SAndroid Build Coastguard Worker| zh/phonebook | zh/pinyin | default type for zh | 594*912701f9SAndroid Build Coastguard Worker| zh-Hant/phonebook | zh/stroke | default type for zh-Hant is "stroke" | 595*912701f9SAndroid Build Coastguard Worker| da/searchjl | da/search | "search.+" falls back to "search" | 596*912701f9SAndroid Build Coastguard Worker| el/search | root/search | no "search" tailoring for Greek | 597*912701f9SAndroid Build Coastguard Worker| el/searchjl | root/search | "search.+" falls back to "search", found in root | 598*912701f9SAndroid Build Coastguard Worker| ko/searchjl | ko/searchjl | requested data is actually available | 599*912701f9SAndroid Build Coastguard Worker 600*912701f9SAndroid Build Coastguard Worker### <a name="Collation_Version" href="#Collation_Version">Version</a> 601*912701f9SAndroid Build Coastguard Worker 602*912701f9SAndroid Build Coastguard WorkerThe `version` attribute is used in case a specific version of the UCA is to be specified. It is optional, and is specified if the results are to be identical on different systems. If it is not supplied, then the version is assumed to be the same as the Unicode version for the system as a whole. 603*912701f9SAndroid Build Coastguard Worker 604*912701f9SAndroid Build Coastguard Worker> **Note**: For version 3.1.1 of the UCA, the version of Unicode must also be specified with any versioning information; an example would be "3.1.1/3.2" for version 3.1.1 of the UCA, for version 3.2 of Unicode. This was changed by decision of the UTC, so that dual versions were no longer necessary. So for UCA 4.0 and beyond, the version just has a single number. 605*912701f9SAndroid Build Coastguard Worker 606*912701f9SAndroid Build Coastguard Worker### <a name="Collation_Element" href="#Collation_Element">Collation Element</a> 607*912701f9SAndroid Build Coastguard Worker 608*912701f9SAndroid Build Coastguard Worker```xml 609*912701f9SAndroid Build Coastguard Worker<!ELEMENT collation (alias | (cr*, special*)) > 610*912701f9SAndroid Build Coastguard Worker``` 611*912701f9SAndroid Build Coastguard Worker 612*912701f9SAndroid Build Coastguard WorkerThe tailoring syntax is designed to be independent of the actual weights used in any particular UCA table. That way the same rules can be applied to UCA versions over time, even if the underlying weights change. The following illustrates the overall structure of a collation: 613*912701f9SAndroid Build Coastguard Worker 614*912701f9SAndroid Build Coastguard Worker```xml 615*912701f9SAndroid Build Coastguard Worker<collation type="phonebook"> 616*912701f9SAndroid Build Coastguard Worker <cr><![CDATA[ 617*912701f9SAndroid Build Coastguard Worker [caseLevel on] 618*912701f9SAndroid Build Coastguard Worker &c < k 619*912701f9SAndroid Build Coastguard Worker ]]></cr> 620*912701f9SAndroid Build Coastguard Worker</collation> 621*912701f9SAndroid Build Coastguard Worker``` 622*912701f9SAndroid Build Coastguard Worker 623*912701f9SAndroid Build Coastguard Worker### <a name="Setting_Options" href="#Setting_Options">Setting Options</a> 624*912701f9SAndroid Build Coastguard Worker 625*912701f9SAndroid Build Coastguard WorkerParametric settings can be specified in language tags or in rule syntax (in the form `[keyword value]` ). For example, `-ks-level2` or `[strength 2]` will only compare strings based on their primary and secondary weights. 626*912701f9SAndroid Build Coastguard Worker 627*912701f9SAndroid Build Coastguard WorkerIf a setting is not present, the CLDR default (or the default for the locale, if there is one) is used. That default is listed in bold italics. Where there is a UCA default that is different, it is listed in bold with (**UCA default**). Note that the default value for a locale may be different than the normal default value for the setting. 628*912701f9SAndroid Build Coastguard Worker 629*912701f9SAndroid Build Coastguard Worker###### Table: <a name="Collation_Settings" href="#Collation_Settings">Collation Settings</a> 630*912701f9SAndroid Build Coastguard Worker 631*912701f9SAndroid Build Coastguard Worker<table><tbody> 632*912701f9SAndroid Build Coastguard Worker<tr><th>BCP47 Key</th><th>BCP47 Value</th><th>Rule Syntax</th><th>Description</th></tr> 633*912701f9SAndroid Build Coastguard Worker 634*912701f9SAndroid Build Coastguard Worker<tr><td rowspan="5">ks</td><td>level1</td><td><code>[strength 1]</code><br/>(primary)</td> 635*912701f9SAndroid Build Coastguard Worker <td rowspan="5">Sets the default strength for comparison, as described in the [<a href="https://www.unicode.org/reports/tr41/#UTS10">UCA</a>]. <i>Note that a strength setting of greater than 4 may have the same effect as <b>identical</b>, depending on the locale and implementation.</i></td></tr> 636*912701f9SAndroid Build Coastguard Worker<tr><td>level2</td><td><code>[strength 2]</code><br/>(secondary)</td></tr> 637*912701f9SAndroid Build Coastguard Worker<tr><td>level3</td><td><i><b><code>[strength 3]</code><br/>(tertiary)</b></i></td></tr> 638*912701f9SAndroid Build Coastguard Worker<tr><td>level4</td><td><code>[strength 4]</code><br/>(quaternary)</td></tr> 639*912701f9SAndroid Build Coastguard Worker<tr><td>identic</td><td><code>[strength I]</code><br/>(identical)</td></tr> 640*912701f9SAndroid Build Coastguard Worker 641*912701f9SAndroid Build Coastguard Worker<tr><td rowspan="3">ka</td><td>noignore</td><td><i><b><code>[alternate non-ignorable]</code></b></i><br/></td> 642*912701f9SAndroid Build Coastguard Worker <td rowspan="3">Sets alternate handling for variable weights, as described in [<a href="https://www.unicode.org/reports/tr41/#UTS10">UCA</a>], where "shifted" causes certain characters to be ignored in comparison. <i>The default for LDML is different than it is in the UCA. In LDML, the default for alternate handling is <b>non-ignorable</b>, while in UCA it is <b>shifted</b>. In addition, in LDML only whitespace and punctuation are variable by default.</i></td></tr> 643*912701f9SAndroid Build Coastguard Worker<tr><td>shifted</td><td><b><code>[alternate shifted]</code><br/>(UCA default)</b></td></tr> 644*912701f9SAndroid Build Coastguard Worker<tr><td><i>n/a</i></td><td><i>n/a</i><br/>(blanked)</td></tr> 645*912701f9SAndroid Build Coastguard Worker 646*912701f9SAndroid Build Coastguard Worker<tr><td rowspan="2">kb</td><td>true</td><td><code>[backwards 2]</code></td> 647*912701f9SAndroid Build Coastguard Worker <td rowspan="2">Sets the comparison for the second level to be <b>backwards</b>, as described in [<a href="https://www.unicode.org/reports/tr41/#UTS10">UCA</a>].</td></tr> 648*912701f9SAndroid Build Coastguard Worker<tr><td>false</td><td><i><b>n/a</b></i></td></tr> 649*912701f9SAndroid Build Coastguard Worker 650*912701f9SAndroid Build Coastguard Worker<tr><td rowspan="2">kk</td><td>true</td><td><b><code>[normalization on]</code><br/>(UCA default)</b></td> 651*912701f9SAndroid Build Coastguard Worker <td rowspan="2">If <b>on</b>, then the normal [<a href="https://www.unicode.org/reports/tr41/#UTS10">UCA</a>] algorithm is used. If <b>off</b>, then most strings should still sort correctly despite not normalizing to NFD first.<br/><i>Note that the default for CLDR locales may be different than in the UCA. The rules for particular locales have it set to <b>on</b>: those locales whose exemplar characters (in forms commonly interchanged) would be affected by normalization.</i></td></tr> 652*912701f9SAndroid Build Coastguard Worker<tr><td>false</td><td><i><b><code>[normalization off]</code></b></i></td></tr> 653*912701f9SAndroid Build Coastguard Worker 654*912701f9SAndroid Build Coastguard Worker<tr><td rowspan="2">kc</td><td>true</td><td><code>[caseLevel on]</code></td> 655*912701f9SAndroid Build Coastguard Worker <td rowspan="2">If set to <b>on</b><i>,</i> a level consisting only of case characteristics will be inserted in front of tertiary level, as a "Level 2.5". To ignore accents but take case into account, set strength to <b>primary</b> and case level to <b>on</b>. For details, see <i><a href="#Case_Parameters">Case Parameters</a></i> .</td></tr> 656*912701f9SAndroid Build Coastguard Worker<tr><td>false</td><td><i><b><code>[caseLevel off]</code></b></i></td></tr> 657*912701f9SAndroid Build Coastguard Worker 658*912701f9SAndroid Build Coastguard Worker<tr><td rowspan="3">kf</td><td>upper</td><td><code>[caseFirst upper]</code></td> 659*912701f9SAndroid Build Coastguard Worker <td rowspan="3">If set to <b>upper</b>, causes upper case to sort before lower case. If set to <b>lower</b>, causes lower case to sort before upper case. Useful for locales that have already supported ordering but require different order of cases. Affects case and tertiary levels. For details, see <i><a href="#Case_Parameters">Case Parameters</a></i> .</td></tr> 660*912701f9SAndroid Build Coastguard Worker<tr><td>lower</td><td><code>[caseFirst lower]</code></td></tr> 661*912701f9SAndroid Build Coastguard Worker<tr><td>false</td><td><i><b><code>[caseFirst off]</code></b></i></td></tr> 662*912701f9SAndroid Build Coastguard Worker 663*912701f9SAndroid Build Coastguard Worker<tr><td rowspan="2">kh</td><td>true<br/><i><b>Deprecated:</b></i> Use rules with quater­nary relations instead.</td><td><code>[hiraganaQ on]</code></td> 664*912701f9SAndroid Build Coastguard Worker <td rowspan="2">Controls special treatment of Hiragana code points on quaternary level. If turned <b>on</b>, Hiragana codepoints will get lower values than all the other non-variable code points in <b>shifted</b>. That is, the normal Level 4 value for a regular collation element is FFFF, as described in [<a href="https://www.unicode.org/reports/tr41/#UTS10">UCA</a>], <i><a href="https://www.unicode.org/reports/tr10/#Variable_Weighting">Variable Weighting</a></i> . This is changed to FFFE for [:script=Hiragana:] characters. The strength must be greater or equal than quaternary if this attribute is to have any effect.</td></tr> 665*912701f9SAndroid Build Coastguard Worker<tr><td>false</td><td><i><b><code>[hiraganaQ off]</code></b></i></td></tr> 666*912701f9SAndroid Build Coastguard Worker 667*912701f9SAndroid Build Coastguard Worker<tr><td rowspan="2">kn</td><td>true</td><td><code>[numericOrdering on]</code></td> 668*912701f9SAndroid Build Coastguard Worker <td rowspan="2">If set to <b>on</b>, any sequence of Decimal Digits (General_Category = Nd in the [<a href="https://www.unicode.org/reports/tr41/#UAX44">UAX44</a>]) is sorted at a primary level with its numeric value. For example, "A-21" < "A-123". The computed primary weights are all at the start of the <b>digit</b> reordering group. Thus with an untailored UCA table, "a$" < "a0" < "a2" < "a12" < "a⓪" < "aa".</td></tr> 669*912701f9SAndroid Build Coastguard Worker<tr><td>false</td><td><i><b><code>[numericOrdering off]</code></b></i></td></tr> 670*912701f9SAndroid Build Coastguard Worker 671*912701f9SAndroid Build Coastguard Worker<tr><td>kr</td><td>a sequence of one or more reorder codes: <b>space, punct, symbol, currency, digit</b>, or any BCP47 script ID</td><td><code>[reorder Grek digit]</code></td> 672*912701f9SAndroid Build Coastguard Worker <td>Specifies a reordering of scripts or other significant blocks of characters such as symbols, punctuation, and digits. For the precise meaning and usage of the reorder codes, see <i><a href="#Script_Reordering">Collation Reordering</a>.</i></td></tr> 673*912701f9SAndroid Build Coastguard Worker 674*912701f9SAndroid Build Coastguard Worker<tr><td rowspan="4">kv</td><td>space</td><td><code>[maxVariable space]</code></td> 675*912701f9SAndroid Build Coastguard Worker <td rowspan="4">Sets the variable top to the top of the specified reordering group. All code points with primary weights less than or equal to the variable top will be considered variable, and thus affected by the alternate handling. Variables are ignorable by default in [<a href="https://www.unicode.org/reports/tr41/#UTS10">UCA</a>], but not in CLDR.</td></tr> 676*912701f9SAndroid Build Coastguard Worker<tr><td>punct</td><td><i><b><code>[maxVariable punct]</code></b></i></td></tr> 677*912701f9SAndroid Build Coastguard Worker<tr><td>symbol</td><td><b><code>[maxVariable symbol]</code><br/>(UCA default)</b></td></tr> 678*912701f9SAndroid Build Coastguard Worker<tr><td>currency</td><td><code>[maxVariable currency]</code></td></tr> 679*912701f9SAndroid Build Coastguard Worker<tr><td>vt</td><td>See <i>Part 1 <a href="tr35.md#Unicode_Locale_Extension_Data_Files">U Extension Data Files</a></i>.<br/><i><b>Deprecated:</b></i> Use maxVariable instead.</td><td><code>&\u00XX\uYYYY < [variable top]</code><br/><br/>(the default is set to the highest punctuation, thus including spaces and punctuation, but not symbols)</td> 680*912701f9SAndroid Build Coastguard Worker <td>The BCP47 value is described in <i>Appendix Q: <a href="tr35.md#Locale_Extension_Key_and_Type_Data">Locale Extension Keys and Types</a>.</i><br/><br/>Sets the string value for the variable top. All the code points with primary weights less than or equal to the variable top will be considered variable, and thus affected by the alternate handling.<br/>An implementation that supports the variableTop setting should also support the maxVariable setting, and it should "pin" ("round up") the variableTop to the top of the containing reordering group.<br/>Variables are ignorable by default in [<a href="https://www.unicode.org/reports/tr41/#UTS10">UCA</a>], but not in CLDR. See below for more information.</td></tr> 681*912701f9SAndroid Build Coastguard Worker 682*912701f9SAndroid Build Coastguard Worker<tr><td><i>n/a</i></td><td><i>n/a</i></td><td><i>n/a</i></td> 683*912701f9SAndroid Build Coastguard Worker <td>match-boundaries: <i><b>none</b></i> | whole-character | whole-word<br/>Defined by <i><a href="https://www.unicode.org/reports/tr10/#Searching">Searching and Matching</a></i> of [<a href="https://www.unicode.org/reports/tr41/#UTS10">UCA</a>].</td></tr> 684*912701f9SAndroid Build Coastguard Worker<tr><td><i>n/a</i></td><td><i>n/a</i></td><td><i>n/a</i></td> 685*912701f9SAndroid Build Coastguard Worker <td>match-style: <i><b>minimal</b></i> | medial | maximal<br/>Defined by <i><a href="https://www.unicode.org/reports/tr10/#Searching">Searching and Matching</a></i> of [<a href="https://www.unicode.org/reports/tr41/#UTS10">UCA</a>].</td></tr> 686*912701f9SAndroid Build Coastguard Worker</tbody></table> 687*912701f9SAndroid Build Coastguard Worker 688*912701f9SAndroid Build Coastguard Worker#### <a name="Common_Settings" href="#Common_Settings">Common settings combinations</a> 689*912701f9SAndroid Build Coastguard Worker 690*912701f9SAndroid Build Coastguard WorkerSome commonly used parametric collation settings are available via combinations of LDML settings attributes: 691*912701f9SAndroid Build Coastguard Worker 692*912701f9SAndroid Build Coastguard Worker* “Ignore accents”: **strength=primary** 693*912701f9SAndroid Build Coastguard Worker* “Ignore accents” but take case into account: **strength=primary caseLevel=on** 694*912701f9SAndroid Build Coastguard Worker* “Ignore case”: **strength=secondary** 695*912701f9SAndroid Build Coastguard Worker* “Ignore punctuation” (completely): **strength=tertiary alternate=shifted** 696*912701f9SAndroid Build Coastguard Worker* “Ignore punctuation” but distinguish among punctuation marks: **strength=quaternary alternate=shifted** 697*912701f9SAndroid Build Coastguard Worker 698*912701f9SAndroid Build Coastguard Worker#### <a name="Normalization_Setting" href="#Normalization_Setting">Notes on the normalization setting</a> 699*912701f9SAndroid Build Coastguard Worker 700*912701f9SAndroid Build Coastguard WorkerThe UCA always normalizes input strings into NFD form before the rest of the algorithm. However, this results in poor performance. 701*912701f9SAndroid Build Coastguard Worker 702*912701f9SAndroid Build Coastguard WorkerWith **normalization=off**, strings that are in [[FCD](tr35.md#FCD)] and do not contain Tibetan precomposed vowels (U+0F73, U+0F75, U+0F81) should sort correctly. With **normalization=on**, an implementation that does not normalize to NFD must at least perform an incremental FCD check and normalize substrings as necessary. It should also always decompose the Tibetan precomposed vowels. (Otherwise discontiguous contractions across their leading components cannot be handled correctly.) 703*912701f9SAndroid Build Coastguard Worker 704*912701f9SAndroid Build Coastguard WorkerAnother complication for an implementation that does not always use NFD arises when contraction mappings overlap with canonical Decomposition_Mapping strings. For example, the Danish contraction “aa” overlaps with the decompositions of ‘ä’, ‘å’, and other characters. In the root collation (and in the DUCET), Cyrillic ‘ӛ’ maps to a single collation element, which means that its decomposition “ә+◌̈” forms a contraction, and its second character (U+0308) is the same as the first character in the Decomposition_Mapping of U+0344 ‘◌̈́’=“◌̈+◌́”. 705*912701f9SAndroid Build Coastguard Worker 706*912701f9SAndroid Build Coastguard WorkerIn order to handle strings with these characters (e.g., “aä” and “ӛ́” [which are in FCD]) exactly as with prior NFD normalization, an implementation needs to either add overlap contractions to its data (e.g., “a+ä” and “ә+◌̈́”), or it needs to decompose the relevant composites (e.g., ‘ä’ and ‘◌̈́’) as soon as they are encountered. 707*912701f9SAndroid Build Coastguard Worker 708*912701f9SAndroid Build Coastguard Worker#### <a name="Variable_Top_Settings" href="#Variable_Top_Settings">Notes on variable top settings</a> 709*912701f9SAndroid Build Coastguard Worker 710*912701f9SAndroid Build Coastguard WorkerUsers may want to include more or fewer characters as Variable. For example, someone could want to restrict the Variable characters to just include space marks. In that case, maxVariable would be set to "space". (In CLDR 24 and earlier, the now-deprecated variableTop would be set to U+1680, see the “Whitespace” [UCA collation chart](https://www.unicode.org/charts/collation/)). Alternatively, someone could want more of the Common characters in them, and include characters up to (but not including) '0', by setting maxVariable to "currency". (In CLDR 24 and earlier, the now-deprecated variableTop would be set to U+20BA, see the “Currency-Symbol” collation chart). 711*912701f9SAndroid Build Coastguard Worker 712*912701f9SAndroid Build Coastguard WorkerThe effect of these settings is to customize to ignore different sets of characters when comparing strings. For example, the locale identifier "de-u-ka-shifted-kv-currency" is requesting settings appropriate for German, including German sorting conventions, and that currency symbols and characters sorting below them are ignored in sorting. 713*912701f9SAndroid Build Coastguard Worker 714*912701f9SAndroid Build Coastguard Worker### <a name="Rules" href="#Rules">Collation Rule Syntax</a> 715*912701f9SAndroid Build Coastguard Worker 716*912701f9SAndroid Build Coastguard Worker```xml 717*912701f9SAndroid Build Coastguard Worker<!ELEMENT cr #PCDATA > 718*912701f9SAndroid Build Coastguard Worker``` 719*912701f9SAndroid Build Coastguard Worker 720*912701f9SAndroid Build Coastguard WorkerThe goal for the collation rule syntax is to have clearly expressed rules with a concise format. The CLDR rule syntax is a subset of the [[ICUCollation](tr35.md#ICUCollation)] syntax. 721*912701f9SAndroid Build Coastguard Worker 722*912701f9SAndroid Build Coastguard WorkerFor the CLDR root collation, the FractionalUCA.txt file defines all mappings for all of Unicode directly, and it also provides information about script boundaries, reordering groups, and other details. For tailorings, this is neither necessary nor practical. In particular, while the root collation sort order rarely changes for existing characters, their numeric collation weights change with every version. If tailorings also specified numeric weights directly, then they would have to change with every version, parallel with the root collation. Instead, for tailorings, mappings are added and modified relative to the root collation. (There is no syntax to _remove_ mappings, except via [special \[suppressContractions \[...\]\]](#Special_Purpose_Commands) .) 723*912701f9SAndroid Build Coastguard Worker 724*912701f9SAndroid Build Coastguard WorkerThe ASCII [:P:] and [:S:] characters are reserved for collation syntax: `[\u0021-\u002F \u003A-\u0040 \u005B-\u0060 \u007B-\u007E]` 725*912701f9SAndroid Build Coastguard Worker 726*912701f9SAndroid Build Coastguard WorkerUnicode Pattern_White_Space characters between tokens are ignored. Unquoted white space terminates reset and relation strings. 727*912701f9SAndroid Build Coastguard Worker 728*912701f9SAndroid Build Coastguard WorkerA pair of ASCII apostrophes encloses quoted literal text. They are normally used to enclose a syntax character or white space, or a whole reset/relation string containing one or more such characters, so that those are parsed as part of the reset/relation strings rather than treated as syntax. A pair of immediately adjacent apostrophes is used to encode one apostrophe. 729*912701f9SAndroid Build Coastguard Worker 730*912701f9SAndroid Build Coastguard WorkerCode points can be escaped with `\uhhhh` and `\U00hhhhhh` escapes, as well as common escapes like `\t` and `\n` . (For details see the documentation of ICU `UnicodeString::unescape()`.) This is particularly useful for default-ignorable code points, combining marks, visually indistinct variants, hard-to-type characters, etc. These sequences are unescaped before the rules are parsed; this means that even escaped syntax and white space characters need to be enclosed in apostrophes. For example: `&'\u0020'='\u3000'`. Note: The unescaping is done by ICU tools (genrb) and demos before passing rule strings into the ICU library code. The ICU collation API does not unescape rule strings. 731*912701f9SAndroid Build Coastguard Worker 732*912701f9SAndroid Build Coastguard WorkerThe ASCII double quote must be both escaped (so that the collation syntax can be enclosed in pairs of double quotes in programming environments such as ICU resource bundle .txt files) and quoted. For example: `&'\u0022'<<<x` 733*912701f9SAndroid Build Coastguard Worker 734*912701f9SAndroid Build Coastguard WorkerComments are allowed at the beginning, and after any complete reset, relation, setting, or command. A comment begins with a `#` and extends to the end of the line (according to the Unicode Newline Guidelines). 735*912701f9SAndroid Build Coastguard Worker 736*912701f9SAndroid Build Coastguard WorkerThe collation syntax is case-sensitive. 737*912701f9SAndroid Build Coastguard Worker 738*912701f9SAndroid Build Coastguard Worker### <a name="Orderings" href="#Orderings">Orderings</a> 739*912701f9SAndroid Build Coastguard Worker 740*912701f9SAndroid Build Coastguard WorkerThe root collation mappings form the initial state. Mappings are added and removed via a sequence of rule chains. Each tailoring rule builds on the current state after all of the preceding rules (and is not affected by any following rules). Rule chains may alternate with comments, settings, and special commands. 741*912701f9SAndroid Build Coastguard Worker 742*912701f9SAndroid Build Coastguard WorkerA rule chain consists of a reset followed by one or more relations. The reset position is a string which maps to one or more collation elements according to the current state. A relation consists of an operator and a string; it maps the string to the current collation elements, modified according to the operator. 743*912701f9SAndroid Build Coastguard Worker 744*912701f9SAndroid Build Coastguard Worker###### Table: <a name="Specifying_Collation_Ordering" href="#Specifying_Collation_Ordering">Specifying Collation Ordering</a> 745*912701f9SAndroid Build Coastguard Worker 746*912701f9SAndroid Build Coastguard Worker| Relation Operator | Example | Description | 747*912701f9SAndroid Build Coastguard Worker| ----------------- | ------- | ----------- | 748*912701f9SAndroid Build Coastguard Worker| `&` | `& Z` | Map Z to collation elements according to the current state. These will be modified according to the following relation operators and then assigned to the corresponding relation strings. | 749*912701f9SAndroid Build Coastguard Worker| `<` | `& a`<br/>`< b` | Make 'b' sort after 'a', as a _primary_ (base-character) difference | 750*912701f9SAndroid Build Coastguard Worker| `<<` | `& a`<br/>`<< ä` | Make 'ä' sort after 'a' as a _secondary_ (accent) difference | 751*912701f9SAndroid Build Coastguard Worker| `<<<` | `& a`<br/>`<<< A` | Make 'A' sort after 'a' as a _tertiary_ (case/variant) difference | 752*912701f9SAndroid Build Coastguard Worker| `<<<<` | `& か`<br/>`<<<< カ` | Make 'カ' (Katakana Ka) sort after 'か' (Hiragana Ka) as a _quaternary_ difference | 753*912701f9SAndroid Build Coastguard Worker| `=` | `& v`<br/>`= w` | Make 'w' sort _identically_ to 'v' | 754*912701f9SAndroid Build Coastguard Worker 755*912701f9SAndroid Build Coastguard WorkerThe following shows the result of serially applying three rules. 756*912701f9SAndroid Build Coastguard Worker 757*912701f9SAndroid Build Coastguard Worker| | Rules | Result | Comment | 758*912701f9SAndroid Build Coastguard Worker| --- | ----------- | ---------------------------- | ------- | 759*912701f9SAndroid Build Coastguard Worker| 1 | & a < g | ... a **<₁ g** ... | Put g after a. | 760*912701f9SAndroid Build Coastguard Worker| 2 | & a < h < k | ... a **<₁ h <₁ k** <₁ g ... | Now put h and k after a (inserting before the g). | 761*912701f9SAndroid Build Coastguard Worker| 3 | & h << g | ... a <₁ h **<₁ g** <₁ k ... | Now put g after h (inserting before k). | 762*912701f9SAndroid Build Coastguard Worker 763*912701f9SAndroid Build Coastguard WorkerNotice that relation strings can occur multiple times, and thus override previous rules. 764*912701f9SAndroid Build Coastguard Worker 765*912701f9SAndroid Build Coastguard WorkerEach relation uses and modifies the collation elements of the immediately preceding reset position or relation. A rule chain with two or more relations is equivalent to a sequence of “atomic rules” where each rule chain has exactly one relation, and each relation is followed by a reset to this same relation string. 766*912701f9SAndroid Build Coastguard Worker 767*912701f9SAndroid Build Coastguard Worker_Example:_ 768*912701f9SAndroid Build Coastguard Worker 769*912701f9SAndroid Build Coastguard Worker| Rules | Equivalent Atomic Rules | 770*912701f9SAndroid Build Coastguard Worker| ---------------------------------------------- | ----------------------- | 771*912701f9SAndroid Build Coastguard Worker| & b < q <<< Q<br/>& a < x <<< X << q <<< Q < z | & b < q<br/>& q <<< Q<br/>& a < x<br/>& x <<< X<br/>& X << q<br/>& q <<< Q<br/>& Q < z | 772*912701f9SAndroid Build Coastguard Worker 773*912701f9SAndroid Build Coastguard WorkerThis is not always possible because prefix and extension strings can occur in a relation but not in a reset (see below). 774*912701f9SAndroid Build Coastguard Worker 775*912701f9SAndroid Build Coastguard WorkerThe relation operator `=` maps its relation string to the current collation elements. Any other relation operator modifies the current collation elements as follows. 776*912701f9SAndroid Build Coastguard Worker 777*912701f9SAndroid Build Coastguard Worker* Find the _last_ collation element whose strength is at least as great as the strength of the operator. For example, for `<<` find the last primary or secondary CE. This CE will be modified; all following CEs should be removed. If there is no such CE, then reset the collation elements to a single completely-ignorable CE. 778*912701f9SAndroid Build Coastguard Worker* Increment the collation element weight corresponding to the strength of the operator. For example, for `<<` increment the secondary weight. 779*912701f9SAndroid Build Coastguard Worker* The new weight must be less than the next weight for the same combination of higher-level weights of any collation element according to the current state. 780*912701f9SAndroid Build Coastguard Worker* Weights must be allocated in accordance with the [UCA well-formedness conditions](https://www.unicode.org/reports/tr10/#Well-Formed). 781*912701f9SAndroid Build Coastguard Worker* When incrementing any weight, lower-level weights should be reset to the “common” values, to help with sort key compression. 782*912701f9SAndroid Build Coastguard Worker 783*912701f9SAndroid Build Coastguard WorkerIn all cases, even for `=` , the case bits are recomputed according to _[Case Parameters](#Case_Parameters)_. (This can be skipped if an implementation does not support the caseLevel or caseFirst settings.) 784*912701f9SAndroid Build Coastguard Worker 785*912701f9SAndroid Build Coastguard WorkerFor example, `&ae<x` maps ‘x’ to two collation elements. The first one is the same as for ‘a’, and the second one has a primary weight between those for ‘e’ and ‘f’. As a result, ‘x’ sorts between “ae” and “af”. (If the primary of the first collation element was incremented instead, then ‘x’ would sort after “az”. While also sorting primary-after “ae” this would be surprising and sub-optimal.) 786*912701f9SAndroid Build Coastguard Worker 787*912701f9SAndroid Build Coastguard WorkerSome additional operators are provided to save space with large tailorings. The addition of a * to the relation operator indicates that each of the following single characters are to be handled as if they were separate relations with the corresponding strength. Each of the following single characters must be NFD-inert, that is, it does not have a canonical decomposition and it does not reorder (ccc=0). This keeps abbreviated rules unambiguous. 788*912701f9SAndroid Build Coastguard Worker 789*912701f9SAndroid Build Coastguard WorkerA starred relation operator is followed by a sequence of characters with the same quoting/escaping rules as normal relation strings. Such a sequence can also be followed by one or more pairs of ‘-’ and another sequence of characters. The single characters adjacent to the ‘-’ establish a code point order range. The same character cannot be both the end of a range and the start of another range. (For example, `<a-d-g` is not allowed.) 790*912701f9SAndroid Build Coastguard Worker 791*912701f9SAndroid Build Coastguard Worker###### Table: <a name="Abbreviating_Ordering_Specifications" href="#Abbreviating_Ordering_Specifications">Abbreviating Ordering Specifications</a> 792*912701f9SAndroid Build Coastguard Worker 793*912701f9SAndroid Build Coastguard Worker| Relation Operator | Example | Equivalent | 794*912701f9SAndroid Build Coastguard Worker| ----------------- | ----------------------- | ---------- | 795*912701f9SAndroid Build Coastguard Worker| `<*` | `& a`<br/>`<* bcd-gp-s` | `& a`<br/>`< b < c < d < e < f < g < p < q < r < s` | 796*912701f9SAndroid Build Coastguard Worker| `<<*` | `& a`<br/>`<<* æᶏɐ` | `& a`<br/>`<< æ << ᶏ << ɐ` | 797*912701f9SAndroid Build Coastguard Worker| `<<<*` | `& p`<br/>`<<<* PpP` | `& p`<br/>`<<< P <<< p <<< P` | 798*912701f9SAndroid Build Coastguard Worker| `<<<<*` | `& k`<br/>`<<<<* qQ` | `& k`<br/>`<<<< q <<<< Q` | 799*912701f9SAndroid Build Coastguard Worker| `=*` | `& v`<br/>`=* VwW` | `& v`<br/>`= V = w = W` | 800*912701f9SAndroid Build Coastguard Worker 801*912701f9SAndroid Build Coastguard Worker### <a name="Contractions" href="#Contractions">Contractions</a> 802*912701f9SAndroid Build Coastguard Worker 803*912701f9SAndroid Build Coastguard WorkerA multi-character relation string defines a contraction. 804*912701f9SAndroid Build Coastguard Worker 805*912701f9SAndroid Build Coastguard Worker###### Table: <a name="Specifying_Contractions" href="#Specifying_Contractions">Specifying Contractions</a> 806*912701f9SAndroid Build Coastguard Worker 807*912701f9SAndroid Build Coastguard Worker| Example | Description | 808*912701f9SAndroid Build Coastguard Worker| ---------------- | ----------- | 809*912701f9SAndroid Build Coastguard Worker| `& k`<br/>`< ch` | Make the sequence 'ch' sort after 'k', as a primary (base-character) difference | 810*912701f9SAndroid Build Coastguard Worker 811*912701f9SAndroid Build Coastguard Worker### <a name="Expansions" href="#Expansions">Expansions</a> 812*912701f9SAndroid Build Coastguard Worker 813*912701f9SAndroid Build Coastguard WorkerA mapping to multiple collation elements defines an expansion. This is normally the result of a reset position (and/or preceding relation) that yields multiple collation elements, for example `&ae<x` or `&æ<y` . 814*912701f9SAndroid Build Coastguard Worker 815*912701f9SAndroid Build Coastguard WorkerA relation string can also be followed by `/` and an _extension string_. The extension string is mapped to collation elements according to the current state, and the relation string is mapped to the concatenation of the regular CEs and the extension CEs. The extension CEs are not modified, not even their case bits. The extension CEs are _not_ retained for following relations. 816*912701f9SAndroid Build Coastguard Worker 817*912701f9SAndroid Build Coastguard WorkerFor example, `&a<z/e` maps ‘z’ to an expansion similar to `&ae<x` . However, the first CE of ‘z’ is primary-after that of ‘a’, and the second CE is exactly that of ‘e’, which yields the order ae < x < af < ag < ... < az < z < b. 818*912701f9SAndroid Build Coastguard Worker 819*912701f9SAndroid Build Coastguard WorkerThe choice of reset-to-expansion vs. use of an extension string can be exploited to affect contextual mappings. For example, `&L·=x` yields a second CE for ‘x’ equal to the context-sensitive middle-dot-after-L (which is a secondary CE in the root collation). On the other hand, `&L=x/·` yields a second CE of the middle dot by itself (which is a primary CE). 820*912701f9SAndroid Build Coastguard Worker 821*912701f9SAndroid Build Coastguard WorkerThe two ways of specifying expansions also differ in how case bits are computed. When some of the CEs are copied verbatim from an extension string, then the relation string’s case bits are distributed over a smaller number of normal CEs. For example, `&aE=Ch` yields an uppercase CE and a lowercase CE, but `&a=Ch/E` yields a mixed-case CE (for ‘C’ and ‘h’ together) followed by an uppercase CE (copied from ‘E’). 822*912701f9SAndroid Build Coastguard Worker 823*912701f9SAndroid Build Coastguard WorkerIn summary, there are two ways of specifying expansions which produce subtly different mappings. The use of extension strings is unusual but sometimes necessary. 824*912701f9SAndroid Build Coastguard Worker 825*912701f9SAndroid Build Coastguard Worker### <a name="Context_Before" href="#Context_Before">Context Before</a> 826*912701f9SAndroid Build Coastguard Worker 827*912701f9SAndroid Build Coastguard WorkerA relation string can have a prefix (context before) which makes the mapping from the relation string to its tailored position conditional on the string occurring after that prefix. For details see the specification of _[Context-Sensitive Mappings](#Context_Sensitive_Mappings)_. 828*912701f9SAndroid Build Coastguard Worker 829*912701f9SAndroid Build Coastguard WorkerFor example, suppose that "-" is sorted like the previous vowel. Then one could have rules that take "a-", "e-", and so on. However, that means that every time a very common character (a, e, ...) is encountered, a system will slow down as it looks for possible contractions. An alternative is to indicate that when "-" is encountered, and it comes after an 'a', it sorts like an 'a', and so on. 830*912701f9SAndroid Build Coastguard Worker 831*912701f9SAndroid Build Coastguard Worker###### Table: <a name="Specifying_Previous_Context" href="#Specifying_Previous_Context">Specifying Previous Context</a> 832*912701f9SAndroid Build Coastguard Worker 833*912701f9SAndroid Build Coastguard Worker| Rules | 834*912701f9SAndroid Build Coastguard Worker| ----- | 835*912701f9SAndroid Build Coastguard Worker| `& a <<< a \| '-'`<br/>`& e <<< e \| '-'`<br/>`...` | 836*912701f9SAndroid Build Coastguard Worker 837*912701f9SAndroid Build Coastguard WorkerBoth the prefix and extension strings can occur in a relation. For example, the following are allowed: 838*912701f9SAndroid Build Coastguard Worker 839*912701f9SAndroid Build Coastguard Worker* `< abc | def / ghi` 840*912701f9SAndroid Build Coastguard Worker* `< def / ghi` 841*912701f9SAndroid Build Coastguard Worker* `< abc | def` 842*912701f9SAndroid Build Coastguard Worker 843*912701f9SAndroid Build Coastguard Worker### <a name="Placing_Characters_Before_Others" href="#Placing_Characters_Before_Others">Placing Characters Before Others</a> 844*912701f9SAndroid Build Coastguard Worker 845*912701f9SAndroid Build Coastguard WorkerThere are certain circumstances where characters need to be placed before a given character, rather than after. This is the case with Pinyin, for example, where certain accented letters are positioned before the base letter. That is accomplished with the following syntax. 846*912701f9SAndroid Build Coastguard Worker 847*912701f9SAndroid Build Coastguard Worker`&[before 2] a << à` 848*912701f9SAndroid Build Coastguard Worker 849*912701f9SAndroid Build Coastguard WorkerThe before-strength can be 1 (primary), 2 (secondary), or 3 (tertiary). 850*912701f9SAndroid Build Coastguard Worker 851*912701f9SAndroid Build Coastguard WorkerIt is an error if the strength of the reset-before differs from the strength of the immediately following relation. Thus the following are errors. 852*912701f9SAndroid Build Coastguard Worker 853*912701f9SAndroid Build Coastguard Worker* `&[before 2] a < à # error` 854*912701f9SAndroid Build Coastguard Worker* `&[before 2] a <<< à # error` 855*912701f9SAndroid Build Coastguard Worker 856*912701f9SAndroid Build Coastguard Worker### <a name="Logical_Reset_Positions" href="#Logical_Reset_Positions">Logical Reset Positions</a> 857*912701f9SAndroid Build Coastguard Worker 858*912701f9SAndroid Build Coastguard WorkerThe CLDR table (based on UCA) has the following overall structure for weights, going from low to high. 859*912701f9SAndroid Build Coastguard Worker 860*912701f9SAndroid Build Coastguard Worker###### Table: <a name="Specifying_Logical_Positions" href="#Specifying_Logical_Positions">Specifying Logical Positions</a> 861*912701f9SAndroid Build Coastguard Worker 862*912701f9SAndroid Build Coastguard Worker| Name | Description | UCA Examples | 863*912701f9SAndroid Build Coastguard Worker| -------------------------------------------------------------- | ---------------- | ------------ | 864*912701f9SAndroid Build Coastguard Worker| first tertiary ignorable<br/>...<br/>last tertiary ignorable | p, s, t = ignore | Control Codes<br/>Format Characters<br/>Hebrew Points<br/>Tibetan Signs<br/>... | 865*912701f9SAndroid Build Coastguard Worker| first secondary ignorable<br/>...<br/>last secondary ignorable | p, s = ignore | None in UCA | 866*912701f9SAndroid Build Coastguard Worker| first primary ignorable<br/>...<br/>last primary ignorable | p = ignore | Most combining marks | 867*912701f9SAndroid Build Coastguard Worker| first variable<br/>...<br/>last variable | _**if** alternate = non-ignorable<br/>_p != ignore,<br/>_**if** alternate = shifted_<br/>p, s, t = ignore | Whitespace,<br/>Punctuation | 868*912701f9SAndroid Build Coastguard Worker| first regular<br/>...<br/>last regular | p != ignore | General Symbols<br/>Currency Symbols<br/>Numbers<br/>Latin<br/>Greek<br/>... | 869*912701f9SAndroid Build Coastguard Worker| first implicit<br/>...<br/>last implicit | p != ignore, assigned automatically | CJK, CJK compatibility (those that are not decomposed)<br/>CJK Extension A, B, C, ...<br/>Unassigned | 870*912701f9SAndroid Build Coastguard Worker| first trailing<br/>...<br/>last trailing | p != ignore,<br/>used for trailing syllable components | Jamo Trailing<br/>Jamo Leading<br/>U+FFFD<br/>U+FFFF | 871*912701f9SAndroid Build Coastguard Worker 872*912701f9SAndroid Build Coastguard WorkerEach of the above Names can be used with a reset to position characters relative to that logical position. That allows characters to be ordered before or after a _logical_ position rather than a specific character. 873*912701f9SAndroid Build Coastguard Worker 874*912701f9SAndroid Build Coastguard Worker> **Note**: The reason for this is so that tailorings can be more stable. A future version of the UCA might add characters at any point in the above list. Suppose that you set character X to be after Y. It could be that you want X to come after Y, no matter what future characters are added; or it could be that you just want Y to come after a given logical position, for example, after the last primary ignorable. 875*912701f9SAndroid Build Coastguard Worker 876*912701f9SAndroid Build Coastguard WorkerEach of these special reset positions always maps to a single collation element. 877*912701f9SAndroid Build Coastguard Worker 878*912701f9SAndroid Build Coastguard WorkerHere is an example of the syntax: 879*912701f9SAndroid Build Coastguard Worker 880*912701f9SAndroid Build Coastguard Worker`& [first tertiary ignorable] << à` 881*912701f9SAndroid Build Coastguard Worker 882*912701f9SAndroid Build Coastguard WorkerFor example, to make a character be a secondary ignorable, one can make it be immediately after (at a secondary level) a specific character (like a combining diaeresis), or one can make it be immediately after the last secondary ignorable. 883*912701f9SAndroid Build Coastguard Worker 884*912701f9SAndroid Build Coastguard WorkerEach special reset position adjusts to the effects of preceding rules, just like normal reset position strings. For example, if a tailoring rule creates a new collation element after `&[last variable]` (via explicit tailoring after that, or via tailoring after the relevant character), then this new CE becomes the new _last variable_ CE, and is used in following resets to `[last variable]` . 885*912701f9SAndroid Build Coastguard Worker 886*912701f9SAndroid Build Coastguard Worker[first variable] and [first regular] and [first trailing] should be the first real such CEs (e.g., CE(U+0060 \`)), as adjusted according to the tailoring, not the boundary CEs (see the FractionalUCA.txt “first primary” mappings starting with U+FDD1). 887*912701f9SAndroid Build Coastguard Worker 888*912701f9SAndroid Build Coastguard Worker`[last regular]` is not actually the last normal CE with a primary weight before implicit primaries. It is used to tailor large numbers of characters, usually CJK, into the script=Hani range between the last regular script and the first implicit CE. (The first group of implicit CEs is for Han characters.) Therefore, `[last regular]` is set to the first Hani CE, the artificial script boundary CE at the beginning of this range. For example: `&[last regular]<*亜唖娃阿...` 889*912701f9SAndroid Build Coastguard Worker 890*912701f9SAndroid Build Coastguard WorkerThe [last trailing] is the CE of U+FFFF. Tailoring to that is not allowed. 891*912701f9SAndroid Build Coastguard Worker 892*912701f9SAndroid Build Coastguard WorkerThe `[last variable]` indicates the "highest" character that is treated as punctuation with alternate handling. 893*912701f9SAndroid Build Coastguard Worker 894*912701f9SAndroid Build Coastguard WorkerThe value can be changed by using the maxVariable setting. This takes effect, however, after the rules have been built, and does not affect any characters that are reset relative to the `[last variable]` value when the rules are being built. The maxVariable setting might also be changed via a runtime parameter. That also does not affect the rules. 895*912701f9SAndroid Build Coastguard Worker(In CLDR 24 and earlier, the variable top could also be set by using a tailoring rule with `[variable top]` in the place of a relation string.) 896*912701f9SAndroid Build Coastguard Worker 897*912701f9SAndroid Build Coastguard Worker### <a name="Special_Purpose_Commands" href="#Special_Purpose_Commands">Special-Purpose Commands</a> 898*912701f9SAndroid Build Coastguard Worker 899*912701f9SAndroid Build Coastguard WorkerThe import command imports rules from another collation. This allows for better maintenance and smaller rule sizes. The source is a BCP 47 language tag with an optional collation type but without other extensions. The collation type is the BCP 47 form of the collation type in the source; it defaults to "standard". 900*912701f9SAndroid Build Coastguard Worker 901*912701f9SAndroid Build Coastguard Worker_Examples:_ 902*912701f9SAndroid Build Coastguard Worker 903*912701f9SAndroid Build Coastguard Worker* `[import de-u-co-phonebk]` (not "...-co-phonebook") 904*912701f9SAndroid Build Coastguard Worker* `[import und-u-co-search]` (not "root-...") 905*912701f9SAndroid Build Coastguard Worker* `[import ja-u-co-private-kana]` (language "ja" required even when this import itself is in another "ja" tailoring.) 906*912701f9SAndroid Build Coastguard Worker 907*912701f9SAndroid Build Coastguard Worker###### Table: <a name="Special_Purpose_Elements" href="#Special_Purpose_Elements">Special-Purpose Elements</a> 908*912701f9SAndroid Build Coastguard Worker 909*912701f9SAndroid Build Coastguard Worker| Rule Syntax | 910*912701f9SAndroid Build Coastguard Worker| ----------- | 911*912701f9SAndroid Build Coastguard Worker| [suppressContractions [Љ-ґ]] | 912*912701f9SAndroid Build Coastguard Worker| [optimize [Ά-ώ]] | 913*912701f9SAndroid Build Coastguard Worker 914*912701f9SAndroid Build Coastguard WorkerThe _suppress contractions_ tailoring command turns off any existing contractions that begin with those characters, as well as any prefixes for those characters. It is typically used to turn off the Cyrillic contractions in the UCA, since they are not used in many languages and have a considerable performance penalty. The argument is a [Unicode Set](tr35.md#Unicode_Sets). 915*912701f9SAndroid Build Coastguard Worker 916*912701f9SAndroid Build Coastguard WorkerThe _suppress contractions_ command has immediate effect on the current set of mappings, including mappings added by preceding rules. Following rules are processed after removing any context-sensitive mappings originating from any of the characters in the set. 917*912701f9SAndroid Build Coastguard Worker 918*912701f9SAndroid Build Coastguard WorkerThe _optimize_ tailoring command is purely for performance. It indicates that those characters are sufficiently common in the target language for the tailoring that their performance should be enhanced. 919*912701f9SAndroid Build Coastguard Worker 920*912701f9SAndroid Build Coastguard WorkerThe reason that these are not settings is so that their contents can be arbitrary characters. 921*912701f9SAndroid Build Coastguard Worker 922*912701f9SAndroid Build Coastguard Worker* * * 923*912701f9SAndroid Build Coastguard Worker 924*912701f9SAndroid Build Coastguard Worker_Example:_ 925*912701f9SAndroid Build Coastguard Worker 926*912701f9SAndroid Build Coastguard WorkerThe following is a simple example that combines portions of different tailorings for illustration. For more complete examples, see the actual locale data: [Japanese](https://github.com/unicode-org/cldr/blob/main/common/collation/ja.xml), [Chinese](https://github.com/unicode-org/cldr/blob/main/common/collation/zh.xml), [Swedish](https://github.com/unicode-org/cldr/blob/main/common/collation/sv.xml), and [German](https://github.com/unicode-org/cldr/blob/main/common/collation/de.xml) (type="phonebook") are particularly illustrative. 927*912701f9SAndroid Build Coastguard Worker 928*912701f9SAndroid Build Coastguard Worker```xml 929*912701f9SAndroid Build Coastguard Worker<collation> 930*912701f9SAndroid Build Coastguard Worker <cr><![CDATA[ 931*912701f9SAndroid Build Coastguard Worker [caseLevel on] 932*912701f9SAndroid Build Coastguard Worker &Z 933*912701f9SAndroid Build Coastguard Worker < æ <<< Æ 934*912701f9SAndroid Build Coastguard Worker < å <<< Å <<< aa <<< aA <<< Aa <<< AA 935*912701f9SAndroid Build Coastguard Worker < ä <<< Ä 936*912701f9SAndroid Build Coastguard Worker < ö <<< Ö << ű <<< Ű 937*912701f9SAndroid Build Coastguard Worker < ő <<< Ő << ø <<< Ø 938*912701f9SAndroid Build Coastguard Worker &V <<<* wW 939*912701f9SAndroid Build Coastguard Worker &Y <<<* üÜ 940*912701f9SAndroid Build Coastguard Worker &[last non-ignorable] 941*912701f9SAndroid Build Coastguard Worker # The following is equivalent to <亜<唖<娃... 942*912701f9SAndroid Build Coastguard Worker <* 亜唖娃阿哀愛挨姶逢葵茜穐悪握渥旭葦芦 943*912701f9SAndroid Build Coastguard Worker <* 鯵梓圧斡扱 944*912701f9SAndroid Build Coastguard Worker ]]></cr> 945*912701f9SAndroid Build Coastguard Worker</collation> 946*912701f9SAndroid Build Coastguard Worker``` 947*912701f9SAndroid Build Coastguard Worker 948*912701f9SAndroid Build Coastguard Worker### <a name="Script_Reordering" href="#Script_Reordering">Collation Reordering</a> 949*912701f9SAndroid Build Coastguard Worker 950*912701f9SAndroid Build Coastguard WorkerCollation reordering allows scripts and certain other defined blocks of characters to be moved relative to each other parametrically, without changing the detailed rules for all the characters involved. This reordering is done on top of any specific ordering rules within the script or block currently in effect. Reordering can specify groups to be placed at the start and/or the end of the collation order. For example, to reorder Greek characters before Latin characters, and digits afterwards (but before other scripts), the following can be used: 951*912701f9SAndroid Build Coastguard Worker 952*912701f9SAndroid Build Coastguard Worker| Rule Syntax | Locale Identifier | 953*912701f9SAndroid Build Coastguard Worker| --------------------------- | ----------------- | 954*912701f9SAndroid Build Coastguard Worker| `[reorder Grek Latn digit]` | `en-u-kr-grek-latn-digit` | 955*912701f9SAndroid Build Coastguard Worker 956*912701f9SAndroid Build Coastguard WorkerIn each case, a sequence of _**reorder_codes**_ is used, separated by spaces in the settings attribute and in rule syntax, and by hyphens in locale identifiers. 957*912701f9SAndroid Build Coastguard Worker 958*912701f9SAndroid Build Coastguard WorkerA **_reorder_code_** is any of the following special codes: 959*912701f9SAndroid Build Coastguard Worker 960*912701f9SAndroid Build Coastguard Worker1. **space, punct, symbol, currency, digit** - core groups of characters below 'a' 961*912701f9SAndroid Build Coastguard Worker2. **any script code** except **Common** and **Inherited**. 962*912701f9SAndroid Build Coastguard Worker * Some pairs of scripts sort primary-equal and always reorder together. For example, Katakana characters are are always reordered with Hiragana. 963*912701f9SAndroid Build Coastguard Worker3. **others** - where all codes not explicitly mentioned should be ordered. The script code **Zzzz** (Unknown Script) is a synonym for **others**. 964*912701f9SAndroid Build Coastguard Worker 965*912701f9SAndroid Build Coastguard WorkerIt is an error if a code occurs multiple times. 966*912701f9SAndroid Build Coastguard Worker 967*912701f9SAndroid Build Coastguard WorkerIt is an error if the sequence of reorder codes is empty in the XML attribute or in the locale identifier. Some implementations may interpret an empty sequence in the `[reorder]` rule syntax as a reset to the DUCET ordering, synonymous with `[reorder others]` ; other implementations may forbid an empty sequence in the rule syntax as well. 968*912701f9SAndroid Build Coastguard Worker 969*912701f9SAndroid Build Coastguard WorkerInteraction with **alternate=shifted**: Whether a primary weight is “variable” is determined according to the “variable top”, before applying script reordering. Once that is determined, script reordering is applied to the primary weight regardless of whether it is “regular” (used in the primary level) or “shifted” (used in the quaternary level). 970*912701f9SAndroid Build Coastguard Worker 971*912701f9SAndroid Build Coastguard Worker#### <a name="Interpretation_reordering" href="#Interpretation_reordering">Interpretation of a reordering list</a> 972*912701f9SAndroid Build Coastguard Worker 973*912701f9SAndroid Build Coastguard WorkerThe reordering list is interpreted as if it were processed in the following way. 974*912701f9SAndroid Build Coastguard Worker 975*912701f9SAndroid Build Coastguard Worker1. If any core code is not present, then it is inserted at the front of the list in the order given above. 976*912701f9SAndroid Build Coastguard Worker2. If the **others** code is not present, then it is inserted at the end of the list. 977*912701f9SAndroid Build Coastguard Worker3. The **others** code is replaced by the list of all script codes not explicitly mentioned, in DUCET order. 978*912701f9SAndroid Build Coastguard Worker4. The reordering list is now complete, and used to reorder characters in collation accordingly. 979*912701f9SAndroid Build Coastguard Worker 980*912701f9SAndroid Build Coastguard WorkerThe locale data may have a particular ordering. For example, the Czech locale data could put digits after all letters, with `[reorder others digit]` . Any reordering codes specified on top of that (such as with a bcp47 locale identifier) completely replace what was there. To specify a version of collation that completely resets any existing reordering to the DUCET ordering, the single code **Zzzz** or **others** can be used, as below. 981*912701f9SAndroid Build Coastguard Worker 982*912701f9SAndroid Build Coastguard Worker_Examples:_ 983*912701f9SAndroid Build Coastguard Worker 984*912701f9SAndroid Build Coastguard Worker| Locale Identifier | Effect | 985*912701f9SAndroid Build Coastguard Worker| --------------------------------- | ------ | 986*912701f9SAndroid Build Coastguard Worker| `en-u-kr-latn-digit` | Reorder digits after Latin characters (but before other scripts like Cyrillic). | 987*912701f9SAndroid Build Coastguard Worker| `en-u-kr-others-digit` | Reorder digits after all other characters. | 988*912701f9SAndroid Build Coastguard Worker| `en-u-kr-arab-cyrl-others-symbol` | Reorder Arabic characters first, then Cyrillic, and put symbols at the end—after all other characters. | 989*912701f9SAndroid Build Coastguard Worker| `en-u-kr-others` | Remove any locale-specific reordering, and use DUCET order for reordering blocks. | 990*912701f9SAndroid Build Coastguard Worker 991*912701f9SAndroid Build Coastguard WorkerThe default reordering groups are defined by the FractionalUCA.txt file, based on the primary weights of associated collation elements. The file contains special mappings for the start of each group, script, and reorder-reserved range, see _[FractionalUCA.txt](#File_Format_FractionalUCA_txt)_. 992*912701f9SAndroid Build Coastguard Worker 993*912701f9SAndroid Build Coastguard WorkerThere are some special cases: 994*912701f9SAndroid Build Coastguard Worker 995*912701f9SAndroid Build Coastguard Worker* The **Hani** group includes implicit weights for _Han characters_ according to the UCA as well as any characters tailored relative to a Han character, or after `&[first Hani]`. 996*912701f9SAndroid Build Coastguard Worker* Implicit weights for _unassigned code points_ according to the UCA reorder as the last weights in the **others** (**Zzzz**) group. 997*912701f9SAndroid Build Coastguard Worker There is no script code to explicitly reorder the unassigned-implicit weights into a particular position. (Unassigned-implicit weights are used for non-Hani code points without any mappings. For a given Unicode version they are the code points with General_Category values Cn, Co, Cs.) 998*912701f9SAndroid Build Coastguard Worker* The TRAILING group, the FIELD-SEPARATOR (associated with U+FFFE), and collation elements with only zero primary weights are not reordered. 999*912701f9SAndroid Build Coastguard Worker* The TERMINATOR, LEVEL-SEPARATOR, and SPECIAL groups are never associated with characters. 1000*912701f9SAndroid Build Coastguard Worker 1001*912701f9SAndroid Build Coastguard WorkerFor example, `reorder="Hani Zzzz Grek"` sorts Hani, Latin, Cyrillic, ... (all other scripts) ..., unassigned, Greek, TRAILING. 1002*912701f9SAndroid Build Coastguard Worker 1003*912701f9SAndroid Build Coastguard WorkerNotes for implementations that write sort keys: 1004*912701f9SAndroid Build Coastguard Worker 1005*912701f9SAndroid Build Coastguard Worker* Primaries must always be offset by one or more whole primary lead bytes. (Otherwise the number of bytes in a fractional weight may change, compressible scripts may span multiple lead bytes, or trailing primary bytes may collide with separators and primary-compression terminators.) 1006*912701f9SAndroid Build Coastguard Worker* When a script is reordered that does not start and end on whole-primary-lead-byte boundaries, then the lead byte needs to be “split”, and a reserved byte is used up. The data supports this via reorder-reserved ranges of primary weights that are not used for collation elements. 1007*912701f9SAndroid Build Coastguard Worker* Primary weights from different original lead bytes can be reordered to a shared lead byte, as long as they do not overlap. Primary compression ends when the target lead byte differs or when the original lead byte of the next primary is not compressible. 1008*912701f9SAndroid Build Coastguard Worker* Non-compressible groups and scripts begin or end on whole-primary-lead-byte boundaries (or both), so that reordering cannot surround a non-compressible script by two compressible ones within the same target lead byte. This is so that primary compression can be terminated reliably (choosing the low or high terminator byte) simply by comparing the previous and current primary weights. Otherwise it would have to also check for another condition (e.g., equal scripts). 1009*912701f9SAndroid Build Coastguard Worker 1010*912701f9SAndroid Build Coastguard Worker#### <a name="Reordering_Groups_allkeys" href="#Reordering_Groups_allkeys">Reordering Groups for allkeys.txt</a> 1011*912701f9SAndroid Build Coastguard Worker 1012*912701f9SAndroid Build Coastguard WorkerFor allkeys_CLDR.txt, the start of each reordering group can be determined from FractionalUCA.txt, by finding the first real mapping (after “xyz first primary”) of that group (e.g., `0060; [0D 07, 05, 05] # Zyyy Sk [0312.0020.0002] * GRAVE ACCENT` ), and looking for that mapping's character sequence ( `0060` ) in allkeys_CLDR.txt. The comment in FractionalUCA.txt ( `[0312.0020.0002]` ) also shows the allkeys_CLDR.txt collation elements. 1013*912701f9SAndroid Build Coastguard Worker 1014*912701f9SAndroid Build Coastguard WorkerThe DUCET ordering of some characters is slightly different from the CLDR root collation order. The reordering groups for the DUCET are not specified. The following describes how reordering groups for the DUCET can be derived. 1015*912701f9SAndroid Build Coastguard Worker 1016*912701f9SAndroid Build Coastguard WorkerFor allkeys_DUCET.txt, the start of each reordering group is normally the primary weight corresponding to the same character sequence as for allkeys_CLDR.txt. In a few cases this requires adjustment, especially for the special reordering groups, due to CLDR’s ordering the common characters more strictly by category than the DUCET (as described in _[Root Collation](#Root_Collation)_). The necessary adjustment would set the start of each allkeys_DUCET.txt reordering group to the primary weight of the first mapping for the relevant General_Category for a special reordering group (for characters that sort before ‘a’), or the primary weight of the first mapping for the first script (e.g., sc=Grek) of an “alphabetic” group (for characters that sort at or after ‘a’). 1017*912701f9SAndroid Build Coastguard Worker 1018*912701f9SAndroid Build Coastguard WorkerNote that the following only applies to primary weights greater than the one for U+FFFE and less than "trailing" weights. 1019*912701f9SAndroid Build Coastguard Worker 1020*912701f9SAndroid Build Coastguard WorkerThe special reordering groups correspond to General_Category values as follows: 1021*912701f9SAndroid Build Coastguard Worker 1022*912701f9SAndroid Build Coastguard Worker* punct: P 1023*912701f9SAndroid Build Coastguard Worker* symbol: Sk, Sm, So 1024*912701f9SAndroid Build Coastguard Worker* space: Z, Cc 1025*912701f9SAndroid Build Coastguard Worker* currency: Sc 1026*912701f9SAndroid Build Coastguard Worker* digit: Nd 1027*912701f9SAndroid Build Coastguard Worker 1028*912701f9SAndroid Build Coastguard WorkerIn the DUCET, some characters that sort below ‘a’ and have other General_Category values not mentioned above (e.g., gc=Lm) are also grouped with symbols. Variants of numbers (gc=No or Nl) can be found among punctuation, symbols, and digits. 1029*912701f9SAndroid Build Coastguard Worker 1030*912701f9SAndroid Build Coastguard WorkerEach collation element of an expansion may be in a different reordering group, for example for parenthesized characters. 1031*912701f9SAndroid Build Coastguard Worker 1032*912701f9SAndroid Build Coastguard Worker### <a name="Case_Parameters" href="#Case_Parameters">Case Parameters</a> 1033*912701f9SAndroid Build Coastguard Worker 1034*912701f9SAndroid Build Coastguard WorkerThe **case level** is an _optional_ intermediate level ("2.5") between Level 2 and Level 3 (or after Level 1, if there is no Level 2 due to strength settings). The case level is used to support two parametric features: ignoring non-case variants (Level 3 differences) except for case, and giving case differences a higher-level priority than other tertiary differences. Distinctions between small and large Kana characters are also included as case differences, to support Japanese collation. 1035*912701f9SAndroid Build Coastguard Worker 1036*912701f9SAndroid Build Coastguard WorkerThe **case first** parameter controls whether to swap the order of upper and lowercase. It can be used with or without the case level. 1037*912701f9SAndroid Build Coastguard Worker 1038*912701f9SAndroid Build Coastguard WorkerImportantly, the case parameters have no effect in many instances. For example, they have no effect on the comparison of two non-ignorable characters with different primary weights, or with different secondary weights if the strength = **secondary (or higher).** 1039*912701f9SAndroid Build Coastguard Worker 1040*912701f9SAndroid Build Coastguard WorkerWhen either the **case level** or **case first** parameters are set, the following describes the derivation of the modified collation elements. It assumes the original levels for the code point are [p.s.t] (primary, secondary, tertiary). This derivation may change in future versions of LDML, to track the case characteristics more closely. 1041*912701f9SAndroid Build Coastguard Worker 1042*912701f9SAndroid Build Coastguard Worker#### <a name="Case_Untailored" href="#Case_Untailored">Untailored Characters</a> 1043*912701f9SAndroid Build Coastguard Worker 1044*912701f9SAndroid Build Coastguard WorkerFor untailored characters and strings, that is, for mappings in the root collation, the case value for each collation element is computed from the tertiary weight listed in allkeys_CLDR.txt. This is used to modify the collation element. 1045*912701f9SAndroid Build Coastguard Worker 1046*912701f9SAndroid Build Coastguard WorkerLook up a case value for the tertiary weight x of each collation element: 1047*912701f9SAndroid Build Coastguard Worker 1048*912701f9SAndroid Build Coastguard Worker1. UPPER if x ∈ {08-0C, 0E, 11, 12, 1D} 1049*912701f9SAndroid Build Coastguard Worker2. UNCASED otherwise 1050*912701f9SAndroid Build Coastguard Worker3. FractionalUCA.txt encodes the case information in bits 6 and 7 of the first byte in each tertiary weight. The case bits are set to 00 for UNCASED and LOWERCASE, and 10 for UPPER. There is no MIXED case value (01) in the root collation. 1051*912701f9SAndroid Build Coastguard Worker 1052*912701f9SAndroid Build Coastguard Worker#### <a name="Case_Weights" href="#Case_Weights">Compute Modified Collation Elements</a> 1053*912701f9SAndroid Build Coastguard Worker 1054*912701f9SAndroid Build Coastguard WorkerFrom a computed case value, set a weight **c** according to the following. 1055*912701f9SAndroid Build Coastguard Worker 1056*912701f9SAndroid Build Coastguard Worker1. If **CaseFirst=UpperFirst**, set **c** = UPPER ? **1** : MIXED ? 2 : **3** 1057*912701f9SAndroid Build Coastguard Worker2. Otherwise set **c** = UPPER ? **3** : MIXED ? 2 : **1** 1058*912701f9SAndroid Build Coastguard Worker 1059*912701f9SAndroid Build Coastguard WorkerCompute a new collation element according to the following table. The notation _xt_ means that the values are numerically combined into a single level, such that xt < yu whenever x < y. The fourth level (if it exists) is unaffected. Note that a secondary CE must have a secondary weight S which is greater than the secondary weight s of any primary CE; and a tertiary CE must have a tertiary weight T which is greater than the tertiary weight t of any primary or secondary CE ([[UCA](https://www.unicode.org/reports/tr41/#UTS10)] [WF2](https://www.unicode.org/reports/tr10/#WF2)). 1060*912701f9SAndroid Build Coastguard Worker 1061*912701f9SAndroid Build Coastguard Worker<table><tbody> 1062*912701f9SAndroid Build Coastguard Worker<tr><th>Case Level</th><th>Strength</th><th>Original CE</th><th>Modified CE</th><th>Comment</th></tr> 1063*912701f9SAndroid Build Coastguard Worker 1064*912701f9SAndroid Build Coastguard Worker<tr><td rowspan="5"><strong>on</strong></td><td rowspan="2"><strong>primary</strong></td><td><code>0.S.t</code></td><td><code>0.0</code></td><td rowspan="2">ignore case level weights of primary-ignorable CEs</td></tr> 1065*912701f9SAndroid Build Coastguard Worker<tr><td><code>p.s.t</code></td><td><code>p.c</code></td></tr> 1066*912701f9SAndroid Build Coastguard Worker 1067*912701f9SAndroid Build Coastguard Worker<tr><td rowspan="3"><strong>secondary<br></strong> or higher</td><td><code>0.0.T</code></td> <td><code>0.0.0.T</code></td><td rowspan="3">ignore case level weights of secondary-ignorable CEs</td></tr> 1068*912701f9SAndroid Build Coastguard Worker <tr><td><code>0.S.t</code></td><td><code>0.S.c.t</code></td></tr> 1069*912701f9SAndroid Build Coastguard Worker <tr><td><code>p.s.t</code></td><td><code>p.s.c.t</code></td></tr> 1070*912701f9SAndroid Build Coastguard Worker 1071*912701f9SAndroid Build Coastguard Worker<tr><td rowspan="4"><strong>off</strong></td><td rowspan="4">any</td><td><code>0.0.0</code></td><td><code>0.0.00</code></td><td rowspan="4">ignore case level weights of tertiary-ignorable CEs</td></tr> 1072*912701f9SAndroid Build Coastguard Worker <tr><td><code>0.0.T</code></td><td><code>0.0.3T</code></td></tr> 1073*912701f9SAndroid Build Coastguard Worker <tr><td><code>0.S.t</code></td><td><code>0.S.ct</code></td></tr> 1074*912701f9SAndroid Build Coastguard Worker <tr><td><code>p.s.t</code></td><td><code>p.s.ct</code></td></tr> 1075*912701f9SAndroid Build Coastguard Worker</tbody></table> 1076*912701f9SAndroid Build Coastguard Worker 1077*912701f9SAndroid Build Coastguard WorkerFor primary+case, which is used for “ignore accents but not case” collation, primary ignorables are ignored so that a = ä. For secondary+case, which would by analogy mean “ignore variants but not case”, secondary ignorables are ignored for equivalent behavior. 1078*912701f9SAndroid Build Coastguard Worker 1079*912701f9SAndroid Build Coastguard WorkerWhen using **caseFirst** but not **caseLevel**, the combined case+tertiary weight of a tertiary CE must be greater than the combined case+tertiary weight of any primary or secondary CE so that [[UCA](https://www.unicode.org/reports/tr41/#UTS10)] [well-formedness condition 2](https://www.unicode.org/reports/tr10/#WF2) is fulfilled. Since the tertiary CE’s tertiary weight T is already greater than any t of primary or secondary CEs, it is sufficient to set its case weight to UPPER=3. It must not be affected by **caseFirst=upper**. (The table uses the constant 3 in this case rather than the computed c.) 1080*912701f9SAndroid Build Coastguard Worker 1081*912701f9SAndroid Build Coastguard WorkerThe case weight of a tertiary-ignorable CE must be 0 so that [[UCA](https://www.unicode.org/reports/tr41/#UTS10)] [well-formedness condition 1](https://www.unicode.org/reports/tr10/#WF1) is fulfilled. 1082*912701f9SAndroid Build Coastguard Worker 1083*912701f9SAndroid Build Coastguard Worker#### <a name="Case_Tailored" href="#Case_Tailored">Tailored Strings</a> 1084*912701f9SAndroid Build Coastguard Worker 1085*912701f9SAndroid Build Coastguard WorkerCharacters and strings that are tailored have case values computed from their root collation case bits. 1086*912701f9SAndroid Build Coastguard Worker 1087*912701f9SAndroid Build Coastguard Worker1. Look up the tailored string’s root CEs. (Ignore any prefix or extension strings.) N=number of primary root CEs. 1088*912701f9SAndroid Build Coastguard Worker2. Determine the number and type (primary vs. weaker) of CEs a tailored string maps to. M=number of primary tailored CEs. 1089*912701f9SAndroid Build Coastguard Worker3. If N<=M (no more root than tailoring primary CEs): Copy the root case bits for primary CEs 0..N-1. 1090*912701f9SAndroid Build Coastguard Worker * If N<M (fewer root primary CEs): Clear the case bits of the remaining tailored primary CEs. (uncased/lowercase/small Kana) 1091*912701f9SAndroid Build Coastguard Worker4. If N>M (more root primary CEs): Copy the root case bits for primary CEs 0..M-2. Set the case bits for tailored primary CE M-1 according to the remaining root primary CEs M-1..N-1: 1092*912701f9SAndroid Build Coastguard Worker * Set to uncased/lower if all remaining root primary CEs have uncased/lower. 1093*912701f9SAndroid Build Coastguard Worker * Set to uppercase if all remaining root primary CEs have uppercase. 1094*912701f9SAndroid Build Coastguard Worker * Otherwise, set to mixed. 1095*912701f9SAndroid Build Coastguard Worker5. Clear the case bits for secondary CEs 0.s.t. 1096*912701f9SAndroid Build Coastguard Worker6. Tertiary CEs 0.0.t must get uppercase bits. 1097*912701f9SAndroid Build Coastguard Worker7. Tertiary-ignorable CEs 0.0.0 must get ignorable-case=lowercase bits. 1098*912701f9SAndroid Build Coastguard Worker 1099*912701f9SAndroid Build Coastguard Worker> **Note**: Almost all Cased characters have primary (non-ignorable) root collation CEs, except for U+0345 Combining Ypogegrammeni which is Lowercase. All Uppercase characters have primary root collation CEs. 1100*912701f9SAndroid Build Coastguard Worker 1101*912701f9SAndroid Build Coastguard Worker### <a name="Visibility" href="#Visibility">Visibility</a> 1102*912701f9SAndroid Build Coastguard Worker 1103*912701f9SAndroid Build Coastguard WorkerCollations have external visibility by default, meaning that they can be displayed in a list of collation options for users to choose from. A collation whose type name starts with "private-" is internal and should not be shown in such a list. Collations are typically internal when they are partial sequences included in other collations. See _[Collation Types](#Collation_Types)_ . 1104*912701f9SAndroid Build Coastguard Worker 1105*912701f9SAndroid Build Coastguard Worker### <a name="Collation_Indexes" href="#Collation_Indexes">Collation Indexes</a> 1106*912701f9SAndroid Build Coastguard Worker 1107*912701f9SAndroid Build Coastguard Worker#### <a name="Index_Characters" href="#Index_Characters">Index Characters</a> 1108*912701f9SAndroid Build Coastguard Worker 1109*912701f9SAndroid Build Coastguard WorkerThe main data includes `<exemplarCharacters>` for collation indexes. See _Part 2 General, [Character Elements](tr35-general.md#Character_Elements)_, for general information about exemplar characters. 1110*912701f9SAndroid Build Coastguard Worker 1111*912701f9SAndroid Build Coastguard WorkerThe index characters are a set of characters for use as a UI "index", that is, a list of clickable characters (or character sequences) that allow the user to see a segment of a larger "target" list. Each character corresponds to a bucket in the target list. One may have different kinds of index lists; one that produces an index list that is relatively static, and the other is a list that produces roughly equally-sized buckets. While CLDR is mostly focused on the first, there is provision for supporting the second as well. 1112*912701f9SAndroid Build Coastguard Worker 1113*912701f9SAndroid Build Coastguard WorkerThe index characters need to be used in conjunction with a collation for the locale, which will determine the order of the characters. It will also determine which index characters show up. 1114*912701f9SAndroid Build Coastguard Worker 1115*912701f9SAndroid Build Coastguard WorkerThe static list would be presented as something like the following (either vertically or horizontally): 1116*912701f9SAndroid Build Coastguard Worker 1117*912701f9SAndroid Build Coastguard Worker… A B C D E F G H CH I J K L M N O P Q R S T U V W X Y Z … 1118*912701f9SAndroid Build Coastguard Worker 1119*912701f9SAndroid Build Coastguard WorkerIn the "A" bucket, you would find all items that are primary greater than or equal to "A" in collation order, and primary less than "B". The use of the list requires that the target list be sorted according to the locale that is used to create that list. Although we say "character" above, the index character could be a sequence, like "CH" above. The index exemplar characters must always be used with a collation appropriate for the locale. Any characters that do not have primary differences from others in the set should be removed. 1120*912701f9SAndroid Build Coastguard Worker 1121*912701f9SAndroid Build Coastguard WorkerDetails: 1122*912701f9SAndroid Build Coastguard Worker 1123*912701f9SAndroid Build Coastguard Worker1. The primary weight (according to the collation) is used to determine which bucket a string is in. There are special buckets for before the first character, between buckets of different scripts, and after the last bucket (and of a different script). 1124*912701f9SAndroid Build Coastguard Worker2. Characters in the _index characters_ do not need to have distinct primary weights. That is, the _index characters_ are adapted to the underlying collation: normally Ё is in the Е bucket for Russian, but if someone used a variant of Russian collation that distinguished them on a primary level, then Ё would show up as its own bucket. 1125*912701f9SAndroid Build Coastguard Worker3. If an _index character_ string ends with a single "\*" (U+002A), for example "Sch\*" and "St\*" in German, then there will be a separate bucket for the string minus the "\*", for example "Sch" and "St", even if that string does not sort distinctly. 1126*912701f9SAndroid Build Coastguard Worker4. An _index character_ can have multiple primary weights, for example "Æ" and "Sch". Names that have the same initial primary weights sort into this _index character_’s bucket. This can be achieved by using an upper-boundary string that is the concatenation of the _index character_ and U+FFFF, for example "Æ\\uFFFF" and "Sch\\uFFFF". Names that sort greater than this upper boundary but less than the next index character are redirected to the last preceding single-primary index character (A and S for the examples here). 1127*912701f9SAndroid Build Coastguard Worker 1128*912701f9SAndroid Build Coastguard WorkerFor example, for index characters `[A Æ B R S {Sch*} {St*} T]` the following sample names are sorted into an index as shown. 1129*912701f9SAndroid Build Coastguard Worker 1130*912701f9SAndroid Build Coastguard Worker* A — Adelbert, Afrika 1131*912701f9SAndroid Build Coastguard Worker* Æ — Æsculap, Aesthet 1132*912701f9SAndroid Build Coastguard Worker* B — Berlin 1133*912701f9SAndroid Build Coastguard Worker* R — Rilke 1134*912701f9SAndroid Build Coastguard Worker* S — Sacher, Seiler, Sultan 1135*912701f9SAndroid Build Coastguard Worker* Sch — Schiller 1136*912701f9SAndroid Build Coastguard Worker* St — Steiff 1137*912701f9SAndroid Build Coastguard Worker* T — Thomas 1138*912701f9SAndroid Build Coastguard Worker 1139*912701f9SAndroid Build Coastguard WorkerThe … items are special: each is a bucket for everything else, either less or greater. They are inserted at the start and end of the index list, _and_ on script boundaries. Each script has its own range, except where scripts sort primary-equal (e.g., Hira & Kana). All characters that sort in one of the low reordering groups (whitespace, punctuation, symbols, currency symbols, digits) are treated as a single script for this purpose. 1140*912701f9SAndroid Build Coastguard Worker 1141*912701f9SAndroid Build Coastguard WorkerIf you tailor a Greek character into the Cyrillic script, that Greek character will be bucketed (and sorted) among the Cyrillic ones. 1142*912701f9SAndroid Build Coastguard Worker 1143*912701f9SAndroid Build Coastguard WorkerEven in an implementation that reorders groups of scripts rather than single scripts, for example Hebrew together with Phoenician and Samaritan, the index boundaries are really script boundaries, _not_ multi-script-group boundaries. So if you had a collation that reordered Hebrew after Ethiopic, you would still get index boundaries between the following (and in that order): 1144*912701f9SAndroid Build Coastguard Worker 1145*912701f9SAndroid Build Coastguard Worker1. Ethiopic 1146*912701f9SAndroid Build Coastguard Worker2. Hebrew 1147*912701f9SAndroid Build Coastguard Worker3. Phoenician _// included in the Hebrew reordering group_ 1148*912701f9SAndroid Build Coastguard Worker4. Samaritan _// included in the Hebrew reordering group_ 1149*912701f9SAndroid Build Coastguard Worker5. Devanagari 1150*912701f9SAndroid Build Coastguard Worker 1151*912701f9SAndroid Build Coastguard Worker(Beginning with CLDR 27, single scripts can be reordered.) 1152*912701f9SAndroid Build Coastguard Worker 1153*912701f9SAndroid Build Coastguard WorkerIn the UI, an index character could also be omitted or grayed out if its bucket is empty. For example, if there is nothing in the bucket for Q, then Q could be omitted. That would be up to the implementation. Additional buckets could be added if other characters are present. For example, we might see something like the following: 1154*912701f9SAndroid Build Coastguard Worker 1155*912701f9SAndroid Build Coastguard Worker| Sample Greek Index | Contents | 1156*912701f9SAndroid Build Coastguard Worker| :---------------------------------------------------------: | -------- | 1157*912701f9SAndroid Build Coastguard Worker| Α Β Γ Δ Ε Ζ Η Θ Ι Κ Λ Μ Ν Ξ Ο Π Ρ Σ Τ Υ Φ Χ Ψ Ω | With only content beginning with Greek letters | 1158*912701f9SAndroid Build Coastguard Worker| … Α Β Γ Δ Ε Ζ Η Θ Ι Κ Λ Μ Ν Ξ Ο Π Ρ Σ Τ Υ Φ Χ Ψ Ω … | With some content before or after | 1159*912701f9SAndroid Build Coastguard Worker| … 9 Α Β Γ Δ Ε Ζ Η Θ Ι Κ Λ Μ Ν Ξ Ο Π Ρ Σ Τ Υ Φ Χ Ψ Ω … | With numbers, and nothing between 9 and Alpha | 1160*912701f9SAndroid Build Coastguard Worker| … 9 _A-Z_ Α Β Γ Δ Ε Ζ Η Θ Ι Κ Λ Μ Ν Ξ Ο Π Ρ Σ Τ Υ Φ Χ Ψ Ω … | With numbers, some Latin | 1161*912701f9SAndroid Build Coastguard Worker 1162*912701f9SAndroid Build Coastguard WorkerHere is a sample of the XML structure: 1163*912701f9SAndroid Build Coastguard Worker 1164*912701f9SAndroid Build Coastguard Worker```xml 1165*912701f9SAndroid Build Coastguard Worker<exemplarCharacters type="index">[A B C D E F G H I J K L M N O P Q R S T U V W X Y Z]</exemplarCharacters> 1166*912701f9SAndroid Build Coastguard Worker``` 1167*912701f9SAndroid Build Coastguard Worker 1168*912701f9SAndroid Build Coastguard WorkerThe display of the index characters can be modified with the Index labels elements, discussed in the _Part 2 General, [Index Labels](tr35-general.md#IndexLabels)_. 1169*912701f9SAndroid Build Coastguard Worker 1170*912701f9SAndroid Build Coastguard Worker#### <a name="CJK_Index_Markers" href="#CJK_Index_Markers">CJK Index Markers</a> 1171*912701f9SAndroid Build Coastguard Worker 1172*912701f9SAndroid Build Coastguard WorkerSpecial index markers have been added to the CJK collations for stroke, pinyin, zhuyin, and unihan. These markers allow for effective and robust use of indexes for these collations. 1173*912701f9SAndroid Build Coastguard Worker 1174*912701f9SAndroid Build Coastguard WorkerThe per-language index exemplar characters are not useful for collation indexes for CJK because for each such language there are multiple sort orders in use (for example, Chinese pinyin vs. stroke vs. unihan vs. zhuyin), and these sort orders use very different index characters. In addition, sometimes the boundary strings are different from the bucket label strings. For collations that contain index markers, the boundary strings and bucket labels should be derived from those index markers, ignoring the index exemplar characters. 1175*912701f9SAndroid Build Coastguard Worker 1176*912701f9SAndroid Build Coastguard WorkerFor example, near the start of the pinyin tailoring there is the following: 1177*912701f9SAndroid Build Coastguard Worker 1178*912701f9SAndroid Build Coastguard Worker```html 1179*912701f9SAndroid Build Coastguard Worker<p> A</p><!-- INDEX A --> 1180*912701f9SAndroid Build Coastguard Worker<pc>阿呵锕</pc><!-- ā --> 1181*912701f9SAndroid Build Coastguard Worker… 1182*912701f9SAndroid Build Coastguard Worker<pc>翶</pc><!-- ao --> 1183*912701f9SAndroid Build Coastguard Worker<p> B</p><!-- INDEX B --> 1184*912701f9SAndroid Build Coastguard Worker``` 1185*912701f9SAndroid Build Coastguard Worker 1186*912701f9SAndroid Build Coastguard WorkerThese indicate the boundaries of "buckets" that can be used for indexing. They are always two characters starting with the noncharacter U+FDD0, and thus will not occur in normal text. For pinyin the second character is A-Z; for unihan it is one of the radicals; and for stroke it is a character after U+2800 indicating the number of strokes, such as ⠁. For zhuyin the second character is one of the standard Bopomofo characters in the range U+3105 through U+3129. 1187*912701f9SAndroid Build Coastguard Worker 1188*912701f9SAndroid Build Coastguard WorkerThe corresponding bucket label strings are the boundary strings with the leading U+FDD0 removed. For example, the Pinyin boundary string "\\uFDD0A" yields the label string "A". 1189*912701f9SAndroid Build Coastguard Worker 1190*912701f9SAndroid Build Coastguard WorkerHowever, for stroke order, the label string is the stroke count (second character minus U+2800) as a decimal-digit number followed by 劃 (U+5283). For example, the stroke order boundary string "\\uFDD0\\u2805" yields the label string "5劃". 1191*912701f9SAndroid Build Coastguard Worker 1192*912701f9SAndroid Build Coastguard Worker* * * 1193*912701f9SAndroid Build Coastguard Worker 1194*912701f9SAndroid Build Coastguard WorkerCopyright © 2001–2024 Unicode, Inc. All Rights Reserved. The Unicode Consortium makes no expressed or implied warranty of any kind, and assumes no liability for errors or omissions. No liability is assumed for incidental and consequential damages in connection with or arising out of the use of the information or programs contained or accompanying this technical report. The Unicode [Terms of Use](https://www.unicode.org/copyright.html) apply. 1195*912701f9SAndroid Build Coastguard Worker 1196*912701f9SAndroid Build Coastguard WorkerUnicode and the Unicode logo are trademarks of Unicode, Inc., and are registered in some jurisdictions. 1197