xref: /aosp_15_r20/external/cldr/docs/ldml/tr35-keyboards.md (revision 912701f9769bb47905792267661f0baf2b85bed5)
1## Unicode Technical Standard #35 Tech Preview
2
3# Unicode Locale Data Markup Language (LDML)<br/>Part 7: Keyboards
4
5|Version|45           |
6|-------|-------------|
7|Editors|Steven Loomis (<a href="mailto:[email protected]">[email protected]</a>) and <a href="tr35.md#Acknowledgments">other CLDR committee members</a>|
8
9For the full header, summary, and status, see [Part 1: Core](tr35.md).
10
11### _Summary_
12
13This document describes parts of an XML format (_vocabulary_) for the exchange of structured locale data. This format is used in the [Unicode Common Locale Data Repository](https://www.unicode.org/cldr/).
14
15This is a partial document, describing keyboards. For the other parts of the LDML see the [main LDML document](tr35.md) and the links above.
16
17_Note:_
18Some links may lead to in-development or older
19versions of the data files.
20See <https://cldr.unicode.org> for up-to-date CLDR release data.
21
22### _Status_
23
24<!-- _This is a draft document which may be updated, replaced, or superseded by other documents at any time.
25Publication does not imply endorsement by the Unicode Consortium.
26This is not a stable document; it is inappropriate to cite this document as other than a work in progress._ -->
27
28_This document has been reviewed by Unicode members and other interested parties, and has been approved for publication by the Unicode Consortium.
29This is a stable document and may be used as reference material or cited as a normative reference by other specifications._
30
31> _**A Unicode Technical Standard (UTS)** is an independent specification. Conformance to the Unicode Standard does not imply conformance to any UTS._
32
33_Please submit corrigenda and other comments with the CLDR bug reporting form [[Bugs](tr35.md#Bugs)]. Related information that is useful in understanding this document is found in the [References](tr35.md#References). For the latest version of the Unicode Standard see [[Unicode](tr35.md#Unicode)]. For a list of current Unicode Technical Reports see [[Reports](tr35.md#Reports)]. For more information about versions of the Unicode Standard, see [[Versions](tr35.md#Versions)]._
34
35
36See also [Compatibility Notice](#compatibility-notice).
37
38## <a name="Parts" href="#Parts">Parts</a>
39
40The LDML specification is divided into the following parts:
41
42*   Part 1: [Core](tr35.md#Contents) (languages, locales, basic structure)
43*   Part 2: [General](tr35-general.md#Contents) (display names & transforms, etc.)
44*   Part 3: [Numbers](tr35-numbers.md#Contents) (number & currency formatting)
45*   Part 4: [Dates](tr35-dates.md#Contents) (date, time, time zone formatting)
46*   Part 5: [Collation](tr35-collation.md#Contents) (sorting, searching, grouping)
47*   Part 6: [Supplemental](tr35-info.md#Contents) (supplemental data)
48*   Part 7: [Keyboards](tr35-keyboards.md#Contents) (keyboard mappings)
49*   Part 8: [Person Names](tr35-personNames.md#Contents) (person names)
50*   Part 9: [MessageFormat](tr35-messageFormat.md#Contents) (message format)
51
52## <a name="Contents" href="#Contents">Contents of Part 7, Keyboards</a>
53
54* [Keyboards](#keyboards)
55* [Goals and Non-goals](#goals-and-non-goals)
56  * [Compatibility Notice](#compatibility-notice)
57  * [Accessibility](#accessibility)
58* [Definitions](#definitions)
59* [Notation](#notation)
60  * [Escaping](#escaping)
61  * [UnicodeSet Escaping](#unicodeset-escaping)
62  * [UTS18 Escaping](#uts18-escaping)
63* [File and Directory Structure](#file-and-directory-structure)
64  * [Extensibility](#extensibility)
65* [Normalization](#normalization)
66  * [Where Normalization Occurs](#where-normalization-occurs)
67  * [Normalization and Transform Matching](#normalization-and-transform-matching)
68  * [Normalization and Markers](#normalization-and-markers)
69    * [Rationale for 'gluing' markers](#rationale-for-gluing-markers)
70    * [Data Model: `Marker`](#data-model-marker)
71    * [Data Model: string](#data-model-string)
72    * [Data Model: `MarkerEntry`](#data-model-markerentry)
73    * [Marker Algorithm Overview](#marker-algorithm-overview)
74    * [Phase 1: Parsing/Removing Markers](#phase-1-parsingremoving-markers)
75    * [Phase 2: Plain Text Processing](#phase-2-plain-text-processing)
76    * [Phase 3: Adding Markers](#phase-3-adding-markers)
77    * [Example Normalization with Markers](#example-normalization-with-markers)
78  * [Normalization and Character Classes](#normalization-and-character-classes)
79  * [Normalization and Reorder elements](#normalization-and-reorder-elements)
80  * [Normalization-safe Segments](#normalization-safe-segments)
81  * [Normalization and Output](#normalization-and-output)
82  * [Disabling Normalization](#disabling-normalization)
83* [Element Hierarchy](#element-hierarchy)
84  * [Element: keyboard3](#element-keyboard3)
85  * [Element: import](#element-import)
86  * [Element: locales](#element-locales)
87  * [Element: locale](#element-locale)
88  * [Element: version](#element-version)
89  * [Element: info](#element-info)
90  * [Element: settings](#element-settings)
91  * [Element: displays](#element-displays)
92  * [Element: display](#element-display)
93    * [Non-spacing marks on keytops](#non-spacing-marks-on-keytops)
94  * [Element: displayOptions](#element-displayoptions)
95  * [Element: keys](#element-keys)
96  * [Element: key](#element-key)
97    * [Implied Keys](#implied-keys)
98  * [Element: flicks](#element-flicks)
99    * [Element: flick](#element-flick)
100    * [Element: flickSegment](#element-flicksegment)
101  * [Element: forms](#element-forms)
102  * [Element: form](#element-form)
103    * [Implied Form Values](#implied-form-values)
104  * [Element: scanCodes](#element-scancodes)
105  * [Element: layers](#element-layers)
106  * [Element: layer](#element-layer)
107    * [Layer Modifier Sets](#layer-modifier-sets)
108    * [Layer Modifier Components](#layer-modifier-components)
109    * [Modifier Left- and Right- keys](#modifier-left--and-right--keys)
110    * [Layer Modifier Matching](#layer-modifier-matching)
111  * [Element: row](#element-row)
112  * [Element: variables](#element-variables)
113  * [Element: string](#element-string)
114  * [Element: set](#element-set)
115  * [Element: uset](#element-uset)
116  * [Element: transforms](#element-transforms)
117    * [Markers](#markers)
118  * [Element: transformGroup](#element-transformgroup)
119    * [Example: `transformGroup` with `transform` elements](#example-transformgroup-with-transform-elements)
120    * [Example: `transformGroup` with `reorder` elements](#example-transformgroup-with-reorder-elements)
121  * [Element: transform](#element-transform)
122    * [Regex-like Syntax](#regex-like-syntax)
123    * [Additional Features](#additional-features)
124    * [Disallowed Regex Features](#disallowed-regex-features)
125    * [Replacement syntax](#replacement-syntax)
126  * [Element: reorder](#element-reorder)
127    * [Using `<import>` with `<reorder>` elements](#using-import-with-reorder-elements)
128    * [Example Post-reorder transforms](#example-post-reorder-transforms)
129    * [Reorder and Markers](#reorder-and-markers)
130  * [Backspace Transforms](#backspace-transforms)
131* [Invariants](#invariants)
132* [Keyboard IDs](#keyboard-ids)
133  * [Principles for Keyboard IDs](#principles-for-keyboard-ids)
134* [Platform Behaviors in Edge Cases](#platform-behaviors-in-edge-cases)
135
136## Keyboards
137
138The Unicode Standard and related technologies such as CLDR have dramatically improved the path to language support. However, keyboard support remains platform and vendor specific, causing inconsistencies in implementation as well as timeline.
139
140More and more language communities are determining that digitization is vital to their approach to language preservation and that engagement with Unicode is essential to becoming fully digitized. For many of these communities, however, getting new characters or a new script added to The Unicode Standard is not the end of their journey. The next, often more challenging stage is to get device makers, operating systems, apps and services to implement the script requirements that Unicode has just added to support their language.
141
142However, commensurate improvements to streamline new language support on the input side have been lacking. CLDR’s Keyboard specification has been updated in an attempt to address this gap.
143
144This document specifies an interchange format for the communication of keyboard mapping data independent of vendors and platforms. Keyboard authors can then create a single mapping file for their language, which implementations can use to provide that language’s keyboard mapping on their own platform.
145
146Additionally, the standardized identifier for keyboards can be used to communicate, internally or externally, a request for a particular keyboard mapping that is to be used to transform either text or keystrokes. The corresponding data can then be used to perform the requested actions.  For example, a remote screen-access application (such as used for customer service or server management) would be able to communicate and choose the same keyboard layout on the remote device as is used in front of the user, even if the two systems used different platforms.
147
148The data can also be used in analysis of the capabilities of different keyboards. It also allows better interoperability by making it easier for keyboard designers to see which characters are generally supported on keyboards for given languages.
149
150<!-- To illustrate this specification, here is an abridged layout representing the English US 101 keyboard on the macOS operating system (with an inserted long-press example). -->
151
152For complete examples, see the XML files in the CLDR source repository.
153
154Attribute values should be evaluated considering the DTD and [DTD Annotations](tr35.md#dtd-annotations).
155
156* * *
157
158## Goals and Non-goals
159
160Some goals of this format are:
161
1621. Physical and virtual keyboard layouts defined in a single file.
1632. Provide definitive platform-independent definitions for new keyboard layouts.
164    * For example, a new French standard keyboard layout would have a single definition which would be usable across all implementations.
1653. Allow platforms to be able to use CLDR keyboard data for the character-emitting keys (non-frame) aspects of keyboard layouts.
1664. Deprecate & archive existing LDML platform-specific layouts so they are not part of future releases.
167
168<!--
1691. Make the XML as readable as possible.
1702. Represent faithfully keyboard data from major platforms: it should be possible to create a functionally-equivalent data file (such that given any input, it can produce the same output).
1713. Make as much commonality in the data across platforms as possible to make comparison easy. -->
172
173Some non-goals (outside the scope of the format) currently are:
174
1751. Adaptation for screen scaling resolution. Instead, keyboards should define layouts based on physical size. Platforms may interpret physical size definitions and adapt for different physical screen sizes with different resolutions.
1762. Unification of platform-specific virtual key and scan code mapping tables.
1773. Unification of pre-existing platform layouts themselves (e.g. existing fr-azerty on platform a, b, c).
1784. Support for prior (pre 3.0) CLDR keyboard files. See [Compatibility Notice](#compatibility-notice).
1795. Run-time efficiency. [LDML is explicitly an interchange format](tr35.md#Introduction), and so it is expected that data will be transformed to a more compact format for use by a keystroke processing engine.
1806. Platform-specific frame keys such as Fn, Numpad, IME swap keys, and cursor keys are out of scope.
181   (This also means that in this specification, modifier (frame) keys cannot generate output, such as capslock producing backslash.)
182
183<!-- 1. Display names or symbols for keycaps (eg, the German name for "Return"). If that were added to LDML, it would be in a different structure, outside the scope of this section.
1842. Advanced IME features, handwriting recognition, etc.
1853. Roundtrip mappings—the ability to recover precisely the same format as an original platform's representation. In particular, the internal structure may have no relation to the internal structure of external keyboard source data, the only goal is functional equivalence. -->
186
187<!-- Note: During development of this section, it was considered whether the modifier RAlt (= AltGr) should be merged with Option. In the end, they were kept separate, but for comparison across platforms implementers may choose to unify them. -->
188
189Note that in parts of this document, the format `@x` is used to indicate the _attribute_ **x**.
190
191### Compatibility Notice
192
193> A major rewrite of this specification, called "Keyboard 3.0", was introduced in CLDR v45.
194> The changes required were too extensive to maintain compatibility. For this reason, the `ldmlKeyboard3.dtd` DTD is _not_ compatible with DTDs from prior versions of CLDR such as v43 and prior.
195>
196> To process earlier XML files, use the data and specification from v43.1, found at <https://www.unicode.org/reports/tr35/tr35-69/tr35.html>
197>
198> `ldmlKeyboard.dtd` continues to be made available in CLDR, however, it will not be updated.
199
200### Accessibility
201
202Keyboard use can be challenging for individuals with various types of disabilities. For this revision, features or architectural designs specifically for the purpose of improving accessibility are not yet included. However:
203
2041. Having an industry-wide standard format for keyboards will enable accessibility software to make use of keyboard data with a reduced dependence on platform-specific knowledge.
2052. Features which require certain levels of mobility or speed of entry should be considered for their impact on accessibility. This impact could be mitigated by means of additional, accessible methods of generating the same output.
2063. Public feedback is welcome on any aspects of this document which might hinder accessibility.
207
208## Definitions
209
210**Arrangement:** The relative position of the rectangles that represent keys, either physically or virtually. A hardware keyboard has a static arrangement while a touch keyboard may have a dynamic arrangement that changes per language and/or layer. While the arrangement of keys on a keyboard may be fixed, the mapping of those keys may vary.
211
212**Base character:** The character emitted by a particular key when no modifiers are active. In ISO 9995-1:2009 terms, this is Group 1, Level 1.
213
214**Core keys:** also known as “alphanumeric” section. The primary set of key values on a keyboard that are used for typing the target language of the keyboard. For example, the three rows of letters on a standard US QWERTY keyboard (QWERTYUIOP, ASDFGHJKL, ZXCVBNM) together with the most significant punctuation keys. Usually this equates to the minimal set of keys for a language as seen on mobile phone keyboards.
215Distinguished from the **frame keys**.
216
217**Dead keys:** These are keys which do not emit normal characters by themselves. They are so named because to the user, they may appear to be “dead,” i.e., non-functional. However, they do produce a change to the input context. For example, in many Latin keyboards hitting the `^` dead-key followed by the `e` key produces `ê`. The `^` by itself may be invisible or presented in a special way by the platform.
218
219**Frame keys:** These are keys which are outside of the area of the **core keys** and typically do not emit characters. These keys include **modifier** keys, such as Shift or Ctrl, but also include platform specific keys: Fn, IME and layout-switching keys, cursor keys, insert emoji keys etc.
220
221**Hardware keyboard:** an input device which has individual keys that are pressed. Each key has a unique identifier and the arrangement doesn't change, even if the mapping of those keys does. Also known as a physical keyboard.
222
223**Implementation:** see **Keyboard implementation**
224
225**Input Method Editor (IME):** a component or program that supports input of large character sets. Typically, IMEs employ contextual logic and candidate UI to identify the Unicode characters intended by the user.
226
227**Keyboard implementation:** Software which implements the present specification, such that keyboard XML files can be used to interpret keystrokes from a **Hardware keyboard** or an on-screen **Touch keyboard**.
228
229Keyboard implementations will typically consist of two parts:
230
2311. A _compile/build tool_ part used by **Keyboard authors** to parse the XML file and produce a compact runtime format, and
2322. A _runtime_ part which interprets the runtime format when the keyboard is selected by the end user, and delivers the output plain text to the platform or application.
233
234**Key:** A physical key on a hardware keyboard, or a virtual key on a touch keyboard.
235
236**Key code:** The integer code sent to the application on pressing a key.
237
238**Key map:** The basic mapping between hardware or on-screen positions and the output characters for each set of modifier combinations associated with a particular layout. There may be multiple key maps for each layout.
239
240**Keyboard:** A particular arrangement of keys for the inputting of text, such as a hardware keyboard or a touch keyboard.
241
242**Keyboard author:** The person or group of people designing and producing a particular keyboard layout designed to support one or more languages. In the context of this specification, that author may be editing the LDML XML file directly or by means of software tools.
243
244**Keyboard layout:** A layout is the overall keyboard configuration for a particular locale. Within a keyboard layout, there is a single base map, one or more key maps and zero or more transforms.
245
246**Layer** is an arrangement of keys on a touch keyboard. A touch keyboard is made up of a set of layers. Each layer may have a different key layout, unlike with a hardware keyboard, and may not correspond directly to a hardware keyboard's modifier keys. A layer is accessed via a layer-switching key. See also touch keyboard and modifier.
247
248**Long-press key:** also known as a “child key”. A secondary key that is invoked from a top level key on a touch keyboard. Secondary keys typically provide access to variants of the top level key, such as accented variants (a => á, à, ä, ã)
249
250**Modifier:** A key that is held to change the behavior of a hardware keyboard. For example, the "Shift" key allows access to upper-case characters on a US keyboard. Other modifier keys include but are not limited to: Ctrl, Alt, Option, Command and Caps Lock. On a touch keyboard, keys that appear to be modifier keys should be considered to be layer-switching keys.
251
252**Physical keyboard:** see **Hardware keyboard**
253
254**Touch keyboard:** A keyboard that is rendered on a, typically, touch surface. It has a dynamic arrangement and contrasts with a hardware keyboard. This term has many synonyms: software keyboard, SIP (Software Input Panel), virtual keyboard. This contrasts with other uses of the term virtual keyboard as an on-screen keyboard for reference or accessibility data entry.
255
256**Transform:** A transform is an element that specifies a set of conversions from sequences of code points into one (or more) other code points. Transforms may reorder or replace text. They may be used to implement “dead key” behaviors, simple orthographic corrections, visual (typewriter) type input etc.
257
258**Virtual keyboard:** see **Touch keyboard**
259
260## Notation
261
262- Ellipses (`…`) in syntax examples are used to denote substituted parts.
263
264  For example, `id="…keyId"` denotes that `…keyId` (the part between double quotes) is to be replaced with something, in this case a key identifier. As another example, `\u{…usv}` denotes that the `…usv` is to be replaced with something, in this case a Unicode scalar value in hex.
265
266### Escaping
267
268When explicitly specified, attribute values can contain escaped characters. This specification uses two methods of escaping, the _UnicodeSet_ notation and the `\u{…usv}` notation.
269
270### UnicodeSet Escaping
271
272The _UnicodeSet_ notation is described in [UTS #35 section 5.3.3](tr35.md#Unicode_Sets) and allows for comprehensive character matching, including by character range, properties, names, or codepoints.
273
274Note that the `\u1234` and `\x{C1}` format escaping is not supported, only the `\u{…}` format (using `bracketedHex`).
275
276Currently, the following attribute values allow _UnicodeSet_ notation:
277
278* `from` or `before` on the `<transform>` element
279* `from` or `before` on the `<reorder>` element
280* `chars` on the [`<repertoire>`](#test-element-repertoire) test element.
281
282### UTS18 Escaping
283
284The `\u{…usv}` notation, a subset of hex notation, is described in [UTS #18 section 1.1](https://www.unicode.org/reports/tr18/#Hex_notation). It can refer to one or multiple individual codepoints. Currently, the following attribute values allow the `\u{…}` notation:
285
286* `output` on the `<key>` element
287* `from` or `to` on the `<transform>` element
288* `value` on the `<variable>` element
289* `output` and `display` on the `<display>` element
290* `baseCharacter` on the `<displayOptions>` element
291* Some attributes on [Keyboard Test Data](#keyboard-test-data) subelements
292
293Characters of general category of Mark (M), Control characters (Cc), Format characters (Cf), and whitespace other than space should be encoded using one of the notation above as appropriate.
294
295Attribute values escaped in this manner are annotated with the `<!--@ALLOWS_UESC-->` DTD annotation, see [DTD Annotations](tr35.md#dtd-annotations)
296
297* * *
298
299## File and Directory Structure
300
301* In the future, new layouts will be included in the CLDR repository, as a way for new layouts to be distributed in a cross-platorm manner. The process for this repository of layouts has not yet been defined, see the [CLDR Keyboard Workgroup Page][keyboard-workgroup] for up-to-date information.
302
303* Layouts have version metadata to indicate their specification compliance versi​​on number, such as `45`. See [`cldrVersion`](tr35-info.md#version-information).
304
305```xml
306<keyboard3 xmlns="https://schemas.unicode.org/cldr/45/keyboard3" conformsTo="45"/>
307```
308
309> _Note_: Unlike other LDML files, layouts are designed to be used outside of the CLDR source tree.  As such, they do not contain DOCTYPE entries.
310>
311> DTD and Schema (.xsd) files are available for use in validating keyboard files.
312
313* The filename of a keyboard .xml file does not have to match the BCP47 primary locale ID, but it is recommended to do so. The CLDR repository may enforce filename consistency.
314
315### Extensibility
316
317For extensibility, the `<special>` element will be allowed at nearly every level.
318
319See [Element special](tr35.md#special) in Part 1.
320
321## Normalization
322
323Unicode Normalization, as described in [The Unicode Standard](https://www.unicode.org/reports/tr41/#Unicode/), is a process by which Unicode text is processed to eliminate unwanted distinctions.
324
325This section discusses how conformant keyboards are affected by normalization, and the impact of normalization on keyboard authors and keyboard implmentations.
326
327Keyboard implementations will usually apply normalization as appropriate when matching transform rules and `<display>` value matching.
328Output from the keyboard, following application of all transform rules, will be normalized to the appropriate form by the keyboard implementation.
329
330> Note: There are many existing software libraries which perform Unicode Normalization, including [ICU](https://icu.unicode.org), [ICU4X](https://icu4x.unicode.org), and JavaScript's [String.prototype.normalize()](https://developer.mozilla.org/docs/Web/JavaScript/Reference/Global_Objects/String/normalize).
331
332Keyboard authors will not typically need to perform normalization as part of the keyboard layout.  However, authors should be aware of areas where normalization affects keyboard operation so that they may achieve their desired results.
333
334### Where Normalization Occurs
335
336There are four stages where normalization must be performed by keyboard implementations.
337
3381. **From the keyboard source `.xml`**
339
340    Keyboard source .xml files may be in any normalization form.
341    However, in processing they are converted to NFD.
342
343    - From any form to NFD: full normalization (decompose+reorder)
344    - Markers must be processed as described [below](#marker-algorithm-overview).
345    - Regex patterns must be processed so that matching is performed in NFD.
346
347    Example: `<key output=`, and `<transform from= to=` attribute contents will be normalized to NFD.
348
3492. **From the input context**
350
351    The input context must be normalized for purposes of matching.
352
353    - From any form to NFD: full normalization (decompose+reorder)
354    - Markers in the cached context must be preserved.
355
356    Example: The input context contains U+00E8 (`è`).  The user clicks the cursor after the character, then presses a key which produces U+0320 (`<key output="\u{0320}"/>`).
357    The implementation must normalize the context buffer to `e\u{0320}\u{0300}` (`è̠`) before matching.
358
3593. **Before each `transformGroup`**
360
361    Text must be normalized before processing by the next `transformGroup`.
362
363    - To NFD: no decomposition should be needed, because all of the input text (including transform rules) was already in NFD form.
364    However, marker reordering may be needed if transforms insert segments out of order.
365    - Markers must be preserved.
366
367    Example: The input context contains U+00E8 (`è`).  The user clicks the cursor after this character, then presses a key producing `x`. A transform rule `<transform from='x' to='\u{0320}'/>` matches. The implementation must normalize the intermediate buffer to `e\u{0320}\u{0300}` (`è̠`) before proceeding to the next `transformGroup`.
368
3694. **Before output to the platform/application**
370
371    Text must be normalized into the output form requested by the platform or application. This will typically be NFC, but may not be.
372
373    - If normalizing to NFC, full normalization (reorder+composition) will be required.
374    - No markers are present in this text, they are removed prior to output but retained in the implementation's input context for subsequent keystrokes. See [markers](#markers).
375
376    Example: The result of keystrokes and transform processing produces the string `e\u{0300}`. The keyboard implementation normalizes this to a single NFC codepoint U+00E8 (`è`), which is returned to the application.
377
378### Normalization and Transform Matching
379
380Regardless of the normalization form in the keyboard source file or in the edit buffer context, transform matching will be performed using **NFD**. For example, all of the following transforms will match the input strings è̠, whether the input is U+00E8 U+0320, U+0065 U+0320 U+0300, or U+0065 U+0300 U+0320.
381
382```xml
383<transform from="e\u{0320}\u{0300}" /> <!-- NFD -->
384<transform from="\u{00E8}\u{0320}"  /> <!-- NFC: è + U+0320 -->
385<transform from="e\u{0300}\u{0320}" /> <!-- Unnormalized -->
386```
387
388### Normalization and Markers
389
390A special issue occurs when markers are involved.
391[Markers](#markers) are not text, and so not themselves modified or reordered by the Unicode Normalization Algorithm.
392Existing Normalization APIs typically operate on plain text, and so those APIs can not be used with content containing markers.
393
394However, the markers must be retained and processed by keyboard implementations in a manner which will be both consistent across implementations and predictable to keyboard authors.
395Inconsistencies would result in different user experiences — specifically, different or incorrect text output — on some implementations and not another.
396Unpredictability would make it challenging for the keyboard author to create a keyboard with expected behavior.
397
398This section gives an algorithm for implementing normalization on a text stream including markers.
399
400_Note:_ When the algorithm is performed on a plain text stream that doesn't include markers, implementations may skip the removing/re-adding steps 1 and 3 because no markers are involved.
401
402#### Rationale for 'gluing' markers
403
404The processing described here describes an extension to Unicode normalization to account for the desired behavior of markers.
405
406The algorithm described considers markers 'glued' (remaining with) the following character. If a context ends with a marker, that marker would be guaranteed to remain at the end after processing, consistently located with respect to the next keystroke to be input.
407
4081. Keyboard authors can keep a marker together with a character of interest by emitting the marker just previous to that character.
409
410For example, given a key `output="\m{marker}X"`, the marker will proceed `X` regardless of any normalization. (If `output="X\m{marker}"` were used, and `X` were to reorder with other characters, the marker would no longer be adjacent to the X.)
411
4122. Markers which are at the end of the input remain at the end of input during normalization.
413
414For example, given input context which ends with a marker, such as `...ABCDX\m{marker}`, the marker will remain at the end of the input context regardless of any normalization.
415
416The 'gluing' is only applicable during one particular processing step. It does not persist or affect further processing steps or future keystrokes.
417
418#### Data Model: `Marker`
419
420For purposes of this algorithm, a `Marker` is an opaque data type which has one property, its ID. See [Markers](#markers) for a discussion of the marker ID.
421
422#### Data Model: string
423
424For purposes of this algorithm, a string is an array of elements, where each element is either a codepoint or a `Marker`. For example, a [`key`](#element-key) in the XML such as `<key id="sha" output="��\m{mymarker}x" />` would produce a string with three elements:
425
4261. The codepoint U+104EF
4272. The `Marker` named `mymarker`
4283. The codepoint U+0078
429
430If this string were output to an application, it would be converted to _plain text_ by removing all markers, which would yield the plain text string with only two codepoints: `��x`.
431
432#### Data Model: `MarkerEntry`
433
434This algorithm uses a temporary data structure which is an ordered array of `MarkerEntry` elements.
435
436Each `MarkerEntry` element has the following properties:
437- `glue` (a codepoint, or the special value `END_OF_SEGMENT`)
438- `divider?` (true/false)
439- `processed?` (true/false, defaults to false)
440- `marker` (the `Marker` object)
441
442#### Marker Algorithm Overview
443
444This algorithm has three main phases to it.
445
4461. **Parsing/Removing Markers**
447
448    In this phase, the input string is analyzed to locate all markers. Metadata about each marker is stored in a temporary `MarkerArray` data structure.
449    Markers are removed from the input string, leaving only plain text.
450
4512. **Plain Text Processing**
452
453    This phase is performed on the plain text string, such as NFD normalization.
454
4553. **Re-Adding Markers**
456
457    Finally, markers are re-added to the plain text string using the `MarkerEntry` metadata from step 1.
458    This phase results in a string which contains both codepoints and markers.
459
460#### Phase 1: Parsing/Removing Markers
461
462Given an input string _s_
463
4641. Initialize an empty `MarkerEntry` array _e_
4652. Initialize an empty `Marker` array _pending_
4662. Loop through each element _i_ of the input _s_
467    1. If _i_ is a `Marker`:
468        1. add the marker _i_ to the end of _pending_
469        2. remove the marker from the input string _s_
470    2. else if _i_ is a codepoint:
471        1. Decompose _i_ into NFD form into a plain text string array of codepoints _d_
472        2. Add an element with `glue=d[0]` (the first codepoint of _d_) and `divider? = true` to the end of _e_
473        3. For every marker _m_ in _pending_:
474            1. Add an element with `glue=d[0]` and `marker=m` and `divider? = false` to the end of _e_
475        4. Clear the _pending_ array.
476        5. Finally, for every codepoint _c_ in _d_ **following** the initial codepoint: (d[1]..):
477            1. Add an element with `glue=c` and `divider? = true` to the end of _e_
4783. At the end of text,
479    1. Add an element with `glue=END` and `divider?=true` to the end of _e_
480    2. For every marker _m_ in _pending_:
481        1. Add an element with `glue=END` and `marker=m` and `divider? = false` to the end of _e_
482
483The string _s_ is now plain text and can be processed by the next phase.
484
485The array _e_ will be used in Phase 3.
486
487#### Phase 2: Plain Text Processing
488
489See [UAX #15](https://www.unicode.org/reports/tr15/#Description_Norm) for an overview of the process.  An existing Unicode-compliant API can be used here.
490
491#### Phase 3: Adding Markers
492
4931. Initialize an empty output string _o_
4942. Loop through the elements _p_ of the array _e_ from end to beginning (backwards)
495    1. If _p_.glue isn't `END`:
496        1. break out of the loop
497    2. If _p_.divider? == false:
498        1. Prepend marker _p_.marker to the output string _o_
499    3. Set _p_.processed?=true (so we don't process this again)
5002. Loop through each codepoint _i_ ( in the plain text input string ) from end to beginning (backwards)
501    1. Prepend _i_ to output _o_
502    2. Loop through the elements _p_ of the array _e_ from end to beginning (backwards)
503        1. If _p_.processed? == true:
504            1. Continue the inner loop  (was already processed)
505        2. If _p_.glue isn't _i_
506            1. Continue the inner loop  (wrong glue, not applicable)
507        3. If _p_.divider? == true:
508            1. Break out of the inner loop  (reached end of this 'glue' char)
509        4. Prepend marker _p_.marker to the output string _o_
510        5. Set _p_.processed?=true (so we don't process this again)
5113. _o_ is now the output string including markers.
512
513#### Example Normalization with Markers
514
515**Example 1**
516
517Consider this example, without markers:
518
519- `e\u{0300}\u{0320}` (input)
520- `e\u{0320}\u{0300}` (NFD)
521
522The combining marks are reordered.
523
524**Example 2**
525
526If we add markers:
527
528- `e\u{0300}\m{marker}\u{0320}` (input)
529- `e\m{marker}\u{0320}\u{0300}` (NFD)
530
531Note that the marker is 'glued' to the _following_ character. In the above example, `\m{marker}` was 'glued' to the `\u{0320}`.
532
533**Example 2**
534
535A second example:
536
537- `e\m{marker0}\u{0300}\m{marker1}\u{0320}\m{marker2}` (input)
538- `e\m{marker1}\u{0320}\m{marker0}\u{0300}\m{marker2}` (NFD)
539
540Here `\m{marker2}` is 'glued' to the end of the string. However, if additional text is added such as by a subsequent keystroke (which may add an additional combining character, for example), this marker may be 'glued' to that following text.
541
542Markers remain in the same normalization-safe segment during normalization. Consider:
543
544**Example 3**
545
546- `e\u{0300}\m{marker1}\u{0320}a\u{0300}\m{marker2}\u{0320}` (original)
547- `e\m{marker1}\u{0320}\u{0300}a\m{marker2}\u{0320}\u{0300}` (NFD)
548
549There are two normalization-safe segments here:
550
5511. `e\u{0300}\m{marker1}\u{0320}`
5522. `a\u{0300}\m{marker2}\u{0320}`
553
554Normalization (and marker rearranging) effectively occurs within each segment.  While `\m{marker1}` is 'glued' to the `\u{0320}`, it is glued within the first segment and has no effect on the second segment.
555
556### Normalization and Character Classes
557
558If pre-composed (non-NFD) characters are used in [character classes](#regex-like-syntax), such as `[á-é]`, these may not match as keyboard authors expect, as the U+00E1 character (á) will not occur in NFD form. Thus this may be masking serious errors in the data.
559
560Tools that process keyboard data must reject the data when character classes include non-NFD characters.
561
562The above should be written instead as a regex `(á|â|ã|ä|å|æ|ç|è|é)`. Alternatively, it could be written as a set variable `<set id="Example" value="á â ã ä å æ ç è é"/>` and matched as `$[Example]`.
563
564There is another case where there is no explicit mention of a non-NFD character, but the character class could include non-NFD characters, such as the range `[\u{0020}-\u{01FF}]`. For these, the tools should raise a warning by default.
565
566### Normalization and Reorder elements
567
568[`reorder`](#element-reorder) elements operate on NFD codepoints.
569
570### Normalization-safe Segments
571
572For purposes of this algorithm, "normalization-safe segments" are defined as a string of codepoints which are
573
5741. already in [NFD](https://www.unicode.org/reports/tr15/#Norm_Forms), and
5752. begin with a character with [Canonical Combining Class](https://www.unicode.org/reports/tr44/#Canonical_Combining_Class_Values) of `0`.
576
577See [UAX #15 Section 9.1: Stable Code Points](https://www.unicode.org/reports/tr15/#Stable_Code_Points) for related discussion.
578Text under consideration can be segmented by locating such characters.
579
580### Normalization and Output
581
582On output, text will be normalized into a specified normalization form. That form will typically be NFC, but an implementation may allow a calling application to override the choice of normalization form.
583For example, many platforms may request NFC as the output format. In such a case, all text emitted via the keyboard will be transformed into NFC.
584
585Existing text in a document will only have normalization applied within a single normalization-safe segment from the caret.  The output will not contain any markers, thus any normalization is unaffected by any markers embedded within the segment.
586
587For example, the sequence `e\m{marker}\u{300}` would be output in NFC as `è`. The marker is removed and has no effect on the output.
588
589### Disabling Normalization
590
591The attribute value `normalization="disabled"` can be used to indicate that no automatic normalization is to be applied in input, matching, or output. Using this setting should be done with caution:
592
593- When this attribute value is used, all matching and output uses only the exact codepoints provided by the keyboard author.
594- The input context from the application may not be normalized, which means that the keyboard author should consider all possible combinations, including NFC, NFD, and mixed normalization in `<transform from=` attributes.
595- See [`<settings>`](#element-settings) for further details.
596
597The majority of the above section only applies when `normalization="disabled"` is not used.
598
599* * *
600
601## Element Hierarchy
602
603This section describes the XML elements in a keyboard layout file, beginning with the top level element `<keyboard3>`.
604
605### Element: keyboard3
606
607This is the top level element. All other elements defined below are under this element.
608
609**Syntax**
610
611```xml
612<keyboard3 locale="…localeId">
613    <!-- …definition of the layout as described by the elements defined below -->
614</keyboard3>
615```
616
617> <small>
618>
619> Parents: _none_
620>
621> Children: [displays](#element-displays), [flicks](#element-flicks), [forms](#element-forms), [import](#element-import), [info](#element-info), [keys](#element-keys), [layers](#element-layers), [locales](#element-locales), [settings](#element-settings), [_special_](tr35.md#special), [transforms](#element-transforms), [variables](#element-variables), [version](#element-version)
622>
623> Occurrence: required, single
624>
625> </small>
626
627_Attribute:_ `conformsTo` (required)
628
629This attribute value distinguishes the keyboard from prior versions,
630and it also specifies the minimum CLDR major version required.
631
632This attribute value must be a whole number of `45` or greater. See [`cldrVersion`](tr35-info.md#version-information)
633
634```xml
635<keyboard3 … conformsTo="45"/>
636```
637
638_Attribute:_ `locale` (required)
639
640This attribute value contains the primary locale of the keyboard using BCP 47 [Unicode locale identifiers](tr35.md#Canonical_Unicode_Locale_Identifiers) - for example `"el"` for Greek. Sometimes, the locale may not specify the base language. For example, a Devanagari keyboard for many languages could be specified by BCP-47 code: `"und-Deva"`. However, it is better to list out the languages explicitly using the [`locales`](#element-locales) element.
641
642For further details about the choice of locale ID, see [Keyboard IDs](#keyboard-ids).
643
644**Example** (for illustrative purposes only, not indicative of the real data)
645
646```xml
647<keyboard3 locale="ka">
648649</keyboard3>
650```
651
652```xml
653<keyboard3 locale="fr-CH-t-k0-azerty">
654655</keyboard3>
656```
657* * *
658
659### Element: import
660
661The `import` element is used to reference another xml file so that elements are imported from
662another file. The use case is to be able to import a standard set of `transform`s and similar
663from the CLDR repository, especially to be able to share common information relevant to a particular script.
664The intent is for each single XML file to contain all that is needed for a keyboard layout, other than required standard import data from the CLDR repository.
665
666`<import>` can be used as a child of a number of elements (see the _Parents_ section immediately below). Multiple `<import>` elements may be used, however, `<import>` elements must come before any other sibling elements.
667If two identical elements are defined, the later element will take precedence, that is, override.
668Imported elements may contain other `<import>` statements. Implementations must prevent recursion, that is, each imported file may only be included once.
669
670**Note:** imported files do not have any indication of their normalization mode. For this reason, the keyboard author must verify that the imported file is of a compatible normalization mode. See the [`settings` element](#element-settings) for further details.
671
672**Syntax**
673```xml
674<import base="cldr" path="45/keys-Zyyy-punctuation.xml"/>
675```
676> <small>
677>
678> Parents: [displays](#element-displays), [flicks](#element-flicks), [forms](#element-forms), [keyboard3](#element-keyboard3), [keys](#element-keys), [layers](#element-layers), [transformGroup](#element-transformgroup), [transforms](#element-transforms), [variables](#element-variables)
679> Children: _none_
680>
681> Occurrence: optional, multiple
682>
683> </small>
684
685_Attribute:_ `base`
686
687> The base may be omitted (indicating a local import) or have the value `"cldr"`.
688
689**Note:** `base="cldr"` is required for all `<import>` statements within keyboard files in the CLDR repository.
690
691_Attribute:_ `path` (required)
692
693> If `base` is `cldr`, then the `path` must start with a CLDR major version (such as `45`) representing the CLDR version to pull imports from. The imports are located in the `keyboard/import` subdirectory of the CLDR source repository.
694> Implementations are not required to have all CLDR versions available to them.
695>
696> If `base` is omitted, then `path` is an absolute or relative file path.
697
698
699**Further Examples**
700
701```xml
702<!-- in a keyboard xml file-->
703704<transforms type="simple">
705    <import base="cldr" path="45/transforms-example.xml"/>
706    <transform from="` " to="`" />
707    <transform from="^ " to="^" />
708</transforms>
709710
711
712<!-- contents of transforms-example.xml -->
713<?xml version="1.0" encoding="UTF-8"?>
714<transforms>
715    <!-- begin imported part-->
716    <transform from="`a" to="à" />
717    <transform from="`e" to="è" />
718    <transform from="`i" to="ì" />
719    <transform from="`o" to="ò" />
720    <transform from="`u" to="ù" />
721    <!-- end imported part -->
722</transforms>
723```
724
725**Note:** The root element, here `transforms`, is the same as
726the _parent_ of the `<import/>` element. It is an error to import an XML file
727whose root element is different than the parent element of the `<import/>` element.
728
729After loading, the above example will be the equivalent of the following.
730
731```xml
732<transforms type="simple">
733    <!-- begin imported part-->
734    <transform from="`a" to="à" />
735    <transform from="`e" to="è" />
736    <transform from="`i" to="ì" />
737    <transform from="`o" to="ò" />
738    <transform from="`u" to="ù" />
739    <!-- end imported part -->
740
741    <!-- this line is after the import -->
742    <transform from="^ " to="^" />
743    <transform from="` " to="`" />
744</transforms>
745```
746
747* * *
748
749### Element: locales
750
751The optional `<locales>` element allows specifying additional or alternate locales.
752
753**Syntax**
754
755```xml
756<locales>
757    <locale id="…"/>
758    <locale id="…"/>
759</locales>
760```
761
762> <small>
763>
764> Parents: [keyboard3](#element-keyboard3)
765>
766> Children: [locale](#element-locale)
767>
768> Occurrence: optional, single
769>
770> </small>
771
772### Element: locale
773
774The `<locale>` element specifies an additional or alternate locale. Denotes intentional support for an extra language, not just that a keyboard incidentally supports a language’s orthography.
775
776**Syntax**
777
778```xml
779<locale id="…id"/>
780```
781
782> <small>
783>
784> Parents: [locales](#element-locales)
785>
786> Children: _none_
787>
788> Occurrence: optional, multiple
789>
790> </small>
791
792_Attribute:_ `id` (required)
793
794> The [BCP 47](tr35.md#Canonical_Unicode_Locale_Identifiers) locale ID of an additional language supported by this keyboard.
795> Must _not_ include the `-k0-` subtag for this additional language.
796
797**Example**
798
799See [Principles for Keyboard IDs](#principles-for-keyboard-ids) for discussion and further examples.
800
801```xml
802<!-- Pan Nigerian Keyboard-->
803<keyboard3 locale="mul-Latn-NG-t-k0-panng">
804    <locales>
805        <locale id="ha"/>
806        <locale id="ig"/>
807        <!-- others … -->
808    </locales>
809</keyboard3>
810```
811
812* * *
813
814### Element: version
815
816Element used to keep track of the source data version.
817
818**Syntax**
819
820```xml
821<version number="…number">
822```
823
824> <small>
825>
826> Parents: [keyboard3](#element-keyboard3)
827>
828> Children: _none_
829>
830> Occurrence: optional, single
831>
832> </small>
833
834_Attribute:_ `number` (required)
835
836> Must be a [[SEMVER](https://semver.org)] compatible version number, such as `1.0.0` or `38.0.0-beta.11`
837
838_Attribute:_ `cldrVersion` (fixed by DTD)
839
840> The CLDR specification version that is associated with this data file. This value is fixed and is inherited from the [DTD file](https://github.com/unicode-org/cldr/tree/main/keyboards/dtd) and therefore does not show up directly in the XML file.
841
842**Example**
843
844```xml
845<keyboard3 locale="tok">
846847    <version number="1"/>
848849</keyboard3>
850```
851
852* * *
853
854### Element: info
855
856Element containing informative properties about the layout, for displaying in user interfaces etc.
857
858**Syntax**
859
860```xml
861<info
862      name="…name"
863      author="…author"
864      layout="…hint of the layout"
865      indicator="…short identifier" />
866```
867
868> <small>
869>
870> Parents: [keyboard3](#element-keyboard3)
871>
872> Children: _none_
873>
874> Occurrence: required, single
875>
876> </small>
877
878_Attribute:_ `name` (required)
879
880> Note that this is the only required attribute for the `<info>` element.
881>
882> This attribute is an informative name for the keyboard.
883
884```xml
885<keyboard3 locale="bg-t-k0-phonetic-trad">
886887    <info name="Bulgarian (Phonetic Traditional)" />
888889</keyboard3>
890```
891
892* * *
893
894
895_Attribute:_ `author`
896
897> The `author` attribute value contains the name of the author of the layout file.
898
899_Attribute:_ `layout`
900
901> The `layout` attribute describes the layout pattern, such as QWERTY, DVORAK, INSCRIPT, etc. typically used to distinguish various layouts for the same language.
902>
903> This attribute is not localized, but is an informative identifier for implementation use.
904
905_Attribute:_ `indicator`
906
907> The `indicator` attribute describes a short string to be used in currently selected layout indicator, such as `US`, `SI9` etc.
908> Typically, this is shown on a UI element that allows switching keyboard layouts and/or input languages.
909>
910> This attribute is not localized.
911
912* * *
913
914### Element: settings
915
916An element used to keep track of layout-specific settings by implementations. This element may or may not show up on a layout. These settings reflect the normal practice by the implementation. However, an implementation using the data may customize the behavior.
917
918**Syntax**
919
920```xml
921<settings normalization="disabled" />
922```
923
924> <small>
925>
926> Parents: [keyboard3](#element-keyboard3)
927>
928> Children: _none_
929>
930> Occurrence: optional, single
931>
932> </small>
933
934_Attribute:_ `normalization="disabled"`
935
936> The presence of this attribute indicates that normalization will not be applied to the input text, matching, or the output.
937> See [Normalization](#normalization) for additional details.
938>
939> **Note**: while this attribute is allowed by the specification, it should be used with caution.
940
941
942**Example**
943
944```xml
945<keyboard3 locale="bg">
946947    <settings normalization="disabled" />
948949</keyboard3>
950```
951
952* * *
953
954### Element: displays
955
956The `displays` element consists of a list of [`display`](#element-display) subelements.
957
958**Syntax**
959
960```xml
961<displays>
962    <display … />
963    <display … />
964965</displays>
966```
967
968> <small>
969>
970> Parents: [keyboard3](#element-keyboard3)
971>
972> Children: [display](#element-display), [displayOptions](#element-displayoptions), [_special_](tr35.md#special)
973>
974> Occurrence: optional, single
975>
976> </small>
977
978* * *
979
980### Element: display
981
982The `display` elements can be used to describe what is to be displayed on the keytops for various keys. For the most part, such explicit information is unnecessary since the `@to` element from the `keys/key` element will be used for keytop display.
983
984- Some characters, such as diacritics, do not display well on their own.
985- Another useful scenario is where there are doubled diacritics, or multiple characters with spacing issues.
986- Finally, the `display` element provides a way to specify the keytop for keys which do not otherwise produce output. Keys which switch layers using the `@layerId` attribute typically do not produce output.
987
988> Note: `displays` elements are designed to be shared across many different keyboard layout descriptions, and imported with `<import>` where needed.
989
990#### Non-spacing marks on keytops
991
992For non-spacing marks, U+25CC `◌` is used as a base. It is an error to use a nonspacing character without a base in the `display` attribute. For example, `display="\u{0303}"` would produce an error.
993
994A key which outputs a combining tilde (U+0303) could be represented as either of the following:
995
996```xml
997    <display output="\u{0303}" display="◌̃" />  <!-- \u{25CC} \u{0303}-->
998    <display output="\u{0303}" display="\u{25cc}\u{0303}" />  <!-- also acceptable -->
999```
1000
1001This way, a key which outputs a combining tilde (U+0303) will be represented as `◌̃` (a tilde on a dotted circle).
1002
1003Users of some scripts/languages may prefer a different base than U+25CC. See  [`<displayOptions baseCharacter=…/>`](#element-displayoptions).
1004
1005
1006**Syntax**
1007
1008```xml
1009<display output="…string" display="…string" />
1010```
1011
1012> <small>
1013>
1014> Parents: [displays](#element-displays)
1015>
1016> Children: _none_
1017>
1018> Occurrence: required, multiple
1019>
1020> </small>
1021
1022One of the `output` or `id` attributes is required.
1023
1024**Note**: There is currently no way to indicate a custom display for a key without output (i.e. without a `to=` attribute), nor is there a way to indicate that such a key has a standardized identity (e.g. that a key should be identified as a “Shift”). These may be addressed in future versions of this standard.
1025
1026
1027_Attribute:_ `output` (optional)
1028
1029> Specifies the character or character sequence from the `keys/key` element that is to have a special display.
1030> This attribute may be escaped with `\u` notation, see [Escaping](#escaping).
1031> The `output` attribute may also contain the `\m{…}` syntax to reference a marker. See [Markers](#markers). Implementations may highlight a displayed marker, such as with a lighter text color, or a yellow highlight.
1032> String variables may be substituted. See [String variables](#element-string)
1033
1034_Attribute:_ `keyId` (optional)
1035
1036> Specifies the `key` id. This is useful for keys which do not produce any output (no `output=` value), such as a shift key.
1037>
1038> Must match `[A-Za-z0-9][A-Za-z0-9_-]*`
1039
1040_Attribute:_ `display` (required)
1041
1042> Required and specifies the character sequence that should be displayed on the keytop for any key that generates the `@output` sequence or has the `@id`. (It is an error if the value of the `display` attribute is the same as the value of the `output` attribute, this would be an extraneous entry.)
1043
1044> String variables may be substituted. See [String variables](#element-string)
1045
1046This attribute may be escaped with `\u` notation, see [Escaping](#escaping).
1047
1048**Example**
1049
1050```xml
1051<keyboard3>
1052    <keys>
1053        <key id="grave" output="\u{0300}" /> <!-- combining grave -->
1054        <key id="marker" output="\m{acute}" /> <!-- generates a marker-->
1055        <key id="numeric" layerId="numeric" /> <!-- changes layers-->
1056    </keys>
1057    <displays>
1058        <display output="\u{0300}" display="ˋ" /> <!-- \u{02CB} -->
1059        <display keyId="numeric"  display="#" /> <!-- display the layer shift key as # -->
1060        <display output="\m{acute}" display="´" /> <!-- Display \m{acute} as ´ -->
1061    </displays>
1062</keyboard3>
1063```
1064
1065To allow `displays` elements to be shared across keyboards, there is no requirement that `@output` in a `display` element matches any `@output`/`@id` in any `keys/key` element in the keyboard description.
1066
1067* * *
1068
1069### Element: displayOptions
1070
1071The `displayOptions` is an optional singleton element providing additional settings on this `displays`.  It is structured so as to provide for future flexibility in such options.
1072
1073**Syntax**
1074
1075```xml
1076<displays>
1077    <display …/>
1078    <displayOptions baseCharacter="x"/>
1079</displays>
1080```
1081
1082> <small>
1083>
1084> Parents: [displays](#element-displays)
1085>
1086> Children: _none_
1087>
1088> Occurrence: optional, single
1089>
1090> </small>
1091
1092_Attribute:_ `baseCharacter` (optional)
1093
1094**Note:** At present, this is the only option settable in the `displayOptions`.
1095
1096> Some scripts/languages may prefer a different base than U+25CC.
1097> For Lao for example, `x` is often used as a base instead of `◌`.
1098> Setting `baseCharacter="x"` (for example) is a _hint_ to the implementation which
1099> requests U+25CC to be substituted with `x` on display.
1100> As a hint, the implementation may ignore this option.
1101>
1102> **Note** that not all base characters will be suitable as bases for combining marks.
1103
1104This attribute may be escaped with `\u` notation, see [Escaping](#escaping).
1105
1106* * *
1107
1108### Element: keys
1109
1110This element defines the properties of all possible keys via [`<key>` elements](#element-key) used in all layouts.
1111It is a “bag of keys” without specifying any ordering or relation between the keys.
1112There is only a single `<keys>` element in each layout.
1113
1114**Syntax**
1115
1116```xml
1117<keys>
1118    <key … />
1119    <key … />
1120    <key … />
1121</keys>
1122```
1123
1124> <small>
1125>
1126> Parents: [keyboard3](#element-keyboard3)
1127> Children: [key](#element-key)
1128> Occurrence: optional, single
1129>
1130> </small>
1131
1132
1133
1134* * *
1135
1136### Element: key
1137
1138This element defines a mapping between an abstract key and its output. This element must have the `keys` element as its parent. The `key` element is referenced by the `keys=` attribute of the [`row` element](#element-row).
1139
1140**Syntax**
1141
1142```xml
1143<key
1144 id="…keyId"
1145 flickId="…flickId"
1146 gap="true"
1147 output="…string"
1148 longPressKeyIds="…list of keyIds"
1149 longPressDefaultKeyId="…keyId"
1150 multiTapKeyIds="…listId"
1151 stretch="true"
1152 layerId="…layerId"
1153 width="…number"
1154 />
1155```
1156
1157> <small>
1158>
1159> Parents: [keys](#element-keys)
1160>
1161> Children: _none_
1162>
1163> Occurrence: optional, multiple
1164> </small>
1165
1166**Note**: The `id` attribute is required.
1167
1168**Note**: _at least one of_ `layerId`, `gap`, or `output` are required.
1169
1170_Attribute:_ `id`
1171
1172> The `id` attribute uniquely identifies the key. NMTOKEN. It can (but needn't be) the key name (a, b, c, A, B, C, …), or any other valid token (e-acute, alef, alif, alpha, …).
1173>
1174> In the future, this attribute’s definition is expected to be updated to align with [UAX#31](https://www.unicode.org/reports/tr31/).
1175
1176_Attribute:_ `flickId="…flickId"` (optional)
1177
1178> The `flickId` attribute indicates that this key makes use of a [`flick`](#element-flick) set with the specified id.
1179
1180_Attribute:_ `gap="true"` (optional)
1181
1182> The `gap` attribute indicates that this key does not have any appearance, but causes a "gap" of the specified number of key widths. Can be used with `width` to set a width.
1183> Such elements may not be referred to by `display` elements, nor may they have any of the following attributes:  `flickId`, `longPressKeyId`, `longPressDefaultKeyId`, `multiTapKeyIds`, `layerId`, or `output`.
1184
1185```xml
1186<key id="mediumgap" gap="true" width="1.5"/>
1187```
1188
1189_Attribute:_ `output`
1190
1191> The `output` attribute value contains the sequence of characters that is emitted when pressing this particular key. Control characters, whitespace (other than the regular space character) and combining marks in this attribute are escaped using the `\u{…}` notation. More than one key may output the same output.
1192>
1193> The `output` attribute may also contain the `\m{…markerId}` syntax to insert a marker. See the definition of [markers](#markers).
1194
1195_Attribute:_ `longPressKeyIds="…list of keyIds"` (optional)
1196
1197> A space-separated ordered list of `key` element ids, which keys which can be emitted by "long-pressing" this key. This feature is prominent in mobile devices.
1198>
1199> In a list of keys specified by `longPressKeyIds`, the key matching `longPressDefaultKeyId` attribute (if present) specifies the default long-press target, which could be different than the first element. It is an error if the `longPressDefaultKeyId` key is not in the `longPressKeyIds` list.
1200>
1201> Implementations shall ignore any gestures (such as flick, multiTap, longPress) defined on keys in the `longPressKeyIds` list.
1202>
1203> For example, if the default key is a key whose [display](#element-displays) value is `{`, an implementation might render the key as follows:
1204>
1205> ![keycap hint](images/keycapHint.png)
1206>
1207> _Example:_
1208> - pressing the `o` key will produce `o`
1209> - holding down the key will produce a list `ó`, `{` (where `{` is the default and produces a marker)
1210>
1211> ```xml
1212> <displays>
1213>    <display output="\m{marker}" display="{" />
1214> </displays>
1215>
1216> <keys>
1217>    <key id="o" output="o" longPressKeyIds="o-acute marker" longPressDefaultKeyId="marker">
1218>    <key id="o-acute" output="ó"/>
1219>    <key id="marker" output="\m{marker}" />
1220> </key>
1221>
1222> ```
1223
1224_Attribute:_ `longPressDefaultKeyId="…keyId"` (optional)
1225
1226> Specifies the default key, by id, in a list of long-press keys. See the discussion of `LongPressKeyIds`, above.
1227
1228_Attribute:_ `multiTapKeyIds` (optional)
1229
1230> A space-separated ordered list of `key` element ids, which keys, where each successive key in the list is produced by the corresponding number of quick taps.
1231> It is an error for a key to reference itself in the `multiTapKeyIds` list.
1232>
1233> Implementations shall ignore any gestures (such as flick, multiTap, longPress) defined on keys in the `multiTapKeyIds` list.
1234>
1235> _Example:_
1236> - first tap on the key will produce “a”
1237> - two taps will produce “bb”
1238> - three taps on the key will produce “c”
1239> - four taps on the key will produce “d”
1240>
1241> ```xml
1242> <keys>
1243>    <key id="a" output="a" multiTapKeyIds="bb c d">
1244>    <key id="bb" output="bb" />
1245>    <key id="c" output="c" />
1246>    <key id="d" output="d" />
1247> </key>
1248> ```
1249
1250**Note**: Behavior past the end of the multiTap list is implementation specific.
1251
1252_Attribute:_ `stretch="true"` (optional)
1253
1254> The `stretch` attribute indicates that a touch layout may stretch this key to fill available horizontal space on the row.
1255> This is used, for example, on the spacebar. Note that `stretch=` is ignored for hardware layouts.
1256
1257_Attribute:_ `layerId="shift"` (optional)
1258
1259> The `layerId` attribute indicates that this key switches to another `layer` with the specified id (such as `<layer id="shift"/>` in this example).
1260> Note that a key may have both a `layerId=` and a `output=` attribute, indicating that the key outputs _prior_ to switching layers.
1261> Also note that `layerId=` is ignored for hardware layouts: their shifting is controlled via
1262> the modifier keys.
1263>
1264> This attribute is an NMTOKEN.
1265>
1266> In the future, this attribute’s definition is expected to be updated to align with [UAX#31](https://www.unicode.org/reports/tr31/).
1267
1268
1269_Attribute:_ `width="1.2"` (optional, default "1.0")
1270
1271> The `width` attribute indicates that this key has a different width than other keys, by the specified number of key widths.
1272
1273```xml
1274<key id="wide-a" output="a" width="1.2"/>
1275<key id="wide-gap" gap="true" width="2.5"/>
1276```
1277
1278##### Implied Keys
1279
1280Not all keys need to be listed explicitly.  The following two can be assumed to already exist:
1281
1282```xml
1283<key id="gap" gap="true" width="1"/>
1284<key id="space" output=" " stretch="true" width="1"/>
1285```
1286
1287In addition, these 62 keys, comprising 10 digit keys, 26 Latin lower-case keys, and 26 Latin upper-case keys, where the `id` is the same as the `to`, are assumed to exist:
1288
1289```xml
1290<key id="0" output="0"/>
1291<key id="1" output="1"/>
1292<key id="2" output="2"/>
12931294<key id="A" output="A"/>
1295<key id="B" output="B"/>
1296<key id="C" output="C"/>
12971298<key id="a" output="a"/>
1299<key id="b" output="b"/>
1300<key id="c" output="c"/>
13011302```
1303
1304These implied keys are available in a data file named `keyboards/import/keys-Latn-implied.xml` in the CLDR distribution for the convenience of implementations.
1305
1306Thus, the implied keys behave as if the following import were present.
1307
1308```xml
1309<keyboard3>
1310    <keys>
1311        <import base="cldr" path="45/keys-Latn-implied.xml" />
1312    </keys>
1313</keyboard3>
1314```
1315
1316**Note:** All implied keys may be overridden, as with all other imported data items. See the [`import`](#element-import) element for more details.
1317
1318* * *
1319
1320### Element: flicks
1321
1322The `flicks` element is a collection of `flick` elements.
1323
1324> <small>
1325>
1326> Parents: [keyboard3](#element-keyboard3)
1327>
1328> Children: [flick](#element-flick), [import](#element-import), [_special_](tr35.md#special)
1329>
1330> Occurrence: optional, single
1331> </small>
1332
1333* * *
1334
1335#### Element: flick
1336
1337The `flick` element is used to generate results from a "flick" of the finger on a mobile device.
1338
1339**Syntax**
1340
1341```xml
1342<keyboard3>
1343    <keys>
1344        <key id="a" flickId="a-flicks" output="a" />
1345    </keys>
1346    <flicks>
1347        <flick id="a-flicks">
1348            <flickSegment … />
1349            <flickSegment … />
1350            <flickSegment … />
1351        </flick>
1352    </flicks>
1353</keyboard3>
1354```
1355
1356> <small>
1357>
1358> Parents: [flicks](#element-flicks)
1359>
1360> Children: [flickSegment](#element-flicksegment), [_special_](tr35.md#special)
1361>
1362> Occurrence: optional, multiple
1363>
1364> </small>
1365
1366_Attribute:_ `id` (required)
1367
1368> The `id` attribute identifies the flicks. It can be any NMTOKEN.
1369>
1370> The `id` attribute on `flick` elements are distinct from the `id` attribute on `key` elements.
1371> For example, it is permissible to have both `<key id="a" />` and
1372> `<flick id="a" />` which are two unrelated elements.
1373>
1374> In the future, this attribute’s definition is expected to be updated to align with [UAX#31](https://www.unicode.org/reports/tr31/).
1375
1376* * *
1377
1378#### Element: flickSegment
1379
1380> <small>
1381>
1382> Parents: [flick](#element-flick)
1383>
1384> Children: _none_
1385>
1386> Occurrence: required, multiple
1387>
1388> </small>
1389
1390_Attribute:_ `directions` (required)
1391
1392> The `directions` attribute value is a space-delimited list of keywords, that describe a path, currently restricted to the cardinal and intercardinal directions `{n e s w ne nw se sw}`.
1393
1394_Attribute:_ `keyId` (required)
1395
1396> The `keyId` attribute value is the result of (one or more) flicks.
1397>
1398> Implementations shall ignore any gestures (such as flick, multiTap, longPress) defined on the key specified by `keyId`.
1399
1400
1401**Example**
1402where a flick to the Northeast then South produces `Å`.
1403
1404```xml
1405<keys>
1406    <key id="something" flickId="a" output="Something" />
1407    <key id="A-ring" output="A-ring" />
1408</keys>
1409
1410<flicks>
1411    <flick id="a">
1412        <flickSegment directions="ne s" keyId="A-ring" />
1413    </flick>
1414</flicks>
1415```
1416
1417* * *
1418
1419### Element: forms
1420
1421This element contains a set of `form` elements which define the layout of a particular hardware form.
1422
1423
1424> <small>
1425>
1426> Parents: [keyboard3](#element-keyboard3)
1427>
1428> Children: [import](#element-import), [form](#element-form), [_special_](tr35.md#special)
1429>
1430> Occurrence: optional, single
1431>
1432> </small>
1433
1434***Syntax***
1435
1436```xml
1437<forms>
1438    <form id="iso">
1439        <!-- … -->
1440    </form>
1441    <form id="us">
1442        <!-- … -->
1443    </form>
1444</forms>
1445```
1446
1447* * *
1448
1449### Element: form
1450
1451This element contains a specific `form` element which defines the layout of a particular hardware form.
1452
1453> *Note:* Most keyboards will not need to use this element directly, and the CLDR repository will not accept keyboards which define a custom `form` element.  This element is provided for two reasons:
1454
14551. To formally specify the standard hardware arrangements used with CLDR for implementations. Implementations can verify the arrangement, and validate keyboards against the number of rows and the number of keys per row.
1456
14572. To allow a way to customize the scancode layout for keyboards not intended to be included in the common CLDR repository.
1458
1459See [Implied Form Values](#implied-form-values), below.
1460
1461> <small>
1462>
1463> Parents: [forms](#element-forms)
1464>
1465> Children: [scanCodes](#element-scancodes), [_special_](tr35.md#special)
1466>
1467> Occurrence: optional, multiple
1468>
1469> </small>
1470
1471_Attribute:_ `id` (required)
1472
1473> This attribute specifies the form id. The value may not be `touch`.
1474
1475> Must match `[A-Za-z0-9][A-Za-z0-9_-]*`
1476
1477
1478***Syntax***
1479
1480```xml
1481<form id="us">
1482    <scanCodes codes="00 01 02"/>
1483    <scanCodes codes="03 04 05"/>
1484</form>
1485```
1486
1487##### Implied Form Values
1488
1489There is an implied set of `<form>` elements corresponding to the default forms, thus implementations must behave as if there was the following import statement:
1490
1491```xml
1492<keyboard3>
1493    <forms>
1494        <import base="cldr" path="45/scanCodes-implied.xml" /> <!-- the version will match the current conformsTo of the file -->
1495    </forms>
1496</keyboard3>
1497```
1498
1499Here is a summary of the implied form elements. Keyboards included in the CLDR Repository must only use these `formId=` values and may not override the scanCodes.
1500
1501> - `touch` - Touch (non-hardware) layout.
1502> - `abnt2` - Brazilian 103 key ABNT2 layout (iso + extra key near right shift)
1503> - `iso` - European 102 key layout (extra key near left shift)
1504> - `jis` - Japanese 109 key layout
1505> - `us` - ANSI 101 key layout
1506> - `ks` - Korean KS layout
1507
1508* * *
1509
1510### Element: scanCodes
1511
1512This element contains a keyboard row, and defines the scan codes for the non-frame keys in that row.
1513
1514> <small>
1515>
1516> Parents: [form](#element-form)
1517>
1518> Children: none
1519>
1520> Occurrence: required, multiple
1521>
1522> </small>
1523
1524> _Attribute:_ `codes` (required)
1525
1526> The `codes` attribute is a space-separated list of 2-digit hex bytes, each representing a scan code.
1527
1528**Syntax**
1529
1530```xml
1531<scanCodes codes="29 02 03 04 05 06 07 08 09 0A 0B 0C 0D" />
1532```
1533
1534* * *
1535
1536### Element: layers
1537
1538This element contains a set of `layer` elements with a specific physical form factor, whether
1539hardware or touch layout.
1540
1541> <small>
1542>
1543> Parents: [keyboard3](#element-keyboard3)
1544>
1545> Children: [import](#element-import), [layer](#element-layer), [_special_](tr35.md#special)
1546>
1547> Occurrence: required, multiple
1548>
1549> </small>
1550
1551- At least one `layers` element is required.
1552
1553_Attribute:_ `formId` (required)
1554
1555> This attribute specifies the physical layout of a hardware keyboard,
1556> or that the form is a `touch` layout.
1557>
1558> When using an on-screen touch keyboard, if the keyboard does not specify a `<layers formId="touch">`
1559> element, a `<layers formId="…formId">` element can be used as an fallback alternative.
1560> If there is no `hardware` form, the implementation may need
1561> to choose a different keyboard file, or use some other fallback behavior when using a
1562> hardware keyboard.
1563>
1564> Because a hardware keyboard facilitates non-trivial amounts of text input,
1565> and many touch devices can also be connected to a hardware keyboard, it
1566> is recommended to always have a hardware (non-touch) form.
1567>
1568> Multiple `<layers formId="touch">` elements are allowed with distinct `minDeviceWidth` values.
1569> At most one hardware (non-`formId="touch"`) `<layers>` element is allowed. If a different key arrangement is desired between, for example, `us` and `iso` formats, these should be separated into two different keyboards.
1570>
1571> The typical keyboard author will be designing a keyboard based on their circumstances and the hardware that they are using. So, for example, if they are in South East Asia, they will almost certainly be using an 101 key hardware keyboard with US key caps. So we want them to be able to reference that (`<layers formId="us">`) in their design, rather than having to work with an unfamiliar form.
1572>
1573> A mismatch between the hardware layout in the keyboard file, and the actual hardware used by the user could result in some keys being inaccessible to the user if their hardware cannot generate the scancodes corresponding to the layout specified by the `formId=` attribute. Such keys could be accessed only via an on-screen keyboard utility. Conversely, a user with hardware keys that are not present in the specified `formId=` will result in some hardware keys which have no function when pressed.
1574>
1575> The value of the `formId=` attribute may be `touch`, or correspond to a `form` element. See [`form`](#element-form).
1576>
1577
1578_Attribute:_ `minDeviceWidth`
1579
1580> This attribute specifies the minimum required width, in millimeters (mm), of the touch surface.  The `layers` entry with the greatest matching width will be selected. This attribute is intended for `formId="touch"`, but is supported for hardware forms.
1581>
1582> This must be a whole number between 1 and 999, inclusive.
1583
1584### Element: layer
1585
1586A `layer` element describes the configuration of keys on a particular layer of a keyboard. It contains one or more `row` elements to describe which keys exist in each row.
1587
1588**Syntax**
1589
1590```xml
1591<layer id="…layerId" modifiers="…modifier modifier, …modifier modifier, …">
1592    <row …/>
1593    <row …/>
15941595</layer>
1596```
1597
1598> <small>
1599>
1600> Parents: [keyboard3](#element-keyboard3)
1601>
1602> Children: [row](#element-row), [_special_](tr35.md#special)
1603>
1604> Occurrence: optional, multiple
1605>
1606> </small>
1607
1608_Attribute_ `id` (required for `touch`)
1609
1610> The `id` attribute identifies the layer for touch layouts.  This identifier specifies the layout as the target for layer switching, as specified by the `layerId=` attribute on the [`<key>`](#element-key) element.
1611> Touch layouts must have one `layer` with `id="base"` to serve as the base layer.
1612>
1613> Must match `[A-Za-z0-9][A-Za-z0-9_-]*`
1614
1615_Attribute:_ `modifiers` (required for `hardware`)
1616
1617> This has two roles. It acts as an identifier for the `layer` element for hardware keyboards (in the absence of the id= element) and also provides the linkage from the hardware modifiers into the correct `layer`.
1618>
1619> For hardware layouts, the use of `@modifiers` as an identifier for a layer is sufficient since it is always unique among the set of `layer` elements in each  `form`.
1620>
1621> This attribute value is a list of lists. It is a comma-separated (`,`) list of modifier sets, and each modifier set is a space-separated list of modifier components.
1622>
1623> Each modifier component must match `[A-Za-z0-9]+`. Extra whitespace is ignored.
1624>
1625> To indicate that no modifiers apply, the reserved name of `none` is used.
1626
1627**Syntax**
1628
1629```xml
1630<layer id="base"        modifiers="none">
1631    <row keys="a" />
1632</layer>
1633
1634<layer id="upper"       modifiers="shift">
1635    <row keys="A" />
1636</layer>
1637
1638<layer id="altgr"       modifiers="altR">
1639    <row keys="a-umlaut" />
1640</layer>
1641
1642<layer id="upper-altgr" modifiers="altR shift">
1643    <row keys="A-umlaut" />
1644</layer>
1645```
1646
1647#### Layer Modifier Sets
1648
1649The `@modifiers` attribute value contains one or more Layer Modifier Sets, separated by commas.
1650For example, in the element `<layer … modifiers="ctrlL altL, altR" …` the attribute value consists of two sets:
1651
1652- `ctrlL altL` (two components)
1653- `altR` (one component)
1654
1655The order of the sets and the order of the components within each set is not significant. However, for clarity in reading, the canonical order within a set is in the order listed in Layout Modifier Components; the canonical order for the sets should be first by the cardinality of the sets (least first), then alphabetical.
1656
1657#### Layer Modifier Components
1658
1659Within a Layer Modifier Set, the following modifier components can be used, separated by spaces.
1660
1661 - `none` (no modifier)
1662 - `alt`
1663 - `altL`
1664 - `altR`
1665 - `caps`
1666 - `ctrl`
1667 - `ctrlL`
1668 - `ctrlR`
1669 - `shift`
1670 - `other` (matches if no other layers match)
1671
16721. `alt` in this specification is referred to on some platforms as "opt" or "option".
1673
16742. `none` and `other` may not be combined with any other components.
1675
1676#### Modifier Left- and Right- keys
1677
16781. `L` or `R` indicates a left- or right- side modifier only (such as `altL`)
1679 whereas `alt` indicates _either_ left or right alt key (that is, `altL` or `altR`). `ctrl` indicates either left or right ctrl key (that is, `ctrlL` or `ctrlR`).
1680
16812. Keyboard implementations must warn if a keyboard mixes `alt` with `altL`/`altR`, or `ctrl` with `ctrlL`/`ctrlR`.
1682
16833. Left- and right- side modifiers may not be mixed together in a single `modifier` attribute value, so neither `altL ctrlR"` nor `altL altR` are allowed.
1684
16854. `shift` indicates either shift key. The left and right shift keys are not distinguishable in this specification.
1686
1687#### Layer Modifier Matching
1688
1689Layers are matched exactly based on the modifier keys which are down. For example:
1690
1691- `none` as a modifier will only match if *all* of the keys `caps`, `alt`, `ctrl` and `shift` are up.
1692
1693- `alt` as a modifier will only match if either `alt` is down, *and* `caps`, `ctrl`, and `shift` are up.
1694
1695- `altL ctrl` as a modifier will only match if the left `alt` is down, either `ctrl` is down, *and* `shift` and `caps` are up.
1696
1697- `other` as a modifier will match if no other layers match.
1698
1699Multiple modifier sets are separated by commas.  For example, `none, shift caps` will match either no modifiers *or* shift and caps.  `ctrlL altL, altR` will match either  left-control and left-alt, *or* right-alt.
1700
1701Keystrokes must be ignored where there isn’t a layer that explicitly matches nor a layer with `other`. Example: If there is a `ctrl` and `shift` layer, but no `ctrl shift` nor `other` layer, no output will result from `ctrl shift X`.
1702
1703Layers are not allowed to overlap in their matching.  For example, the keyboard author will receive an error if one layer specifies `alt shift` and another layer specifies `altR shift`.
1704
1705There is one special case:  the `other` layer matches if and only if no other layer matches. Thus logically the `other` layer is matched after all other layers have been checked.
1706
1707Because there is no overlap allowed between layers, the order of `<layer>` elements is not significant.
1708
1709> Note: The modifier syntax may be enhanced in the future, but will remain backwards compatible with the syntax described here.
1710
1711* * *
1712
1713### Element: row
1714
1715A `row` element describes the keys that are present in the row of a keyboard.
1716
1717**Syntax**
1718
1719```xml
1720<row keys="…keyId …keyId …" />
1721```
1722
1723> <small>
1724>
1725> Parents: [layer](#element-layer)
1726>
1727> Children: _none_
1728>
1729> Occurrence: required, multiple
1730>
1731> </small>
1732
1733_Attribute:_ `keys` (required)
1734
1735> This is a string that lists the id of [`key` elements](#element-key) for each of the keys in a row, whether those are explicitly listed in the file or are implied.  See the `key` documentation for more detail.
1736>
1737> For non-`touch` forms, the number of keys in each row may not exceed the number of scan codes defined for that row, and the number of rows may not exceed the defined number of rows for that form. See [`scanCodes`](#element-scancodes);
1738
1739**Example**
1740
1741Here is an example of a `row` element:
1742
1743```xml
1744<row keys="a z e r t y u i o p caret dollar" />
1745```
1746
1747* * *
1748
1749### Element: variables
1750
1751> <small>
1752>
1753> Parents: [keyboard3](#element-keyboard3)
1754>
1755> Children: [import](#element-import), [_special_](tr35.md#special), [string](#element-string), [set](#element-set), [uset](#element-uset)
1756>
1757> Occurrence: optional, single
1758> </small>
1759
1760This is a container for variables to be used with [transform](#element-transform), [display](#element-display) and [key](#element-key) elements.
1761
1762Note that the `id=` attribute value must be unique across all children of the `variables` element.
1763
1764**Example**
1765
1766```xml
1767<variables>
1768    <string id="y" value="yes" /> <!-- a simple string-->
1769    <set id="upper" value="A B C D E FF" /> <!-- a set with 6 items -->
1770    <uset id="consonants" value="[कसतनमह]" /> <!-- a UnicodeSet -->
1771</variables>
1772```
1773
1774* * *
1775
1776### Element: string
1777
1778> <small>
1779>
1780> Parents: [variables](#element-variables)
1781>
1782> Children: _none_
1783>
1784> Occurrence: optional, multiple
1785> </small>
1786
1787> This element contains a single string which is used by the [transform](#element-transform) elements for string matching and substitution, as well as by the [key](#element-key) and [display](#element-display) elements.
1788
1789_Attribute:_ `id` (required)
1790
1791> Specifies the identifier (name) of this string.
1792> All ids must be unique across all types of variables.
1793>
1794> `id` must match `[0-9A-Za-z_]{1,32}`
1795
1796_Attribute:_ `value` (required)
1797
1798> Strings may contain whitespaces. However, for clarity, it is recommended to escape spacing marks, even in strings.
1799> This attribute value may be escaped with `\u` notation, see [Escaping](#escaping).
1800> Variables may refer to other string variables if they have been previously defined, using `${string}` syntax.
1801> [Markers](#markers) may be included with the `\m{…}` notation.
1802
1803**Example**
1804
1805```xml
1806<variables>
1807    <string id="cluster_hi" value="हि" /> <!-- a string -->
1808    <string id="zwnj" value="\u{200C}"/> <!-- single codepoint -->
1809    <string id="acute" value="\m{acute}"/> <!-- refer to a marker -->
1810    <string id="backquote" value="`"/>
1811    <string id="zwnj_acute" value="${zwnj}${acute}"  /> <!-- Combine two variables -->
1812    <string id="zwnj_sp_acute" value="${zwnj}\u{0020}${acute}"  /> <!-- Combine two variables -->
1813</variables>
1814```
1815
1816These may be then used in multiple contexts:
1817
1818```xml
1819<!-- as part of a regex -->
1820<transform from="${cluster_hi}X" to="X" />
1821<transform from="Y" to="${cluster_hi}" />
18221823<!-- as part of a key bag  -->
1824<key id="hi_key" output="${cluster_hi}" />
1825<key id="acute_key" output="${acute}" />
18261827<!-- Display ´ instead of the non-displayable marker -->
1828<display output="${acute}" display="${backquote}" />
1829```
1830
1831* * *
1832
1833### Element: set
1834
1835> <small>
1836>
1837> Parents: [variables](#element-variables)
1838>
1839> Children: _none_
1840>
1841> Occurrence: optional, multiple
1842> </small>
1843
1844> This element contains a set of strings used by the [transform](#element-transform) elements for string matching and substitution.
1845
1846_Attribute:_ `id` (required)
1847
1848> Specifies the identifier (name) of this set.
1849> All ids must be unique across all types of variables.
1850>
1851> `id` must match `[0-9A-Za-z_]{1,32}`
1852
1853_Attribute:_ `value` (required)
1854
1855> The `value` attribute value is always a set of strings separated by whitespace, even if there is only a single item in the set, such as `"A"`.
1856> Leading and trailing whitespace is ignored.
1857> This attribute value may be escaped with `\u` notation, see [Escaping](#escaping).
1858> Sets may refer to other string variables if they have been previously defined, using `${string}` syntax, or to other previously-defined sets using `$[set]` syntax.
1859> Set references must be separated by whitespace: `$[set1]$[set2]` is an error; instead use `$[set1] $[set2]`.
1860> [Markers](#markers) may be included with the `\m{…}` notation.
1861
1862**Examples**
1863
1864```xml
1865<variables>
1866    <set id="upper" value="A B CC D E FF " /> <!-- 6 items -->
1867    <set id="lower" value="a b c  d e  f " /> <!-- 6 items -->
1868    <set id="upper_or_lower" value="$[upper] $[lower]"  /> <!-- Concatenate two sets -->
1869    <set id="lower_or_upper" value="$[lower] $[upper]"  /> <!-- Concatenate two sets -->
1870    <set id="a" value="A"/> <!-- Just one element, an 'A'-->
1871    <set id="cluster_or_zwnj" value="${hi_cluster} ${zwnj}"/> <!-- 2 items: "हि \u${200C}"-->
1872</variables>
1873```
1874
1875Match "X" followed by any uppercase letter:
1876
1877```xml
1878<transform from="X$[upper]" to="…" />
1879```
1880
1881Map from upper to lower:
1882
1883```xml
1884<transform from="($[upper])" to="$[1:lower]" />
1885```
1886
1887See [transform](#element-transform) for further details and syntax.
1888
1889* * *
1890
1891### Element: uset
1892
1893> <small>
1894>
1895> Parents: [variables](#element-variables)
1896>
1897> Children: _none_
1898>
1899> Occurrence: optional, multiple
1900> </small>
1901
1902> This element contains a set, using a subset of the [UnicodeSet](tr35.md#Unicode_Sets) format, used by the [`transform`](#element-transform) elements for string matching and substitution.
1903> Note important restrictions on the syntax below.
1904
1905_Attribute:_ `id` (required)
1906
1907> Specifies the identifier (name) of this uset.
1908> All ids must be unique across all types of variables.
1909>
1910> `id` must match `[0-9A-Za-z_]{1,32}`
1911
1912_Attribute:_ `value` (required)
1913
1914> String value in a subset of [UnicodeSet](tr35.md#Unicode_Sets) format.
1915> Leading and trailing whitespace is ignored.
1916> Variables may refer to other string variables if they have been previously defined, using `${string}` syntax, or to other previously-defined `uset` elements (not `set` elements) using `$[...usetId]` syntax.
1917
1918
1919- Warning: `uset` elements look superficially similar to regex character classes as used in [`transform`](#element-transform) elements, but they are different. `uset`s must be defined with a `uset` element, and referenced with the `$[...usetId]` notation in transforms. `uset`s cannot be specified inline in a transform, and can only be used indirectly by reference to the corresponding `uset` element.
1920- Multi-character strings (`{}`) are not supported, such as `[żġħ{ie}{għ}]`.
1921- UnicodeSet property notation (`\p{…}` or `[:…:]`) may **NOT** be used.
1922
1923> **Rationale**: allowing property notation would make keyboard implementations dependent on a particular version of Unicode. However, implementations and tools may wish to pre-calculate the value of a particular uset, and "freeze" it as explicit code points.  The example below of `$[KhmrMn]` matches nonspacing marks in the `Khmr` script.
1924
1925- `uset` elements may represent a very large number of codepoints. Keyboard implementations may set a limit on how many unique range entries may be matched.
1926- The `uset` element may not be used as the source or target for mapping operations (`$[1:variable]` syntax).
1927- The `uset` element may not be referenced by [`key`](#element-key) or [`display`](#element-display) elements.
1928
1929**Examples**
1930
1931```xml
1932<variables>
1933  <uset id="consonants" value="[कसतनमह]" /> <!-- unicode set range -->
1934  <uset id="range" value="[a-z D E F G \u{200A}]" /> <!-- a through z, plus a few others -->
1935  <uset id="newrange" value="[$[range]-[G]]" /> <!-- The above range, but not including G -->
1936  <uset id="KhmrMn" value="[\u{17B4}\u{17B5}\u{17B7}-\u{17BD}\u{17C6}\u{17C9}-\u{17D3}\u{17DD}]"> <!--  [[:Khmr:][:Mn:]] as of Unicode 15.0-->
1937</variables>
1938```
1939
1940* * *
1941
1942### Element: transforms
1943
1944This element defines a group of one or more `transform` elements associated with this keyboard layout. This is used to support features such as dead-keys, character reordering, backspace behavior, etc. using a straightforward structure that works for all the keyboards tested, and that results in readable source data.
1945
1946There can be multiple `<transforms>` elements, but only one for each `type`.
1947
1948**Syntax**
1949
1950```xml
1951<transforms type="…type">
1952    <transformGroup …/>
1953    <transformGroup …/>
19541955</transforms>
1956```
1957
1958> <small>
1959>
1960> Parents: [keyboard3](#element-keyboard3)
1961>
1962> Children: [import](#element-import), [_special_](tr35.md#special), [transformGroup](#element-transformgroup)
1963>
1964> Occurrence: optional, multiple
1965>
1966> </small>
1967
1968_Attribute:_ `type` (required)
1969
1970> Values: `simple`, `backspace`
1971
1972There are other keying behaviors that are needed particularly in handing complex orthographies from various parts of the world. The behaviors intended to be covered by the transforms are:
1973
1974* Reordering combining marks. The order required for underlying storage may differ considerably from the desired typing order. In addition, a keyboard may want to allow for different typing orders.
1975* Error indication. Sometimes a keyboard layout will want to specify to the application that a particular keying sequence in a context is in error and that the application should indicate that that particular keypress is erroneous.
1976* Backspace handling. There are various approaches to handling the backspace key. An application may treat it as an undo of the last key input, or it may simply delete the last character in the currently output text, or it may use transform rules to tell it how much to delete.
1977
1978#### Markers
1979
1980Markers are placeholders which record some state, but without producing normal visible text output.  They were designed particularly to support dead-keys.
1981
1982The marker ID is any valid `NMTOKEN`.
1983
1984Consider the following abbreviated example:
1985
1986```xml
1987    <display output="\m{circ_marker}" display="^" />
19881989    <key id="circ_key" output="\m{circ_marker}" />
1990    <key id="e" output="e" />
19911992    <transform from="\m{circ_marker}e" to="ê" />
1993```
1994
19951. The user presses the `circ_key` key. The key can be shown with the keycap `^` due to the `<display>` element.
1996
19972. The special marker, `circ_marker`, is added to the end of the input context.
1998
1999    The input context does not match any transforms.
2000
2001    The input context has:
2002
2003    - …
2004    - marker `circ_marker`
2005
20063. Also due to the `<display>` element, implementations can opt to display a visible `^` (perhaps visually distinct from a plain `^` carat). Implementations may opt to display nothing and only store the marker in the input context.
2007
20084. The user now presses the `e` key, which is also added to the input context. The input context now has:
2009
2010    - …
2011    - character `e`
2012    - marker `circ_marker`
2013
20145. Now, the input context matches the transform.  The `e` and the marker are replaced with `ê`.
2015
2016    The input context now has:
2017
2018    - …
2019    - character `ê`
2020
2021**Using markers to inhibit other transforms**
2022
2023Sometimes it is desirable to prevent transforms from having an effect.
2024Perhaps two different keys output the same characters, with different key or modifier combinations, but only one of them is intended to participate in a transform.
2025
2026Consider the following case, where pressing the keys `X`, `e` results in `^e`, which is transformed into `ê`.
2027
2028```xml
2029<keys>
2030    <key id="X" output="^"/>
2031    <key id="e" output="e" />
2032</keys>
2033<transforms>
2034    <transform from="^e" output="ê"/>
2035</transforms>
2036```
2037
2038However, what if the user wanted to produce `^e` without the transform taking effect?
2039One strategy would be to use a marker, which won’t be visible in the output, but will inhibit the transform.
2040
2041```xml
2042<keys>
2043    <key id="caret" output="^\m{no_transform}"/>
2044    <key id="X" output="^" />
2045    <key id="e" output="e" />
2046</keys>
20472048<transforms>
2049    <!-- this wouldn't match the key caret output because of the marker -->
2050    <transform from="^e" output="ê"/>
2051</transforms>
2052```
2053
2054Pressing `caret` `e` will result in `^e` (with an invisible _no_transform_ marker — note that any name could be used). The `^e` won’t have the transform applied, at least while the marker’s context remains valid.
2055
2056Another strategy might be to use a marker to indicate where transforms are desired, instead of where they aren't desired.
2057
2058```xml
2059<keys>
2060    <key id="caret" output="^"/>
2061    <key id="X" output="^\m{transform}"/>
2062    <key id="e" output="e" />
2063</keys>
20642065<transforms …>
2066    <!-- Won't match ^e without marker. -->
2067    <transform from="^\m{transform}e" output="ê"/>
2068</transforms>
2069```
2070
2071In this way, only the `X`, `e` keys will produce `^e` with a _transform_ marker (again, any name could be used) which will cause the transform to be applied. One benefit is that navigating to an existing `^` in a document and adding an `e` will result in `^e`, and this output will not be affected by the transform, because there will be no marker present there (remember that markers are not stored with the document but only recorded in memory temporarily during text input).
2072
2073Please note important considerations for [Normalization and Markers](#normalization-and-markers).
2074
2075**Effect of markers on final text**
2076
2077All markers must be removed before text is returned to the application from the input context.
2078If the input context changes, such as if the cursor or mouse moves the insertion point somewhere else, all markers in the input context are removed.
2079
2080**Implementation Notes**
2081
2082Ideally, markers are implemented entirely out-of-band from the normal text stream. However, implementations _may_ choose to map each marker to a [Unicode private-use character](https://www.unicode.org/glossary/#private_use_character) for use only within the implementation’s processing and temporary storage in the input context.
2083
2084For example, the first marker encountered could be represented as U+E000, the second by U+E001 and so on.  If a regex processing engine were used, then those PUA characters could be processed through the existing regex processing engine.  `[^\u{E000}-\u{E009}]` could be used as an expression to match a character that is not a marker, and `[Ee]\u{E000}` could match `E` or `e` followed by the first marker.
2085
2086Such implementations must take care to remove all such markers (see prior section) from the resultant text. As well, implementations must take care to avoid conflicts if applications themselves are using PUA characters, such as is often done with not-yet-encoded scripts or characters.
2087
2088* * *
2089
2090### Element: transformGroup
2091
2092> <small>
2093>
2094> Parents: [transforms](#element-transforms)
2095>
2096> Children: [import](#element-import), [reorder](#element-reorder), [_special_](tr35.md#special), [transform](#element-transform)
2097>
2098> Occurrence: optional, multiple
2099> </small>
2100
2101A `transformGroup` contains a set of transform elements or reorder elements.
2102
2103Each `transformGroup` is processed entirely before proceeding to the next one.
2104
2105
2106Each `transformGroup` element, after imports are processed, must have either [reorder](#element-reorder) elements or [transform](#element-transform) elements, but not both. The `<transformGroup>` element may not be empty.
2107
2108**Examples**
2109
2110
2111#### Example: `transformGroup` with `transform` elements
2112
2113This is a `transformGroup` that consists of one or more [`transform`](#element-transform) elements, prefaced by one or more `import` elements. See the discussion of those elements for details. `import` elements in this group may not import `reorder` elements.
2114
2115
2116```xml
2117<transformGroup>
2118    <import path="…"/> <!-- optional import elements-->
2119    <transform />
2120    <!-- other <transform/> elements -->
2121</transformGroup>
2122```
2123
2124
2125#### Example: `transformGroup` with `reorder` elements
2126
2127This is a `transformGroup` that consists of one or more [`transform`](#element-transform) elements, optionally prefaced by one or more `import` elements that import `transform` elements. See the discussion of those elements for details.
2128
2129`import` elements in this group may not import `transform` elements.
2130
2131```xml
2132<transformGroup>
2133    <import path="…"/> <!-- optional import elements-->
2134    <reorder … />
2135    <!-- other <reorder> elements -->
2136</transformGroup>
2137```
2138
2139* * *
2140
2141### Element: transform
2142
2143This element contains a single transform that may be performed using the keyboard layout. A transform is an element that specifies a set of conversions from sequences of code points into (one or more) other code points. For example, in most French keyboards hitting the `^` dead-key followed by the `e` key produces `ê`.
2144
2145Matches are processed against the "input context", a temporary buffer containing all relevant text up to the insertion point. If the user moves the insertion point, the input context is discarded and recreated from the application’s text buffer.  Implementations may discard the input context at any time.
2146
2147The input context may contain, besides regular text, any [Markers](#markers) as a result of keys or transforms, since the insertion point was moved.
2148
2149Using regular expression terminology, matches are done as if there was an implicit `$` (match end of buffer) at the end of each pattern. In other words, `<transform from="ke" …>` will not match an input context ending with `…keyboard`, but it will match the last two codepoints of an input context ending with `…awake`.
2150
2151All of the `transform` elements in a `transformGroup` are tested for a match, in order, until a match is found. Then, the matching element is processed, and then processing proceeds to the **next** `transformGroup`. If none of the `transform` elements match, processing proceeds without modification to the buffer to the **next** `transformGroup`.
2152
2153**Syntax**
2154
2155```xml
2156<transform from="…matching pattern" to="…output pattern"/>
2157```
2158
2159> <small>
2160>
2161> Parents: [transformGroup](#element-transformgroup)
2162> Children: _none_
2163> Occurrence: required, multiple
2164>
2165> </small>
2166
2167
2168_Attribute:_ `from` (required)
2169
2170> The `from` attribute value consists of an input rule for matching the input context.
2171>
2172> The `transform` rule and output pattern uses a modified, mostly subsetted, regular expression syntax, with EcmaScript syntax (with the `u` Unicode flag) as its baseline reference (see [MDN-REGEX](https://developer.mozilla.org/docs/Web/JavaScript/Guide/Regular_Expressions)). Differences from regex implementations will be noted.
2173
2174#### Regex-like Syntax
2175
2176- **Simple matches**
2177
2178    `abc` `��`
2179
2180- **Unicode codepoint escapes**
2181
2182    `\u{1234} \u{012A}`
2183    `\u{22} \u{012a} \u{1234A}`
2184
2185    The hex escaping is case insensitive. The value may not match a surrogate or illegal character, nor a marker character.
2186    The form `\u{…}` is preferred as it is the same regardless of codepoint length.
2187
2188- **Fixed character classes and escapes**
2189
2190    `\s \S \t \r \n \f \v \\ \$ \d \w \D \W \0`
2191
2192    The value of these classes do not change with Unicode versions.
2193
2194    `\s` for example is exactly `[\f\n\r\t\v\u{00a0}\u{1680}\u{2000}-\u{200a}\u{2028}\u{2029}\u{202f}\u{205f}\u{3000}\u{feff}]`
2195
2196    `\\` and `\$` evaluate to `\` and `$`, respectively.
2197
2198- **Character classes**
2199
2200    `[abc]` `[^def]` `[a-z]` `[ॲऄ-आइ-ऋ]` `[\u{093F}-\u{0944}\u{0962}\u{0963}]`
2201
2202    - supported
2203    - no Unicode properties such as `\p{…}`
2204    - Warning: Character classes look superficially similar to [`uset`](#element-uset) elements, but they are distinct and referenced with the `$[...usetId]` notation in transforms. The `uset` notation cannot be embedded directly in a transform.
2205
2206- **Bounded quantifier**
2207
2208    `{x,y}`
2209
2210    `x` and `y` are required single digits representing the minimum and maximum number of occurrences.
2211    `x` must be ≥ 0, `y` must be ≥ x and ≥ 1
2212
2213- **Optional Specifier**
2214
2215    `?` - equivalent of `{0,1}`
2216
2217- **Numbered Capture Groups**
2218
2219    `([abc])([def])` (up to 9 groups)
2220
2221    These refer to groups captured as a set, and can be referenced with the `$1` through `$9` operators in the `to=` pattern. May not be nested.
2222
2223- **Non-capturing groups**
2224
2225    `(?:thismatches)`
2226
2227- **Nested capturing groups**
2228
2229    `(?:[abc]([def]))|(?:[ghi])`
2230
2231    Capture groups may be nested, however only the innermost group is allowed to be a capture group. The outer group must be a non-capturing group.
2232
2233- **Disjunctions**
2234
2235    `abc|def`
2236
2237    Match either `abc` or `def`.
2238
2239- **Match a single Unicode codepoint**
2240
2241    `.`
2242
2243    Matches a codepoint, not individual code units. (See the ’u’ option in EcmaScript262 regex.)
2244    For example, Osage `��` is one match (`.`) not two.
2245    Does not match [markers](#markers). (See `\m{.}` and `\m{marker}`, below.)
2246
2247- **Match the start of the text context**
2248
2249    `^`
2250
2251    The start of the context could be the start of a line, a grid cell, or some other formatting boundary.
2252    See description at the top of [`transforms`](#element-transform).
2253
2254#### Additional Features
2255
2256The following are additions to standard Regex syntax.
2257
2258- **Match a Marker**
2259
2260    `\m{Some_Marker}`
2261
2262    Matches the named marker.
2263    Also see [Markers](#markers).
2264
2265- **Match a single marker**
2266
2267    `\m{.}`
2268
2269    Matches any single marker.
2270    Also see [Markers](#markers).
2271
2272- **String Variables**
2273
2274    `${zwnj}`
2275
2276    In this usage, the variable with `id="zwnj"` will be substituted in at this point in the expression. The variable can contain a range, a character, or any other portion of a pattern. If `zwnj` is a simple string, the pattern will match that string at this point.
2277
2278- **`set` or `uset` variables**
2279
2280    `$[upper]`
2281
2282    Given a space-separated `set` or `uset` variable, this syntax will match _any_ of the substrings. This expression may be thought of  (and implemented) as if it were a _non-capturing group_. It may, however, be enclosed within a capturing group. For example, the following definition of `$[upper]` will match as if it were written `(?:A|B|CC|D|E|FF)`.
2283
2284    ```xml
2285    <variables>
2286        <set id="upper" value=" A B CC  D E  FF " />
2287    </variables>
2288    ```
2289
2290    This expression in a `from=` may be used to **insert a mapped variable**, see below under [Replacement syntax](#replacement-syntax).
2291
2292#### Disallowed Regex Features
2293
2294- **Matching an empty string**
2295
2296    Transforms may not match an empty string. For example, `<transform from=""/>` or `<transform from="X{0,1}"/>` are not allowed and must be flagged as an error to keyboard authors.
2297
2298- **Unicode properties**
2299
2300    `\p{property}` `\P{property}`
2301
2302    **Rationale:** The behavior of this feature varies by Unicode version, and so would not have predictable results.
2303
2304    Tooling may choose to suggest an expansion of properties, such as `\p{Mn}` to all non spacing marks for a certain Unicode version.  As well, a set of variables could be constructed in an `import`-able file matching particularly useful Unicode properties.
2305
2306    ```xml
2307    <uset id="Mn" value="[\u{034F}\u{0591}-\u{05AF}\u{05BD}\u{05C4}\u{05C5}\…]" /> <!-- 1,985 code points -->
2308    ```
2309
2310- **Backreferences**
2311
2312    `([abc])-\1` `\k<something>`
2313
2314    **Rationale:** Implementation and cognitive complexity.
2315
2316- **Unbounded Quantifiers**
2317
2318    `* + *? +? {1,} {0,}`
2319
2320    **Rationale:** Implementation and Computational complexity.
2321
2322- **Nested capture groups**
2323
2324    `((a|b|c)|(d|e|f))`
2325
2326    **Rationale:** Computational and cognitive complexity.
2327
2328- **Named capture groups**
2329
2330    `(?<something>)`
2331
2332    **Rationale:** Implementation complexity.
2333
2334- **Assertions** other than `^`
2335
2336    `\b` `\B` `(?<!…)` …
2337
2338    **Rationale:** Implementation complexity.
2339
2340- **End marker**
2341
2342    `$`
2343
2344    The end marker can be thought of as being implicitly at the end of every `from=` pattern, matching the insertion point. Transforms do not match past the insertion point.
2345
2346_Attribute:_ `to`
2347
2348> This attribute value represents the characters that are output from the transform.
2349>
2350> If this attribute is absent, it indicates that the no characters are output, such as with a backspace transform.
2351>
2352> A final rule such as `<transform from=".*"/>` will remove all context which doesn’t match one of the prior rules.
2353
2354#### Replacement syntax
2355
2356Used in the `to=`
2357
2358- **Literals**
2359
2360    `$$ \$ \\` = `$ $ \`
2361
2362- **Entire matched substring**
2363
2364    `$0`
2365
2366- **Insert the specified capture group**
2367
2368    `$1 $2 $3 … $9`
2369
2370- **Insert an entire variable**
2371
2372    `${variable}`
2373
2374    The entire contents of the named variable will be inserted at this point.
2375
2376- **Insert a mapped set**
2377
2378    `$[1:variable]` (Where "1" is any numbered capture group from 1 to 9)
2379
2380    Maps capture group 1 to variable `variable`. The `from=` side must also contain a grouped variable. This expression may appear anywhere or multiple times in the `to=` pattern.
2381
2382    **Example**
2383
2384    ```xml
2385    <set id="upper" value="A B CC D E  FF       G" />
2386    <set id="lower" value="a b c  d e  \u{0192} g" />
2387    <!-- note that values may be spaced for ease of reading -->
23882389    <transform from="($[upper])" to="$[1:lower]" />
2390    ```
2391
2392    - The capture group on the `from=` side **must** contain exactly one set variable.  `from="Q($[upper])X"` can be used (other context before or after the capture group), but `from="(Q$[upper])"` may not be used with a mapped variable and is flagged as an error.
2393
2394    - The `from=` and `to=` sides of the pattern must both be using `set` variables. There is no way to insert a set literal on either side and avoid using a variable.
2395
2396    - The two variables (here `upper` and `lower`) must have exactly the same number of whitespace-separated items. Leading and trailing space (such as at the end of `lower`) is ignored. A variable without any spaces is considered to be a set variable of exactly one item.
2397
2398    - As described in [Additional Features](#additional-features), the `upper` set variable as used here matches as if it is `((?:A|B|CC|D|E|FF|G))`, showing the enclosing capturing group. When text from the input context matches this expression, and all above conditions are met, the mapping proceeds as follows:
2399
2400    1. The portion of the input context, such as `CC`, is matched against the above calculated pattern.
2401
2402    2. The position within the `from=` variable (`upper`) is calculated. The regex match may not have this information, but the matched substring `CC` can be compared against the tokenized input variable: `A`, `B`, `CC`, `D`, … to find that the 3rd item matches exactly.
2403
2404    3. The same position within the `to=` variable (`lower`) is calculated. The 3rd item is `c`.
2405
2406    4. `CC` in the input context is replaced with `c`, and processing proceeds to the next `transformGroup`.
2407
2408- **Emit a marker**
2409
2410    `\m{Some_marker}`
2411
2412    Emits the named mark. Also see [Markers](#markers).
2413
2414* * *
2415
2416### Element: reorder
2417
2418The reorder transform consists of a [`<transformGroup>`](#element-transformgroup) element containing `<reorder>` elements.  Multiple such `<transformGroup>` elements may be contained in an enclosing `<transforms>` element.
2419
2420One or more [`<import>`](#element-import) elements are allowed to precede the `<reorder>` elements.
2421
2422This transform has the job of reordering sequences of characters that have been typed, from their typed order to the desired output order. The primary concern in this transform is to sort combining marks into their correct relative order after a base, as described in this section. The reorder transforms can be quite complex, keyboard layouts will almost always import them.
2423
2424The reordering algorithm consists of four parts:
2425
24261. Create a sort key for each character in the input string. A sort key has 4 parts (primary, index, tertiary, quaternary):
2427   * The **primary weight** is the primary order value.
2428   * The **secondary weight** is the index, a position in the input string, usually of the character itself, but it may be of a character earlier in the string.
2429   * The **tertiary weight** is a tertiary order value (defaulting to 0).
2430   * The **quaternary weight** is the index of the character in the string. This is solely to ensure a stable sort for sequences of characters with the same tertiary weight.
24312. Mark each character as to whether it is a prebase character, one that is typed before the base and logically stored after. Thus it will have a primary order > 0.
24323. Use the sort key and the prebase mark to identify runs. A run starts with a prefix that contains any prebase characters and a single base character whose primary and tertiary key is 0. The run extends until, but not including, the start of the prefix of the next run or end of the string.
2433   * `run := preBase* (primary=0 && tertiary=0) ((primary≠0 || tertiary≠0) && !preBase)*`
24344. Sort the character order of each character in the run based on its sort key.
2435
2436The primary order of a character with the Unicode property `Canonical_Combining_Class` (ccc) of 0 may well not be 0. In addition, a character may receive a different primary order dependent on context. For example, in the Devanagari sequence ka halant ka, the first ka would have a primary order 0 while the halant ka sequence would give both halant and the second ka a primary order > 0, for example 2. Note that “base” character in this discussion is not a Unicode base character. It is instead a character with primary=0.
2437
2438In order to get the characters into the correct relative order, it is necessary not only to order combining marks relative to the base character, but also to order some combining marks in a subsequence following another combining mark. For example in Devanagari, a nukta may follow a consonant character, but it may also follow a conjunct consisting of consonant, halant, consonant. Notice that the second consonant is not, in this model, the start of a new run because some characters may need to be reordered to before the first base, for example repha. The repha would get primary < 0, and be sorted before the character with order = 0, which is, in the case of Devanagari, the initial consonant of the orthographic syllable.
2439
2440The reorder transform consists of `<reorder>` elements encapsulated in a `<transformGroup>` element. Each element is a rule that matches against a string of characters with the action of setting the various ordering attributes (`primary`, `tertiary`, `tertiaryBase`, `preBase`) for the matched characters in the string.
2441
2442The relative ordering of `<reorder>` elements is not significant.
2443
2444**Syntax**
2445
2446```xml
2447<transformGroup>
2448    <!-- one or more <import/> elements are allowed at this point -->
2449    <reorder from="…combination of characters"
2450    before="…look-behind required match"
2451    order="…list of weights"
2452    tertiary="…list of weights"
2453    tertiaryBase="…list of true/false"
2454    preBase="…list of true/false" />
2455    <!-- other <reorder/> elements… -->
2456</transformGroup>
2457```
2458
2459> <small>
2460>
2461> Parents: [transformGroup](#element-transformgroup)
2462> Children: _none_
2463> Occurrence: optional, multiple
2464>
2465> </small>
2466
2467_Attribute:_ `from` (required)
2468
2469> This attribute value contains a string of elements. Each element matches one character and may consist of a codepoint or a UnicodeSet (both as defined in [UTS #35 Part One](tr35.md#Unicode_Sets)).
2470
2471_Attribute:_ `before`
2472
2473> This attribute value contains the element string that must match the string immediately preceding the start of the string that the @from matches.
2474
2475_Attribute:_ `order`
2476
2477> This attribute value gives the primary order for the elements in the matched string in the `@from` attribute. The value is a simple integer between -128 and +127 inclusive, or a space separated list of such integers. For a single integer, it is applied to all the elements in the matched string. Details of such list type attributes are given after all the attributes are described. If missing, the order value of all the matched characters is 0. We consider the order value for a matched character in the string.
2478>
2479> * If the value is 0 and its tertiary value is 0, then the character is the base of a new run.
2480> * If the value is 0 and its tertiary value is non-zero, then it is a normal character in a run, with ordering semantics as described in the `@tertiary` attribute.
2481> * If the value is negative, then the character is a primary character and will reorder to be before the base of the run.
2482> * If the value is positive, then the character is a primary character and is sorted based on the order value as the primary key following a previous base character.
2483>
2484> A character with a zero tertiary value is a primary character and receives a sort key consisting of:
2485>
2486> * Primary weight is the order value
2487> * Secondary weight is the index of the character. This may be any value (character index, codepoint index) such that its value is greater than the character before it and less than the character after it.
2488> * Tertiary weight is 0.
2489> * Quaternary weight is the same as the secondary weight.
2490
2491_Attribute:_ `tertiary`
2492
2493> This attribute value gives the tertiary order value to the characters matched. The value is a simple integer between -128 and +127 inclusive, or a space separated list of such integers. If missing, the value for all the characters matched is 0. We consider the tertiary value for a matched character in the string.
2494>
2495> * If the value is 0 then the character is considered to have a primary order as specified in its order value and is a primary character.
2496> * If the value is non zero, then the order value must be zero otherwise it is an error. The character is considered as a tertiary character for the purposes of ordering.
2497>
2498> A tertiary character receives its primary order and index from a previous character, which it is intended to sort closely after. The sort key for a tertiary character consists of:
2499>
2500> * Primary weight is the primary weight of the primary character..
2501> * Secondary weight is the index of the primary character, not the tertiary character
2502> * Tertiary weight is the tertiary value for the character.
2503> * Quaternary weight is the index of the tertiary character.
2504
2505_Attribute:_ `tertiaryBase`
2506
2507> This attribute value is a space separated list of `"true"` or `"false"` values corresponding to each character matched. It is illegal for a tertiary character to have a true `tertiaryBase` value. For a primary character it marks that this character may have tertiary characters moved after it. When calculating the secondary weight for a tertiary character, the most recently encountered primary character with a true `tertiaryBase` attribute value is used. Primary characters with an `@order` value of 0 automatically are treated as having `tertiaryBase` true regardless of what is specified for them.
2508
2509_Attribute:_ `preBase`
2510
2511> This attribute value gives the prebase attribute for each character matched. The value may be `"true"` or `"false"` or a space separated list of such values. If missing the value for all the characters matched is false. It is illegal for a tertiary character to have a true prebase value.
2512>
2513> If a primary character has a true prebase value then the character is marked as being typed before the base character of a run, even though it is intended to be stored after it. The primary order gives the intended position in the order after the base character, that the prebase character will end up. Thus `@order` shall not be 0. These characters are part of the run prefix. If such characters are typed then, in order to give the run a base character after which characters can be sorted, an appropriate base character, such as a dotted circle, is inserted into the output run, until a real base character has been typed. A value of `"false"` indicates that the character is not a prebase.
2514
2515For `@from` attribute values with a match string length greater than 1, the sort key information (`@order`, `@tertiary`, `@tertiaryBase`, `@preBase`) may consist of a space-separated list of values, one for each element matched. The last value is repeated to fill out any missing values. Such a list may not contain more values than there are elements in the `@from` attribute:
2516
2517```java
2518if len(@from) < len(@list) then error
2519else
2520    while len(@from) > len(@list)
2521        append lastitem(@list) to @list
2522    endwhile
2523endif
2524```
2525
2526**Example**
2527
2528For example, consider the Northern Thai (`nod-Lana`, Tai Tham script) word: ᨡ᩠ᩅᩫ᩶ 'roasted'. This is ideally encoded as the following:
2529
2530| name | _kha_ | _sakot_ | _wa_ | _o_  | _t2_ |
2531|------|-------|---------|------|------|------|
2532| code | 1A21  | 1A60    | 1A45 | 1A6B | 1A76 |
2533| ccc  | 0     | 9       | 0    | 0    | 230  |
2534
2535(That sequence is already in NFC format.)
2536
2537Some users may type the upper component of the vowel first, and the tone before or after the lower component. Thus someone might type it as:
2538
2539| name | _kha_ | _o_  | _t2_ | _sakot_ | _wa_ |
2540|------|-------|------|------|---------|------|
2541| code | 1A21  | 1A6B | 1A76 | 1A60    | 1A45 |
2542| ccc  | 0     | 0    | 230  | 9       | 0    |
2543
2544The Unicode NFC format of that typed value reorders to:
2545
2546| name | _kha_ | _o_  | _sakot_ | _t2_ | _wa_ |
2547|------|-------|------|---------|------|------|
2548| code | 1A21  | 1A6B | 1A60    | 1A76 | 1A45 |
2549| ccc  | 0     | 0    | 9       | 230  | 0    |
2550
2551Finally, the user might also type in the sequence with the tone _after_ the lower component.
2552
2553| name | _kha_ | _o_  | _sakot_ | _wa_ | _t2_ |
2554|------|-------|------|---------|------|------|
2555| code | 1A21  | 1A6B | 1A60    | 1A45 | 1A76 |
2556| ccc  | 0     | 0    | 9       | 0    | 230  |
2557
2558(That sequence is already in NFC format.)
2559
2560We want all of these sequences to end up ordered as the first. To do this, we use the following rules:
2561
2562```xml
2563<reorder from="\u{1A60}" order="127" />      <!-- max possible order -->
2564<reorder from="\u{1A6B}" order="42" />
2565<reorder from="[\u{1A75}-\u{1A79}]" order="55" />
2566<reorder before="\u{1A6B}" from="\u{1A60}\u{1A45}" order="10" />
2567<reorder before="\u{1A6B}[\u{1A75}-\u{1A79}]" from="\u{1A60}\u{1A45}" order="10" />
2568<reorder before="\u{1A6B}" from="\u{1A60}[\u{1A75}-\u{1A79}]\u{1A45}" order="10 55 10" />
2569```
2570
2571The first reorder is the default ordering for the _sakot_ which allows for it to be placed anywhere in a sequence, but moves any non-consonants that may immediately follow it, back before it in the sequence. The next two rules give the orders for the top vowel component and tone marks respectively. The next three rules give the _sakot_ and _wa_ characters a primary order that places them before the _o_. Notice particularly the final reorder rule where the _sakot_+_wa_ is split by the tone mark. This rule is necessary in case someone types into the middle of previously normalized text.
2572
2573`<reorder>` elements are priority ordered based first on the length of string their `@from` attribute value matches and then the sum of the lengths of the strings their `@before` attribute value matches.
2574
2575#### Using `<import>` with `<reorder>` elements
2576
2577This section describes the impact of using [`import`](#element-import) elements with `<reorder>` elements.
2578
2579The @from string in a `<reorder>` element describes a set of strings that it matches. This also holds for the `@before` attribute. The **intersection** of any two `<reorder>` elements consists of the intersections of their `@from` and `@before` string sets. Tooling should warn users if the intersection between any two `<reorder>` elements in the same `<transformGroup>` element to be non empty prior to processing imports.
2580
2581If two `<reorder>` elements have a non empty intersection, then they are split and merged. They are split such that where there were two `<reorder>` elements, there are, in effect (but not actuality), three elements consisting of:
2582
2583* `@from`, `@before` that match the intersection of the two rules. The other attribute values are merged, as described below.
2584* `@from`, `@before` that match the set of strings in the first rule not in the intersection with the other attribute values from the first rule.
2585* `@from`, `@before` that match the set of strings in the second rule not in the intersection, with the other attribute values from the second rule.
2586
2587When merging the other attributes, the second rule is taken to have priority (being an override of the earlier element). Where the second rule does not define the value for a character but the first does, the value is taken from the first rule, otherwise it is taken from the second rule.
2588
2589Notice that it is possible for two rules to match the same string, but for them not to merge because the distribution of the string across `@before` and `@from` is different. For example, the following would not merge:
2590
2591```xml
2592<reorder before="ab" from="cd" />
2593<reorder before="a" from="bcd" />
2594```
2595
2596After `<reorder>` elements merge, the resulting `reorder` elements are sorted into priority order for matching.
2597
2598Consider this fragment from a shared reordering for the Myanmar script:
2599
2600```xml
2601<!-- File: "myanmar-reordering.xml" -->
2602<transformGroup>
2603    <!-- medial-r -->
2604    <reorder from="\u{103C}" order="20" />
2605
2606    <!-- [medial-wa or shan-medial-wa] -->
2607    <reorder from="[\u{103D}\u{1082}]" order="25" />
2608
2609    <!-- [medial-ha or shan-medial-wa]+asat = Mon asat -->
2610    <reorder from="[\u{103E}\u{1082}]\u{103A}" order="27" />
2611
2612    <!-- [medial-ha or mon-medial-wa] -->
2613    <reorder from="[\u{103E}\u{1060}]" order="27" />
2614
2615    <!-- [e-vowel (U+1031) or shan-e-vowel (U+1084)] -->
2616    <reorder from="[\u{1031}\u{1084}]" order="30" />
2617
2618    <reorder from="[\u{102D}\u{102E}\u{1033}-\u{1035}\u{1071}-\u{1074}\u{1085}\u{109D}\u{A9E5}]" order="35" />
2619</transformGroup>
2620```
2621
2622A particular Myanmar keyboard layout can have these `reorder` elements:
2623
2624```xml
2625<transformGroup>
2626    <import path="myanmar-reordering.xml"/> <!-- import the above transformGroup -->
2627    <!-- Kinzi -->
2628    <reorder from="\u{1004}\u{103A}\u{1039}" order="-1" />
2629
2630    <!-- e-vowel -->
2631    <reorder from="\u{1031}" preBase="1" />
2632
2633    <!-- medial-r -->
2634    <reorder from="\u{103C}" preBase="1" />
2635</transformGroup>
2636```
2637
2638The effect of this is that the _e-vowel_ will be identified as a prebase and will have an order of 30. Likewise a _medial-r_ will be identified as a prebase and will have an order of 20. Notice that a _shan-e-vowel_ (`\u{1084}`) will not be identified as a prebase (even if it should be!). The _kinzi_ is described in the layout since it moves something across a run boundary. By separating such movements (prebase or moving to in front of a base) from the shared ordering rules, the shared ordering rules become a self-contained combining order description that can be used in other keyboards or even in other contexts than keyboarding.
2639
2640#### Example Post-reorder transforms
2641
2642It may be desired to perform additional processing following reorder operations.  This may be aaccomplished by adding an additional `<transformGroup>` element after the group containing `<reorder>` elements.
2643
2644First, a partial example from Khmer where split vowels are combined after reordering.
2645
2646```xml
26472648<transformGroup>
2649    <reorder … />
2650    <reorder … />
2651    <reorder … />
26522653</transformGroup>
2654<transformGroup>
2655    <transform from="\u{17C1}\u{17B8}" to="\u{17BE}" />
2656    <transform from="\u{17C1}\u{17B6}" to="\u{17C4}" />
2657</transformGroup>
2658```
2659
2660Another partial example allows a keyboard implementation to prevent people typing two lower vowels in a Burmese cluster:
2661
2662```xml
26632664<transformGroup>
2665    <reorder … />
2666    <reorder … />
2667    <reorder … />
26682669</transformGroup>
2670<transformGroup>
2671    <transform from="[\u{102F}\u{1030}\u{1048}\u{1059}][\u{102F}\u{1030}\u{1048}\u{1059}]"  />
2672</transformGroup>
2673```
2674
2675#### Reorder and Markers
2676
2677Markers are not matched by `reorder` elements. However, if a character preceded by one or more markers is reordered due to a `reorder` element, those markers will be reordered with the characters, maintaining the same relative order.  This is a similar process to the algorithm used to normalize strings processed by `transform` elements.
2678
2679Keyboard implementations must process `reorder` elements using the following algorithm.
2680
2681Note that steps 1 and 3 are identical to the steps used for normalization using markers in the [Marker Algorithm Overview](#marker-algorithm-overview).
2682
2683Given an input string from context or from a previous `transformGroup`:
2684
26851. Parsing/Removing Markers
2686
26872. Perform reordering (as in this section)
2688
26893. Re-Adding Markers
2690
2691* * *
2692
2693### Backspace Transforms
2694
2695The `<transforms type="backspace">` describe an optional transform that is not applied on input of normal characters, but is only used to perform extra backspace modifications to previously committed text.
2696
2697When the backspace key is pressed, the `<transforms type="backspace">` element (if present) is processed, and then the `<transforms type="simple">` element (if processed) as with any other key.
2698
2699Keyboarding applications typically work, but are not required to, in one of two modes:
2700
2701**_text entry_**
2702
2703> text entry happens while a user is typing new text. A user typically wants the backspace key to undo whatever they last typed, whether or not they typed things in the 'right' order.
2704
2705**_text editing_**
2706
2707> text editing happens when a user moves the cursor into some previously entered text which may have been entered by someone else. As such, there is no way to know in which order things were typed, but a user will still want appropriate behaviour when they press backspace. This may involve deleting more than one character or replacing a sequence of characters with a different sequence.
2708
2709In text editing mode, different keyboard layouts may behave differently in the same textual context. The backspace transform allows the keyboard layout to specify the effect of pressing backspace in a particular textual context. This is done by specifying a set of backspace rules that match a string before the cursor and replace it with another string. The rules are expressed within a `transforms type="backspace"` element.
2710
2711
2712```xml
2713<transforms type="backspace">
2714    <transformGroup>
2715        <transform from="…match pattern" to="…output pattern" />
2716    </transformGroup>
2717</transforms>
2718```
2719
2720**Example**
2721
2722For example, consider deleting a Devanagari ksha क्श:
2723
2724While this character is made up of three codepoints, the following rule causes all three to be deleted by a single press of the backspace.
2725
2726
2727```xml
2728<transforms type="backspace">
2729    <transformGroup>
2730        <transform from="\u{0915}\u{094D}\u{0936}"/>
2731    </transformGroup>
2732</transforms>
2733```
2734
2735Note that the optional attribute `@to` is omitted, since the whole string is being deleted. This is not uncommon in backspace transforms.
2736
2737A more complex example comes from a Burmese visually ordered keyboard:
2738
2739```xml
2740<transforms type="backspace">
2741    <transformGroup>
2742        <!-- Kinzi -->
2743        <transform from="[\u{1004}\u{101B}\u{105A}]\u{103A}\u{1039}" />
2744
2745        <!-- subjoined consonant -->
2746        <transform from="\u{1039}[\u{1000}-\u{101C}\u{101E}\u{1020}\u{1021}\u{1050}\u{1051}\u{105A}-\u{105D}]" />
2747
2748        <!-- tone mark -->
2749        <transform from="\u{102B}\u{103A}" />
2750
2751        <!-- Handle prebases -->
2752        <!-- diacritics stored before e-vowel -->
2753        <transform from="[\u{103A}-\u{103F}\u{105E}-\u{1060}\u{1082}]\u{1031}" to="\u{1031}" />
2754
2755        <!-- diacritics stored before medial r -->
2756        <transform from="[\u{103A}-\u{103B}\u{105E}-\u{105F}]\u{103C}" to="\u{103C}" />
2757
2758        <!-- subjoined consonant before e-vowel -->
2759        <transform from="\u{1039}[\u{1000}-\u{101C}\u{101E}\u{1020}\u{1021}]\u{1031}" to="\u{1031}" />
2760
2761        <!-- base consonant before e-vowel -->
2762        <transform from="[\u{1000}-\u{102A}\u{103F}-\u{1049}\u{104E}]\u{1031}" to="\m{prebase}\u{1031}" />
2763
2764        <!-- subjoined consonant before medial r -->
2765        <transform from="\u{1039}[\u{1000}-\u{101C}\u{101E}\u{1020}\u{1021}]\u{103C}" to="\u{103C}" />
2766
2767        <!-- base consonant before medial r -->
2768        <transform from="[\u{1000}-\u{102A}\u{103F}-\u{1049}\u{104E}]\u{103C}" to="\m{prebase}\u{103C}" />
2769
2770        <!-- delete lone medial r or e-vowel -->
2771        <transform from="\m{prebase}[\u{1031}\u{103C}]" />
2772    </transformGroup>
2773</transforms>
2774```
2775
2776The above example is simplified, and doesn't fully handle the interaction between medial-r and e-vowel.
2777
2778
2779> The character `\m{prebase}` does not represent a literal character, but is instead a special marker, used as a "filler string". When a keyboard implementation handles a user pressing a key that inserts a prebase character, it also has to insert a special filler string before the prebase to ensure that the prebase character does not combine with the previous cluster. See the reorder transform for details. See [markers](#markers) for the `\m` syntax.
2780
2781The first three transforms above delete various ligatures with a single keypress. The other transforms handle prebase characters. There are two in this Burmese keyboard. The transforms delete the characters preceding the prebase character up to base which gets replaced with the prebase filler string, which represents a null base. Finally the prebase filler string + prebase is deleted as a unit.
2782
2783If no specified transform among all `transformGroup`s under the `<transforms type="backspace">` element matches, a default will be used instead — an implied final transform that simply deletes the codepoint at the end of the input context. This implied transform is effectively similar to the following code sample, even though the `*` operator is not actually allowed in `from=`.  See the documentation for *Match a single Unicode codepoint* under [transform syntax](#regex-like-syntax) and [markers](#markers), above.
2784
2785It is important that implementations do not by default delete more than one non-marker codepoint at a time, except in the case of emoji clusters. Note that implementations will vary in the emoji handling due to the iterative nature of successive Unicode releases. See [UTS#51 §2.4.2: Emoji Modifiers in Text](https://www.unicode.org/reports/tr51/#Emoji_Modifiers_in_Text)
2786
2787```xml
2788<transforms type="backspace">
2789    <!-- Other explicit transforms -->
2790
2791    <!-- Final implicit backspace transform: Delete the final codepoint. -->
2792    <transformGroup>
2793        <!-- (:?\m{.})*  - matches any number of contiguous markers -->
2794        <transform from="(:?\m{.})*.(:?\m{.})*" /> <!-- deletes any number of markers directly on either side of the final pre-caret codepoint -->
2795    </transformGroup>
2796</transforms>
2797```
2798
2799* * *
2800
2801## Invariants
2802
2803Beyond what the DTD imposes, certain other restrictions on the data are imposed on the data.
2804Please note the constraints given under each element section above.
2805DTD validation alone is not sufficient to verify a keyboard file.
2806
2807* * *
2808
2809## Keyboard IDs
2810
2811There is a set of subtags that help identify the keyboards. Each of these are used after the `"t-k0"` subtags to help identify the keyboards. The first tag appended is a mandatory platform tag followed by zero or more tags that help differentiate the keyboard from others with the same locale code.
2812
2813### Principles for Keyboard IDs
2814
2815The following are the design principles for the IDs.
2816
28171. BCP47 compliant.
2818   1. Eg, `en`, `sr-Cyrl`, or `en-t-k0-extended`.
28192. Use the minimal language id based on `likelySubtags` (see [Part 1: Likely Subtags](tr35.md#Likely_Subtags))
2820   1. Eg, instead of `fa-Arab`, use `fa`.
2821   2. The data is in <https://github.com/unicode-org/cldr/blob/main/common/supplemental/likelySubtags.xml>
28223. Keyboard files should be platform-independent, however, if included, a platform id is the first subtag after `-t-k0-`. If a keyboard on the platform changes over time, both are dated, eg `bg-t-k0-chromeos-2011`. When selecting, if there is no date, it means the latest one.
28234. Keyboards are only tagged that differ from the "standard for each language". That is, for each language on a platform, there will be a keyboard with no subtags. Subtags with common semantics across languages and platforms are used, such as `-extended`, `-phonetic`, `-qwerty`, `-qwertz`, `-azerty`, …
28245. In order to get to 8 letters, abbreviations are reused that are already in [bcp47](https://github.com/unicode-org/cldr/blob/main/common/bcp47/) -u/-t extensions and in [language-subtag-registry](https://www.iana.org/assignments/language-subtag-registry) variants, eg for Traditional use `-trad` or `-traditio` (both exist in [bcp47](https://github.com/unicode-org/cldr/blob/main/common/bcp47/)).
28256. Multiple languages cannot be indicated in the locale id, so the predominant target is used.
2826   1. For Finnish + Sami, use `fi-t-k0-smi` or `extended-smi`
2827   2. The [`<locales>`](#element-locales) element may be used to identify additional languages.
28287. In some cases, there are multiple subtags, like `en-US-t-k0-chromeos-intl-altgr.xml`
28298. Otherwise, platform names are used as a guide.
2830
2831**Examples**
2832
2833```xml
2834<!-- Serbian Latin -->
2835<keyboard3 locale="sr-Latn"/>
2836```
2837
2838```xml
2839<!-- Serbian Cyrillic -->
2840<keyboard3 locale="sr-Cyrl"/>
2841```
2842
2843```xml
2844<!-- Pan Nigerian Keyboard-->
2845<keyboard3 locale="mul-Latn-NG-t-k0-panng">
2846    <locales>
2847    <locale id="ha"/>
2848    <locale id="ig"/>
2849    <!-- others … -->
2850    </locales>
2851</keyboard3>
2852```
2853
2854```xml
2855<!-- Finnish Keyboard including Skolt Sami -->
2856<keyboard3 locale="fi-t-k0-smi">
2857    <locales>
2858    <locale id="sms"/>
2859    </locales>
2860</keyboard3>
2861```
2862
2863* * *
2864
2865## Platform Behaviors in Edge Cases
2866
2867| Platform | No modifier combination match is available | No map match is available for key position | Transform fails (i.e. if \^d is pressed when that transform does not exist) |
2868|----------|--------------------------------------------|--------------------------------------------|---------------------------------------------------------------------------|
2869| Chrome OS | Fall back to base | Fall back to character in a keyMap with same "level" of modifier combination. If this character does not exist, fall back to (n-1) level. (This is handled data-generation-side.) <br/> In the specification: No output | No output at all |
2870| Mac OS X  | Fall back to base (unless combination is some sort of keyboard shortcut, e.g. cmd-c) | No output | Both keys are output separately |
2871| Windows  | No output | No output | Both keys are output separately |
2872
2873* * *
2874
2875Copyright © 2001–2024 Unicode, Inc. All Rights Reserved. The Unicode Consortium makes no expressed or implied warranty of any kind, and assumes no liability for errors or omissions. No liability is assumed for incidental and consequential damages in connection with or arising out of the use of the information or programs contained or accompanying this technical report. The Unicode [Terms of Use](https://www.unicode.org/copyright.html) apply.
2876
2877Unicode and the Unicode logo are trademarks of Unicode, Inc., and are registered in some jurisdictions.
2878
2879
2880[keyboard-workgroup]: https://cldr.unicode.org/index/keyboard-workgroup
2881