## Unicode Technical Standard #35 Tech Preview
# Unicode Locale Data Markup Language (LDML)
Part 7: Keyboards
|Version|45 |
|-------|-------------|
|Editors|Steven Loomis (srloomis@unicode.org) and other CLDR committee members|
For the full header, summary, and status, see [Part 1: Core](tr35.md).
### _Summary_
This document describes parts of an XML format (_vocabulary_) for the exchange of structured locale data. This format is used in the [Unicode Common Locale Data Repository](https://www.unicode.org/cldr/).
This is a partial document, describing keyboards. For the other parts of the LDML see the [main LDML document](tr35.md) and the links above.
_Note:_
Some links may lead to in-development or older
versions of the data files.
See for up-to-date CLDR release data.
### _Status_
_This document has been reviewed by Unicode members and other interested parties, and has been approved for publication by the Unicode Consortium.
This is a stable document and may be used as reference material or cited as a normative reference by other specifications._
> _**A Unicode Technical Standard (UTS)** is an independent specification. Conformance to the Unicode Standard does not imply conformance to any UTS._
_Please submit corrigenda and other comments with the CLDR bug reporting form [[Bugs](tr35.md#Bugs)]. Related information that is useful in understanding this document is found in the [References](tr35.md#References). For the latest version of the Unicode Standard see [[Unicode](tr35.md#Unicode)]. For a list of current Unicode Technical Reports see [[Reports](tr35.md#Reports)]. For more information about versions of the Unicode Standard, see [[Versions](tr35.md#Versions)]._
See also [Compatibility Notice](#compatibility-notice).
## Parts
The LDML specification is divided into the following parts:
* Part 1: [Core](tr35.md#Contents) (languages, locales, basic structure)
* Part 2: [General](tr35-general.md#Contents) (display names & transforms, etc.)
* Part 3: [Numbers](tr35-numbers.md#Contents) (number & currency formatting)
* Part 4: [Dates](tr35-dates.md#Contents) (date, time, time zone formatting)
* Part 5: [Collation](tr35-collation.md#Contents) (sorting, searching, grouping)
* Part 6: [Supplemental](tr35-info.md#Contents) (supplemental data)
* Part 7: [Keyboards](tr35-keyboards.md#Contents) (keyboard mappings)
* Part 8: [Person Names](tr35-personNames.md#Contents) (person names)
* Part 9: [MessageFormat](tr35-messageFormat.md#Contents) (message format)
## Contents of Part 7, Keyboards
* [Keyboards](#keyboards)
* [Goals and Non-goals](#goals-and-non-goals)
* [Compatibility Notice](#compatibility-notice)
* [Accessibility](#accessibility)
* [Definitions](#definitions)
* [Notation](#notation)
* [Escaping](#escaping)
* [UnicodeSet Escaping](#unicodeset-escaping)
* [UTS18 Escaping](#uts18-escaping)
* [File and Directory Structure](#file-and-directory-structure)
* [Extensibility](#extensibility)
* [Normalization](#normalization)
* [Where Normalization Occurs](#where-normalization-occurs)
* [Normalization and Transform Matching](#normalization-and-transform-matching)
* [Normalization and Markers](#normalization-and-markers)
* [Rationale for 'gluing' markers](#rationale-for-gluing-markers)
* [Data Model: `Marker`](#data-model-marker)
* [Data Model: string](#data-model-string)
* [Data Model: `MarkerEntry`](#data-model-markerentry)
* [Marker Algorithm Overview](#marker-algorithm-overview)
* [Phase 1: Parsing/Removing Markers](#phase-1-parsingremoving-markers)
* [Phase 2: Plain Text Processing](#phase-2-plain-text-processing)
* [Phase 3: Adding Markers](#phase-3-adding-markers)
* [Example Normalization with Markers](#example-normalization-with-markers)
* [Normalization and Character Classes](#normalization-and-character-classes)
* [Normalization and Reorder elements](#normalization-and-reorder-elements)
* [Normalization-safe Segments](#normalization-safe-segments)
* [Normalization and Output](#normalization-and-output)
* [Disabling Normalization](#disabling-normalization)
* [Element Hierarchy](#element-hierarchy)
* [Element: keyboard3](#element-keyboard3)
* [Element: import](#element-import)
* [Element: locales](#element-locales)
* [Element: locale](#element-locale)
* [Element: version](#element-version)
* [Element: info](#element-info)
* [Element: settings](#element-settings)
* [Element: displays](#element-displays)
* [Element: display](#element-display)
* [Non-spacing marks on keytops](#non-spacing-marks-on-keytops)
* [Element: displayOptions](#element-displayoptions)
* [Element: keys](#element-keys)
* [Element: key](#element-key)
* [Implied Keys](#implied-keys)
* [Element: flicks](#element-flicks)
* [Element: flick](#element-flick)
* [Element: flickSegment](#element-flicksegment)
* [Element: forms](#element-forms)
* [Element: form](#element-form)
* [Implied Form Values](#implied-form-values)
* [Element: scanCodes](#element-scancodes)
* [Element: layers](#element-layers)
* [Element: layer](#element-layer)
* [Layer Modifier Sets](#layer-modifier-sets)
* [Layer Modifier Components](#layer-modifier-components)
* [Modifier Left- and Right- keys](#modifier-left--and-right--keys)
* [Layer Modifier Matching](#layer-modifier-matching)
* [Element: row](#element-row)
* [Element: variables](#element-variables)
* [Element: string](#element-string)
* [Element: set](#element-set)
* [Element: uset](#element-uset)
* [Element: transforms](#element-transforms)
* [Markers](#markers)
* [Element: transformGroup](#element-transformgroup)
* [Example: `transformGroup` with `transform` elements](#example-transformgroup-with-transform-elements)
* [Example: `transformGroup` with `reorder` elements](#example-transformgroup-with-reorder-elements)
* [Element: transform](#element-transform)
* [Regex-like Syntax](#regex-like-syntax)
* [Additional Features](#additional-features)
* [Disallowed Regex Features](#disallowed-regex-features)
* [Replacement syntax](#replacement-syntax)
* [Element: reorder](#element-reorder)
* [Using `` with `` elements](#using-import-with-reorder-elements)
* [Example Post-reorder transforms](#example-post-reorder-transforms)
* [Reorder and Markers](#reorder-and-markers)
* [Backspace Transforms](#backspace-transforms)
* [Invariants](#invariants)
* [Keyboard IDs](#keyboard-ids)
* [Principles for Keyboard IDs](#principles-for-keyboard-ids)
* [Platform Behaviors in Edge Cases](#platform-behaviors-in-edge-cases)
## Keyboards
The Unicode Standard and related technologies such as CLDR have dramatically improved the path to language support. However, keyboard support remains platform and vendor specific, causing inconsistencies in implementation as well as timeline.
More and more language communities are determining that digitization is vital to their approach to language preservation and that engagement with Unicode is essential to becoming fully digitized. For many of these communities, however, getting new characters or a new script added to The Unicode Standard is not the end of their journey. The next, often more challenging stage is to get device makers, operating systems, apps and services to implement the script requirements that Unicode has just added to support their language.
However, commensurate improvements to streamline new language support on the input side have been lacking. CLDR’s Keyboard specification has been updated in an attempt to address this gap.
This document specifies an interchange format for the communication of keyboard mapping data independent of vendors and platforms. Keyboard authors can then create a single mapping file for their language, which implementations can use to provide that language’s keyboard mapping on their own platform.
Additionally, the standardized identifier for keyboards can be used to communicate, internally or externally, a request for a particular keyboard mapping that is to be used to transform either text or keystrokes. The corresponding data can then be used to perform the requested actions. For example, a remote screen-access application (such as used for customer service or server management) would be able to communicate and choose the same keyboard layout on the remote device as is used in front of the user, even if the two systems used different platforms.
The data can also be used in analysis of the capabilities of different keyboards. It also allows better interoperability by making it easier for keyboard designers to see which characters are generally supported on keyboards for given languages.
For complete examples, see the XML files in the CLDR source repository.
Attribute values should be evaluated considering the DTD and [DTD Annotations](tr35.md#dtd-annotations).
* * *
## Goals and Non-goals
Some goals of this format are:
1. Physical and virtual keyboard layouts defined in a single file.
2. Provide definitive platform-independent definitions for new keyboard layouts.
* For example, a new French standard keyboard layout would have a single definition which would be usable across all implementations.
3. Allow platforms to be able to use CLDR keyboard data for the character-emitting keys (non-frame) aspects of keyboard layouts.
4. Deprecate & archive existing LDML platform-specific layouts so they are not part of future releases.
Some non-goals (outside the scope of the format) currently are:
1. Adaptation for screen scaling resolution. Instead, keyboards should define layouts based on physical size. Platforms may interpret physical size definitions and adapt for different physical screen sizes with different resolutions.
2. Unification of platform-specific virtual key and scan code mapping tables.
3. Unification of pre-existing platform layouts themselves (e.g. existing fr-azerty on platform a, b, c).
4. Support for prior (pre 3.0) CLDR keyboard files. See [Compatibility Notice](#compatibility-notice).
5. Run-time efficiency. [LDML is explicitly an interchange format](tr35.md#Introduction), and so it is expected that data will be transformed to a more compact format for use by a keystroke processing engine.
6. Platform-specific frame keys such as Fn, Numpad, IME swap keys, and cursor keys are out of scope.
(This also means that in this specification, modifier (frame) keys cannot generate output, such as capslock producing backslash.)
Note that in parts of this document, the format `@x` is used to indicate the _attribute_ **x**.
### Compatibility Notice
> A major rewrite of this specification, called "Keyboard 3.0", was introduced in CLDR v45.
> The changes required were too extensive to maintain compatibility. For this reason, the `ldmlKeyboard3.dtd` DTD is _not_ compatible with DTDs from prior versions of CLDR such as v43 and prior.
>
> To process earlier XML files, use the data and specification from v43.1, found at
>
> `ldmlKeyboard.dtd` continues to be made available in CLDR, however, it will not be updated.
### Accessibility
Keyboard use can be challenging for individuals with various types of disabilities. For this revision, features or architectural designs specifically for the purpose of improving accessibility are not yet included. However:
1. Having an industry-wide standard format for keyboards will enable accessibility software to make use of keyboard data with a reduced dependence on platform-specific knowledge.
2. Features which require certain levels of mobility or speed of entry should be considered for their impact on accessibility. This impact could be mitigated by means of additional, accessible methods of generating the same output.
3. Public feedback is welcome on any aspects of this document which might hinder accessibility.
## Definitions
**Arrangement:** The relative position of the rectangles that represent keys, either physically or virtually. A hardware keyboard has a static arrangement while a touch keyboard may have a dynamic arrangement that changes per language and/or layer. While the arrangement of keys on a keyboard may be fixed, the mapping of those keys may vary.
**Base character:** The character emitted by a particular key when no modifiers are active. In ISO 9995-1:2009 terms, this is Group 1, Level 1.
**Core keys:** also known as “alphanumeric” section. The primary set of key values on a keyboard that are used for typing the target language of the keyboard. For example, the three rows of letters on a standard US QWERTY keyboard (QWERTYUIOP, ASDFGHJKL, ZXCVBNM) together with the most significant punctuation keys. Usually this equates to the minimal set of keys for a language as seen on mobile phone keyboards.
Distinguished from the **frame keys**.
**Dead keys:** These are keys which do not emit normal characters by themselves. They are so named because to the user, they may appear to be “dead,” i.e., non-functional. However, they do produce a change to the input context. For example, in many Latin keyboards hitting the `^` dead-key followed by the `e` key produces `ê`. The `^` by itself may be invisible or presented in a special way by the platform.
**Frame keys:** These are keys which are outside of the area of the **core keys** and typically do not emit characters. These keys include **modifier** keys, such as Shift or Ctrl, but also include platform specific keys: Fn, IME and layout-switching keys, cursor keys, insert emoji keys etc.
**Hardware keyboard:** an input device which has individual keys that are pressed. Each key has a unique identifier and the arrangement doesn't change, even if the mapping of those keys does. Also known as a physical keyboard.
**Implementation:** see **Keyboard implementation**
**Input Method Editor (IME):** a component or program that supports input of large character sets. Typically, IMEs employ contextual logic and candidate UI to identify the Unicode characters intended by the user.
**Keyboard implementation:** Software which implements the present specification, such that keyboard XML files can be used to interpret keystrokes from a **Hardware keyboard** or an on-screen **Touch keyboard**.
Keyboard implementations will typically consist of two parts:
1. A _compile/build tool_ part used by **Keyboard authors** to parse the XML file and produce a compact runtime format, and
2. A _runtime_ part which interprets the runtime format when the keyboard is selected by the end user, and delivers the output plain text to the platform or application.
**Key:** A physical key on a hardware keyboard, or a virtual key on a touch keyboard.
**Key code:** The integer code sent to the application on pressing a key.
**Key map:** The basic mapping between hardware or on-screen positions and the output characters for each set of modifier combinations associated with a particular layout. There may be multiple key maps for each layout.
**Keyboard:** A particular arrangement of keys for the inputting of text, such as a hardware keyboard or a touch keyboard.
**Keyboard author:** The person or group of people designing and producing a particular keyboard layout designed to support one or more languages. In the context of this specification, that author may be editing the LDML XML file directly or by means of software tools.
**Keyboard layout:** A layout is the overall keyboard configuration for a particular locale. Within a keyboard layout, there is a single base map, one or more key maps and zero or more transforms.
**Layer** is an arrangement of keys on a touch keyboard. A touch keyboard is made up of a set of layers. Each layer may have a different key layout, unlike with a hardware keyboard, and may not correspond directly to a hardware keyboard's modifier keys. A layer is accessed via a layer-switching key. See also touch keyboard and modifier.
**Long-press key:** also known as a “child key”. A secondary key that is invoked from a top level key on a touch keyboard. Secondary keys typically provide access to variants of the top level key, such as accented variants (a => á, à, ä, ã)
**Modifier:** A key that is held to change the behavior of a hardware keyboard. For example, the "Shift" key allows access to upper-case characters on a US keyboard. Other modifier keys include but are not limited to: Ctrl, Alt, Option, Command and Caps Lock. On a touch keyboard, keys that appear to be modifier keys should be considered to be layer-switching keys.
**Physical keyboard:** see **Hardware keyboard**
**Touch keyboard:** A keyboard that is rendered on a, typically, touch surface. It has a dynamic arrangement and contrasts with a hardware keyboard. This term has many synonyms: software keyboard, SIP (Software Input Panel), virtual keyboard. This contrasts with other uses of the term virtual keyboard as an on-screen keyboard for reference or accessibility data entry.
**Transform:** A transform is an element that specifies a set of conversions from sequences of code points into one (or more) other code points. Transforms may reorder or replace text. They may be used to implement “dead key” behaviors, simple orthographic corrections, visual (typewriter) type input etc.
**Virtual keyboard:** see **Touch keyboard**
## Notation
- Ellipses (`…`) in syntax examples are used to denote substituted parts.
For example, `id="…keyId"` denotes that `…keyId` (the part between double quotes) is to be replaced with something, in this case a key identifier. As another example, `\u{…usv}` denotes that the `…usv` is to be replaced with something, in this case a Unicode scalar value in hex.
### Escaping
When explicitly specified, attribute values can contain escaped characters. This specification uses two methods of escaping, the _UnicodeSet_ notation and the `\u{…usv}` notation.
### UnicodeSet Escaping
The _UnicodeSet_ notation is described in [UTS #35 section 5.3.3](tr35.md#Unicode_Sets) and allows for comprehensive character matching, including by character range, properties, names, or codepoints.
Note that the `\u1234` and `\x{C1}` format escaping is not supported, only the `\u{…}` format (using `bracketedHex`).
Currently, the following attribute values allow _UnicodeSet_ notation:
* `from` or `before` on the `` element
* `from` or `before` on the `` element
* `chars` on the [``](#test-element-repertoire) test element.
### UTS18 Escaping
The `\u{…usv}` notation, a subset of hex notation, is described in [UTS #18 section 1.1](https://www.unicode.org/reports/tr18/#Hex_notation). It can refer to one or multiple individual codepoints. Currently, the following attribute values allow the `\u{…}` notation:
* `output` on the `` element
* `from` or `to` on the `` element
* `value` on the `` element
* `output` and `display` on the `` element
* `baseCharacter` on the `` element
* Some attributes on [Keyboard Test Data](#keyboard-test-data) subelements
Characters of general category of Mark (M), Control characters (Cc), Format characters (Cf), and whitespace other than space should be encoded using one of the notation above as appropriate.
Attribute values escaped in this manner are annotated with the `` DTD annotation, see [DTD Annotations](tr35.md#dtd-annotations)
* * *
## File and Directory Structure
* In the future, new layouts will be included in the CLDR repository, as a way for new layouts to be distributed in a cross-platorm manner. The process for this repository of layouts has not yet been defined, see the [CLDR Keyboard Workgroup Page][keyboard-workgroup] for up-to-date information.
* Layouts have version metadata to indicate their specification compliance version number, such as `45`. See [`cldrVersion`](tr35-info.md#version-information).
```xml
```
> _Note_: Unlike other LDML files, layouts are designed to be used outside of the CLDR source tree. As such, they do not contain DOCTYPE entries.
>
> DTD and Schema (.xsd) files are available for use in validating keyboard files.
* The filename of a keyboard .xml file does not have to match the BCP47 primary locale ID, but it is recommended to do so. The CLDR repository may enforce filename consistency.
### Extensibility
For extensibility, the `` element will be allowed at nearly every level.
See [Element special](tr35.md#special) in Part 1.
## Normalization
Unicode Normalization, as described in [The Unicode Standard](https://www.unicode.org/reports/tr41/#Unicode/), is a process by which Unicode text is processed to eliminate unwanted distinctions.
This section discusses how conformant keyboards are affected by normalization, and the impact of normalization on keyboard authors and keyboard implmentations.
Keyboard implementations will usually apply normalization as appropriate when matching transform rules and `` value matching.
Output from the keyboard, following application of all transform rules, will be normalized to the appropriate form by the keyboard implementation.
> Note: There are many existing software libraries which perform Unicode Normalization, including [ICU](https://icu.unicode.org), [ICU4X](https://icu4x.unicode.org), and JavaScript's [String.prototype.normalize()](https://developer.mozilla.org/docs/Web/JavaScript/Reference/Global_Objects/String/normalize).
Keyboard authors will not typically need to perform normalization as part of the keyboard layout. However, authors should be aware of areas where normalization affects keyboard operation so that they may achieve their desired results.
### Where Normalization Occurs
There are four stages where normalization must be performed by keyboard implementations.
1. **From the keyboard source `.xml`**
Keyboard source .xml files may be in any normalization form.
However, in processing they are converted to NFD.
- From any form to NFD: full normalization (decompose+reorder)
- Markers must be processed as described [below](#marker-algorithm-overview).
- Regex patterns must be processed so that matching is performed in NFD.
Example: ``).
The implementation must normalize the context buffer to `e\u{0320}\u{0300}` (`è̠`) before matching.
3. **Before each `transformGroup`**
Text must be normalized before processing by the next `transformGroup`.
- To NFD: no decomposition should be needed, because all of the input text (including transform rules) was already in NFD form.
However, marker reordering may be needed if transforms insert segments out of order.
- Markers must be preserved.
Example: The input context contains U+00E8 (`è`). The user clicks the cursor after this character, then presses a key producing `x`. A transform rule `` matches. The implementation must normalize the intermediate buffer to `e\u{0320}\u{0300}` (`è̠`) before proceeding to the next `transformGroup`.
4. **Before output to the platform/application**
Text must be normalized into the output form requested by the platform or application. This will typically be NFC, but may not be.
- If normalizing to NFC, full normalization (reorder+composition) will be required.
- No markers are present in this text, they are removed prior to output but retained in the implementation's input context for subsequent keystrokes. See [markers](#markers).
Example: The result of keystrokes and transform processing produces the string `e\u{0300}`. The keyboard implementation normalizes this to a single NFC codepoint U+00E8 (`è`), which is returned to the application.
### Normalization and Transform Matching
Regardless of the normalization form in the keyboard source file or in the edit buffer context, transform matching will be performed using **NFD**. For example, all of the following transforms will match the input strings è̠, whether the input is U+00E8 U+0320, U+0065 U+0320 U+0300, or U+0065 U+0300 U+0320.
```xml
```
### Normalization and Markers
A special issue occurs when markers are involved.
[Markers](#markers) are not text, and so not themselves modified or reordered by the Unicode Normalization Algorithm.
Existing Normalization APIs typically operate on plain text, and so those APIs can not be used with content containing markers.
However, the markers must be retained and processed by keyboard implementations in a manner which will be both consistent across implementations and predictable to keyboard authors.
Inconsistencies would result in different user experiences — specifically, different or incorrect text output — on some implementations and not another.
Unpredictability would make it challenging for the keyboard author to create a keyboard with expected behavior.
This section gives an algorithm for implementing normalization on a text stream including markers.
_Note:_ When the algorithm is performed on a plain text stream that doesn't include markers, implementations may skip the removing/re-adding steps 1 and 3 because no markers are involved.
#### Rationale for 'gluing' markers
The processing described here describes an extension to Unicode normalization to account for the desired behavior of markers.
The algorithm described considers markers 'glued' (remaining with) the following character. If a context ends with a marker, that marker would be guaranteed to remain at the end after processing, consistently located with respect to the next keystroke to be input.
1. Keyboard authors can keep a marker together with a character of interest by emitting the marker just previous to that character.
For example, given a key `output="\m{marker}X"`, the marker will proceed `X` regardless of any normalization. (If `output="X\m{marker}"` were used, and `X` were to reorder with other characters, the marker would no longer be adjacent to the X.)
2. Markers which are at the end of the input remain at the end of input during normalization.
For example, given input context which ends with a marker, such as `...ABCDX\m{marker}`, the marker will remain at the end of the input context regardless of any normalization.
The 'gluing' is only applicable during one particular processing step. It does not persist or affect further processing steps or future keystrokes.
#### Data Model: `Marker`
For purposes of this algorithm, a `Marker` is an opaque data type which has one property, its ID. See [Markers](#markers) for a discussion of the marker ID.
#### Data Model: string
For purposes of this algorithm, a string is an array of elements, where each element is either a codepoint or a `Marker`. For example, a [`key`](#element-key) in the XML such as `` would produce a string with three elements:
1. The codepoint U+104EF
2. The `Marker` named `mymarker`
3. The codepoint U+0078
If this string were output to an application, it would be converted to _plain text_ by removing all markers, which would yield the plain text string with only two codepoints: `𐓯x`.
#### Data Model: `MarkerEntry`
This algorithm uses a temporary data structure which is an ordered array of `MarkerEntry` elements.
Each `MarkerEntry` element has the following properties:
- `glue` (a codepoint, or the special value `END_OF_SEGMENT`)
- `divider?` (true/false)
- `processed?` (true/false, defaults to false)
- `marker` (the `Marker` object)
#### Marker Algorithm Overview
This algorithm has three main phases to it.
1. **Parsing/Removing Markers**
In this phase, the input string is analyzed to locate all markers. Metadata about each marker is stored in a temporary `MarkerArray` data structure.
Markers are removed from the input string, leaving only plain text.
2. **Plain Text Processing**
This phase is performed on the plain text string, such as NFD normalization.
3. **Re-Adding Markers**
Finally, markers are re-added to the plain text string using the `MarkerEntry` metadata from step 1.
This phase results in a string which contains both codepoints and markers.
#### Phase 1: Parsing/Removing Markers
Given an input string _s_
1. Initialize an empty `MarkerEntry` array _e_
2. Initialize an empty `Marker` array _pending_
2. Loop through each element _i_ of the input _s_
1. If _i_ is a `Marker`:
1. add the marker _i_ to the end of _pending_
2. remove the marker from the input string _s_
2. else if _i_ is a codepoint:
1. Decompose _i_ into NFD form into a plain text string array of codepoints _d_
2. Add an element with `glue=d[0]` (the first codepoint of _d_) and `divider? = true` to the end of _e_
3. For every marker _m_ in _pending_:
1. Add an element with `glue=d[0]` and `marker=m` and `divider? = false` to the end of _e_
4. Clear the _pending_ array.
5. Finally, for every codepoint _c_ in _d_ **following** the initial codepoint: (d[1]..):
1. Add an element with `glue=c` and `divider? = true` to the end of _e_
3. At the end of text,
1. Add an element with `glue=END` and `divider?=true` to the end of _e_
2. For every marker _m_ in _pending_:
1. Add an element with `glue=END` and `marker=m` and `divider? = false` to the end of _e_
The string _s_ is now plain text and can be processed by the next phase.
The array _e_ will be used in Phase 3.
#### Phase 2: Plain Text Processing
See [UAX #15](https://www.unicode.org/reports/tr15/#Description_Norm) for an overview of the process. An existing Unicode-compliant API can be used here.
#### Phase 3: Adding Markers
1. Initialize an empty output string _o_
2. Loop through the elements _p_ of the array _e_ from end to beginning (backwards)
1. If _p_.glue isn't `END`:
1. break out of the loop
2. If _p_.divider? == false:
1. Prepend marker _p_.marker to the output string _o_
3. Set _p_.processed?=true (so we don't process this again)
2. Loop through each codepoint _i_ ( in the plain text input string ) from end to beginning (backwards)
1. Prepend _i_ to output _o_
2. Loop through the elements _p_ of the array _e_ from end to beginning (backwards)
1. If _p_.processed? == true:
1. Continue the inner loop (was already processed)
2. If _p_.glue isn't _i_
1. Continue the inner loop (wrong glue, not applicable)
3. If _p_.divider? == true:
1. Break out of the inner loop (reached end of this 'glue' char)
4. Prepend marker _p_.marker to the output string _o_
5. Set _p_.processed?=true (so we don't process this again)
3. _o_ is now the output string including markers.
#### Example Normalization with Markers
**Example 1**
Consider this example, without markers:
- `e\u{0300}\u{0320}` (input)
- `e\u{0320}\u{0300}` (NFD)
The combining marks are reordered.
**Example 2**
If we add markers:
- `e\u{0300}\m{marker}\u{0320}` (input)
- `e\m{marker}\u{0320}\u{0300}` (NFD)
Note that the marker is 'glued' to the _following_ character. In the above example, `\m{marker}` was 'glued' to the `\u{0320}`.
**Example 2**
A second example:
- `e\m{marker0}\u{0300}\m{marker1}\u{0320}\m{marker2}` (input)
- `e\m{marker1}\u{0320}\m{marker0}\u{0300}\m{marker2}` (NFD)
Here `\m{marker2}` is 'glued' to the end of the string. However, if additional text is added such as by a subsequent keystroke (which may add an additional combining character, for example), this marker may be 'glued' to that following text.
Markers remain in the same normalization-safe segment during normalization. Consider:
**Example 3**
- `e\u{0300}\m{marker1}\u{0320}a\u{0300}\m{marker2}\u{0320}` (original)
- `e\m{marker1}\u{0320}\u{0300}a\m{marker2}\u{0320}\u{0300}` (NFD)
There are two normalization-safe segments here:
1. `e\u{0300}\m{marker1}\u{0320}`
2. `a\u{0300}\m{marker2}\u{0320}`
Normalization (and marker rearranging) effectively occurs within each segment. While `\m{marker1}` is 'glued' to the `\u{0320}`, it is glued within the first segment and has no effect on the second segment.
### Normalization and Character Classes
If pre-composed (non-NFD) characters are used in [character classes](#regex-like-syntax), such as `[á-é]`, these may not match as keyboard authors expect, as the U+00E1 character (á) will not occur in NFD form. Thus this may be masking serious errors in the data.
Tools that process keyboard data must reject the data when character classes include non-NFD characters.
The above should be written instead as a regex `(á|â|ã|ä|å|æ|ç|è|é)`. Alternatively, it could be written as a set variable `` and matched as `$[Example]`.
There is another case where there is no explicit mention of a non-NFD character, but the character class could include non-NFD characters, such as the range `[\u{0020}-\u{01FF}]`. For these, the tools should raise a warning by default.
### Normalization and Reorder elements
[`reorder`](#element-reorder) elements operate on NFD codepoints.
### Normalization-safe Segments
For purposes of this algorithm, "normalization-safe segments" are defined as a string of codepoints which are
1. already in [NFD](https://www.unicode.org/reports/tr15/#Norm_Forms), and
2. begin with a character with [Canonical Combining Class](https://www.unicode.org/reports/tr44/#Canonical_Combining_Class_Values) of `0`.
See [UAX #15 Section 9.1: Stable Code Points](https://www.unicode.org/reports/tr15/#Stable_Code_Points) for related discussion.
Text under consideration can be segmented by locating such characters.
### Normalization and Output
On output, text will be normalized into a specified normalization form. That form will typically be NFC, but an implementation may allow a calling application to override the choice of normalization form.
For example, many platforms may request NFC as the output format. In such a case, all text emitted via the keyboard will be transformed into NFC.
Existing text in a document will only have normalization applied within a single normalization-safe segment from the caret. The output will not contain any markers, thus any normalization is unaffected by any markers embedded within the segment.
For example, the sequence `e\m{marker}\u{300}` would be output in NFC as `è`. The marker is removed and has no effect on the output.
### Disabling Normalization
The attribute value `normalization="disabled"` can be used to indicate that no automatic normalization is to be applied in input, matching, or output. Using this setting should be done with caution:
- When this attribute value is used, all matching and output uses only the exact codepoints provided by the keyboard author.
- The input context from the application may not be normalized, which means that the keyboard author should consider all possible combinations, including NFC, NFD, and mixed normalization in ``](#element-settings) for further details.
The majority of the above section only applies when `normalization="disabled"` is not used.
* * *
## Element Hierarchy
This section describes the XML elements in a keyboard layout file, beginning with the top level element ``.
### Element: keyboard3
This is the top level element. All other elements defined below are under this element.
**Syntax**
```xml
```
>
>
> Parents: _none_
>
> Children: [displays](#element-displays), [flicks](#element-flicks), [forms](#element-forms), [import](#element-import), [info](#element-info), [keys](#element-keys), [layers](#element-layers), [locales](#element-locales), [settings](#element-settings), [_special_](tr35.md#special), [transforms](#element-transforms), [variables](#element-variables), [version](#element-version)
>
> Occurrence: required, single
>
>
_Attribute:_ `conformsTo` (required)
This attribute value distinguishes the keyboard from prior versions,
and it also specifies the minimum CLDR major version required.
This attribute value must be a whole number of `45` or greater. See [`cldrVersion`](tr35-info.md#version-information)
```xml
```
_Attribute:_ `locale` (required)
This attribute value contains the primary locale of the keyboard using BCP 47 [Unicode locale identifiers](tr35.md#Canonical_Unicode_Locale_Identifiers) - for example `"el"` for Greek. Sometimes, the locale may not specify the base language. For example, a Devanagari keyboard for many languages could be specified by BCP-47 code: `"und-Deva"`. However, it is better to list out the languages explicitly using the [`locales`](#element-locales) element.
For further details about the choice of locale ID, see [Keyboard IDs](#keyboard-ids).
**Example** (for illustrative purposes only, not indicative of the real data)
```xml
…
```
```xml
…
```
* * *
### Element: import
The `import` element is used to reference another xml file so that elements are imported from
another file. The use case is to be able to import a standard set of `transform`s and similar
from the CLDR repository, especially to be able to share common information relevant to a particular script.
The intent is for each single XML file to contain all that is needed for a keyboard layout, other than required standard import data from the CLDR repository.
`` can be used as a child of a number of elements (see the _Parents_ section immediately below). Multiple `` elements may be used, however, `` elements must come before any other sibling elements.
If two identical elements are defined, the later element will take precedence, that is, override.
Imported elements may contain other `` statements. Implementations must prevent recursion, that is, each imported file may only be included once.
**Note:** imported files do not have any indication of their normalization mode. For this reason, the keyboard author must verify that the imported file is of a compatible normalization mode. See the [`settings` element](#element-settings) for further details.
**Syntax**
```xml
```
>
>
> Parents: [displays](#element-displays), [flicks](#element-flicks), [forms](#element-forms), [keyboard3](#element-keyboard3), [keys](#element-keys), [layers](#element-layers), [transformGroup](#element-transformgroup), [transforms](#element-transforms), [variables](#element-variables)
> Children: _none_
>
> Occurrence: optional, multiple
>
>
_Attribute:_ `base`
> The base may be omitted (indicating a local import) or have the value `"cldr"`.
**Note:** `base="cldr"` is required for all `` statements within keyboard files in the CLDR repository.
_Attribute:_ `path` (required)
> If `base` is `cldr`, then the `path` must start with a CLDR major version (such as `45`) representing the CLDR version to pull imports from. The imports are located in the `keyboard/import` subdirectory of the CLDR source repository.
> Implementations are not required to have all CLDR versions available to them.
>
> If `base` is omitted, then `path` is an absolute or relative file path.
**Further Examples**
```xml
…
…
```
**Note:** The root element, here `transforms`, is the same as
the _parent_ of the `` element. It is an error to import an XML file
whose root element is different than the parent element of the `` element.
After loading, the above example will be the equivalent of the following.
```xml
```
* * *
### Element: locales
The optional `` element allows specifying additional or alternate locales.
**Syntax**
```xml
```
>
>
> Parents: [keyboard3](#element-keyboard3)
>
> Children: [locale](#element-locale)
>
> Occurrence: optional, single
>
>
### Element: locale
The `` element specifies an additional or alternate locale. Denotes intentional support for an extra language, not just that a keyboard incidentally supports a language’s orthography.
**Syntax**
```xml
```
>
>
> Parents: [locales](#element-locales)
>
> Children: _none_
>
> Occurrence: optional, multiple
>
>
_Attribute:_ `id` (required)
> The [BCP 47](tr35.md#Canonical_Unicode_Locale_Identifiers) locale ID of an additional language supported by this keyboard.
> Must _not_ include the `-k0-` subtag for this additional language.
**Example**
See [Principles for Keyboard IDs](#principles-for-keyboard-ids) for discussion and further examples.
```xml
```
* * *
### Element: version
Element used to keep track of the source data version.
**Syntax**
```xml
```
>
>
> Parents: [keyboard3](#element-keyboard3)
>
> Children: _none_
>
> Occurrence: optional, single
>
>
_Attribute:_ `number` (required)
> Must be a [[SEMVER](https://semver.org)] compatible version number, such as `1.0.0` or `38.0.0-beta.11`
_Attribute:_ `cldrVersion` (fixed by DTD)
> The CLDR specification version that is associated with this data file. This value is fixed and is inherited from the [DTD file](https://github.com/unicode-org/cldr/tree/main/keyboards/dtd) and therefore does not show up directly in the XML file.
**Example**
```xml
…
…
```
* * *
### Element: info
Element containing informative properties about the layout, for displaying in user interfaces etc.
**Syntax**
```xml
```
>
>
> Parents: [keyboard3](#element-keyboard3)
>
> Children: _none_
>
> Occurrence: required, single
>
>
_Attribute:_ `name` (required)
> Note that this is the only required attribute for the `` element.
>
> This attribute is an informative name for the keyboard.
```xml
…
…
```
* * *
_Attribute:_ `author`
> The `author` attribute value contains the name of the author of the layout file.
_Attribute:_ `layout`
> The `layout` attribute describes the layout pattern, such as QWERTY, DVORAK, INSCRIPT, etc. typically used to distinguish various layouts for the same language.
>
> This attribute is not localized, but is an informative identifier for implementation use.
_Attribute:_ `indicator`
> The `indicator` attribute describes a short string to be used in currently selected layout indicator, such as `US`, `SI9` etc.
> Typically, this is shown on a UI element that allows switching keyboard layouts and/or input languages.
>
> This attribute is not localized.
* * *
### Element: settings
An element used to keep track of layout-specific settings by implementations. This element may or may not show up on a layout. These settings reflect the normal practice by the implementation. However, an implementation using the data may customize the behavior.
**Syntax**
```xml
```
>
>
> Parents: [keyboard3](#element-keyboard3)
>
> Children: _none_
>
> Occurrence: optional, single
>
>
_Attribute:_ `normalization="disabled"`
> The presence of this attribute indicates that normalization will not be applied to the input text, matching, or the output.
> See [Normalization](#normalization) for additional details.
>
> **Note**: while this attribute is allowed by the specification, it should be used with caution.
**Example**
```xml
…
…
```
* * *
### Element: displays
The `displays` element consists of a list of [`display`](#element-display) subelements.
**Syntax**
```xml
…
```
>
>
> Parents: [keyboard3](#element-keyboard3)
>
> Children: [display](#element-display), [displayOptions](#element-displayoptions), [_special_](tr35.md#special)
>
> Occurrence: optional, single
>
>
* * *
### Element: display
The `display` elements can be used to describe what is to be displayed on the keytops for various keys. For the most part, such explicit information is unnecessary since the `@to` element from the `keys/key` element will be used for keytop display.
- Some characters, such as diacritics, do not display well on their own.
- Another useful scenario is where there are doubled diacritics, or multiple characters with spacing issues.
- Finally, the `display` element provides a way to specify the keytop for keys which do not otherwise produce output. Keys which switch layers using the `@layerId` attribute typically do not produce output.
> Note: `displays` elements are designed to be shared across many different keyboard layout descriptions, and imported with `` where needed.
#### Non-spacing marks on keytops
For non-spacing marks, U+25CC `◌` is used as a base. It is an error to use a nonspacing character without a base in the `display` attribute. For example, `display="\u{0303}"` would produce an error.
A key which outputs a combining tilde (U+0303) could be represented as either of the following:
```xml
```
This way, a key which outputs a combining tilde (U+0303) will be represented as `◌̃` (a tilde on a dotted circle).
Users of some scripts/languages may prefer a different base than U+25CC. See [``](#element-displayoptions).
**Syntax**
```xml
```
>
>
> Parents: [displays](#element-displays)
>
> Children: _none_
>
> Occurrence: required, multiple
>
>
One of the `output` or `id` attributes is required.
**Note**: There is currently no way to indicate a custom display for a key without output (i.e. without a `to=` attribute), nor is there a way to indicate that such a key has a standardized identity (e.g. that a key should be identified as a “Shift”). These may be addressed in future versions of this standard.
_Attribute:_ `output` (optional)
> Specifies the character or character sequence from the `keys/key` element that is to have a special display.
> This attribute may be escaped with `\u` notation, see [Escaping](#escaping).
> The `output` attribute may also contain the `\m{…}` syntax to reference a marker. See [Markers](#markers). Implementations may highlight a displayed marker, such as with a lighter text color, or a yellow highlight.
> String variables may be substituted. See [String variables](#element-string)
_Attribute:_ `keyId` (optional)
> Specifies the `key` id. This is useful for keys which do not produce any output (no `output=` value), such as a shift key.
>
> Must match `[A-Za-z0-9][A-Za-z0-9_-]*`
_Attribute:_ `display` (required)
> Required and specifies the character sequence that should be displayed on the keytop for any key that generates the `@output` sequence or has the `@id`. (It is an error if the value of the `display` attribute is the same as the value of the `output` attribute, this would be an extraneous entry.)
> String variables may be substituted. See [String variables](#element-string)
This attribute may be escaped with `\u` notation, see [Escaping](#escaping).
**Example**
```xml
```
To allow `displays` elements to be shared across keyboards, there is no requirement that `@output` in a `display` element matches any `@output`/`@id` in any `keys/key` element in the keyboard description.
* * *
### Element: displayOptions
The `displayOptions` is an optional singleton element providing additional settings on this `displays`. It is structured so as to provide for future flexibility in such options.
**Syntax**
```xml
```
>
>
> Parents: [displays](#element-displays)
>
> Children: _none_
>
> Occurrence: optional, single
>
>
_Attribute:_ `baseCharacter` (optional)
**Note:** At present, this is the only option settable in the `displayOptions`.
> Some scripts/languages may prefer a different base than U+25CC.
> For Lao for example, `x` is often used as a base instead of `◌`.
> Setting `baseCharacter="x"` (for example) is a _hint_ to the implementation which
> requests U+25CC to be substituted with `x` on display.
> As a hint, the implementation may ignore this option.
>
> **Note** that not all base characters will be suitable as bases for combining marks.
This attribute may be escaped with `\u` notation, see [Escaping](#escaping).
* * *
### Element: keys
This element defines the properties of all possible keys via [`` elements](#element-key) used in all layouts.
It is a “bag of keys” without specifying any ordering or relation between the keys.
There is only a single `` element in each layout.
**Syntax**
```xml
```
>
>
> Parents: [keyboard3](#element-keyboard3)
> Children: [key](#element-key)
> Occurrence: optional, single
>
>
* * *
### Element: key
This element defines a mapping between an abstract key and its output. This element must have the `keys` element as its parent. The `key` element is referenced by the `keys=` attribute of the [`row` element](#element-row).
**Syntax**
```xml
```
>
>
> Parents: [keys](#element-keys)
>
> Children: _none_
>
> Occurrence: optional, multiple
>
**Note**: The `id` attribute is required.
**Note**: _at least one of_ `layerId`, `gap`, or `output` are required.
_Attribute:_ `id`
> The `id` attribute uniquely identifies the key. NMTOKEN. It can (but needn't be) the key name (a, b, c, A, B, C, …), or any other valid token (e-acute, alef, alif, alpha, …).
>
> In the future, this attribute’s definition is expected to be updated to align with [UAX#31](https://www.unicode.org/reports/tr31/).
_Attribute:_ `flickId="…flickId"` (optional)
> The `flickId` attribute indicates that this key makes use of a [`flick`](#element-flick) set with the specified id.
_Attribute:_ `gap="true"` (optional)
> The `gap` attribute indicates that this key does not have any appearance, but causes a "gap" of the specified number of key widths. Can be used with `width` to set a width.
> Such elements may not be referred to by `display` elements, nor may they have any of the following attributes: `flickId`, `longPressKeyId`, `longPressDefaultKeyId`, `multiTapKeyIds`, `layerId`, or `output`.
```xml
```
_Attribute:_ `output`
> The `output` attribute value contains the sequence of characters that is emitted when pressing this particular key. Control characters, whitespace (other than the regular space character) and combining marks in this attribute are escaped using the `\u{…}` notation. More than one key may output the same output.
>
> The `output` attribute may also contain the `\m{…markerId}` syntax to insert a marker. See the definition of [markers](#markers).
_Attribute:_ `longPressKeyIds="…list of keyIds"` (optional)
> A space-separated ordered list of `key` element ids, which keys which can be emitted by "long-pressing" this key. This feature is prominent in mobile devices.
>
> In a list of keys specified by `longPressKeyIds`, the key matching `longPressDefaultKeyId` attribute (if present) specifies the default long-press target, which could be different than the first element. It is an error if the `longPressDefaultKeyId` key is not in the `longPressKeyIds` list.
>
> Implementations shall ignore any gestures (such as flick, multiTap, longPress) defined on keys in the `longPressKeyIds` list.
>
> For example, if the default key is a key whose [display](#element-displays) value is `{`, an implementation might render the key as follows:
>
> 
>
> _Example:_
> - pressing the `o` key will produce `o`
> - holding down the key will produce a list `ó`, `{` (where `{` is the default and produces a marker)
>
> ```xml
>
>
>
>
>
>
>
>
>
>
> ```
_Attribute:_ `longPressDefaultKeyId="…keyId"` (optional)
> Specifies the default key, by id, in a list of long-press keys. See the discussion of `LongPressKeyIds`, above.
_Attribute:_ `multiTapKeyIds` (optional)
> A space-separated ordered list of `key` element ids, which keys, where each successive key in the list is produced by the corresponding number of quick taps.
> It is an error for a key to reference itself in the `multiTapKeyIds` list.
>
> Implementations shall ignore any gestures (such as flick, multiTap, longPress) defined on keys in the `multiTapKeyIds` list.
>
> _Example:_
> - first tap on the key will produce “a”
> - two taps will produce “bb”
> - three taps on the key will produce “c”
> - four taps on the key will produce “d”
>
> ```xml
>
>
>
>
>
>
> ```
**Note**: Behavior past the end of the multiTap list is implementation specific.
_Attribute:_ `stretch="true"` (optional)
> The `stretch` attribute indicates that a touch layout may stretch this key to fill available horizontal space on the row.
> This is used, for example, on the spacebar. Note that `stretch=` is ignored for hardware layouts.
_Attribute:_ `layerId="shift"` (optional)
> The `layerId` attribute indicates that this key switches to another `layer` with the specified id (such as `` in this example).
> Note that a key may have both a `layerId=` and a `output=` attribute, indicating that the key outputs _prior_ to switching layers.
> Also note that `layerId=` is ignored for hardware layouts: their shifting is controlled via
> the modifier keys.
>
> This attribute is an NMTOKEN.
>
> In the future, this attribute’s definition is expected to be updated to align with [UAX#31](https://www.unicode.org/reports/tr31/).
_Attribute:_ `width="1.2"` (optional, default "1.0")
> The `width` attribute indicates that this key has a different width than other keys, by the specified number of key widths.
```xml
```
##### Implied Keys
Not all keys need to be listed explicitly. The following two can be assumed to already exist:
```xml
```
In addition, these 62 keys, comprising 10 digit keys, 26 Latin lower-case keys, and 26 Latin upper-case keys, where the `id` is the same as the `to`, are assumed to exist:
```xml
…
…
…
```
These implied keys are available in a data file named `keyboards/import/keys-Latn-implied.xml` in the CLDR distribution for the convenience of implementations.
Thus, the implied keys behave as if the following import were present.
```xml
```
**Note:** All implied keys may be overridden, as with all other imported data items. See the [`import`](#element-import) element for more details.
* * *
### Element: flicks
The `flicks` element is a collection of `flick` elements.
>
>
> Parents: [keyboard3](#element-keyboard3)
>
> Children: [flick](#element-flick), [import](#element-import), [_special_](tr35.md#special)
>
> Occurrence: optional, single
>
* * *
#### Element: flick
The `flick` element is used to generate results from a "flick" of the finger on a mobile device.
**Syntax**
```xml
```
>
>
> Parents: [flicks](#element-flicks)
>
> Children: [flickSegment](#element-flicksegment), [_special_](tr35.md#special)
>
> Occurrence: optional, multiple
>
>
_Attribute:_ `id` (required)
> The `id` attribute identifies the flicks. It can be any NMTOKEN.
>
> The `id` attribute on `flick` elements are distinct from the `id` attribute on `key` elements.
> For example, it is permissible to have both `` and
> `` which are two unrelated elements.
>
> In the future, this attribute’s definition is expected to be updated to align with [UAX#31](https://www.unicode.org/reports/tr31/).
* * *
#### Element: flickSegment
>
>
> Parents: [flick](#element-flick)
>
> Children: _none_
>
> Occurrence: required, multiple
>
>
_Attribute:_ `directions` (required)
> The `directions` attribute value is a space-delimited list of keywords, that describe a path, currently restricted to the cardinal and intercardinal directions `{n e s w ne nw se sw}`.
_Attribute:_ `keyId` (required)
> The `keyId` attribute value is the result of (one or more) flicks.
>
> Implementations shall ignore any gestures (such as flick, multiTap, longPress) defined on the key specified by `keyId`.
**Example**
where a flick to the Northeast then South produces `Å`.
```xml
```
* * *
### Element: forms
This element contains a set of `form` elements which define the layout of a particular hardware form.
>
>
> Parents: [keyboard3](#element-keyboard3)
>
> Children: [import](#element-import), [form](#element-form), [_special_](tr35.md#special)
>
> Occurrence: optional, single
>
>
***Syntax***
```xml
```
* * *
### Element: form
This element contains a specific `form` element which defines the layout of a particular hardware form.
> *Note:* Most keyboards will not need to use this element directly, and the CLDR repository will not accept keyboards which define a custom `form` element. This element is provided for two reasons:
1. To formally specify the standard hardware arrangements used with CLDR for implementations. Implementations can verify the arrangement, and validate keyboards against the number of rows and the number of keys per row.
2. To allow a way to customize the scancode layout for keyboards not intended to be included in the common CLDR repository.
See [Implied Form Values](#implied-form-values), below.
>
>
> Parents: [forms](#element-forms)
>
> Children: [scanCodes](#element-scancodes), [_special_](tr35.md#special)
>
> Occurrence: optional, multiple
>
>
_Attribute:_ `id` (required)
> This attribute specifies the form id. The value may not be `touch`.
> Must match `[A-Za-z0-9][A-Za-z0-9_-]*`
***Syntax***
```xml
```
##### Implied Form Values
There is an implied set of `