Name Date Size #Lines LOC

..--

cmake/H25-Apr-2025-2016

fuzz/H25-Apr-2025-3624

utf8_corpus_dir/H25-Apr-2025-515454

utf8_to_utf16/H25-Apr-2025-623483

Android.bpH A D25-Apr-2025693 3430

BUILD.bazelH A D25-Apr-20251.4 KiB6657

CMakeLists.txtH A D25-Apr-20252.4 KiB8570

LICENSEH A D25-Apr-20251.1 KiB2318

README.mdH A D25-Apr-202512.6 KiB265206

UTF-8-demo.txtH A D25-Apr-202513.7 KiB213162

ascii.cppH A D25-Apr-20255 KiB223163

boost.cppH A D25-Apr-2025387 1611

lemire-avx2.cH A D25-Apr-20258.7 KiB234165

lemire-neon.cH A D25-Apr-20257.3 KiB216144

lemire-sse.cH A D25-Apr-20257.3 KiB207139

lookup.cH A D25-Apr-20251.8 KiB4227

main.cH A D25-Apr-202511.1 KiB406349

naive.cH A D25-Apr-20253.9 KiB9344

range-avx2.cH A D25-Apr-202510.1 KiB278155

range-neon.cH A D25-Apr-20258.3 KiB229110

range-sse.cH A D25-Apr-20259.1 KiB256137

range2-neon.cH A D25-Apr-20255.5 KiB158114

range2-sse.cH A D25-Apr-20255.8 KiB171122

utf8_range.cH A D25-Apr-202517.5 KiB468200

utf8_range.hH A D25-Apr-2025540 2312

utf8_validity.ccH A D25-Apr-20251 KiB3712

utf8_validity.hH A D25-Apr-2025712 269

utf8_validity_test.ccH A D25-Apr-20253.5 KiB7753

README.md

1[![Build Status](https://travis-ci.com/cyb70289/utf8.svg?branch=master)](https://travis-ci.com/cyb70289/utf8)
2
3# Fast UTF-8 validation with Range algorithm (NEON+SSE4+AVX2)
4
5This is a brand new algorithm to leverage SIMD for fast UTF-8 string validation. Both **NEON**(armv8a) and **SSE4** versions are implemented. **AVX2** implementation contributed by [ioioioio](https://github.com/ioioioio).
6
7Four UTF-8 validation methods are compared on both x86 and Arm platforms. Benchmark result shows range base algorithm is the best solution on Arm, and achieves same performance as [Lemire's approach](https://lemire.me/blog/2018/05/16/validating-utf-8-strings-using-as-little-as-0-7-cycles-per-byte/) on x86.
8
9* Range based algorithm
10  * range-neon.c: NEON version
11  * range-sse.c: SSE4 version
12  * range-avx2.c: AVX2 version
13  * range2-neon.c, range2-sse.c: Process two blocks in one iteration
14* [Lemire's SIMD implementation](https://github.com/lemire/fastvalidate-utf-8)
15  * lemire-sse.c: SSE4 version
16  * lemire-avx2.c: AVX2 version
17  * lemire-neon.c: NEON porting
18* naive.c: Naive UTF-8 validation byte by byte
19* lookup.c: [Lookup-table method](http://bjoern.hoehrmann.de/utf-8/decoder/dfa/)
20
21## About the code
22
23* Run "make" to build. Built and tested with gcc-7.3.
24* Run "./utf8" to see all command line options.
25* Benchmark
26  * Run "./utf8 bench" to bechmark all algorithms with [default test file](https://raw.githubusercontent.com/cyb70289/utf8/master/UTF-8-demo.txt).
27  * Run "./utf8 bench size NUM" to benchmark specified string size.
28* Run "./utf8 test" to test all algorithms with positive and negative test cases.
29* To benchmark or test specific algorithm, run something like "./utf8 bench range".
30
31## Benchmark result (MB/s)
32
33### Method
341. Generate UTF-8 test buffer per [test file](https://raw.githubusercontent.com/cyb70289/utf8/master/UTF-8-demo.txt) or buffer size.
351. Call validation sub-routines in a loop until 1G bytes are checked.
361. Calculate speed(MB/s) of validating UTF-8 strings.
37
38### NEON(armv8a)
39Test case | naive | lookup | lemire | range | range2
40:-------- | :---- | :----- | :----- | :---- | :-----
41[UTF-demo.txt](https://raw.githubusercontent.com/cyb70289/utf8/master/UTF-8-demo.txt) | 562.25 | 412.84 | 1198.50 | 1411.72 | **1579.85**
4232 bytes | 651.55 | 441.70 | 891.38 | 1003.95 | **1043.58**
4333 bytes | 660.00 | 446.78 | 588.77 | 1009.31 | **1048.12**
44129 bytes | 771.89 | 402.55 | 938.07 | 1283.77 | **1401.76**
451K bytes | 811.92 | 411.58 | 1188.96 | 1398.15 | **1560.23**
468K bytes | 812.25  | 412.74 | 1198.90 | 1412.18 | **1580.65**
4764K bytes | 817.35 | 412.24 | 1200.20 | 1415.11 | **1583.86**
481M bytes | 815.70  | 411.93 | 1200.93 | 1415.65 | **1585.40**
49
50### SSE4(E5-2650)
51Test case | naive | lookup | lemire | range | range2
52:-------- | :---- | :----- | :----- | :---- | :-----
53[UTF-demo.txt](https://raw.githubusercontent.com/cyb70289/utf8/master/UTF-8-demo.txt) | 753.70 | 310.41 | 3954.74 | 3945.60 | **3986.13**
5432 bytes | 1135.76 | 364.07 | **2890.52** | 2351.81 | 2173.02
5533 bytes | 1161.85 | 376.29 | 1352.95 | **2239.55** | 2041.43
56129 bytes | 1161.22 | 322.47 | 2742.49 | **3315.33** | 3249.35
571K bytes | 1310.95 | 310.72 | 3755.88 | 3781.23 | **3874.17**
588K bytes | 1348.32 | 307.93 | 3860.71 | 3922.81 | **3968.93**
5964K bytes | 1301.34 | 308.39 | 3935.15 | 3973.50 | **3983.44**
601M bytes | 1279.78 | 309.06 | 3923.51 | 3953.00 | **3960.49**
61
62## Range algorithm analysis
63
64Basic idea:
65* Load 16 bytes
66* Leverage SIMD to calculate value range for each byte efficiently
67* Validate 16 bytes at once
68
69### UTF-8 coding format
70
71http://www.unicode.org/versions/Unicode6.0.0/ch03.pdf, page 94
72
73Table 3-7. Well-Formed UTF-8 Byte Sequences
74
75Code Points        | First Byte | Second Byte | Third Byte | Fourth Byte |
76:----------        | :--------- | :---------- | :--------- | :---------- |
77U+0000..U+007F     | 00..7F     |             |            |             |
78U+0080..U+07FF     | C2..DF     | 80..BF      |            |             |
79U+0800..U+0FFF     | E0         | ***A0***..BF| 80..BF     |             |
80U+1000..U+CFFF     | E1..EC     | 80..BF      | 80..BF     |             |
81U+D000..U+D7FF     | ED         | 80..***9F***| 80..BF     |             |
82U+E000..U+FFFF     | EE..EF     | 80..BF      | 80..BF     |             |
83U+10000..U+3FFFF   | F0         | ***90***..BF| 80..BF     | 80..BF      |
84U+40000..U+FFFFF   | F1..F3     | 80..BF      | 80..BF     | 80..BF      |
85U+100000..U+10FFFF | F4         | 80..***8F***| 80..BF     | 80..BF      |
86
87To summarise UTF-8 encoding:
88* Depending on First Byte, one legal character can be 1, 2, 3, 4 bytes
89  * For First Byte within C0..DF, character length = 2
90  * For First Byte within E0..EF, character length = 3
91  * For First Byte within F0..F4, character length = 4
92* C0, C1, F5..FF are not allowed
93* Second,Third,Fourth Bytes must lie in 80..BF.
94* There are four **special cases** for Second Byte, shown ***bold italic*** in above table.
95
96### Range table
97
98Range table maps range index 0 ~ 15 to minimal and maximum values allowed. Our task is to observe input string, find the pattern and set correct range index for each byte, then validate input string.
99
100Index      | Min | Max | Byte type
101:----      | :-- | :-- | :--------
1020          | 00  | 7F  | First Byte, ASCII
1031,2,3      | 80  | BF  | Second, Third, Fourth Bytes
1044          | A0  | BF  | Second Byte after E0
1055          | 80  | 9F  | Second Byte after ED
1066          | 90  | BF  | Second Byte after F0
1077          | 80  | 8F  | Second Byte after F4
1088          | C2  | F4  | First Byte, non-ASCII
1099..15(NEON) | FF  | 00  | Illegal: unsigned char >= 255 && unsigned char <= 0
1109..15(SSE)  | 7F  | 80  | Illegal: signed char >= 127 && signed char <= -128
111
112### Calculate byte ranges (ignore special cases)
113
114Ignoring the four special cases(E0,ED,F0,F4), how should we set range index for each byte?
115
116* Set range index to 0(00..7F) for all bytes by default
117* Find non-ASCII First Byte (C0..FF), set their range index to 8(C2..F4)
118* For First Byte within C0..DF, set next byte's range index to 1(80..BF)
119* For First Byte within E0..EF, set next two byte's range index to 2,1(80..BF) in sequence
120* For First Byte within F0..FF, set next three byte's range index to 3,2,1(80..BF) in sequence
121
122To implement above operations efficiently with SIMD:
123* For 16 input bytes, use lookup table to map C0..DF to 1, E0..EF to 2, F0..FF to 3, others to 0. Save to first_len.
124* Map C0..FF to 8, we get range indices for First Byte.
125* Shift first_len one byte, we get range indices for Second Byte.
126* Saturate substract first_len by one(3->2, 2->1, 1->0, 0->0), then shift two bytes, we get range indices for Third Byte.
127* Saturate substract first_len by two(3->1, 2->0, 1->0, 0->0), then shift three bytes, we get range indices for Fourth Byte.
128
129Example(assume no previous data)
130
131Input       | F1 | 80 | 80 | 80 | 80 | C2 | 80 | 80 | ...
132:----       | :- | :- | :- | :- | :- | :- | :- | :- | :--
133*first_len* |*3* |*0* |*0* |*0* |*0* |*1* |*0* |*0* |*...*
134First Byte  | 8  | 0  | 0  | 0  | 0  | 8  | 0  | 0  | ...
135Second Byte | 0  | 3  | 0  | 0  | 0  | 0  | 1  | 0  | ...
136Third Byte  | 0  | 0  | 2  | 0  | 0  | 0  | 0  | 0  | ...
137Fourth Byte | 0  | 0  | 0  | 1  | 0  | 0  | 0  | 0  | ...
138Range index | 8  | 3  | 2  | 1  | 0  | 8  | 1  | 0  | ...
139
140```c
141Range_index = First_Byte | Second_Byte | Third_Byte | Fourth_Byte
142```
143
144#### Error handling
145
146* C0,C1,F5..FF are not included in range table and will always be detected.
147* Illegal 80..BF will have range index 0(00..7F) and be detected.
148* Based on First Byte, according Second, Third and Fourth Bytes will have range index 1/2/3, to make sure they must lie in 80..BF.
149* If non-ASCII First Byte overlaps, above algorithm will set range index of the latter First Byte to 9,10,11, which are illegal ranges. E.g, Input = F1 80 C2 90 --> Range index = 8 3 10 1, where 10 indicates error. See table below.
150
151Overlapped non-ASCII First Byte
152
153Input       | F1 | 80 | C2 | 90
154:----       | :- | :- | :- | :-
155*first_len* |*3* |*0* |*1* |*0*
156First Byte  | 8  | 0  | 8  | 0
157Second Byte | 0  | 3  | 0  | 1
158Third Byte  | 0  | 0  | 2  | 0
159Fourth Byte | 0  | 0  | 0  | 1
160Range index | 8  | 3  |***10***| 1
161
162### Adjust Second Byte range for special cases
163
164Range index adjustment for four special cases
165
166First Byte | Second Byte | Before adjustment | Correct index | Adjustment |
167:--------- | :---------- | :---------------- | :------------ | :---------
168E0         | A0..BF      | 2                 | 4             | **2**
169ED         | 80..9F      | 2                 | 5             | **3**
170F0         | 90..BF      | 3                 | 6             | **3**
171F4         | 80..8F      | 3                 | 7             | **4**
172
173Range index adjustment can be reduced to below problem:
174
175***Given 16 bytes, replace E0 with 2, ED with 3, F0 with 3, F4 with 4, others with 0.***
176
177A naive SIMD approach:
1781. Compare 16 bytes with E0, get the mask for eacy byte (FF if equal, 00 otherwise)
1791. And the mask with 2 to get adjustment for E0
1801. Repeat step 1,2 for ED,F0,F4
181
182At least **eight** operations are required for naive approach.
183
184Observing special bytes(E0,ED,F0,F4) are close to each other, we can do much better using lookup table.
185
186#### NEON
187
188NEON ```tbl``` instruction is very convenient for table lookup:
189* Table can be up to 16x4 bytes in size
190* Return zero if index is out of range
191
192Leverage these features, we can solve the problem with as few as **two** operations:
193* Precreate a 16x2 lookup table, where table[0]=2, table[13]=3, table[16]=3, table[20]=4, table[others]=0.
194* Substract input bytes with E0 (E0 -> 0, ED -> 13, F0 -> 16, F4 -> 20).
195* Use the substracted byte as index of lookup table and get range adjustment directly.
196  * For indices less than 32, we get zero or required adjustment value per input byte
197  * For out of bound indices, we get zero per ```tbl``` behaviour
198
199#### SSE
200
201SSE ```pshufb``` instruction is not as friendly as NEON ```tbl``` in this case:
202* Table can only be 16 bytes in size
203* Out of bound indices are handled this way:
204  * If 7-th bit of index is 0, least four bits are used as index (E.g, index 0x73 returns 3rd element)
205  * If 7-th bit of index is 1, return 0 (E.g, index 0x83 returns 0)
206
207We can still leverage these features to solve the problem in **five** operations:
208* Precreate two tables:
209  * table_df[1] = 2, table_df[14] = 3, table_df[others] = 0
210  * table_ef[1] = 3, table_ef[5] = 4, table_ef[others] = 0
211* Substract input bytes with EF (E0 -> 241, ED -> 254, F0 -> 1, F4 -> 5) to get the temporary indices
212* Get range index for E0,ED
213  * Saturate substract temporary indices with 240 (E0 -> 1, ED -> 14, all values below 240 becomes 0)
214  * Use substracted indices to look up table_df, get the correct adjustment
215* Get range index for F0,F4
216  * Saturate add temporary indices with 112(0x70) (F0 -> 0x71, F4 -> 0x75, all values above 16 will be larger than 128(7-th bit set))
217  * Use added indices to look up table_ef, get the correct adjustment (index 0x71,0x75 returns 1st,5th elements, per ```pshufb``` behaviour)
218
219#### Error handling
220
221* For overlapped non-ASCII First Byte, range index before adjustment is 9,10,11. After adjustment (adds 2,3,4 or 0), the range index will be 9 to 15, which is still illegal in range table. So the error will be detected.
222
223### Handling remaining bytes
224
225For remaining input less than 16 bytes, we will fallback to naive byte by byte approach to validate them, which is actually faster than SIMD processing.
226* Look back last 16 bytes buffer to find First Byte. At most three bytes need to look back. Otherwise we either happen to be at character boundray, or there are some errors we already detected.
227* Validate string byte by byte starting from the First Byte.
228
229## Tests
230
231It's necessary to design test cases to cover corner cases as more as possible.
232
233### Positive cases
234
2351. Prepare correct characters
2362. Validate correct characters
2373. Validate long strings
238   * Round concatenate characters starting from first character to 1024 bytes
239   * Validate 1024 bytes string
240   * Shift 1 byte, validate 1025 bytes string
241   * Shift 2 bytes, Validate 1026 bytes string
242   * ...
243   * Shift 16 bytes, validate 1040 bytes string
2444. Repeat step3, test buffer starting from second character
2455. Repeat step3, test buffer starting from third character
2466. ...
247
248### Negative cases
249
2501. Prepare bad characters and bad strings
251   * Bad character
252   * Bad character cross 16 bytes boundary
253   * Bad character cross last 16 bytes and remaining bytes boundary
2542. Test long strings
255   * Prepare correct long strings same as positive cases
256   * Append bad characters
257   * Shift one byte for each iteration
258   * Validate each shift
259
260## Code breakdown
261
262Below table shows how 16 bytes input are processed step by step. See [range-neon.c](range-neon.c) for according code.
263
264![Range based UTF-8 validation algorithm](https://raw.githubusercontent.com/cyb70289/utf8/master/range.png)
265