xref: /aosp_15_r20/external/cronet/net/tools/dafsa/make_dafsa.py (revision 6777b5387eb2ff775bb5750e3f5d96f37fb7352b)
1*6777b538SAndroid Build Coastguard Worker#!/usr/bin/env python3
2*6777b538SAndroid Build Coastguard Worker# Copyright 2014 The Chromium Authors
3*6777b538SAndroid Build Coastguard Worker# Use of this source code is governed by a BSD-style license that can be
4*6777b538SAndroid Build Coastguard Worker# found in the LICENSE file.
5*6777b538SAndroid Build Coastguard Worker
6*6777b538SAndroid Build Coastguard Worker"""
7*6777b538SAndroid Build Coastguard WorkerA Deterministic acyclic finite state automaton (DAFSA) is a compact
8*6777b538SAndroid Build Coastguard Workerrepresentation of an unordered word list (dictionary).
9*6777b538SAndroid Build Coastguard Worker
10*6777b538SAndroid Build Coastguard Workerhttps://en.wikipedia.org/wiki/Deterministic_acyclic_finite_state_automaton
11*6777b538SAndroid Build Coastguard Worker
12*6777b538SAndroid Build Coastguard WorkerThis python program converts a list of strings to a byte array in C++.
13*6777b538SAndroid Build Coastguard WorkerThis python program fetches strings and return values from a gperf file
14*6777b538SAndroid Build Coastguard Workerand generates a C++ file with a byte array representing graph that can be
15*6777b538SAndroid Build Coastguard Workerused as a memory efficient replacement for the perfect hash table.
16*6777b538SAndroid Build Coastguard Worker
17*6777b538SAndroid Build Coastguard WorkerThe input strings are assumed to consist of printable 7-bit ASCII characters
18*6777b538SAndroid Build Coastguard Workerand the return values are assumed to be one digit integers.
19*6777b538SAndroid Build Coastguard Worker
20*6777b538SAndroid Build Coastguard WorkerIn this program a DAFSA is a diamond shaped graph starting at a common
21*6777b538SAndroid Build Coastguard Workersource node and ending at a common sink node. All internal nodes contain
22*6777b538SAndroid Build Coastguard Workera label and each word is represented by the labels in one path from
23*6777b538SAndroid Build Coastguard Workerthe source node to the sink node.
24*6777b538SAndroid Build Coastguard Worker
25*6777b538SAndroid Build Coastguard WorkerThe following python represention is used for nodes:
26*6777b538SAndroid Build Coastguard Worker
27*6777b538SAndroid Build Coastguard Worker  Source node: [ children ]
28*6777b538SAndroid Build Coastguard Worker  Internal node: (label, [ children ])
29*6777b538SAndroid Build Coastguard Worker  Sink node: None
30*6777b538SAndroid Build Coastguard Worker
31*6777b538SAndroid Build Coastguard WorkerThe graph is first compressed by prefixes like a trie. In the next step
32*6777b538SAndroid Build Coastguard Workersuffixes are compressed so that the graph gets diamond shaped. Finally
33*6777b538SAndroid Build Coastguard Workerone to one linked nodes are replaced by nodes with the labels joined.
34*6777b538SAndroid Build Coastguard Worker
35*6777b538SAndroid Build Coastguard WorkerThe order of the operations is crucial since lookups will be performed
36*6777b538SAndroid Build Coastguard Workerstarting from the source with no backtracking. Thus a node must have at
37*6777b538SAndroid Build Coastguard Workermost one child with a label starting by the same character. The output
38*6777b538SAndroid Build Coastguard Workeris also arranged so that all jumps are to increasing addresses, thus forward
39*6777b538SAndroid Build Coastguard Workerin memory.
40*6777b538SAndroid Build Coastguard Worker
41*6777b538SAndroid Build Coastguard WorkerThe generated output has suffix free decoding so that the sign of leading
42*6777b538SAndroid Build Coastguard Workerbits in a link (a reference to a child node) indicate if it has a size of one,
43*6777b538SAndroid Build Coastguard Workertwo or three bytes and if it is the last outgoing link from the actual node.
44*6777b538SAndroid Build Coastguard WorkerA node label is terminated by a byte with the leading bit set.
45*6777b538SAndroid Build Coastguard Worker
46*6777b538SAndroid Build Coastguard WorkerThe generated byte array can described by the following BNF:
47*6777b538SAndroid Build Coastguard Worker
48*6777b538SAndroid Build Coastguard Worker<byte> ::= < 8-bit value in range [0x00-0xFF] >
49*6777b538SAndroid Build Coastguard Worker
50*6777b538SAndroid Build Coastguard Worker<char> ::= < printable 7-bit ASCII character, byte in range [0x20-0x7F] >
51*6777b538SAndroid Build Coastguard Worker<end_char> ::= < char + 0x80, byte in range [0xA0-0xFF] >
52*6777b538SAndroid Build Coastguard Worker<return value> ::= < value + 0x80, byte in range [0x80-0x8F] >
53*6777b538SAndroid Build Coastguard Worker
54*6777b538SAndroid Build Coastguard Worker<offset1> ::= < byte in range [0x00-0x3F] >
55*6777b538SAndroid Build Coastguard Worker<offset2> ::= < byte in range [0x40-0x5F] >
56*6777b538SAndroid Build Coastguard Worker<offset3> ::= < byte in range [0x60-0x7F] >
57*6777b538SAndroid Build Coastguard Worker
58*6777b538SAndroid Build Coastguard Worker<end_offset1> ::= < byte in range [0x80-0xBF] >
59*6777b538SAndroid Build Coastguard Worker<end_offset2> ::= < byte in range [0xC0-0xDF] >
60*6777b538SAndroid Build Coastguard Worker<end_offset3> ::= < byte in range [0xE0-0xFF] >
61*6777b538SAndroid Build Coastguard Worker
62*6777b538SAndroid Build Coastguard Worker<prefix> ::= <char>
63*6777b538SAndroid Build Coastguard Worker
64*6777b538SAndroid Build Coastguard Worker<label> ::= <end_char>
65*6777b538SAndroid Build Coastguard Worker          | <char> <label>
66*6777b538SAndroid Build Coastguard Worker
67*6777b538SAndroid Build Coastguard Worker<end_label> ::= <return_value>
68*6777b538SAndroid Build Coastguard Worker          | <char> <end_label>
69*6777b538SAndroid Build Coastguard Worker
70*6777b538SAndroid Build Coastguard Worker<offset> ::= <offset1>
71*6777b538SAndroid Build Coastguard Worker           | <offset2> <byte>
72*6777b538SAndroid Build Coastguard Worker           | <offset3> <byte> <byte>
73*6777b538SAndroid Build Coastguard Worker
74*6777b538SAndroid Build Coastguard Worker<end_offset> ::= <end_offset1>
75*6777b538SAndroid Build Coastguard Worker               | <end_offset2> <byte>
76*6777b538SAndroid Build Coastguard Worker               | <end_offset3> <byte> <byte>
77*6777b538SAndroid Build Coastguard Worker
78*6777b538SAndroid Build Coastguard Worker<offsets> ::= <end_offset>
79*6777b538SAndroid Build Coastguard Worker            | <offset> <offsets>
80*6777b538SAndroid Build Coastguard Worker
81*6777b538SAndroid Build Coastguard Worker<source> ::= <offsets>
82*6777b538SAndroid Build Coastguard Worker
83*6777b538SAndroid Build Coastguard Worker<node> ::= <label> <offsets>
84*6777b538SAndroid Build Coastguard Worker         | <prefix> <node>
85*6777b538SAndroid Build Coastguard Worker         | <end_label>
86*6777b538SAndroid Build Coastguard Worker
87*6777b538SAndroid Build Coastguard Worker<dafsa> ::= <source>
88*6777b538SAndroid Build Coastguard Worker          | <dafsa> <node>
89*6777b538SAndroid Build Coastguard Worker
90*6777b538SAndroid Build Coastguard WorkerDecoding:
91*6777b538SAndroid Build Coastguard Worker
92*6777b538SAndroid Build Coastguard Worker<char> -> printable 7-bit ASCII character
93*6777b538SAndroid Build Coastguard Worker<end_char> & 0x7F -> printable 7-bit ASCII character
94*6777b538SAndroid Build Coastguard Worker<return value> & 0x0F -> integer
95*6777b538SAndroid Build Coastguard Worker<offset1 & 0x3F> -> integer
96*6777b538SAndroid Build Coastguard Worker((<offset2> & 0x1F>) << 8) + <byte> -> integer
97*6777b538SAndroid Build Coastguard Worker((<offset3> & 0x1F>) << 16) + (<byte> << 8) + <byte> -> integer
98*6777b538SAndroid Build Coastguard Worker
99*6777b538SAndroid Build Coastguard Workerend_offset1, end_offset2 and and_offset3 are decoded same as offset1,
100*6777b538SAndroid Build Coastguard Workeroffset2 and offset3 respectively.
101*6777b538SAndroid Build Coastguard Worker
102*6777b538SAndroid Build Coastguard WorkerThe first offset in a list of offsets is the distance in bytes between the
103*6777b538SAndroid Build Coastguard Workeroffset itself and the first child node. Subsequent offsets are the distance
104*6777b538SAndroid Build Coastguard Workerbetween previous child node and next child node. Thus each offset links a node
105*6777b538SAndroid Build Coastguard Workerto a child node. The distance is always counted between start addresses, i.e.
106*6777b538SAndroid Build Coastguard Workerfirst byte in decoded offset or first byte in child node.
107*6777b538SAndroid Build Coastguard Worker
108*6777b538SAndroid Build Coastguard WorkerExample 1:
109*6777b538SAndroid Build Coastguard Worker
110*6777b538SAndroid Build Coastguard Worker%%
111*6777b538SAndroid Build Coastguard Workeraa, 1
112*6777b538SAndroid Build Coastguard Workera, 2
113*6777b538SAndroid Build Coastguard Worker%%
114*6777b538SAndroid Build Coastguard Worker
115*6777b538SAndroid Build Coastguard WorkerThe input is first parsed to a list of words:
116*6777b538SAndroid Build Coastguard Worker["aa1", "a2"]
117*6777b538SAndroid Build Coastguard Worker
118*6777b538SAndroid Build Coastguard WorkerA fully expanded graph is created from the words:
119*6777b538SAndroid Build Coastguard Workersource = [node1, node4]
120*6777b538SAndroid Build Coastguard Workernode1 = ("a", [node2])
121*6777b538SAndroid Build Coastguard Workernode2 = ("a", [node3])
122*6777b538SAndroid Build Coastguard Workernode3 = ("\x01", [sink])
123*6777b538SAndroid Build Coastguard Workernode4 = ("a", [node5])
124*6777b538SAndroid Build Coastguard Workernode5 = ("\x02", [sink])
125*6777b538SAndroid Build Coastguard Workersink = None
126*6777b538SAndroid Build Coastguard Worker
127*6777b538SAndroid Build Coastguard WorkerCompression results in the following graph:
128*6777b538SAndroid Build Coastguard Workersource = [node1]
129*6777b538SAndroid Build Coastguard Workernode1 = ("a", [node2, node3])
130*6777b538SAndroid Build Coastguard Workernode2 = ("\x02", [sink])
131*6777b538SAndroid Build Coastguard Workernode3 = ("a\x01", [sink])
132*6777b538SAndroid Build Coastguard Workersink = None
133*6777b538SAndroid Build Coastguard Worker
134*6777b538SAndroid Build Coastguard WorkerA C++ representation of the compressed graph is generated:
135*6777b538SAndroid Build Coastguard Worker
136*6777b538SAndroid Build Coastguard Workerconst unsigned char dafsa[7] = {
137*6777b538SAndroid Build Coastguard Worker  0x81, 0xE1, 0x02, 0x81, 0x82, 0x61, 0x81,
138*6777b538SAndroid Build Coastguard Worker};
139*6777b538SAndroid Build Coastguard Worker
140*6777b538SAndroid Build Coastguard WorkerThe bytes in the generated array has the following meaning:
141*6777b538SAndroid Build Coastguard Worker
142*6777b538SAndroid Build Coastguard Worker 0: 0x81 <end_offset1>  child at position 0 + (0x81 & 0x3F) -> jump to 1
143*6777b538SAndroid Build Coastguard Worker
144*6777b538SAndroid Build Coastguard Worker 1: 0xE1 <end_char>     label character (0xE1 & 0x7F) -> match "a"
145*6777b538SAndroid Build Coastguard Worker 2: 0x02 <offset1>      child at position 2 + (0x02 & 0x3F) -> jump to 4
146*6777b538SAndroid Build Coastguard Worker
147*6777b538SAndroid Build Coastguard Worker 3: 0x81 <end_offset1>  child at position 4 + (0x81 & 0x3F) -> jump to 5
148*6777b538SAndroid Build Coastguard Worker 4: 0x82 <return_value> 0x82 & 0x0F -> return 2
149*6777b538SAndroid Build Coastguard Worker
150*6777b538SAndroid Build Coastguard Worker 5: 0x61 <char>         label character 0x61 -> match "a"
151*6777b538SAndroid Build Coastguard Worker 6: 0x81 <return_value> 0x81 & 0x0F -> return 1
152*6777b538SAndroid Build Coastguard Worker
153*6777b538SAndroid Build Coastguard WorkerExample 2:
154*6777b538SAndroid Build Coastguard Worker
155*6777b538SAndroid Build Coastguard Worker%%
156*6777b538SAndroid Build Coastguard Workeraa, 1
157*6777b538SAndroid Build Coastguard Workerbbb, 2
158*6777b538SAndroid Build Coastguard Workerbaa, 1
159*6777b538SAndroid Build Coastguard Worker%%
160*6777b538SAndroid Build Coastguard Worker
161*6777b538SAndroid Build Coastguard WorkerThe input is first parsed to a list of words:
162*6777b538SAndroid Build Coastguard Worker["aa1", "bbb2", "baa1"]
163*6777b538SAndroid Build Coastguard Worker
164*6777b538SAndroid Build Coastguard WorkerCompression results in the following graph:
165*6777b538SAndroid Build Coastguard Workersource = [node1, node2]
166*6777b538SAndroid Build Coastguard Workernode1 = ("b", [node2, node3])
167*6777b538SAndroid Build Coastguard Workernode2 = ("aa\x01", [sink])
168*6777b538SAndroid Build Coastguard Workernode3 = ("bb\x02", [sink])
169*6777b538SAndroid Build Coastguard Workersink = None
170*6777b538SAndroid Build Coastguard Worker
171*6777b538SAndroid Build Coastguard WorkerA C++ representation of the compressed graph is generated:
172*6777b538SAndroid Build Coastguard Worker
173*6777b538SAndroid Build Coastguard Workerconst unsigned char dafsa[11] = {
174*6777b538SAndroid Build Coastguard Worker  0x02, 0x83, 0xE2, 0x02, 0x83, 0x61, 0x61, 0x81, 0x62, 0x62, 0x82,
175*6777b538SAndroid Build Coastguard Worker};
176*6777b538SAndroid Build Coastguard Worker
177*6777b538SAndroid Build Coastguard WorkerThe bytes in the generated array has the following meaning:
178*6777b538SAndroid Build Coastguard Worker
179*6777b538SAndroid Build Coastguard Worker 0: 0x02 <offset1>      child at position 0 + (0x02 & 0x3F) -> jump to 2
180*6777b538SAndroid Build Coastguard Worker 1: 0x83 <end_offset1>  child at position 2 + (0x83 & 0x3F) -> jump to 5
181*6777b538SAndroid Build Coastguard Worker
182*6777b538SAndroid Build Coastguard Worker 2: 0xE2 <end_char>     label character (0xE2 & 0x7F) -> match "b"
183*6777b538SAndroid Build Coastguard Worker 3: 0x02 <offset1>      child at position 3 + (0x02 & 0x3F) -> jump to 5
184*6777b538SAndroid Build Coastguard Worker 4: 0x83 <end_offset1>  child at position 5 + (0x83 & 0x3F) -> jump to 8
185*6777b538SAndroid Build Coastguard Worker
186*6777b538SAndroid Build Coastguard Worker 5: 0x61 <char>         label character 0x61 -> match "a"
187*6777b538SAndroid Build Coastguard Worker 6: 0x61 <char>         label character 0x61 -> match "a"
188*6777b538SAndroid Build Coastguard Worker 7: 0x81 <return_value> 0x81 & 0x0F -> return 1
189*6777b538SAndroid Build Coastguard Worker
190*6777b538SAndroid Build Coastguard Worker 8: 0x62 <char>         label character 0x62 -> match "b"
191*6777b538SAndroid Build Coastguard Worker 9: 0x62 <char>         label character 0x62 -> match "b"
192*6777b538SAndroid Build Coastguard Worker10: 0x82 <return_value> 0x82 & 0x0F -> return 2
193*6777b538SAndroid Build Coastguard Worker"""
194*6777b538SAndroid Build Coastguard Worker
195*6777b538SAndroid Build Coastguard Workerimport argparse
196*6777b538SAndroid Build Coastguard Workerimport sys
197*6777b538SAndroid Build Coastguard Worker
198*6777b538SAndroid Build Coastguard Workerfrom typing import Any, Dict, FrozenSet, Iterable, List, MutableSequence, Sequence, Tuple, Union
199*6777b538SAndroid Build Coastguard Worker
200*6777b538SAndroid Build Coastguard Worker# Use of Any below is because mypy doesn't support recursive types.
201*6777b538SAndroid Build Coastguard WorkerSinkNode = Union[None, None]  # weird hack to get around lack of TypeAlias.
202*6777b538SAndroid Build Coastguard WorkerInteriorNode = Tuple[str, List[Any]]
203*6777b538SAndroid Build Coastguard WorkerSourceNode = List[Any]
204*6777b538SAndroid Build Coastguard Worker
205*6777b538SAndroid Build Coastguard WorkerNonSinkNode = Union[InteriorNode, SourceNode]
206*6777b538SAndroid Build Coastguard WorkerNode = Union[SinkNode, InteriorNode, SourceNode]
207*6777b538SAndroid Build Coastguard WorkerDAFSA = List[Node]
208*6777b538SAndroid Build Coastguard Worker
209*6777b538SAndroid Build Coastguard Worker
210*6777b538SAndroid Build Coastguard Workerclass InputError(Exception):
211*6777b538SAndroid Build Coastguard Worker  """Exception raised for errors in the input file."""
212*6777b538SAndroid Build Coastguard Worker
213*6777b538SAndroid Build Coastguard Worker
214*6777b538SAndroid Build Coastguard Workerdef to_dafsa(words: Iterable[str]) -> DAFSA:
215*6777b538SAndroid Build Coastguard Worker  """Generates a DAFSA from a word list and returns the source nodes.
216*6777b538SAndroid Build Coastguard Worker
217*6777b538SAndroid Build Coastguard Worker  Each word is split into characters so that each character is represented by
218*6777b538SAndroid Build Coastguard Worker  a unique node. It is assumed the word list is not empty.
219*6777b538SAndroid Build Coastguard Worker  """
220*6777b538SAndroid Build Coastguard Worker  if not words:
221*6777b538SAndroid Build Coastguard Worker    raise InputError('The domain list must not be empty')
222*6777b538SAndroid Build Coastguard Worker
223*6777b538SAndroid Build Coastguard Worker  def ToNodes(word: str) -> Node:
224*6777b538SAndroid Build Coastguard Worker    """Split words into characters"""
225*6777b538SAndroid Build Coastguard Worker    if not 0x1F < ord(word[0]) < 0x80:
226*6777b538SAndroid Build Coastguard Worker      raise InputError('Domain names must be printable 7-bit ASCII')
227*6777b538SAndroid Build Coastguard Worker    if len(word) == 1:
228*6777b538SAndroid Build Coastguard Worker      return chr(ord(word[0]) & 0x0F), [None]
229*6777b538SAndroid Build Coastguard Worker    return word[0], [ToNodes(word[1:])]
230*6777b538SAndroid Build Coastguard Worker  return [ToNodes(word) for word in words]
231*6777b538SAndroid Build Coastguard Worker
232*6777b538SAndroid Build Coastguard Worker
233*6777b538SAndroid Build Coastguard Workerdef to_words(node: Node) -> Iterable[str]:
234*6777b538SAndroid Build Coastguard Worker  """Generates a word list from all paths starting from an internal node."""
235*6777b538SAndroid Build Coastguard Worker  if not node:
236*6777b538SAndroid Build Coastguard Worker    return ['']
237*6777b538SAndroid Build Coastguard Worker  return [(node[0] + word) for child in node[1] for word in to_words(child)]
238*6777b538SAndroid Build Coastguard Worker
239*6777b538SAndroid Build Coastguard Worker
240*6777b538SAndroid Build Coastguard Workerdef reverse(dafsa: DAFSA) -> DAFSA:
241*6777b538SAndroid Build Coastguard Worker  """Generates a new DAFSA that is reversed, so that the old sink node becomes
242*6777b538SAndroid Build Coastguard Worker  the new source node.
243*6777b538SAndroid Build Coastguard Worker  """
244*6777b538SAndroid Build Coastguard Worker  sink: SourceNode = []
245*6777b538SAndroid Build Coastguard Worker  nodemap: Dict[int, InteriorNode] = {}
246*6777b538SAndroid Build Coastguard Worker
247*6777b538SAndroid Build Coastguard Worker  def dfs(node: Node, parent: Node) -> None:
248*6777b538SAndroid Build Coastguard Worker    """Creates reverse nodes.
249*6777b538SAndroid Build Coastguard Worker
250*6777b538SAndroid Build Coastguard Worker    A new reverse node will be created for each old node. The new node will
251*6777b538SAndroid Build Coastguard Worker    get a reversed label and the parents of the old node as children.
252*6777b538SAndroid Build Coastguard Worker    """
253*6777b538SAndroid Build Coastguard Worker    if not node:
254*6777b538SAndroid Build Coastguard Worker      sink.append(parent)
255*6777b538SAndroid Build Coastguard Worker    elif id(node) not in nodemap:
256*6777b538SAndroid Build Coastguard Worker      nodemap[id(node)] = (node[0][::-1], [parent])
257*6777b538SAndroid Build Coastguard Worker      for child in node[1]:
258*6777b538SAndroid Build Coastguard Worker        dfs(child, nodemap[id(node)])
259*6777b538SAndroid Build Coastguard Worker    else:
260*6777b538SAndroid Build Coastguard Worker      nodemap[id(node)][1].append(parent)
261*6777b538SAndroid Build Coastguard Worker
262*6777b538SAndroid Build Coastguard Worker  for node in dafsa:
263*6777b538SAndroid Build Coastguard Worker    dfs(node, None)
264*6777b538SAndroid Build Coastguard Worker  return sink
265*6777b538SAndroid Build Coastguard Worker
266*6777b538SAndroid Build Coastguard Worker
267*6777b538SAndroid Build Coastguard Workerdef join_labels(dafsa: DAFSA) -> DAFSA:
268*6777b538SAndroid Build Coastguard Worker  """Generates a new DAFSA where internal nodes are merged if there is a one to
269*6777b538SAndroid Build Coastguard Worker  one connection.
270*6777b538SAndroid Build Coastguard Worker  """
271*6777b538SAndroid Build Coastguard Worker  parentcount: Dict[int, int] = {id(None): 2}
272*6777b538SAndroid Build Coastguard Worker  nodemap: Dict[int, Node] = {id(None): None}
273*6777b538SAndroid Build Coastguard Worker
274*6777b538SAndroid Build Coastguard Worker  def count_parents(node: Node) -> None:
275*6777b538SAndroid Build Coastguard Worker    """Count incoming references"""
276*6777b538SAndroid Build Coastguard Worker    if id(node) in parentcount:
277*6777b538SAndroid Build Coastguard Worker      parentcount[id(node)] += 1
278*6777b538SAndroid Build Coastguard Worker    else:
279*6777b538SAndroid Build Coastguard Worker      assert node is not None  # parentcount statically contains `id(None)`
280*6777b538SAndroid Build Coastguard Worker      parentcount[id(node)] = 1
281*6777b538SAndroid Build Coastguard Worker      for child in node[1]:
282*6777b538SAndroid Build Coastguard Worker        count_parents(child)
283*6777b538SAndroid Build Coastguard Worker
284*6777b538SAndroid Build Coastguard Worker  def join(node: Node) -> Node:
285*6777b538SAndroid Build Coastguard Worker    """Create new nodes"""
286*6777b538SAndroid Build Coastguard Worker    if id(node) not in nodemap:
287*6777b538SAndroid Build Coastguard Worker      assert node is not None  # nodemap statically contains `id(None)`
288*6777b538SAndroid Build Coastguard Worker      children = [join(child) for child in node[1]]
289*6777b538SAndroid Build Coastguard Worker      if len(children) == 1 and parentcount[id(node[1][0])] == 1:
290*6777b538SAndroid Build Coastguard Worker        child = children[0]
291*6777b538SAndroid Build Coastguard Worker        # parentcount statically maps `id(None)` to 2, so this child cannot be
292*6777b538SAndroid Build Coastguard Worker        # the sink.
293*6777b538SAndroid Build Coastguard Worker        assert child is not None
294*6777b538SAndroid Build Coastguard Worker        nodemap[id(node)] = (node[0] + child[0], child[1])
295*6777b538SAndroid Build Coastguard Worker      else:
296*6777b538SAndroid Build Coastguard Worker        nodemap[id(node)] = (node[0], children)
297*6777b538SAndroid Build Coastguard Worker    return nodemap[id(node)]
298*6777b538SAndroid Build Coastguard Worker
299*6777b538SAndroid Build Coastguard Worker  for node in dafsa:
300*6777b538SAndroid Build Coastguard Worker    count_parents(node)
301*6777b538SAndroid Build Coastguard Worker  return [join(node) for node in dafsa]
302*6777b538SAndroid Build Coastguard Worker
303*6777b538SAndroid Build Coastguard Worker
304*6777b538SAndroid Build Coastguard Workerdef join_suffixes(dafsa: DAFSA) -> DAFSA:
305*6777b538SAndroid Build Coastguard Worker  """Generates a new DAFSA where nodes that represent the same word lists
306*6777b538SAndroid Build Coastguard Worker  towards the sink are merged.
307*6777b538SAndroid Build Coastguard Worker  """
308*6777b538SAndroid Build Coastguard Worker  nodemap: Dict[FrozenSet[str], Node] = {frozenset(('', )): None}
309*6777b538SAndroid Build Coastguard Worker
310*6777b538SAndroid Build Coastguard Worker  def join(node: Node) -> Node:
311*6777b538SAndroid Build Coastguard Worker    """Returns a matching node. A new node is created if no matching node
312*6777b538SAndroid Build Coastguard Worker    exists. The graph is accessed in dfs order.
313*6777b538SAndroid Build Coastguard Worker    """
314*6777b538SAndroid Build Coastguard Worker    suffixes = frozenset(to_words(node))
315*6777b538SAndroid Build Coastguard Worker    if suffixes not in nodemap:
316*6777b538SAndroid Build Coastguard Worker      # The only set of suffixes for the sink is {''}, which is statically
317*6777b538SAndroid Build Coastguard Worker      # contained in nodemap.
318*6777b538SAndroid Build Coastguard Worker      assert node is not None
319*6777b538SAndroid Build Coastguard Worker      nodemap[suffixes] = (node[0], [join(child) for child in node[1]])
320*6777b538SAndroid Build Coastguard Worker    return nodemap[suffixes]
321*6777b538SAndroid Build Coastguard Worker
322*6777b538SAndroid Build Coastguard Worker  return [join(node) for node in dafsa]
323*6777b538SAndroid Build Coastguard Worker
324*6777b538SAndroid Build Coastguard Worker
325*6777b538SAndroid Build Coastguard Workerdef top_sort(dafsa: DAFSA) -> Sequence[NonSinkNode]:
326*6777b538SAndroid Build Coastguard Worker  """Generates list of nodes in topological sort order."""
327*6777b538SAndroid Build Coastguard Worker  # `incoming` contains the in-degree of every node except the sink.
328*6777b538SAndroid Build Coastguard Worker  incoming: Dict[int, int] = {}
329*6777b538SAndroid Build Coastguard Worker
330*6777b538SAndroid Build Coastguard Worker  def count_incoming(node: Node) -> None:
331*6777b538SAndroid Build Coastguard Worker    """Counts incoming references."""
332*6777b538SAndroid Build Coastguard Worker    if node:
333*6777b538SAndroid Build Coastguard Worker      if id(node) not in incoming:
334*6777b538SAndroid Build Coastguard Worker        incoming[id(node)] = 1
335*6777b538SAndroid Build Coastguard Worker        for child in node[1]:
336*6777b538SAndroid Build Coastguard Worker          count_incoming(child)
337*6777b538SAndroid Build Coastguard Worker      else:
338*6777b538SAndroid Build Coastguard Worker        incoming[id(node)] += 1
339*6777b538SAndroid Build Coastguard Worker
340*6777b538SAndroid Build Coastguard Worker  for node in dafsa:
341*6777b538SAndroid Build Coastguard Worker    count_incoming(node)
342*6777b538SAndroid Build Coastguard Worker
343*6777b538SAndroid Build Coastguard Worker  for node in dafsa:
344*6777b538SAndroid Build Coastguard Worker    if node:
345*6777b538SAndroid Build Coastguard Worker      incoming[id(node)] -= 1
346*6777b538SAndroid Build Coastguard Worker
347*6777b538SAndroid Build Coastguard Worker  waiting: List[NonSinkNode] = [
348*6777b538SAndroid Build Coastguard Worker      node for node in dafsa if node and incoming[id(node)] == 0
349*6777b538SAndroid Build Coastguard Worker  ]
350*6777b538SAndroid Build Coastguard Worker  nodes: List[NonSinkNode] = []
351*6777b538SAndroid Build Coastguard Worker
352*6777b538SAndroid Build Coastguard Worker  while waiting:
353*6777b538SAndroid Build Coastguard Worker    node = waiting.pop()
354*6777b538SAndroid Build Coastguard Worker    assert incoming[id(node)] == 0
355*6777b538SAndroid Build Coastguard Worker    nodes.append(node)
356*6777b538SAndroid Build Coastguard Worker    for child in node[1]:
357*6777b538SAndroid Build Coastguard Worker      if child:
358*6777b538SAndroid Build Coastguard Worker        incoming[id(child)] -= 1
359*6777b538SAndroid Build Coastguard Worker        if incoming[id(child)] == 0:
360*6777b538SAndroid Build Coastguard Worker          waiting.append(child)
361*6777b538SAndroid Build Coastguard Worker  return nodes
362*6777b538SAndroid Build Coastguard Worker
363*6777b538SAndroid Build Coastguard Worker
364*6777b538SAndroid Build Coastguard Workerdef encode_links(children: Sequence[Node], offsets: Dict[int, int],
365*6777b538SAndroid Build Coastguard Worker                 current: int) -> Iterable[int]:
366*6777b538SAndroid Build Coastguard Worker  """Encodes a list of children as one, two or three byte offsets."""
367*6777b538SAndroid Build Coastguard Worker  if not children[0]:
368*6777b538SAndroid Build Coastguard Worker    # This is an <end_label> node and no links follow such nodes
369*6777b538SAndroid Build Coastguard Worker    assert len(children) == 1
370*6777b538SAndroid Build Coastguard Worker    return []
371*6777b538SAndroid Build Coastguard Worker  guess = 3 * len(children)
372*6777b538SAndroid Build Coastguard Worker  assert children
373*6777b538SAndroid Build Coastguard Worker  children = sorted(children, key = lambda x: -offsets[id(x)])
374*6777b538SAndroid Build Coastguard Worker  while True:
375*6777b538SAndroid Build Coastguard Worker    offset = current + guess
376*6777b538SAndroid Build Coastguard Worker    buf: List[int] = []
377*6777b538SAndroid Build Coastguard Worker    for child in children:
378*6777b538SAndroid Build Coastguard Worker      last = len(buf)
379*6777b538SAndroid Build Coastguard Worker      distance = offset - offsets[id(child)]
380*6777b538SAndroid Build Coastguard Worker      assert distance > 0 and distance < (1 << 21)
381*6777b538SAndroid Build Coastguard Worker
382*6777b538SAndroid Build Coastguard Worker      if distance < (1 << 6):
383*6777b538SAndroid Build Coastguard Worker        # A 6-bit offset: "s0xxxxxx"
384*6777b538SAndroid Build Coastguard Worker        buf.append(distance)
385*6777b538SAndroid Build Coastguard Worker      elif distance < (1 << 13):
386*6777b538SAndroid Build Coastguard Worker        # A 13-bit offset: "s10xxxxxxxxxxxxx"
387*6777b538SAndroid Build Coastguard Worker        buf.append(0x40 | (distance >> 8))
388*6777b538SAndroid Build Coastguard Worker        buf.append(distance & 0xFF)
389*6777b538SAndroid Build Coastguard Worker      else:
390*6777b538SAndroid Build Coastguard Worker        # A 21-bit offset: "s11xxxxxxxxxxxxxxxxxxxxx"
391*6777b538SAndroid Build Coastguard Worker        buf.append(0x60 | (distance >> 16))
392*6777b538SAndroid Build Coastguard Worker        buf.append((distance >> 8) & 0xFF)
393*6777b538SAndroid Build Coastguard Worker        buf.append(distance & 0xFF)
394*6777b538SAndroid Build Coastguard Worker      # Distance in first link is relative to following record.
395*6777b538SAndroid Build Coastguard Worker      # Distance in other links are relative to previous link.
396*6777b538SAndroid Build Coastguard Worker      offset -= distance
397*6777b538SAndroid Build Coastguard Worker    if len(buf) == guess:
398*6777b538SAndroid Build Coastguard Worker      break
399*6777b538SAndroid Build Coastguard Worker    guess = len(buf)
400*6777b538SAndroid Build Coastguard Worker  # Set most significant bit to mark end of links in this node.
401*6777b538SAndroid Build Coastguard Worker  buf[last] |= (1 << 7)
402*6777b538SAndroid Build Coastguard Worker  buf.reverse()
403*6777b538SAndroid Build Coastguard Worker  return buf
404*6777b538SAndroid Build Coastguard Worker
405*6777b538SAndroid Build Coastguard Worker
406*6777b538SAndroid Build Coastguard Workerdef encode_prefix(label: str) -> MutableSequence[int]:
407*6777b538SAndroid Build Coastguard Worker  """Encodes a node label as a list of bytes without a trailing high byte.
408*6777b538SAndroid Build Coastguard Worker
409*6777b538SAndroid Build Coastguard Worker  This method encodes a node if there is exactly one child  and the
410*6777b538SAndroid Build Coastguard Worker  child follows immidiately after so that no jump is needed. This label
411*6777b538SAndroid Build Coastguard Worker  will then be a prefix to the label in the child node.
412*6777b538SAndroid Build Coastguard Worker  """
413*6777b538SAndroid Build Coastguard Worker  assert label
414*6777b538SAndroid Build Coastguard Worker  return [ord(c) for c in reversed(label)]
415*6777b538SAndroid Build Coastguard Worker
416*6777b538SAndroid Build Coastguard Worker
417*6777b538SAndroid Build Coastguard Workerdef encode_label(label: str) -> Iterable[int]:
418*6777b538SAndroid Build Coastguard Worker  """Encodes a node label as a list of bytes with a trailing high byte >0x80.
419*6777b538SAndroid Build Coastguard Worker  """
420*6777b538SAndroid Build Coastguard Worker  buf = encode_prefix(label)
421*6777b538SAndroid Build Coastguard Worker  # Set most significant bit to mark end of label in this node.
422*6777b538SAndroid Build Coastguard Worker  buf[0] |= (1 << 7)
423*6777b538SAndroid Build Coastguard Worker  return buf
424*6777b538SAndroid Build Coastguard Worker
425*6777b538SAndroid Build Coastguard Worker
426*6777b538SAndroid Build Coastguard Workerdef encode(dafsa: DAFSA) -> Sequence[int]:
427*6777b538SAndroid Build Coastguard Worker  """Encodes a DAFSA to a list of bytes"""
428*6777b538SAndroid Build Coastguard Worker  output: List[int] = []
429*6777b538SAndroid Build Coastguard Worker  offsets: Dict[int, int] = {}
430*6777b538SAndroid Build Coastguard Worker
431*6777b538SAndroid Build Coastguard Worker  for node in reversed(top_sort(dafsa)):
432*6777b538SAndroid Build Coastguard Worker    if (len(node[1]) == 1 and node[1][0] and
433*6777b538SAndroid Build Coastguard Worker        (offsets[id(node[1][0])] == len(output))):
434*6777b538SAndroid Build Coastguard Worker      output.extend(encode_prefix(node[0]))
435*6777b538SAndroid Build Coastguard Worker    else:
436*6777b538SAndroid Build Coastguard Worker      output.extend(encode_links(node[1], offsets, len(output)))
437*6777b538SAndroid Build Coastguard Worker      output.extend(encode_label(node[0]))
438*6777b538SAndroid Build Coastguard Worker    offsets[id(node)] = len(output)
439*6777b538SAndroid Build Coastguard Worker
440*6777b538SAndroid Build Coastguard Worker  output.extend(encode_links(dafsa, offsets, len(output)))
441*6777b538SAndroid Build Coastguard Worker  output.reverse()
442*6777b538SAndroid Build Coastguard Worker  return output
443*6777b538SAndroid Build Coastguard Worker
444*6777b538SAndroid Build Coastguard Worker
445*6777b538SAndroid Build Coastguard Workerdef to_cxx(data: Sequence[int]) -> str:
446*6777b538SAndroid Build Coastguard Worker  """Generates C++ code from a list of encoded bytes."""
447*6777b538SAndroid Build Coastguard Worker  text = '/* This file is generated. DO NOT EDIT!\n\n'
448*6777b538SAndroid Build Coastguard Worker  text += 'The byte array encodes effective tld names. See make_dafsa.py for'
449*6777b538SAndroid Build Coastguard Worker  text += ' documentation.'
450*6777b538SAndroid Build Coastguard Worker  text += '*/\n\n'
451*6777b538SAndroid Build Coastguard Worker  text += 'const unsigned char kDafsa[%s] = {\n' % len(data)
452*6777b538SAndroid Build Coastguard Worker  for i in range(0, len(data), 12):
453*6777b538SAndroid Build Coastguard Worker    text += '  '
454*6777b538SAndroid Build Coastguard Worker    text += ', '.join('0x%02x' % byte for byte in data[i:i + 12])
455*6777b538SAndroid Build Coastguard Worker    text += ',\n'
456*6777b538SAndroid Build Coastguard Worker  text += '};\n'
457*6777b538SAndroid Build Coastguard Worker  return text
458*6777b538SAndroid Build Coastguard Worker
459*6777b538SAndroid Build Coastguard Worker
460*6777b538SAndroid Build Coastguard Workerdef words_to_cxx(words: Iterable[str]) -> str:
461*6777b538SAndroid Build Coastguard Worker  """Generates C++ code from a word list"""
462*6777b538SAndroid Build Coastguard Worker  dafsa = to_dafsa(words)
463*6777b538SAndroid Build Coastguard Worker  for fun in (reverse, join_suffixes, reverse, join_suffixes, join_labels):
464*6777b538SAndroid Build Coastguard Worker    dafsa = fun(dafsa)
465*6777b538SAndroid Build Coastguard Worker  return to_cxx(encode(dafsa))
466*6777b538SAndroid Build Coastguard Worker
467*6777b538SAndroid Build Coastguard Worker
468*6777b538SAndroid Build Coastguard Workerdef parse_gperf(infile: Iterable[str], reverse: bool) -> Iterable[str]:
469*6777b538SAndroid Build Coastguard Worker  """Parses gperf file and extract strings and return code"""
470*6777b538SAndroid Build Coastguard Worker  lines = [line.strip() for line in infile]
471*6777b538SAndroid Build Coastguard Worker  # Extract strings after the first '%%' and before the second '%%'.
472*6777b538SAndroid Build Coastguard Worker  begin = lines.index('%%') + 1
473*6777b538SAndroid Build Coastguard Worker  end = lines.index('%%', begin)
474*6777b538SAndroid Build Coastguard Worker  lines = lines[begin:end]
475*6777b538SAndroid Build Coastguard Worker  for line in lines:
476*6777b538SAndroid Build Coastguard Worker    if line[-3:-1] != ', ':
477*6777b538SAndroid Build Coastguard Worker      raise InputError('Expected "domainname, <digit>", found "%s"' % line)
478*6777b538SAndroid Build Coastguard Worker    # Technically the DAFSA format can support return values in the range
479*6777b538SAndroid Build Coastguard Worker    # [0-31], but only the first three bits have any defined meaning.
480*6777b538SAndroid Build Coastguard Worker    if not line.endswith(('0', '1', '2', '3', '4', '5', '6', '7')):
481*6777b538SAndroid Build Coastguard Worker      raise InputError('Expected value to be in the range of 0-7, found "%s"' %
482*6777b538SAndroid Build Coastguard Worker                       line[-1])
483*6777b538SAndroid Build Coastguard Worker  if reverse:
484*6777b538SAndroid Build Coastguard Worker    return [line[-4::-1] + line[-1] for line in lines]
485*6777b538SAndroid Build Coastguard Worker  else:
486*6777b538SAndroid Build Coastguard Worker    return [line[:-3] + line[-1] for line in lines]
487*6777b538SAndroid Build Coastguard Worker
488*6777b538SAndroid Build Coastguard Worker
489*6777b538SAndroid Build Coastguard Workerdef main() -> int:
490*6777b538SAndroid Build Coastguard Worker  parser = argparse.ArgumentParser()
491*6777b538SAndroid Build Coastguard Worker  parser.add_argument('--reverse', action='store_const', const=True,
492*6777b538SAndroid Build Coastguard Worker                      default=False)
493*6777b538SAndroid Build Coastguard Worker  parser.add_argument('infile', nargs='?', type=argparse.FileType('r'),
494*6777b538SAndroid Build Coastguard Worker                      default=sys.stdin)
495*6777b538SAndroid Build Coastguard Worker  parser.add_argument('outfile', nargs='?', type=argparse.FileType('w'),
496*6777b538SAndroid Build Coastguard Worker                      default=sys.stdout)
497*6777b538SAndroid Build Coastguard Worker  args = parser.parse_args()
498*6777b538SAndroid Build Coastguard Worker  args.outfile.write(words_to_cxx(parse_gperf(args.infile, args.reverse)))
499*6777b538SAndroid Build Coastguard Worker  return 0
500*6777b538SAndroid Build Coastguard Worker
501*6777b538SAndroid Build Coastguard Worker
502*6777b538SAndroid Build Coastguard Workerif __name__ == '__main__':
503*6777b538SAndroid Build Coastguard Worker  sys.exit(main())
504