1*6777b538SAndroid Build Coastguard Worker#!/usr/bin/env python3 2*6777b538SAndroid Build Coastguard Worker# Copyright 2014 The Chromium Authors 3*6777b538SAndroid Build Coastguard Worker# Use of this source code is governed by a BSD-style license that can be 4*6777b538SAndroid Build Coastguard Worker# found in the LICENSE file. 5*6777b538SAndroid Build Coastguard Worker 6*6777b538SAndroid Build Coastguard Worker""" 7*6777b538SAndroid Build Coastguard WorkerA Deterministic acyclic finite state automaton (DAFSA) is a compact 8*6777b538SAndroid Build Coastguard Workerrepresentation of an unordered word list (dictionary). 9*6777b538SAndroid Build Coastguard Worker 10*6777b538SAndroid Build Coastguard Workerhttps://en.wikipedia.org/wiki/Deterministic_acyclic_finite_state_automaton 11*6777b538SAndroid Build Coastguard Worker 12*6777b538SAndroid Build Coastguard WorkerThis python program converts a list of strings to a byte array in C++. 13*6777b538SAndroid Build Coastguard WorkerThis python program fetches strings and return values from a gperf file 14*6777b538SAndroid Build Coastguard Workerand generates a C++ file with a byte array representing graph that can be 15*6777b538SAndroid Build Coastguard Workerused as a memory efficient replacement for the perfect hash table. 16*6777b538SAndroid Build Coastguard Worker 17*6777b538SAndroid Build Coastguard WorkerThe input strings are assumed to consist of printable 7-bit ASCII characters 18*6777b538SAndroid Build Coastguard Workerand the return values are assumed to be one digit integers. 19*6777b538SAndroid Build Coastguard Worker 20*6777b538SAndroid Build Coastguard WorkerIn this program a DAFSA is a diamond shaped graph starting at a common 21*6777b538SAndroid Build Coastguard Workersource node and ending at a common sink node. All internal nodes contain 22*6777b538SAndroid Build Coastguard Workera label and each word is represented by the labels in one path from 23*6777b538SAndroid Build Coastguard Workerthe source node to the sink node. 24*6777b538SAndroid Build Coastguard Worker 25*6777b538SAndroid Build Coastguard WorkerThe following python represention is used for nodes: 26*6777b538SAndroid Build Coastguard Worker 27*6777b538SAndroid Build Coastguard Worker Source node: [ children ] 28*6777b538SAndroid Build Coastguard Worker Internal node: (label, [ children ]) 29*6777b538SAndroid Build Coastguard Worker Sink node: None 30*6777b538SAndroid Build Coastguard Worker 31*6777b538SAndroid Build Coastguard WorkerThe graph is first compressed by prefixes like a trie. In the next step 32*6777b538SAndroid Build Coastguard Workersuffixes are compressed so that the graph gets diamond shaped. Finally 33*6777b538SAndroid Build Coastguard Workerone to one linked nodes are replaced by nodes with the labels joined. 34*6777b538SAndroid Build Coastguard Worker 35*6777b538SAndroid Build Coastguard WorkerThe order of the operations is crucial since lookups will be performed 36*6777b538SAndroid Build Coastguard Workerstarting from the source with no backtracking. Thus a node must have at 37*6777b538SAndroid Build Coastguard Workermost one child with a label starting by the same character. The output 38*6777b538SAndroid Build Coastguard Workeris also arranged so that all jumps are to increasing addresses, thus forward 39*6777b538SAndroid Build Coastguard Workerin memory. 40*6777b538SAndroid Build Coastguard Worker 41*6777b538SAndroid Build Coastguard WorkerThe generated output has suffix free decoding so that the sign of leading 42*6777b538SAndroid Build Coastguard Workerbits in a link (a reference to a child node) indicate if it has a size of one, 43*6777b538SAndroid Build Coastguard Workertwo or three bytes and if it is the last outgoing link from the actual node. 44*6777b538SAndroid Build Coastguard WorkerA node label is terminated by a byte with the leading bit set. 45*6777b538SAndroid Build Coastguard Worker 46*6777b538SAndroid Build Coastguard WorkerThe generated byte array can described by the following BNF: 47*6777b538SAndroid Build Coastguard Worker 48*6777b538SAndroid Build Coastguard Worker<byte> ::= < 8-bit value in range [0x00-0xFF] > 49*6777b538SAndroid Build Coastguard Worker 50*6777b538SAndroid Build Coastguard Worker<char> ::= < printable 7-bit ASCII character, byte in range [0x20-0x7F] > 51*6777b538SAndroid Build Coastguard Worker<end_char> ::= < char + 0x80, byte in range [0xA0-0xFF] > 52*6777b538SAndroid Build Coastguard Worker<return value> ::= < value + 0x80, byte in range [0x80-0x8F] > 53*6777b538SAndroid Build Coastguard Worker 54*6777b538SAndroid Build Coastguard Worker<offset1> ::= < byte in range [0x00-0x3F] > 55*6777b538SAndroid Build Coastguard Worker<offset2> ::= < byte in range [0x40-0x5F] > 56*6777b538SAndroid Build Coastguard Worker<offset3> ::= < byte in range [0x60-0x7F] > 57*6777b538SAndroid Build Coastguard Worker 58*6777b538SAndroid Build Coastguard Worker<end_offset1> ::= < byte in range [0x80-0xBF] > 59*6777b538SAndroid Build Coastguard Worker<end_offset2> ::= < byte in range [0xC0-0xDF] > 60*6777b538SAndroid Build Coastguard Worker<end_offset3> ::= < byte in range [0xE0-0xFF] > 61*6777b538SAndroid Build Coastguard Worker 62*6777b538SAndroid Build Coastguard Worker<prefix> ::= <char> 63*6777b538SAndroid Build Coastguard Worker 64*6777b538SAndroid Build Coastguard Worker<label> ::= <end_char> 65*6777b538SAndroid Build Coastguard Worker | <char> <label> 66*6777b538SAndroid Build Coastguard Worker 67*6777b538SAndroid Build Coastguard Worker<end_label> ::= <return_value> 68*6777b538SAndroid Build Coastguard Worker | <char> <end_label> 69*6777b538SAndroid Build Coastguard Worker 70*6777b538SAndroid Build Coastguard Worker<offset> ::= <offset1> 71*6777b538SAndroid Build Coastguard Worker | <offset2> <byte> 72*6777b538SAndroid Build Coastguard Worker | <offset3> <byte> <byte> 73*6777b538SAndroid Build Coastguard Worker 74*6777b538SAndroid Build Coastguard Worker<end_offset> ::= <end_offset1> 75*6777b538SAndroid Build Coastguard Worker | <end_offset2> <byte> 76*6777b538SAndroid Build Coastguard Worker | <end_offset3> <byte> <byte> 77*6777b538SAndroid Build Coastguard Worker 78*6777b538SAndroid Build Coastguard Worker<offsets> ::= <end_offset> 79*6777b538SAndroid Build Coastguard Worker | <offset> <offsets> 80*6777b538SAndroid Build Coastguard Worker 81*6777b538SAndroid Build Coastguard Worker<source> ::= <offsets> 82*6777b538SAndroid Build Coastguard Worker 83*6777b538SAndroid Build Coastguard Worker<node> ::= <label> <offsets> 84*6777b538SAndroid Build Coastguard Worker | <prefix> <node> 85*6777b538SAndroid Build Coastguard Worker | <end_label> 86*6777b538SAndroid Build Coastguard Worker 87*6777b538SAndroid Build Coastguard Worker<dafsa> ::= <source> 88*6777b538SAndroid Build Coastguard Worker | <dafsa> <node> 89*6777b538SAndroid Build Coastguard Worker 90*6777b538SAndroid Build Coastguard WorkerDecoding: 91*6777b538SAndroid Build Coastguard Worker 92*6777b538SAndroid Build Coastguard Worker<char> -> printable 7-bit ASCII character 93*6777b538SAndroid Build Coastguard Worker<end_char> & 0x7F -> printable 7-bit ASCII character 94*6777b538SAndroid Build Coastguard Worker<return value> & 0x0F -> integer 95*6777b538SAndroid Build Coastguard Worker<offset1 & 0x3F> -> integer 96*6777b538SAndroid Build Coastguard Worker((<offset2> & 0x1F>) << 8) + <byte> -> integer 97*6777b538SAndroid Build Coastguard Worker((<offset3> & 0x1F>) << 16) + (<byte> << 8) + <byte> -> integer 98*6777b538SAndroid Build Coastguard Worker 99*6777b538SAndroid Build Coastguard Workerend_offset1, end_offset2 and and_offset3 are decoded same as offset1, 100*6777b538SAndroid Build Coastguard Workeroffset2 and offset3 respectively. 101*6777b538SAndroid Build Coastguard Worker 102*6777b538SAndroid Build Coastguard WorkerThe first offset in a list of offsets is the distance in bytes between the 103*6777b538SAndroid Build Coastguard Workeroffset itself and the first child node. Subsequent offsets are the distance 104*6777b538SAndroid Build Coastguard Workerbetween previous child node and next child node. Thus each offset links a node 105*6777b538SAndroid Build Coastguard Workerto a child node. The distance is always counted between start addresses, i.e. 106*6777b538SAndroid Build Coastguard Workerfirst byte in decoded offset or first byte in child node. 107*6777b538SAndroid Build Coastguard Worker 108*6777b538SAndroid Build Coastguard WorkerExample 1: 109*6777b538SAndroid Build Coastguard Worker 110*6777b538SAndroid Build Coastguard Worker%% 111*6777b538SAndroid Build Coastguard Workeraa, 1 112*6777b538SAndroid Build Coastguard Workera, 2 113*6777b538SAndroid Build Coastguard Worker%% 114*6777b538SAndroid Build Coastguard Worker 115*6777b538SAndroid Build Coastguard WorkerThe input is first parsed to a list of words: 116*6777b538SAndroid Build Coastguard Worker["aa1", "a2"] 117*6777b538SAndroid Build Coastguard Worker 118*6777b538SAndroid Build Coastguard WorkerA fully expanded graph is created from the words: 119*6777b538SAndroid Build Coastguard Workersource = [node1, node4] 120*6777b538SAndroid Build Coastguard Workernode1 = ("a", [node2]) 121*6777b538SAndroid Build Coastguard Workernode2 = ("a", [node3]) 122*6777b538SAndroid Build Coastguard Workernode3 = ("\x01", [sink]) 123*6777b538SAndroid Build Coastguard Workernode4 = ("a", [node5]) 124*6777b538SAndroid Build Coastguard Workernode5 = ("\x02", [sink]) 125*6777b538SAndroid Build Coastguard Workersink = None 126*6777b538SAndroid Build Coastguard Worker 127*6777b538SAndroid Build Coastguard WorkerCompression results in the following graph: 128*6777b538SAndroid Build Coastguard Workersource = [node1] 129*6777b538SAndroid Build Coastguard Workernode1 = ("a", [node2, node3]) 130*6777b538SAndroid Build Coastguard Workernode2 = ("\x02", [sink]) 131*6777b538SAndroid Build Coastguard Workernode3 = ("a\x01", [sink]) 132*6777b538SAndroid Build Coastguard Workersink = None 133*6777b538SAndroid Build Coastguard Worker 134*6777b538SAndroid Build Coastguard WorkerA C++ representation of the compressed graph is generated: 135*6777b538SAndroid Build Coastguard Worker 136*6777b538SAndroid Build Coastguard Workerconst unsigned char dafsa[7] = { 137*6777b538SAndroid Build Coastguard Worker 0x81, 0xE1, 0x02, 0x81, 0x82, 0x61, 0x81, 138*6777b538SAndroid Build Coastguard Worker}; 139*6777b538SAndroid Build Coastguard Worker 140*6777b538SAndroid Build Coastguard WorkerThe bytes in the generated array has the following meaning: 141*6777b538SAndroid Build Coastguard Worker 142*6777b538SAndroid Build Coastguard Worker 0: 0x81 <end_offset1> child at position 0 + (0x81 & 0x3F) -> jump to 1 143*6777b538SAndroid Build Coastguard Worker 144*6777b538SAndroid Build Coastguard Worker 1: 0xE1 <end_char> label character (0xE1 & 0x7F) -> match "a" 145*6777b538SAndroid Build Coastguard Worker 2: 0x02 <offset1> child at position 2 + (0x02 & 0x3F) -> jump to 4 146*6777b538SAndroid Build Coastguard Worker 147*6777b538SAndroid Build Coastguard Worker 3: 0x81 <end_offset1> child at position 4 + (0x81 & 0x3F) -> jump to 5 148*6777b538SAndroid Build Coastguard Worker 4: 0x82 <return_value> 0x82 & 0x0F -> return 2 149*6777b538SAndroid Build Coastguard Worker 150*6777b538SAndroid Build Coastguard Worker 5: 0x61 <char> label character 0x61 -> match "a" 151*6777b538SAndroid Build Coastguard Worker 6: 0x81 <return_value> 0x81 & 0x0F -> return 1 152*6777b538SAndroid Build Coastguard Worker 153*6777b538SAndroid Build Coastguard WorkerExample 2: 154*6777b538SAndroid Build Coastguard Worker 155*6777b538SAndroid Build Coastguard Worker%% 156*6777b538SAndroid Build Coastguard Workeraa, 1 157*6777b538SAndroid Build Coastguard Workerbbb, 2 158*6777b538SAndroid Build Coastguard Workerbaa, 1 159*6777b538SAndroid Build Coastguard Worker%% 160*6777b538SAndroid Build Coastguard Worker 161*6777b538SAndroid Build Coastguard WorkerThe input is first parsed to a list of words: 162*6777b538SAndroid Build Coastguard Worker["aa1", "bbb2", "baa1"] 163*6777b538SAndroid Build Coastguard Worker 164*6777b538SAndroid Build Coastguard WorkerCompression results in the following graph: 165*6777b538SAndroid Build Coastguard Workersource = [node1, node2] 166*6777b538SAndroid Build Coastguard Workernode1 = ("b", [node2, node3]) 167*6777b538SAndroid Build Coastguard Workernode2 = ("aa\x01", [sink]) 168*6777b538SAndroid Build Coastguard Workernode3 = ("bb\x02", [sink]) 169*6777b538SAndroid Build Coastguard Workersink = None 170*6777b538SAndroid Build Coastguard Worker 171*6777b538SAndroid Build Coastguard WorkerA C++ representation of the compressed graph is generated: 172*6777b538SAndroid Build Coastguard Worker 173*6777b538SAndroid Build Coastguard Workerconst unsigned char dafsa[11] = { 174*6777b538SAndroid Build Coastguard Worker 0x02, 0x83, 0xE2, 0x02, 0x83, 0x61, 0x61, 0x81, 0x62, 0x62, 0x82, 175*6777b538SAndroid Build Coastguard Worker}; 176*6777b538SAndroid Build Coastguard Worker 177*6777b538SAndroid Build Coastguard WorkerThe bytes in the generated array has the following meaning: 178*6777b538SAndroid Build Coastguard Worker 179*6777b538SAndroid Build Coastguard Worker 0: 0x02 <offset1> child at position 0 + (0x02 & 0x3F) -> jump to 2 180*6777b538SAndroid Build Coastguard Worker 1: 0x83 <end_offset1> child at position 2 + (0x83 & 0x3F) -> jump to 5 181*6777b538SAndroid Build Coastguard Worker 182*6777b538SAndroid Build Coastguard Worker 2: 0xE2 <end_char> label character (0xE2 & 0x7F) -> match "b" 183*6777b538SAndroid Build Coastguard Worker 3: 0x02 <offset1> child at position 3 + (0x02 & 0x3F) -> jump to 5 184*6777b538SAndroid Build Coastguard Worker 4: 0x83 <end_offset1> child at position 5 + (0x83 & 0x3F) -> jump to 8 185*6777b538SAndroid Build Coastguard Worker 186*6777b538SAndroid Build Coastguard Worker 5: 0x61 <char> label character 0x61 -> match "a" 187*6777b538SAndroid Build Coastguard Worker 6: 0x61 <char> label character 0x61 -> match "a" 188*6777b538SAndroid Build Coastguard Worker 7: 0x81 <return_value> 0x81 & 0x0F -> return 1 189*6777b538SAndroid Build Coastguard Worker 190*6777b538SAndroid Build Coastguard Worker 8: 0x62 <char> label character 0x62 -> match "b" 191*6777b538SAndroid Build Coastguard Worker 9: 0x62 <char> label character 0x62 -> match "b" 192*6777b538SAndroid Build Coastguard Worker10: 0x82 <return_value> 0x82 & 0x0F -> return 2 193*6777b538SAndroid Build Coastguard Worker""" 194*6777b538SAndroid Build Coastguard Worker 195*6777b538SAndroid Build Coastguard Workerimport argparse 196*6777b538SAndroid Build Coastguard Workerimport sys 197*6777b538SAndroid Build Coastguard Worker 198*6777b538SAndroid Build Coastguard Workerfrom typing import Any, Dict, FrozenSet, Iterable, List, MutableSequence, Sequence, Tuple, Union 199*6777b538SAndroid Build Coastguard Worker 200*6777b538SAndroid Build Coastguard Worker# Use of Any below is because mypy doesn't support recursive types. 201*6777b538SAndroid Build Coastguard WorkerSinkNode = Union[None, None] # weird hack to get around lack of TypeAlias. 202*6777b538SAndroid Build Coastguard WorkerInteriorNode = Tuple[str, List[Any]] 203*6777b538SAndroid Build Coastguard WorkerSourceNode = List[Any] 204*6777b538SAndroid Build Coastguard Worker 205*6777b538SAndroid Build Coastguard WorkerNonSinkNode = Union[InteriorNode, SourceNode] 206*6777b538SAndroid Build Coastguard WorkerNode = Union[SinkNode, InteriorNode, SourceNode] 207*6777b538SAndroid Build Coastguard WorkerDAFSA = List[Node] 208*6777b538SAndroid Build Coastguard Worker 209*6777b538SAndroid Build Coastguard Worker 210*6777b538SAndroid Build Coastguard Workerclass InputError(Exception): 211*6777b538SAndroid Build Coastguard Worker """Exception raised for errors in the input file.""" 212*6777b538SAndroid Build Coastguard Worker 213*6777b538SAndroid Build Coastguard Worker 214*6777b538SAndroid Build Coastguard Workerdef to_dafsa(words: Iterable[str]) -> DAFSA: 215*6777b538SAndroid Build Coastguard Worker """Generates a DAFSA from a word list and returns the source nodes. 216*6777b538SAndroid Build Coastguard Worker 217*6777b538SAndroid Build Coastguard Worker Each word is split into characters so that each character is represented by 218*6777b538SAndroid Build Coastguard Worker a unique node. It is assumed the word list is not empty. 219*6777b538SAndroid Build Coastguard Worker """ 220*6777b538SAndroid Build Coastguard Worker if not words: 221*6777b538SAndroid Build Coastguard Worker raise InputError('The domain list must not be empty') 222*6777b538SAndroid Build Coastguard Worker 223*6777b538SAndroid Build Coastguard Worker def ToNodes(word: str) -> Node: 224*6777b538SAndroid Build Coastguard Worker """Split words into characters""" 225*6777b538SAndroid Build Coastguard Worker if not 0x1F < ord(word[0]) < 0x80: 226*6777b538SAndroid Build Coastguard Worker raise InputError('Domain names must be printable 7-bit ASCII') 227*6777b538SAndroid Build Coastguard Worker if len(word) == 1: 228*6777b538SAndroid Build Coastguard Worker return chr(ord(word[0]) & 0x0F), [None] 229*6777b538SAndroid Build Coastguard Worker return word[0], [ToNodes(word[1:])] 230*6777b538SAndroid Build Coastguard Worker return [ToNodes(word) for word in words] 231*6777b538SAndroid Build Coastguard Worker 232*6777b538SAndroid Build Coastguard Worker 233*6777b538SAndroid Build Coastguard Workerdef to_words(node: Node) -> Iterable[str]: 234*6777b538SAndroid Build Coastguard Worker """Generates a word list from all paths starting from an internal node.""" 235*6777b538SAndroid Build Coastguard Worker if not node: 236*6777b538SAndroid Build Coastguard Worker return [''] 237*6777b538SAndroid Build Coastguard Worker return [(node[0] + word) for child in node[1] for word in to_words(child)] 238*6777b538SAndroid Build Coastguard Worker 239*6777b538SAndroid Build Coastguard Worker 240*6777b538SAndroid Build Coastguard Workerdef reverse(dafsa: DAFSA) -> DAFSA: 241*6777b538SAndroid Build Coastguard Worker """Generates a new DAFSA that is reversed, so that the old sink node becomes 242*6777b538SAndroid Build Coastguard Worker the new source node. 243*6777b538SAndroid Build Coastguard Worker """ 244*6777b538SAndroid Build Coastguard Worker sink: SourceNode = [] 245*6777b538SAndroid Build Coastguard Worker nodemap: Dict[int, InteriorNode] = {} 246*6777b538SAndroid Build Coastguard Worker 247*6777b538SAndroid Build Coastguard Worker def dfs(node: Node, parent: Node) -> None: 248*6777b538SAndroid Build Coastguard Worker """Creates reverse nodes. 249*6777b538SAndroid Build Coastguard Worker 250*6777b538SAndroid Build Coastguard Worker A new reverse node will be created for each old node. The new node will 251*6777b538SAndroid Build Coastguard Worker get a reversed label and the parents of the old node as children. 252*6777b538SAndroid Build Coastguard Worker """ 253*6777b538SAndroid Build Coastguard Worker if not node: 254*6777b538SAndroid Build Coastguard Worker sink.append(parent) 255*6777b538SAndroid Build Coastguard Worker elif id(node) not in nodemap: 256*6777b538SAndroid Build Coastguard Worker nodemap[id(node)] = (node[0][::-1], [parent]) 257*6777b538SAndroid Build Coastguard Worker for child in node[1]: 258*6777b538SAndroid Build Coastguard Worker dfs(child, nodemap[id(node)]) 259*6777b538SAndroid Build Coastguard Worker else: 260*6777b538SAndroid Build Coastguard Worker nodemap[id(node)][1].append(parent) 261*6777b538SAndroid Build Coastguard Worker 262*6777b538SAndroid Build Coastguard Worker for node in dafsa: 263*6777b538SAndroid Build Coastguard Worker dfs(node, None) 264*6777b538SAndroid Build Coastguard Worker return sink 265*6777b538SAndroid Build Coastguard Worker 266*6777b538SAndroid Build Coastguard Worker 267*6777b538SAndroid Build Coastguard Workerdef join_labels(dafsa: DAFSA) -> DAFSA: 268*6777b538SAndroid Build Coastguard Worker """Generates a new DAFSA where internal nodes are merged if there is a one to 269*6777b538SAndroid Build Coastguard Worker one connection. 270*6777b538SAndroid Build Coastguard Worker """ 271*6777b538SAndroid Build Coastguard Worker parentcount: Dict[int, int] = {id(None): 2} 272*6777b538SAndroid Build Coastguard Worker nodemap: Dict[int, Node] = {id(None): None} 273*6777b538SAndroid Build Coastguard Worker 274*6777b538SAndroid Build Coastguard Worker def count_parents(node: Node) -> None: 275*6777b538SAndroid Build Coastguard Worker """Count incoming references""" 276*6777b538SAndroid Build Coastguard Worker if id(node) in parentcount: 277*6777b538SAndroid Build Coastguard Worker parentcount[id(node)] += 1 278*6777b538SAndroid Build Coastguard Worker else: 279*6777b538SAndroid Build Coastguard Worker assert node is not None # parentcount statically contains `id(None)` 280*6777b538SAndroid Build Coastguard Worker parentcount[id(node)] = 1 281*6777b538SAndroid Build Coastguard Worker for child in node[1]: 282*6777b538SAndroid Build Coastguard Worker count_parents(child) 283*6777b538SAndroid Build Coastguard Worker 284*6777b538SAndroid Build Coastguard Worker def join(node: Node) -> Node: 285*6777b538SAndroid Build Coastguard Worker """Create new nodes""" 286*6777b538SAndroid Build Coastguard Worker if id(node) not in nodemap: 287*6777b538SAndroid Build Coastguard Worker assert node is not None # nodemap statically contains `id(None)` 288*6777b538SAndroid Build Coastguard Worker children = [join(child) for child in node[1]] 289*6777b538SAndroid Build Coastguard Worker if len(children) == 1 and parentcount[id(node[1][0])] == 1: 290*6777b538SAndroid Build Coastguard Worker child = children[0] 291*6777b538SAndroid Build Coastguard Worker # parentcount statically maps `id(None)` to 2, so this child cannot be 292*6777b538SAndroid Build Coastguard Worker # the sink. 293*6777b538SAndroid Build Coastguard Worker assert child is not None 294*6777b538SAndroid Build Coastguard Worker nodemap[id(node)] = (node[0] + child[0], child[1]) 295*6777b538SAndroid Build Coastguard Worker else: 296*6777b538SAndroid Build Coastguard Worker nodemap[id(node)] = (node[0], children) 297*6777b538SAndroid Build Coastguard Worker return nodemap[id(node)] 298*6777b538SAndroid Build Coastguard Worker 299*6777b538SAndroid Build Coastguard Worker for node in dafsa: 300*6777b538SAndroid Build Coastguard Worker count_parents(node) 301*6777b538SAndroid Build Coastguard Worker return [join(node) for node in dafsa] 302*6777b538SAndroid Build Coastguard Worker 303*6777b538SAndroid Build Coastguard Worker 304*6777b538SAndroid Build Coastguard Workerdef join_suffixes(dafsa: DAFSA) -> DAFSA: 305*6777b538SAndroid Build Coastguard Worker """Generates a new DAFSA where nodes that represent the same word lists 306*6777b538SAndroid Build Coastguard Worker towards the sink are merged. 307*6777b538SAndroid Build Coastguard Worker """ 308*6777b538SAndroid Build Coastguard Worker nodemap: Dict[FrozenSet[str], Node] = {frozenset(('', )): None} 309*6777b538SAndroid Build Coastguard Worker 310*6777b538SAndroid Build Coastguard Worker def join(node: Node) -> Node: 311*6777b538SAndroid Build Coastguard Worker """Returns a matching node. A new node is created if no matching node 312*6777b538SAndroid Build Coastguard Worker exists. The graph is accessed in dfs order. 313*6777b538SAndroid Build Coastguard Worker """ 314*6777b538SAndroid Build Coastguard Worker suffixes = frozenset(to_words(node)) 315*6777b538SAndroid Build Coastguard Worker if suffixes not in nodemap: 316*6777b538SAndroid Build Coastguard Worker # The only set of suffixes for the sink is {''}, which is statically 317*6777b538SAndroid Build Coastguard Worker # contained in nodemap. 318*6777b538SAndroid Build Coastguard Worker assert node is not None 319*6777b538SAndroid Build Coastguard Worker nodemap[suffixes] = (node[0], [join(child) for child in node[1]]) 320*6777b538SAndroid Build Coastguard Worker return nodemap[suffixes] 321*6777b538SAndroid Build Coastguard Worker 322*6777b538SAndroid Build Coastguard Worker return [join(node) for node in dafsa] 323*6777b538SAndroid Build Coastguard Worker 324*6777b538SAndroid Build Coastguard Worker 325*6777b538SAndroid Build Coastguard Workerdef top_sort(dafsa: DAFSA) -> Sequence[NonSinkNode]: 326*6777b538SAndroid Build Coastguard Worker """Generates list of nodes in topological sort order.""" 327*6777b538SAndroid Build Coastguard Worker # `incoming` contains the in-degree of every node except the sink. 328*6777b538SAndroid Build Coastguard Worker incoming: Dict[int, int] = {} 329*6777b538SAndroid Build Coastguard Worker 330*6777b538SAndroid Build Coastguard Worker def count_incoming(node: Node) -> None: 331*6777b538SAndroid Build Coastguard Worker """Counts incoming references.""" 332*6777b538SAndroid Build Coastguard Worker if node: 333*6777b538SAndroid Build Coastguard Worker if id(node) not in incoming: 334*6777b538SAndroid Build Coastguard Worker incoming[id(node)] = 1 335*6777b538SAndroid Build Coastguard Worker for child in node[1]: 336*6777b538SAndroid Build Coastguard Worker count_incoming(child) 337*6777b538SAndroid Build Coastguard Worker else: 338*6777b538SAndroid Build Coastguard Worker incoming[id(node)] += 1 339*6777b538SAndroid Build Coastguard Worker 340*6777b538SAndroid Build Coastguard Worker for node in dafsa: 341*6777b538SAndroid Build Coastguard Worker count_incoming(node) 342*6777b538SAndroid Build Coastguard Worker 343*6777b538SAndroid Build Coastguard Worker for node in dafsa: 344*6777b538SAndroid Build Coastguard Worker if node: 345*6777b538SAndroid Build Coastguard Worker incoming[id(node)] -= 1 346*6777b538SAndroid Build Coastguard Worker 347*6777b538SAndroid Build Coastguard Worker waiting: List[NonSinkNode] = [ 348*6777b538SAndroid Build Coastguard Worker node for node in dafsa if node and incoming[id(node)] == 0 349*6777b538SAndroid Build Coastguard Worker ] 350*6777b538SAndroid Build Coastguard Worker nodes: List[NonSinkNode] = [] 351*6777b538SAndroid Build Coastguard Worker 352*6777b538SAndroid Build Coastguard Worker while waiting: 353*6777b538SAndroid Build Coastguard Worker node = waiting.pop() 354*6777b538SAndroid Build Coastguard Worker assert incoming[id(node)] == 0 355*6777b538SAndroid Build Coastguard Worker nodes.append(node) 356*6777b538SAndroid Build Coastguard Worker for child in node[1]: 357*6777b538SAndroid Build Coastguard Worker if child: 358*6777b538SAndroid Build Coastguard Worker incoming[id(child)] -= 1 359*6777b538SAndroid Build Coastguard Worker if incoming[id(child)] == 0: 360*6777b538SAndroid Build Coastguard Worker waiting.append(child) 361*6777b538SAndroid Build Coastguard Worker return nodes 362*6777b538SAndroid Build Coastguard Worker 363*6777b538SAndroid Build Coastguard Worker 364*6777b538SAndroid Build Coastguard Workerdef encode_links(children: Sequence[Node], offsets: Dict[int, int], 365*6777b538SAndroid Build Coastguard Worker current: int) -> Iterable[int]: 366*6777b538SAndroid Build Coastguard Worker """Encodes a list of children as one, two or three byte offsets.""" 367*6777b538SAndroid Build Coastguard Worker if not children[0]: 368*6777b538SAndroid Build Coastguard Worker # This is an <end_label> node and no links follow such nodes 369*6777b538SAndroid Build Coastguard Worker assert len(children) == 1 370*6777b538SAndroid Build Coastguard Worker return [] 371*6777b538SAndroid Build Coastguard Worker guess = 3 * len(children) 372*6777b538SAndroid Build Coastguard Worker assert children 373*6777b538SAndroid Build Coastguard Worker children = sorted(children, key = lambda x: -offsets[id(x)]) 374*6777b538SAndroid Build Coastguard Worker while True: 375*6777b538SAndroid Build Coastguard Worker offset = current + guess 376*6777b538SAndroid Build Coastguard Worker buf: List[int] = [] 377*6777b538SAndroid Build Coastguard Worker for child in children: 378*6777b538SAndroid Build Coastguard Worker last = len(buf) 379*6777b538SAndroid Build Coastguard Worker distance = offset - offsets[id(child)] 380*6777b538SAndroid Build Coastguard Worker assert distance > 0 and distance < (1 << 21) 381*6777b538SAndroid Build Coastguard Worker 382*6777b538SAndroid Build Coastguard Worker if distance < (1 << 6): 383*6777b538SAndroid Build Coastguard Worker # A 6-bit offset: "s0xxxxxx" 384*6777b538SAndroid Build Coastguard Worker buf.append(distance) 385*6777b538SAndroid Build Coastguard Worker elif distance < (1 << 13): 386*6777b538SAndroid Build Coastguard Worker # A 13-bit offset: "s10xxxxxxxxxxxxx" 387*6777b538SAndroid Build Coastguard Worker buf.append(0x40 | (distance >> 8)) 388*6777b538SAndroid Build Coastguard Worker buf.append(distance & 0xFF) 389*6777b538SAndroid Build Coastguard Worker else: 390*6777b538SAndroid Build Coastguard Worker # A 21-bit offset: "s11xxxxxxxxxxxxxxxxxxxxx" 391*6777b538SAndroid Build Coastguard Worker buf.append(0x60 | (distance >> 16)) 392*6777b538SAndroid Build Coastguard Worker buf.append((distance >> 8) & 0xFF) 393*6777b538SAndroid Build Coastguard Worker buf.append(distance & 0xFF) 394*6777b538SAndroid Build Coastguard Worker # Distance in first link is relative to following record. 395*6777b538SAndroid Build Coastguard Worker # Distance in other links are relative to previous link. 396*6777b538SAndroid Build Coastguard Worker offset -= distance 397*6777b538SAndroid Build Coastguard Worker if len(buf) == guess: 398*6777b538SAndroid Build Coastguard Worker break 399*6777b538SAndroid Build Coastguard Worker guess = len(buf) 400*6777b538SAndroid Build Coastguard Worker # Set most significant bit to mark end of links in this node. 401*6777b538SAndroid Build Coastguard Worker buf[last] |= (1 << 7) 402*6777b538SAndroid Build Coastguard Worker buf.reverse() 403*6777b538SAndroid Build Coastguard Worker return buf 404*6777b538SAndroid Build Coastguard Worker 405*6777b538SAndroid Build Coastguard Worker 406*6777b538SAndroid Build Coastguard Workerdef encode_prefix(label: str) -> MutableSequence[int]: 407*6777b538SAndroid Build Coastguard Worker """Encodes a node label as a list of bytes without a trailing high byte. 408*6777b538SAndroid Build Coastguard Worker 409*6777b538SAndroid Build Coastguard Worker This method encodes a node if there is exactly one child and the 410*6777b538SAndroid Build Coastguard Worker child follows immidiately after so that no jump is needed. This label 411*6777b538SAndroid Build Coastguard Worker will then be a prefix to the label in the child node. 412*6777b538SAndroid Build Coastguard Worker """ 413*6777b538SAndroid Build Coastguard Worker assert label 414*6777b538SAndroid Build Coastguard Worker return [ord(c) for c in reversed(label)] 415*6777b538SAndroid Build Coastguard Worker 416*6777b538SAndroid Build Coastguard Worker 417*6777b538SAndroid Build Coastguard Workerdef encode_label(label: str) -> Iterable[int]: 418*6777b538SAndroid Build Coastguard Worker """Encodes a node label as a list of bytes with a trailing high byte >0x80. 419*6777b538SAndroid Build Coastguard Worker """ 420*6777b538SAndroid Build Coastguard Worker buf = encode_prefix(label) 421*6777b538SAndroid Build Coastguard Worker # Set most significant bit to mark end of label in this node. 422*6777b538SAndroid Build Coastguard Worker buf[0] |= (1 << 7) 423*6777b538SAndroid Build Coastguard Worker return buf 424*6777b538SAndroid Build Coastguard Worker 425*6777b538SAndroid Build Coastguard Worker 426*6777b538SAndroid Build Coastguard Workerdef encode(dafsa: DAFSA) -> Sequence[int]: 427*6777b538SAndroid Build Coastguard Worker """Encodes a DAFSA to a list of bytes""" 428*6777b538SAndroid Build Coastguard Worker output: List[int] = [] 429*6777b538SAndroid Build Coastguard Worker offsets: Dict[int, int] = {} 430*6777b538SAndroid Build Coastguard Worker 431*6777b538SAndroid Build Coastguard Worker for node in reversed(top_sort(dafsa)): 432*6777b538SAndroid Build Coastguard Worker if (len(node[1]) == 1 and node[1][0] and 433*6777b538SAndroid Build Coastguard Worker (offsets[id(node[1][0])] == len(output))): 434*6777b538SAndroid Build Coastguard Worker output.extend(encode_prefix(node[0])) 435*6777b538SAndroid Build Coastguard Worker else: 436*6777b538SAndroid Build Coastguard Worker output.extend(encode_links(node[1], offsets, len(output))) 437*6777b538SAndroid Build Coastguard Worker output.extend(encode_label(node[0])) 438*6777b538SAndroid Build Coastguard Worker offsets[id(node)] = len(output) 439*6777b538SAndroid Build Coastguard Worker 440*6777b538SAndroid Build Coastguard Worker output.extend(encode_links(dafsa, offsets, len(output))) 441*6777b538SAndroid Build Coastguard Worker output.reverse() 442*6777b538SAndroid Build Coastguard Worker return output 443*6777b538SAndroid Build Coastguard Worker 444*6777b538SAndroid Build Coastguard Worker 445*6777b538SAndroid Build Coastguard Workerdef to_cxx(data: Sequence[int]) -> str: 446*6777b538SAndroid Build Coastguard Worker """Generates C++ code from a list of encoded bytes.""" 447*6777b538SAndroid Build Coastguard Worker text = '/* This file is generated. DO NOT EDIT!\n\n' 448*6777b538SAndroid Build Coastguard Worker text += 'The byte array encodes effective tld names. See make_dafsa.py for' 449*6777b538SAndroid Build Coastguard Worker text += ' documentation.' 450*6777b538SAndroid Build Coastguard Worker text += '*/\n\n' 451*6777b538SAndroid Build Coastguard Worker text += 'const unsigned char kDafsa[%s] = {\n' % len(data) 452*6777b538SAndroid Build Coastguard Worker for i in range(0, len(data), 12): 453*6777b538SAndroid Build Coastguard Worker text += ' ' 454*6777b538SAndroid Build Coastguard Worker text += ', '.join('0x%02x' % byte for byte in data[i:i + 12]) 455*6777b538SAndroid Build Coastguard Worker text += ',\n' 456*6777b538SAndroid Build Coastguard Worker text += '};\n' 457*6777b538SAndroid Build Coastguard Worker return text 458*6777b538SAndroid Build Coastguard Worker 459*6777b538SAndroid Build Coastguard Worker 460*6777b538SAndroid Build Coastguard Workerdef words_to_cxx(words: Iterable[str]) -> str: 461*6777b538SAndroid Build Coastguard Worker """Generates C++ code from a word list""" 462*6777b538SAndroid Build Coastguard Worker dafsa = to_dafsa(words) 463*6777b538SAndroid Build Coastguard Worker for fun in (reverse, join_suffixes, reverse, join_suffixes, join_labels): 464*6777b538SAndroid Build Coastguard Worker dafsa = fun(dafsa) 465*6777b538SAndroid Build Coastguard Worker return to_cxx(encode(dafsa)) 466*6777b538SAndroid Build Coastguard Worker 467*6777b538SAndroid Build Coastguard Worker 468*6777b538SAndroid Build Coastguard Workerdef parse_gperf(infile: Iterable[str], reverse: bool) -> Iterable[str]: 469*6777b538SAndroid Build Coastguard Worker """Parses gperf file and extract strings and return code""" 470*6777b538SAndroid Build Coastguard Worker lines = [line.strip() for line in infile] 471*6777b538SAndroid Build Coastguard Worker # Extract strings after the first '%%' and before the second '%%'. 472*6777b538SAndroid Build Coastguard Worker begin = lines.index('%%') + 1 473*6777b538SAndroid Build Coastguard Worker end = lines.index('%%', begin) 474*6777b538SAndroid Build Coastguard Worker lines = lines[begin:end] 475*6777b538SAndroid Build Coastguard Worker for line in lines: 476*6777b538SAndroid Build Coastguard Worker if line[-3:-1] != ', ': 477*6777b538SAndroid Build Coastguard Worker raise InputError('Expected "domainname, <digit>", found "%s"' % line) 478*6777b538SAndroid Build Coastguard Worker # Technically the DAFSA format can support return values in the range 479*6777b538SAndroid Build Coastguard Worker # [0-31], but only the first three bits have any defined meaning. 480*6777b538SAndroid Build Coastguard Worker if not line.endswith(('0', '1', '2', '3', '4', '5', '6', '7')): 481*6777b538SAndroid Build Coastguard Worker raise InputError('Expected value to be in the range of 0-7, found "%s"' % 482*6777b538SAndroid Build Coastguard Worker line[-1]) 483*6777b538SAndroid Build Coastguard Worker if reverse: 484*6777b538SAndroid Build Coastguard Worker return [line[-4::-1] + line[-1] for line in lines] 485*6777b538SAndroid Build Coastguard Worker else: 486*6777b538SAndroid Build Coastguard Worker return [line[:-3] + line[-1] for line in lines] 487*6777b538SAndroid Build Coastguard Worker 488*6777b538SAndroid Build Coastguard Worker 489*6777b538SAndroid Build Coastguard Workerdef main() -> int: 490*6777b538SAndroid Build Coastguard Worker parser = argparse.ArgumentParser() 491*6777b538SAndroid Build Coastguard Worker parser.add_argument('--reverse', action='store_const', const=True, 492*6777b538SAndroid Build Coastguard Worker default=False) 493*6777b538SAndroid Build Coastguard Worker parser.add_argument('infile', nargs='?', type=argparse.FileType('r'), 494*6777b538SAndroid Build Coastguard Worker default=sys.stdin) 495*6777b538SAndroid Build Coastguard Worker parser.add_argument('outfile', nargs='?', type=argparse.FileType('w'), 496*6777b538SAndroid Build Coastguard Worker default=sys.stdout) 497*6777b538SAndroid Build Coastguard Worker args = parser.parse_args() 498*6777b538SAndroid Build Coastguard Worker args.outfile.write(words_to_cxx(parse_gperf(args.infile, args.reverse))) 499*6777b538SAndroid Build Coastguard Worker return 0 500*6777b538SAndroid Build Coastguard Worker 501*6777b538SAndroid Build Coastguard Worker 502*6777b538SAndroid Build Coastguard Workerif __name__ == '__main__': 503*6777b538SAndroid Build Coastguard Worker sys.exit(main()) 504