1# SCC finder core algorithm 2 3When comparing types recursively, we explore a graph whose nodes are pairs of 4types. Nodes can be visited more than once, and there can even be cycles. 5 6The standard tool for exploring directed graphs is Depth First Search. Dealing 7with cycles in such a graph requires the determination of its Strongly-Connected 8Components. 9 10DFS code for graph traversal is trivial. SCC determination is not and we would 11like it to be provided by a reliable and efficient library component. 12 13## Core algorithm 14 15There are three commonly-studied asymptotically-optimal approaches to 16determining the Strongly-Connected Components of a directed graph. Each of these 17admits various optimisations and specialisations for different purposes. 18 19* [Kosaraju's algorithm](https://en.wikipedia.org/wiki/Kosaraju%27s_algorithm) 20* [Tarjan's algorithm](https://en.wikipedia.org/wiki/Tarjan%27s_strongly_connected_components_algorithm) 21* [The path-based algorithm](https://en.wikipedia.org/wiki/Path-based_strong_component_algorithm) 22 23Kosaraju's algorithm is unsuited to DFS-generated graphs (such as type 24comparison graphs) as it requires both forwards and reverse edges to be known 25ahead of time. 26 27Tarjan's algorithm can be massaged into the form where it can be separated into 28a plain DFS traversal and SCC-specific pieces but the resulting code is a bit 29messy and responsibility for SCC state management is rather scattered. 30 31The path-based algorithm is the best fit and can be put in a form where the DFS 32traversal and SCC state management are cleanly separated. The concept of "open" 33nodes carries directly over to the implementation used here and SCC state 34management occurs in two well-defined places. 35 36* node visit starts; repeat visits to open nodes are detected 37* node visit end; completed SCCs are detected 38 39The remainder of this document discusses the finer details of what the finder 40API should look like. 41 42## Choice of primitives 43 44The choice of primitives should be determined by the interface priorities: 45 46* simple - hard to misuse 47* efficient - does not preclude optimal implementation 48* powerful - no need to code around deficiencies 49 50Roughly speaking, the SCC finder "opens" and "closes" nodes and the user code 51will try to open and close nodes at the beginning and end of each node visit. 52 53### Close 54 55The logic to follow when the SCC finder reports a complete SCC (nodes closed) is 56simple: the user code must ensure it considers each node as definitively visited 57from this point and avoid asking the finder to consider it again. This condition 58means the SCC does not have to independently track "ever visited" status that 59may be duplicated elsewhere. 60 61### Open 62 63When reaching a node via DFS, the user of the SCC finder has to perform a 3-way 64classification: 65 661. never visited before - the node should immediately transition to open and 67 known to the SCC finder 682. open - the link just followed would create a cycle and the SCC finder 69 algorithm needs to do some state maintenance; the user code must not 70 recursively process the node 713. closed - the link just followed reaches a node already fully processed and 72 assigned to a SCC; the user code must not recursively process the node 73 74There are at least 3 different ways of structuring program logic to distinguish 75these paths. 76 77#### Populate user visited state on open 78 79Node lifecycle: 80 811. unvisited + not open 822. visited + open 833. visited + not open 84 85If a node has never been visited, it can be unconditionally opened. If it has 86been visited, we must still check if it's open. This is a bit odd in the context 87of the SCC algorithm as `really_open` and `is_open` become separate operations 88(-simplicity). It's possible that a user could know, for other reasons, whether 89a visited node is open or not and then omitting to call `is_open` could upset 90the SCC finder (which needs to update its internal state in the already-open 91case). This risk could be eliminated by duplicating the state update logic in 92both methods (-efficiency). 93 94```c++ 95if (visited) { 96 if (is_open(node)) { 97 // cycle-breaking back link 98 return; 99 } else { 100 // work-saving link to shared node 101 return; 102 } 103} 104mark_visited(); 105token = really_open(node); 106... 107// do work 108... 109nodes = close(token); 110if (!nodes.empty()) { 111 ... 112} 113``` 114 115#### Check SCC open / closed first 116 117Node lifecycle: 118 1191. not open + unvisited (never visited) 1202. open (being visited) 1213. not open + visited (closed) 122 123This scheme also requires separate `is_open` and `really_open` operations as 124nodes musn't be reopened (-simplicity, -efficiency). It does allow the user to 125mark nodes as visited any time between open and close (-simplicity, +power). 126 127```c++ 128if (is_open(node)) { 129 // cycle-breaking back link 130 return; 131} 132if (visited) { 133 // work-saving link to shared node 134 return; 135} 136token = really_open(node); 137... 138// do work 139... 140nodes = close(token); 141if (!nodes.empty()) { 142 // last chance to call mark_visited 143} 144``` 145 146#### Test user visited state before open, populate on close 147 148NOTE: This is the currently implemented approach. 149 150Node lifecycle: 151 1521. unvisited + not open 1532. unvisited + open 1543. visited + not open 155 156This is the purest form of the algorithm with the `open` and `close` operations 157clearly bracketing "real" work. `really_open` and `is_open` operations are 158folded together into `open` which will fail to open an already open node 159(+simplicity, +efficiency). 160 161The user value store cannot be something that gets initialised and updated 162between open and close or it will also need to duplicate the open/closed state 163and this correspondence will need to be maintained as an invariant (-power). 164 165```c++ 166if (visited) { 167 // work-saving link to shared node 168 return; 169} 170token = open(node); 171if (!token) { 172 // cycle-breaking back link 173 return; 174} 175... 176// do work 177... 178nodes = close(token.value()); 179if (!nodes.empty()) { 180 ... 181 mark_visited(); 182} 183``` 184 185Evaluating equality can be made to fit nicely into this category, as can using 186equality along with the return stack to build a diff tree with stubs for 187sharing- and cycle-breaking links. 188 189However, building a graph (say a copy of the traversal, or a diff graph) 190requires open node state to be squirrelled away somewhere. 191 192##### Enhancement 193 194The SCC finder data structure can be made to carry values associated with open 195nodes and hand them to the user on failure-to-open and closure. This allows us 196to retain purity and regain the ability to maintain simple state for open nodes 197separately from that for closed nodes, at the expense of a slightly 198heavier-touch interface (+power). 199 200In the simplest case, we'd want nothing stored at all (beyond the node identity) 201and actually supplying a second empty type would be an annoyance and an 202inefficiency (-simplicity, -power, -efficiency)). So the best thing to supply is 203the user's container's `value_type` and associated `value_compare` comparator. 204 205However, in this variation, it's painful to set up the SCC structures for 206efficient `open` as nodes need to exist in a map or set, independently of any 207payload. The approach could be revisited if there's a solution to this. 208 209```c++ 210if (visited) { 211 // work-saving link to shared node 212 return; 213} 214[&node_state, token] = open(node_state); 215if (!token) { 216 // cycle-breaking back link 217 return; 218} 219... 220// do work, update node_state if you like 221... 222node_states = close(token.value()) 223if (!node_states.empty()) { 224 ... 225 mark_visited(); 226} 227``` 228