xref: /aosp_15_r20/external/stg/doc/scc.md (revision 9e3b08ae94a55201065475453d799e8b1378bea6)
1# SCC finder core algorithm
2
3When comparing types recursively, we explore a graph whose nodes are pairs of
4types. Nodes can be visited more than once, and there can even be cycles.
5
6The standard tool for exploring directed graphs is Depth First Search. Dealing
7with cycles in such a graph requires the determination of its Strongly-Connected
8Components.
9
10DFS code for graph traversal is trivial. SCC determination is not and we would
11like it to be provided by a reliable and efficient library component.
12
13## Core algorithm
14
15There are three commonly-studied asymptotically-optimal approaches to
16determining the Strongly-Connected Components of a directed graph. Each of these
17admits various optimisations and specialisations for different purposes.
18
19*   [Kosaraju's algorithm](https://en.wikipedia.org/wiki/Kosaraju%27s_algorithm)
20*   [Tarjan's algorithm](https://en.wikipedia.org/wiki/Tarjan%27s_strongly_connected_components_algorithm)
21*   [The path-based algorithm](https://en.wikipedia.org/wiki/Path-based_strong_component_algorithm)
22
23Kosaraju's algorithm is unsuited to DFS-generated graphs (such as type
24comparison graphs) as it requires both forwards and reverse edges to be known
25ahead of time.
26
27Tarjan's algorithm can be massaged into the form where it can be separated into
28a plain DFS traversal and SCC-specific pieces but the resulting code is a bit
29messy and responsibility for SCC state management is rather scattered.
30
31The path-based algorithm is the best fit and can be put in a form where the DFS
32traversal and SCC state management are cleanly separated. The concept of "open"
33nodes carries directly over to the implementation used here and SCC state
34management occurs in two well-defined places.
35
36*   node visit starts; repeat visits to open nodes are detected
37*   node visit end; completed SCCs are detected
38
39The remainder of this document discusses the finer details of what the finder
40API should look like.
41
42## Choice of primitives
43
44The choice of primitives should be determined by the interface priorities:
45
46*   simple - hard to misuse
47*   efficient - does not preclude optimal implementation
48*   powerful - no need to code around deficiencies
49
50Roughly speaking, the SCC finder "opens" and "closes" nodes and the user code
51will try to open and close nodes at the beginning and end of each node visit.
52
53### Close
54
55The logic to follow when the SCC finder reports a complete SCC (nodes closed) is
56simple: the user code must ensure it considers each node as definitively visited
57from this point and avoid asking the finder to consider it again. This condition
58means the SCC does not have to independently track "ever visited" status that
59may be duplicated elsewhere.
60
61### Open
62
63When reaching a node via DFS, the user of the SCC finder has to perform a 3-way
64classification:
65
661.  never visited before - the node should immediately transition to open and
67    known to the SCC finder
682.  open - the link just followed would create a cycle and the SCC finder
69    algorithm needs to do some state maintenance; the user code must not
70    recursively process the node
713.  closed - the link just followed reaches a node already fully processed and
72    assigned to a SCC; the user code must not recursively process the node
73
74There are at least 3 different ways of structuring program logic to distinguish
75these paths.
76
77#### Populate user visited state on open
78
79Node lifecycle:
80
811.  unvisited + not open
822.  visited + open
833.  visited + not open
84
85If a node has never been visited, it can be unconditionally opened. If it has
86been visited, we must still check if it's open. This is a bit odd in the context
87of the SCC algorithm as `really_open` and `is_open` become separate operations
88(-simplicity). It's possible that a user could know, for other reasons, whether
89a visited node is open or not and then omitting to call `is_open` could upset
90the SCC finder (which needs to update its internal state in the already-open
91case). This risk could be eliminated by duplicating the state update logic in
92both methods (-efficiency).
93
94```c++
95if (visited) {
96  if (is_open(node)) {
97    // cycle-breaking back link
98    return;
99  } else {
100    // work-saving link to shared node
101    return;
102  }
103}
104mark_visited();
105token = really_open(node);
106...
107// do work
108...
109nodes = close(token);
110if (!nodes.empty()) {
111  ...
112}
113```
114
115#### Check SCC open / closed first
116
117Node lifecycle:
118
1191.  not open + unvisited (never visited)
1202.  open (being visited)
1213.  not open + visited (closed)
122
123This scheme also requires separate `is_open` and `really_open` operations as
124nodes musn't be reopened (-simplicity, -efficiency). It does allow the user to
125mark nodes as visited any time between open and close (-simplicity, +power).
126
127```c++
128if (is_open(node)) {
129  // cycle-breaking back link
130  return;
131}
132if (visited) {
133  // work-saving link to shared node
134  return;
135}
136token = really_open(node);
137...
138// do work
139...
140nodes = close(token);
141if (!nodes.empty()) {
142  // last chance to call mark_visited
143}
144```
145
146#### Test user visited state before open, populate on close
147
148NOTE: This is the currently implemented approach.
149
150Node lifecycle:
151
1521.  unvisited + not open
1532.  unvisited + open
1543.  visited + not open
155
156This is the purest form of the algorithm with the `open` and `close` operations
157clearly bracketing "real" work. `really_open` and `is_open` operations are
158folded together into `open` which will fail to open an already open node
159(+simplicity, +efficiency).
160
161The user value store cannot be something that gets initialised and updated
162between open and close or it will also need to duplicate the open/closed state
163and this correspondence will need to be maintained as an invariant (-power).
164
165```c++
166if (visited) {
167  // work-saving link to shared node
168  return;
169}
170token = open(node);
171if (!token) {
172  // cycle-breaking back link
173  return;
174}
175...
176// do work
177...
178nodes = close(token.value());
179if (!nodes.empty()) {
180  ...
181  mark_visited();
182}
183```
184
185Evaluating equality can be made to fit nicely into this category, as can using
186equality along with the return stack to build a diff tree with stubs for
187sharing- and cycle-breaking links.
188
189However, building a graph (say a copy of the traversal, or a diff graph)
190requires open node state to be squirrelled away somewhere.
191
192##### Enhancement
193
194The SCC finder data structure can be made to carry values associated with open
195nodes and hand them to the user on failure-to-open and closure. This allows us
196to retain purity and regain the ability to maintain simple state for open nodes
197separately from that for closed nodes, at the expense of a slightly
198heavier-touch interface (+power).
199
200In the simplest case, we'd want nothing stored at all (beyond the node identity)
201and actually supplying a second empty type would be an annoyance and an
202inefficiency (-simplicity, -power, -efficiency)). So the best thing to supply is
203the user's container's `value_type` and associated `value_compare` comparator.
204
205However, in this variation, it's painful to set up the SCC structures for
206efficient `open` as nodes need to exist in a map or set, independently of any
207payload. The approach could be revisited if there's a solution to this.
208
209```c++
210if (visited) {
211  // work-saving link to shared node
212  return;
213}
214[&node_state, token] = open(node_state);
215if (!token) {
216  // cycle-breaking back link
217  return;
218}
219...
220// do work, update node_state if you like
221...
222node_states = close(token.value())
223if (!node_states.empty()) {
224  ...
225  mark_visited();
226}
227```
228