stg/doc/comparison.md

# Comparison

This is implemented in `comparison.{h,cc}`.

Graph comparison is the basis of all STG difference reporting.

All the various STG node attributes (such as `size` or `name`) and edge
identifiers (whether explicit or not, such as function parameter index) can be
reduced to generic labels. This (over)simplification reduces the problem to
solve to:

*   given two rooted directed graphs, containing labelled nodes and edges
*   generate a graph that encapsulates all the differences found by following
    matching pairs of edges

Report generation from such a graph is the subject of [Reporting](reporting.md).

STG compares the graphs starting at the root nodes and recursing along edges
with matching labels. The recursion stops at incomparable nodes (such as a
`struct` and `int`). Otherwise a comparison specific to the kind of node is
performed.

Given that the graph will in general not be a tree and may contain cycles, STG

*   memoises comparisons with known results
*   detects comparison cycles and ensures correct termination and propagation of
    results

Overall, this ensures that any given pair of nodes (one from each graph) is
compared at most once. In the case of a small change to a typical ABI graph of
size *N*, *O(N)* node comparisons are performed.

## Ignoring certain kinds of differences

It has proved useful to add selective suppression of certain kinds of
differences, for distinct purposes.

| **ignore**               | **directionality** | **purpose**               |
| ------------------------ | ------------------ | ------------------------- |
| interface addition       | asymmetric         | compatibility checking    |
| type definition addition | asymmetric         | compatibility checking    |
| primitive type encoding  | symmetric          | cross comparison noise    |
:                          :                    : reduction                 :
| member size              | symmetric          | cross comparison noise    |
:                          :                    : reduction                 :
| enum underlying type     | symmetric          | cross comparison noise    |
:                          :                    : reduction                 :
| qualifier                | symmetric          | cross comparison noise    |
:                          :                    : reduction                 :
| symbol CRC               | symmetric          | cross comparison noise    |
:                          :                    : reduction                 :
| symbol type presence     | symmetric          | libabigail XML comparison |
:                          :                    : noise reduction           :
| type declaration status  | symmetric          | libabigail XML comparison |
:                          :                    : noise reduction           :

The first two options can be used to test whether one ABI is a subset of
another.

It can be useful to cross compare ABIs extracted in different ways to validate
the fidelity of one ABI source against another. Where the models or
implementations differ systematically, suppressing those differences will make
the remainder more obvious.

The libabigail versions used in Android's GKI project often generated ABIs with
spurious differences due to the disappearance (or reappearance) of type
definitions and (occasionally) symbol types. The corresponding ignore options
replicate the behaviour of libabigail's `abidiff`.

## Differences and difference graphs

When comparing a pair of nodes, each difference falls into one of the following
categories:

*   node label - a purely local difference
*   matching edge - a recursive difference found by following edges with
    matching labels
*   added or removed labelled edges, where edges are identified by label -
    modelled as a recursive difference with a node absent on one side

Each node in an STG difference graph is one of the following[^1]:

*   a node removal or addition, containing
    *   a reference to either a node in first graph or one in the second
*   a node change, containing
    *   a reference to two nodes, one in each of the two graphs
    *   a possibly-empty list of differences which can each be one of
        *   a node attribute difference in the form of some informative text
        *   an edge difference in the form of a link to a difference node

[^1]: STG models comparisons as pairs of nodes where either node can be absent.
    The absent-absent comparison is used to represent "no change". All such
    edges, except those representing the root of a "no change" graph comparison,
    are pruned during diff graph creation.

Note that STG's difference nodes are *unkinded*, there is only one kind of
difference node, unlike STG's data nodes where there is a separate kind of node
for each kind of C type etc.

## Matching and comparing collections of edges

While an algorithm based on generic label comparison will work, there are a
couple of issues:

*   collections of outgoing edges may not have an obvious label to assign
    (multiple anonymous members of a `struct`, for example)
*   edges are often ordered (parameters and members, for example) and we want to
    preserve this order when reporting differences

Comparing two pointer types is straightforward, just compare the pointed-to
types. However, symbol tables, function arguments and struct members all require
comparisons of multiple entities simultaneously.

STG compares edge aggregates as follows:

*   arrays (like function arguments): by index, comparing recursively items with
    the same index, reporting removals and additions of the remaining items
*   maps (like the symbol table): by key, comparing recursively items with
    matching keys, reporting removals and additions of unmatched items
*   otherwise synthesise a key for comparison, compare by key, report
    differences in the original order of items being compared (favouring the
    second list's order over the first's); the reordering is an *O(n²)*
    operation and it might be possible to adapt the more general Myer's diff
    algorithm to reduce this

## Implementation Details

Comparison is mostly done pair-wise recursively with a DFS, by the function
object `CompareWorker` and with the help of the [SCC finder](scc.md).

The algorithm divides responsibility between `operator()(Id, Id)` and various
`operator()(Node, Node)` methods. There are also various helpers in and outside
`CompareWorker`.

The `Result` type encapsulates the difference between two nodes being compared.
It contains both a list (`Diff`) of differences (`DiffDetail`) and a boolean
equality outcome. The latter is used to propagate inequality information in the
presence of cycles in the diff comparison graph.

### `operator()(Node, Node)`

Specialised for each `Node` type, these methods have the job of computing local
differences, matching edges and obtaining edge differences from recursive calls
to `operator()(Id, Id)` (or `Removed` and `Added`, if edge labels are
unmatched).

Local differences can easily be rendered as text, but edge differences need
recursive calls, the results of which are merged into the local differences
`Result` with helper methods.

These methods form the bulk of the comparison code, so in general we want them
to be as small as possible, containing no boilerplate and simply mirroring the
node data. The helper functions were therefore chosen for power, laziness and
concision.

### `Added` and `Removed`

These take care of comparisons where one side is absent.

There are several reasons for not folding this functionality into `operator(Id,
Id)` itself:

*   it would result in unnecessary extra work as its callers would need to pack
    and the function would need to unpack `optional<Id>` arguments
*   added and removed nodes have none of the other interesting features that it
    handles
*   `Added` and `Removed` don't need to decorate their return values with any
    difference information

### `Defined`

This takes care of comparisons of user-defined types which may be forward
declarations or full definitions.

### `Nodes`

These take care of comparisons of sequences of arbitrary nodes (or enumerators).

STG uses "matching keys" to reduce the problem of comparing sequences to
comparing maps. It attempts to preserve original sequence order as described in
`order.h`.

### `operator(Id, Id)`

This controls recursion and handles some special cases before delegating to some
`operator()(Node, Node)` in the "normal" case.

It takes care of the following:

*   revisited, completed comparisons
*   revisited, in-progress comparison
*   qualified types
*   typedefs
*   incomparable and comparable nodes
*   comparison cycles

Note that the non-trivial special cases relating to typedefs and qualified types
(and their current concrete representations) require non-parallel traversals of
the graphs being compared; this is the only place where the comparison is not
purely structural.

#### Revisited Nodes and Recursive Comparison

STG comparison relies on `SCC` to control recursion and behaviour in the face of
graphs containing arbitrary cycles. Any difference found affecting one node in a
comparison cycle affects all nodes in that cycle.

Excluding the two special cases documented in the following sections, the
comparison steps are approximately:

1.  if the comparison already has a known result then return this
2.  if the comparison already is in progress then return a potential difference
3.  start node visit, register the node with the SCC finder
    1.  (special handling of qualified types and typedefs)
    2.  incomparable nodes (such as a `struct` and an `int`) go to `Mismatch`
        which returns a difference; there is no further recursion
    3.  otherwise delegate node comparison to a node-specific function; with
        possible recursion
    4.  the comparison result here is tentative, due to potential cycles
4.  finish node visit, informing the SCC finder
5.  if an SCC was closed, we've just finished its root comparison
    1.  root compared equal? discard unwanted potential differences
    2.  difference found? record confirmed differences
    3.  record all its comparison results as final
6.  return result (which will be final if an SCC was closed)

#### Typedefs

This special handling is subject to change.

*   `typedef` foo bar ⇔ foo

Typedefs are named type aliases which cannot refer to themselves or later
defined types. The referred-to type is exactly identical to the typedef. So for
difference *finding*, typedefs should just be resolved and skipped over.
However, for *reporting*, it may be still useful to say where a difference came
from. This requires extra handling to collect the typedef names on each side of
the comparison, when there is something to report.

If `operator()(Id, Id)` sees a comparison involving a typedef, it resolves
typedef chains on both sides and keeps track of the names. Then it calls itself
recursively. If the result is no-diff, it returns no-diff, otherwise, it reports
the differences between the types at the end of the typedef chains.

An alternative would be to genuinely just follow the epsilons. Store the
typedefs in the diff tree but record the comparison of what they resolve to. The
presentation layer can decorate the comparison text with resolution chains.

Note that qualified typedefs present extra complications.

#### Qualified Types

This special handling is subject to change.

*   `const` → `volatile` → foo ⇔ `volatile` → `const` → foo

STG currently represents type qualifiers as separate, individual nodes. They are
relevant for finding differences but there may be no guarantee of the order in
which they will appear. For diff reporting, STG currently reports added and
removed qualifiers but also compares the underlying types.

This implies that when faced with a comparison involving a qualifier,
`operator()(Id, Id)` should collect and compare all qualifiers on both sides and
treat the types as compound objects consisting of their qualifiers and the
underlying types, either or both of which may have differences to report.
Comparing the underlying types requires recursive calls.

Note that qualified typedefs present extra complications.

#### Qualified typedefs

STG does not currently do anything special for qualified typedefs which can have
subtle and surprising behaviours. For example:

Before:

```c++
const int quux;
```

After 1:

```c++
typedef int foo;
const foo quux;
```

After 2:

```c++
typedef const int foo;
foo quux;
```

After 3:

```c++
typedef const int foo;
const foo quux;
```

In all cases above, the type of `quux` is unchanged. These examples strongly
suggest that a better model of C types would involve tracking qualification as a
decoration present on every type node, including typedefs.

Note that this behaviour implies C's type system is not purely constructive as
there is machinery to discard duplicate qualifiers which would be illegal
elsewhere.

For the moment, we can pretend that outer qualifications are always significant,
even though they may be absorbed by inner ones, and risk occasional false
positives.

A worse case is:

Before:

```c++
const int quux[];
```

After 1:

```c++
typedef int foo[];
const foo quux;
```

After 2:

```c++
typedef const int foo[];
foo quux;
```

After 3:

```c++
typedef const int foo[];
const foo quux;
```

All the `quux` are identically typed. There is an additional wart that what
would normally be illegal qualifiers on an array type instead decorate its
element type.

Finally, worst is:

Before:

```c++
const int quux();
```

After 1:

```c++
typedef int foo();
const foo quux;
```

After 2:

```c++
typedef const int foo();
foo quux;
```

After 3:

```c++
typedef const int foo();
const foo quux;
```

The two `const foo quux` cases invoke undefined behaviour. The consistently
crazy behaviour would have been to decorate the return type instead.

The "worstest" case is GCC allowing the specification of typedef alignment
different to the defining type. This abomination should not exist and should
never be used. Alignment is not currently modelled by STG.

### Diff helpers

These are mainly used by the `CompareWorker::operator()(Node, Node)` methods.

*   `MarkIncomparable` - nodes are just different
*   `AddNodeDiff` - add node difference, unconditionally
*   `AddEdgeDiff` - add edge difference (addition or removal), unconditionally
*   `MaybeAddNodeDiff` - add node difference (label change), conditionally
*   `MaybeAddEdgeDiff` - add matching edge recursive difference, conditionally

Variants are possible where text is generated lazily on a recursive diff being
found, as are ones where labels are compared and serialised only if different.

## Diff Presentation

In general, there are two problems to solve:

*   generating suitable text, for
    *   nodes and edges
    *   node and edge differences
*   building a report with some meaningful structure

Node and edge description and report structure are the responsibility of the
*reporting* code. See [Naming](naming.md) for more detailed notes on node
description, mainly C type name syntax.

Several report formats are supported and the simplest is (omitting various
complications) a rendering of a difference graph as a difference *tree* where
revisiting nodes is avoided by reporting 2 additional artificial kinds of
difference:

1.  already reported - to handle diff sharing
2.  being compared - to handle diff cycles

The various formats are not documented further here.

Finally, node and edge difference description is currently the responsibility of
the *comparison* code. This may change in the future, but might require a typed
difference graph.