xref: /aosp_15_r20/external/clang/docs/DataFlowSanitizerDesign.rst (revision 67e74705e28f6214e480b399dd47ea732279e315)
1*67e74705SXin LiDataFlowSanitizer Design Document
2*67e74705SXin Li=================================
3*67e74705SXin Li
4*67e74705SXin LiThis document sets out the design for DataFlowSanitizer, a general
5*67e74705SXin Lidynamic data flow analysis.  Unlike other Sanitizer tools, this tool is
6*67e74705SXin Linot designed to detect a specific class of bugs on its own. Instead,
7*67e74705SXin Liit provides a generic dynamic data flow analysis framework to be used
8*67e74705SXin Liby clients to help detect application-specific issues within their
9*67e74705SXin Liown code.
10*67e74705SXin Li
11*67e74705SXin LiDataFlowSanitizer is a program instrumentation which can associate
12*67e74705SXin Lia number of taint labels with any data stored in any memory region
13*67e74705SXin Liaccessible by the program. The analysis is dynamic, which means that
14*67e74705SXin Liit operates on a running program, and tracks how the labels propagate
15*67e74705SXin Lithrough that program. The tool shall support a large (>100) number
16*67e74705SXin Liof labels, such that programs which operate on large numbers of data
17*67e74705SXin Liitems may be analysed with each data item being tracked separately.
18*67e74705SXin Li
19*67e74705SXin LiUse Cases
20*67e74705SXin Li---------
21*67e74705SXin Li
22*67e74705SXin LiThis instrumentation can be used as a tool to help monitor how data
23*67e74705SXin Liflows from a program's inputs (sources) to its outputs (sinks).
24*67e74705SXin LiThis has applications from a privacy/security perspective in that
25*67e74705SXin Lione can audit how a sensitive data item is used within a program and
26*67e74705SXin Liensure it isn't exiting the program anywhere it shouldn't be.
27*67e74705SXin Li
28*67e74705SXin LiInterface
29*67e74705SXin Li---------
30*67e74705SXin Li
31*67e74705SXin LiA number of functions are provided which will create taint labels,
32*67e74705SXin Liattach labels to memory regions and extract the set of labels
33*67e74705SXin Liassociated with a specific memory region. These functions are declared
34*67e74705SXin Liin the header file ``sanitizer/dfsan_interface.h``.
35*67e74705SXin Li
36*67e74705SXin Li.. code-block:: c
37*67e74705SXin Li
38*67e74705SXin Li  /// Creates and returns a base label with the given description and user data.
39*67e74705SXin Li  dfsan_label dfsan_create_label(const char *desc, void *userdata);
40*67e74705SXin Li
41*67e74705SXin Li  /// Sets the label for each address in [addr,addr+size) to \c label.
42*67e74705SXin Li  void dfsan_set_label(dfsan_label label, void *addr, size_t size);
43*67e74705SXin Li
44*67e74705SXin Li  /// Sets the label for each address in [addr,addr+size) to the union of the
45*67e74705SXin Li  /// current label for that address and \c label.
46*67e74705SXin Li  void dfsan_add_label(dfsan_label label, void *addr, size_t size);
47*67e74705SXin Li
48*67e74705SXin Li  /// Retrieves the label associated with the given data.
49*67e74705SXin Li  ///
50*67e74705SXin Li  /// The type of 'data' is arbitrary.  The function accepts a value of any type,
51*67e74705SXin Li  /// which can be truncated or extended (implicitly or explicitly) as necessary.
52*67e74705SXin Li  /// The truncation/extension operations will preserve the label of the original
53*67e74705SXin Li  /// value.
54*67e74705SXin Li  dfsan_label dfsan_get_label(long data);
55*67e74705SXin Li
56*67e74705SXin Li  /// Retrieves a pointer to the dfsan_label_info struct for the given label.
57*67e74705SXin Li  const struct dfsan_label_info *dfsan_get_label_info(dfsan_label label);
58*67e74705SXin Li
59*67e74705SXin Li  /// Returns whether the given label label contains the label elem.
60*67e74705SXin Li  int dfsan_has_label(dfsan_label label, dfsan_label elem);
61*67e74705SXin Li
62*67e74705SXin Li  /// If the given label label contains a label with the description desc, returns
63*67e74705SXin Li  /// that label, else returns 0.
64*67e74705SXin Li  dfsan_label dfsan_has_label_with_desc(dfsan_label label, const char *desc);
65*67e74705SXin Li
66*67e74705SXin LiTaint label representation
67*67e74705SXin Li--------------------------
68*67e74705SXin Li
69*67e74705SXin LiAs stated above, the tool must track a large number of taint
70*67e74705SXin Lilabels. This poses an implementation challenge, as most multiple-label
71*67e74705SXin Litainting systems assign one label per bit to shadow storage, and
72*67e74705SXin Liunion taint labels using a bitwise or operation. This will not scale
73*67e74705SXin Lito clients which use hundreds or thousands of taint labels, as the
74*67e74705SXin Lilabel union operation becomes O(n) in the number of supported labels,
75*67e74705SXin Liand data associated with it will quickly dominate the live variable
76*67e74705SXin Liset, causing register spills and hampering performance.
77*67e74705SXin Li
78*67e74705SXin LiInstead, a low overhead approach is proposed which is best-case O(log\
79*67e74705SXin Li:sub:`2` n) during execution. The underlying assumption is that
80*67e74705SXin Lithe required space of label unions is sparse, which is a reasonable
81*67e74705SXin Liassumption to make given that we are optimizing for the case where
82*67e74705SXin Liapplications mostly copy data from one place to another, without often
83*67e74705SXin Liinvoking the need for an actual union operation. The representation
84*67e74705SXin Liof a taint label is a 16-bit integer, and new labels are allocated
85*67e74705SXin Lisequentially from a pool. The label identifier 0 is special, and means
86*67e74705SXin Lithat the data item is unlabelled.
87*67e74705SXin Li
88*67e74705SXin LiWhen a label union operation is requested at a join point (any
89*67e74705SXin Liarithmetic or logical operation with two or more operands, such as
90*67e74705SXin Liaddition), the code checks whether a union is required, whether the
91*67e74705SXin Lisame union has been requested before, and whether one union label
92*67e74705SXin Lisubsumes the other. If so, it returns the previously allocated union
93*67e74705SXin Lilabel. If not, it allocates a new union label from the same pool used
94*67e74705SXin Lifor new labels.
95*67e74705SXin Li
96*67e74705SXin LiSpecifically, the instrumentation pass will insert code like this
97*67e74705SXin Lito decide the union label ``lu`` for a pair of labels ``l1``
98*67e74705SXin Liand ``l2``:
99*67e74705SXin Li
100*67e74705SXin Li.. code-block:: c
101*67e74705SXin Li
102*67e74705SXin Li  if (l1 == l2)
103*67e74705SXin Li    lu = l1;
104*67e74705SXin Li  else
105*67e74705SXin Li    lu = __dfsan_union(l1, l2);
106*67e74705SXin Li
107*67e74705SXin LiThe equality comparison is outlined, to provide an early exit in
108*67e74705SXin Lithe common cases where the program is processing unlabelled data, or
109*67e74705SXin Liwhere the two data items have the same label.  ``__dfsan_union`` is
110*67e74705SXin Lia runtime library function which performs all other union computation.
111*67e74705SXin Li
112*67e74705SXin LiFurther optimizations are possible, for example if ``l1`` is known
113*67e74705SXin Liat compile time to be zero (e.g. it is derived from a constant),
114*67e74705SXin Li``l2`` can be used for ``lu``, and vice versa.
115*67e74705SXin Li
116*67e74705SXin LiMemory layout and label management
117*67e74705SXin Li----------------------------------
118*67e74705SXin Li
119*67e74705SXin LiThe following is the current memory layout for Linux/x86\_64:
120*67e74705SXin Li
121*67e74705SXin Li+---------------+---------------+--------------------+
122*67e74705SXin Li|    Start      |    End        |        Use         |
123*67e74705SXin Li+===============+===============+====================+
124*67e74705SXin Li| 0x700000008000|0x800000000000 | application memory |
125*67e74705SXin Li+---------------+---------------+--------------------+
126*67e74705SXin Li| 0x200200000000|0x700000008000 |       unused       |
127*67e74705SXin Li+---------------+---------------+--------------------+
128*67e74705SXin Li| 0x200000000000|0x200200000000 |    union table     |
129*67e74705SXin Li+---------------+---------------+--------------------+
130*67e74705SXin Li| 0x000000010000|0x200000000000 |   shadow memory    |
131*67e74705SXin Li+---------------+---------------+--------------------+
132*67e74705SXin Li| 0x000000000000|0x000000010000 | reserved by kernel |
133*67e74705SXin Li+---------------+---------------+--------------------+
134*67e74705SXin Li
135*67e74705SXin LiEach byte of application memory corresponds to two bytes of shadow
136*67e74705SXin Limemory, which are used to store its taint label. As for LLVM SSA
137*67e74705SXin Liregisters, we have not found it necessary to associate a label with
138*67e74705SXin Lieach byte or bit of data, as some other tools do. Instead, labels are
139*67e74705SXin Liassociated directly with registers.  Loads will result in a union of
140*67e74705SXin Liall shadow labels corresponding to bytes loaded (which most of the
141*67e74705SXin Litime will be short circuited by the initial comparison) and stores will
142*67e74705SXin Liresult in a copy of the label to the shadow of all bytes stored to.
143*67e74705SXin Li
144*67e74705SXin LiPropagating labels through arguments
145*67e74705SXin Li------------------------------------
146*67e74705SXin Li
147*67e74705SXin LiIn order to propagate labels through function arguments and return values,
148*67e74705SXin LiDataFlowSanitizer changes the ABI of each function in the translation unit.
149*67e74705SXin LiThere are currently two supported ABIs:
150*67e74705SXin Li
151*67e74705SXin Li* Args -- Argument and return value labels are passed through additional
152*67e74705SXin Li  arguments and by modifying the return type.
153*67e74705SXin Li
154*67e74705SXin Li* TLS -- Argument and return value labels are passed through TLS variables
155*67e74705SXin Li  ``__dfsan_arg_tls`` and ``__dfsan_retval_tls``.
156*67e74705SXin Li
157*67e74705SXin LiThe main advantage of the TLS ABI is that it is more tolerant of ABI mismatches
158*67e74705SXin Li(TLS storage is not shared with any other form of storage, whereas extra
159*67e74705SXin Liarguments may be stored in registers which under the native ABI are not used
160*67e74705SXin Lifor parameter passing and thus could contain arbitrary values).  On the other
161*67e74705SXin Lihand the args ABI is more efficient and allows ABI mismatches to be more easily
162*67e74705SXin Liidentified by checking for nonzero labels in nominally unlabelled programs.
163*67e74705SXin Li
164*67e74705SXin LiImplementing the ABI list
165*67e74705SXin Li-------------------------
166*67e74705SXin Li
167*67e74705SXin LiThe `ABI list <DataFlowSanitizer.html#abi-list>`_ provides a list of functions
168*67e74705SXin Liwhich conform to the native ABI, each of which is callable from an instrumented
169*67e74705SXin Liprogram.  This is implemented by replacing each reference to a native ABI
170*67e74705SXin Lifunction with a reference to a function which uses the instrumented ABI.
171*67e74705SXin LiSuch functions are automatically-generated wrappers for the native functions.
172*67e74705SXin LiFor example, given the ABI list example provided in the user manual, the
173*67e74705SXin Lifollowing wrappers will be generated under the args ABI:
174*67e74705SXin Li
175*67e74705SXin Li.. code-block:: llvm
176*67e74705SXin Li
177*67e74705SXin Li    define linkonce_odr { i8*, i16 } @"dfsw$malloc"(i64 %0, i16 %1) {
178*67e74705SXin Li    entry:
179*67e74705SXin Li      %2 = call i8* @malloc(i64 %0)
180*67e74705SXin Li      %3 = insertvalue { i8*, i16 } undef, i8* %2, 0
181*67e74705SXin Li      %4 = insertvalue { i8*, i16 } %3, i16 0, 1
182*67e74705SXin Li      ret { i8*, i16 } %4
183*67e74705SXin Li    }
184*67e74705SXin Li
185*67e74705SXin Li    define linkonce_odr { i32, i16 } @"dfsw$tolower"(i32 %0, i16 %1) {
186*67e74705SXin Li    entry:
187*67e74705SXin Li      %2 = call i32 @tolower(i32 %0)
188*67e74705SXin Li      %3 = insertvalue { i32, i16 } undef, i32 %2, 0
189*67e74705SXin Li      %4 = insertvalue { i32, i16 } %3, i16 %1, 1
190*67e74705SXin Li      ret { i32, i16 } %4
191*67e74705SXin Li    }
192*67e74705SXin Li
193*67e74705SXin Li    define linkonce_odr { i8*, i16 } @"dfsw$memcpy"(i8* %0, i8* %1, i64 %2, i16 %3, i16 %4, i16 %5) {
194*67e74705SXin Li    entry:
195*67e74705SXin Li      %labelreturn = alloca i16
196*67e74705SXin Li      %6 = call i8* @__dfsw_memcpy(i8* %0, i8* %1, i64 %2, i16 %3, i16 %4, i16 %5, i16* %labelreturn)
197*67e74705SXin Li      %7 = load i16* %labelreturn
198*67e74705SXin Li      %8 = insertvalue { i8*, i16 } undef, i8* %6, 0
199*67e74705SXin Li      %9 = insertvalue { i8*, i16 } %8, i16 %7, 1
200*67e74705SXin Li      ret { i8*, i16 } %9
201*67e74705SXin Li    }
202*67e74705SXin Li
203*67e74705SXin LiAs an optimization, direct calls to native ABI functions will call the
204*67e74705SXin Linative ABI function directly and the pass will compute the appropriate label
205*67e74705SXin Liinternally.  This has the advantage of reducing the number of union operations
206*67e74705SXin Lirequired when the return value label is known to be zero (i.e. ``discard``
207*67e74705SXin Lifunctions, or ``functional`` functions with known unlabelled arguments).
208*67e74705SXin Li
209*67e74705SXin LiChecking ABI Consistency
210*67e74705SXin Li------------------------
211*67e74705SXin Li
212*67e74705SXin LiDFSan changes the ABI of each function in the module.  This makes it possible
213*67e74705SXin Lifor a function with the native ABI to be called with the instrumented ABI,
214*67e74705SXin Lior vice versa, thus possibly invoking undefined behavior.  A simple way
215*67e74705SXin Liof statically detecting instances of this problem is to prepend the prefix
216*67e74705SXin Li"dfs$" to the name of each instrumented-ABI function.
217*67e74705SXin Li
218*67e74705SXin LiThis will not catch every such problem; in particular function pointers passed
219*67e74705SXin Liacross the instrumented-native barrier cannot be used on the other side.
220*67e74705SXin LiThese problems could potentially be caught dynamically.
221