1*67e74705SXin LiDataFlowSanitizer Design Document 2*67e74705SXin Li================================= 3*67e74705SXin Li 4*67e74705SXin LiThis document sets out the design for DataFlowSanitizer, a general 5*67e74705SXin Lidynamic data flow analysis. Unlike other Sanitizer tools, this tool is 6*67e74705SXin Linot designed to detect a specific class of bugs on its own. Instead, 7*67e74705SXin Liit provides a generic dynamic data flow analysis framework to be used 8*67e74705SXin Liby clients to help detect application-specific issues within their 9*67e74705SXin Liown code. 10*67e74705SXin Li 11*67e74705SXin LiDataFlowSanitizer is a program instrumentation which can associate 12*67e74705SXin Lia number of taint labels with any data stored in any memory region 13*67e74705SXin Liaccessible by the program. The analysis is dynamic, which means that 14*67e74705SXin Liit operates on a running program, and tracks how the labels propagate 15*67e74705SXin Lithrough that program. The tool shall support a large (>100) number 16*67e74705SXin Liof labels, such that programs which operate on large numbers of data 17*67e74705SXin Liitems may be analysed with each data item being tracked separately. 18*67e74705SXin Li 19*67e74705SXin LiUse Cases 20*67e74705SXin Li--------- 21*67e74705SXin Li 22*67e74705SXin LiThis instrumentation can be used as a tool to help monitor how data 23*67e74705SXin Liflows from a program's inputs (sources) to its outputs (sinks). 24*67e74705SXin LiThis has applications from a privacy/security perspective in that 25*67e74705SXin Lione can audit how a sensitive data item is used within a program and 26*67e74705SXin Liensure it isn't exiting the program anywhere it shouldn't be. 27*67e74705SXin Li 28*67e74705SXin LiInterface 29*67e74705SXin Li--------- 30*67e74705SXin Li 31*67e74705SXin LiA number of functions are provided which will create taint labels, 32*67e74705SXin Liattach labels to memory regions and extract the set of labels 33*67e74705SXin Liassociated with a specific memory region. These functions are declared 34*67e74705SXin Liin the header file ``sanitizer/dfsan_interface.h``. 35*67e74705SXin Li 36*67e74705SXin Li.. code-block:: c 37*67e74705SXin Li 38*67e74705SXin Li /// Creates and returns a base label with the given description and user data. 39*67e74705SXin Li dfsan_label dfsan_create_label(const char *desc, void *userdata); 40*67e74705SXin Li 41*67e74705SXin Li /// Sets the label for each address in [addr,addr+size) to \c label. 42*67e74705SXin Li void dfsan_set_label(dfsan_label label, void *addr, size_t size); 43*67e74705SXin Li 44*67e74705SXin Li /// Sets the label for each address in [addr,addr+size) to the union of the 45*67e74705SXin Li /// current label for that address and \c label. 46*67e74705SXin Li void dfsan_add_label(dfsan_label label, void *addr, size_t size); 47*67e74705SXin Li 48*67e74705SXin Li /// Retrieves the label associated with the given data. 49*67e74705SXin Li /// 50*67e74705SXin Li /// The type of 'data' is arbitrary. The function accepts a value of any type, 51*67e74705SXin Li /// which can be truncated or extended (implicitly or explicitly) as necessary. 52*67e74705SXin Li /// The truncation/extension operations will preserve the label of the original 53*67e74705SXin Li /// value. 54*67e74705SXin Li dfsan_label dfsan_get_label(long data); 55*67e74705SXin Li 56*67e74705SXin Li /// Retrieves a pointer to the dfsan_label_info struct for the given label. 57*67e74705SXin Li const struct dfsan_label_info *dfsan_get_label_info(dfsan_label label); 58*67e74705SXin Li 59*67e74705SXin Li /// Returns whether the given label label contains the label elem. 60*67e74705SXin Li int dfsan_has_label(dfsan_label label, dfsan_label elem); 61*67e74705SXin Li 62*67e74705SXin Li /// If the given label label contains a label with the description desc, returns 63*67e74705SXin Li /// that label, else returns 0. 64*67e74705SXin Li dfsan_label dfsan_has_label_with_desc(dfsan_label label, const char *desc); 65*67e74705SXin Li 66*67e74705SXin LiTaint label representation 67*67e74705SXin Li-------------------------- 68*67e74705SXin Li 69*67e74705SXin LiAs stated above, the tool must track a large number of taint 70*67e74705SXin Lilabels. This poses an implementation challenge, as most multiple-label 71*67e74705SXin Litainting systems assign one label per bit to shadow storage, and 72*67e74705SXin Liunion taint labels using a bitwise or operation. This will not scale 73*67e74705SXin Lito clients which use hundreds or thousands of taint labels, as the 74*67e74705SXin Lilabel union operation becomes O(n) in the number of supported labels, 75*67e74705SXin Liand data associated with it will quickly dominate the live variable 76*67e74705SXin Liset, causing register spills and hampering performance. 77*67e74705SXin Li 78*67e74705SXin LiInstead, a low overhead approach is proposed which is best-case O(log\ 79*67e74705SXin Li:sub:`2` n) during execution. The underlying assumption is that 80*67e74705SXin Lithe required space of label unions is sparse, which is a reasonable 81*67e74705SXin Liassumption to make given that we are optimizing for the case where 82*67e74705SXin Liapplications mostly copy data from one place to another, without often 83*67e74705SXin Liinvoking the need for an actual union operation. The representation 84*67e74705SXin Liof a taint label is a 16-bit integer, and new labels are allocated 85*67e74705SXin Lisequentially from a pool. The label identifier 0 is special, and means 86*67e74705SXin Lithat the data item is unlabelled. 87*67e74705SXin Li 88*67e74705SXin LiWhen a label union operation is requested at a join point (any 89*67e74705SXin Liarithmetic or logical operation with two or more operands, such as 90*67e74705SXin Liaddition), the code checks whether a union is required, whether the 91*67e74705SXin Lisame union has been requested before, and whether one union label 92*67e74705SXin Lisubsumes the other. If so, it returns the previously allocated union 93*67e74705SXin Lilabel. If not, it allocates a new union label from the same pool used 94*67e74705SXin Lifor new labels. 95*67e74705SXin Li 96*67e74705SXin LiSpecifically, the instrumentation pass will insert code like this 97*67e74705SXin Lito decide the union label ``lu`` for a pair of labels ``l1`` 98*67e74705SXin Liand ``l2``: 99*67e74705SXin Li 100*67e74705SXin Li.. code-block:: c 101*67e74705SXin Li 102*67e74705SXin Li if (l1 == l2) 103*67e74705SXin Li lu = l1; 104*67e74705SXin Li else 105*67e74705SXin Li lu = __dfsan_union(l1, l2); 106*67e74705SXin Li 107*67e74705SXin LiThe equality comparison is outlined, to provide an early exit in 108*67e74705SXin Lithe common cases where the program is processing unlabelled data, or 109*67e74705SXin Liwhere the two data items have the same label. ``__dfsan_union`` is 110*67e74705SXin Lia runtime library function which performs all other union computation. 111*67e74705SXin Li 112*67e74705SXin LiFurther optimizations are possible, for example if ``l1`` is known 113*67e74705SXin Liat compile time to be zero (e.g. it is derived from a constant), 114*67e74705SXin Li``l2`` can be used for ``lu``, and vice versa. 115*67e74705SXin Li 116*67e74705SXin LiMemory layout and label management 117*67e74705SXin Li---------------------------------- 118*67e74705SXin Li 119*67e74705SXin LiThe following is the current memory layout for Linux/x86\_64: 120*67e74705SXin Li 121*67e74705SXin Li+---------------+---------------+--------------------+ 122*67e74705SXin Li| Start | End | Use | 123*67e74705SXin Li+===============+===============+====================+ 124*67e74705SXin Li| 0x700000008000|0x800000000000 | application memory | 125*67e74705SXin Li+---------------+---------------+--------------------+ 126*67e74705SXin Li| 0x200200000000|0x700000008000 | unused | 127*67e74705SXin Li+---------------+---------------+--------------------+ 128*67e74705SXin Li| 0x200000000000|0x200200000000 | union table | 129*67e74705SXin Li+---------------+---------------+--------------------+ 130*67e74705SXin Li| 0x000000010000|0x200000000000 | shadow memory | 131*67e74705SXin Li+---------------+---------------+--------------------+ 132*67e74705SXin Li| 0x000000000000|0x000000010000 | reserved by kernel | 133*67e74705SXin Li+---------------+---------------+--------------------+ 134*67e74705SXin Li 135*67e74705SXin LiEach byte of application memory corresponds to two bytes of shadow 136*67e74705SXin Limemory, which are used to store its taint label. As for LLVM SSA 137*67e74705SXin Liregisters, we have not found it necessary to associate a label with 138*67e74705SXin Lieach byte or bit of data, as some other tools do. Instead, labels are 139*67e74705SXin Liassociated directly with registers. Loads will result in a union of 140*67e74705SXin Liall shadow labels corresponding to bytes loaded (which most of the 141*67e74705SXin Litime will be short circuited by the initial comparison) and stores will 142*67e74705SXin Liresult in a copy of the label to the shadow of all bytes stored to. 143*67e74705SXin Li 144*67e74705SXin LiPropagating labels through arguments 145*67e74705SXin Li------------------------------------ 146*67e74705SXin Li 147*67e74705SXin LiIn order to propagate labels through function arguments and return values, 148*67e74705SXin LiDataFlowSanitizer changes the ABI of each function in the translation unit. 149*67e74705SXin LiThere are currently two supported ABIs: 150*67e74705SXin Li 151*67e74705SXin Li* Args -- Argument and return value labels are passed through additional 152*67e74705SXin Li arguments and by modifying the return type. 153*67e74705SXin Li 154*67e74705SXin Li* TLS -- Argument and return value labels are passed through TLS variables 155*67e74705SXin Li ``__dfsan_arg_tls`` and ``__dfsan_retval_tls``. 156*67e74705SXin Li 157*67e74705SXin LiThe main advantage of the TLS ABI is that it is more tolerant of ABI mismatches 158*67e74705SXin Li(TLS storage is not shared with any other form of storage, whereas extra 159*67e74705SXin Liarguments may be stored in registers which under the native ABI are not used 160*67e74705SXin Lifor parameter passing and thus could contain arbitrary values). On the other 161*67e74705SXin Lihand the args ABI is more efficient and allows ABI mismatches to be more easily 162*67e74705SXin Liidentified by checking for nonzero labels in nominally unlabelled programs. 163*67e74705SXin Li 164*67e74705SXin LiImplementing the ABI list 165*67e74705SXin Li------------------------- 166*67e74705SXin Li 167*67e74705SXin LiThe `ABI list <DataFlowSanitizer.html#abi-list>`_ provides a list of functions 168*67e74705SXin Liwhich conform to the native ABI, each of which is callable from an instrumented 169*67e74705SXin Liprogram. This is implemented by replacing each reference to a native ABI 170*67e74705SXin Lifunction with a reference to a function which uses the instrumented ABI. 171*67e74705SXin LiSuch functions are automatically-generated wrappers for the native functions. 172*67e74705SXin LiFor example, given the ABI list example provided in the user manual, the 173*67e74705SXin Lifollowing wrappers will be generated under the args ABI: 174*67e74705SXin Li 175*67e74705SXin Li.. code-block:: llvm 176*67e74705SXin Li 177*67e74705SXin Li define linkonce_odr { i8*, i16 } @"dfsw$malloc"(i64 %0, i16 %1) { 178*67e74705SXin Li entry: 179*67e74705SXin Li %2 = call i8* @malloc(i64 %0) 180*67e74705SXin Li %3 = insertvalue { i8*, i16 } undef, i8* %2, 0 181*67e74705SXin Li %4 = insertvalue { i8*, i16 } %3, i16 0, 1 182*67e74705SXin Li ret { i8*, i16 } %4 183*67e74705SXin Li } 184*67e74705SXin Li 185*67e74705SXin Li define linkonce_odr { i32, i16 } @"dfsw$tolower"(i32 %0, i16 %1) { 186*67e74705SXin Li entry: 187*67e74705SXin Li %2 = call i32 @tolower(i32 %0) 188*67e74705SXin Li %3 = insertvalue { i32, i16 } undef, i32 %2, 0 189*67e74705SXin Li %4 = insertvalue { i32, i16 } %3, i16 %1, 1 190*67e74705SXin Li ret { i32, i16 } %4 191*67e74705SXin Li } 192*67e74705SXin Li 193*67e74705SXin Li define linkonce_odr { i8*, i16 } @"dfsw$memcpy"(i8* %0, i8* %1, i64 %2, i16 %3, i16 %4, i16 %5) { 194*67e74705SXin Li entry: 195*67e74705SXin Li %labelreturn = alloca i16 196*67e74705SXin Li %6 = call i8* @__dfsw_memcpy(i8* %0, i8* %1, i64 %2, i16 %3, i16 %4, i16 %5, i16* %labelreturn) 197*67e74705SXin Li %7 = load i16* %labelreturn 198*67e74705SXin Li %8 = insertvalue { i8*, i16 } undef, i8* %6, 0 199*67e74705SXin Li %9 = insertvalue { i8*, i16 } %8, i16 %7, 1 200*67e74705SXin Li ret { i8*, i16 } %9 201*67e74705SXin Li } 202*67e74705SXin Li 203*67e74705SXin LiAs an optimization, direct calls to native ABI functions will call the 204*67e74705SXin Linative ABI function directly and the pass will compute the appropriate label 205*67e74705SXin Liinternally. This has the advantage of reducing the number of union operations 206*67e74705SXin Lirequired when the return value label is known to be zero (i.e. ``discard`` 207*67e74705SXin Lifunctions, or ``functional`` functions with known unlabelled arguments). 208*67e74705SXin Li 209*67e74705SXin LiChecking ABI Consistency 210*67e74705SXin Li------------------------ 211*67e74705SXin Li 212*67e74705SXin LiDFSan changes the ABI of each function in the module. This makes it possible 213*67e74705SXin Lifor a function with the native ABI to be called with the instrumented ABI, 214*67e74705SXin Lior vice versa, thus possibly invoking undefined behavior. A simple way 215*67e74705SXin Liof statically detecting instances of this problem is to prepend the prefix 216*67e74705SXin Li"dfs$" to the name of each instrumented-ABI function. 217*67e74705SXin Li 218*67e74705SXin LiThis will not catch every such problem; in particular function pointers passed 219*67e74705SXin Liacross the instrumented-native barrier cannot be used on the other side. 220*67e74705SXin LiThese problems could potentially be caught dynamically. 221