1*67e74705SXin Li//===----------------------------------------------------------------------===// 2*67e74705SXin Li// Clang Static Analyzer 3*67e74705SXin Li//===----------------------------------------------------------------------===// 4*67e74705SXin Li 5*67e74705SXin Li= Library Structure = 6*67e74705SXin Li 7*67e74705SXin LiThe analyzer library has two layers: a (low-level) static analysis 8*67e74705SXin Liengine (GRExprEngine.cpp and friends), and some static checkers 9*67e74705SXin Li(*Checker.cpp). The latter are built on top of the former via the 10*67e74705SXin LiChecker and CheckerVisitor interfaces (Checker.h and 11*67e74705SXin LiCheckerVisitor.h). The Checker interface is designed to be minimal 12*67e74705SXin Liand simple for checker writers, and attempts to isolate them from much 13*67e74705SXin Liof the gore of the internal analysis engine. 14*67e74705SXin Li 15*67e74705SXin Li= How It Works = 16*67e74705SXin Li 17*67e74705SXin LiThe analyzer is inspired by several foundational research papers ([1], 18*67e74705SXin Li[2]). (FIXME: kremenek to add more links) 19*67e74705SXin Li 20*67e74705SXin LiIn a nutshell, the analyzer is basically a source code simulator that 21*67e74705SXin Litraces out possible paths of execution. The state of the program 22*67e74705SXin Li(values of variables and expressions) is encapsulated by the state 23*67e74705SXin Li(ProgramState). A location in the program is called a program point 24*67e74705SXin Li(ProgramPoint), and the combination of state and program point is a 25*67e74705SXin Linode in an exploded graph (ExplodedGraph). The term "exploded" comes 26*67e74705SXin Lifrom exploding the control-flow edges in the control-flow graph (CFG). 27*67e74705SXin Li 28*67e74705SXin LiConceptually the analyzer does a reachability analysis through the 29*67e74705SXin LiExplodedGraph. We start at a root node, which has the entry program 30*67e74705SXin Lipoint and initial state, and then simulate transitions by analyzing 31*67e74705SXin Liindividual expressions. The analysis of an expression can cause the 32*67e74705SXin Listate to change, resulting in a new node in the ExplodedGraph with an 33*67e74705SXin Liupdated program point and an updated state. A bug is found by hitting 34*67e74705SXin Lia node that satisfies some "bug condition" (basically a violation of a 35*67e74705SXin Lichecking invariant). 36*67e74705SXin Li 37*67e74705SXin LiThe analyzer traces out multiple paths by reasoning about branches and 38*67e74705SXin Lithen bifurcating the state: on the true branch the conditions of the 39*67e74705SXin Libranch are assumed to be true and on the false branch the conditions 40*67e74705SXin Liof the branch are assumed to be false. Such "assumptions" create 41*67e74705SXin Liconstraints on the values of the program, and those constraints are 42*67e74705SXin Lirecorded in the ProgramState object (and are manipulated by the 43*67e74705SXin LiConstraintManager). If assuming the conditions of a branch would 44*67e74705SXin Licause the constraints to be unsatisfiable, the branch is considered 45*67e74705SXin Liinfeasible and that path is not taken. This is how we get 46*67e74705SXin Lipath-sensitivity. We reduce exponential blow-up by caching nodes. If 47*67e74705SXin Lia new node with the same state and program point as an existing node 48*67e74705SXin Liwould get generated, the path "caches out" and we simply reuse the 49*67e74705SXin Liexisting node. Thus the ExplodedGraph is not a DAG; it can contain 50*67e74705SXin Licycles as paths loop back onto each other and cache out. 51*67e74705SXin Li 52*67e74705SXin LiProgramState and ExplodedNodes are basically immutable once created. Once 53*67e74705SXin Lione creates a ProgramState, you need to create a new one to get a new 54*67e74705SXin LiProgramState. This immutability is key since the ExplodedGraph represents 55*67e74705SXin Lithe behavior of the analyzed program from the entry point. To 56*67e74705SXin Lirepresent these efficiently, we use functional data structures (e.g., 57*67e74705SXin LiImmutableMaps) which share data between instances. 58*67e74705SXin Li 59*67e74705SXin LiFinally, individual Checkers work by also manipulating the analysis 60*67e74705SXin Listate. The analyzer engine talks to them via a visitor interface. 61*67e74705SXin LiFor example, the PreVisitCallExpr() method is called by GRExprEngine 62*67e74705SXin Lito tell the Checker that we are about to analyze a CallExpr, and the 63*67e74705SXin Lichecker is asked to check for any preconditions that might not be 64*67e74705SXin Lisatisfied. The checker can do nothing, or it can generate a new 65*67e74705SXin LiProgramState and ExplodedNode which contains updated checker state. If it 66*67e74705SXin Lifinds a bug, it can tell the BugReporter object about the bug, 67*67e74705SXin Liproviding it an ExplodedNode which is the last node in the path that 68*67e74705SXin Litriggered the problem. 69*67e74705SXin Li 70*67e74705SXin Li= Notes about C++ = 71*67e74705SXin Li 72*67e74705SXin LiSince now constructors are seen before the variable that is constructed 73*67e74705SXin Liin the CFG, we create a temporary object as the destination region that 74*67e74705SXin Liis constructed into. See ExprEngine::VisitCXXConstructExpr(). 75*67e74705SXin Li 76*67e74705SXin LiIn ExprEngine::processCallExit(), we always bind the object region to the 77*67e74705SXin Lievaluated CXXConstructExpr. Then in VisitDeclStmt(), we compute the 78*67e74705SXin Licorresponding lazy compound value if the variable is not a reference, and 79*67e74705SXin Libind the variable region to the lazy compound value. If the variable 80*67e74705SXin Liis a reference, just use the object region as the initilizer value. 81*67e74705SXin Li 82*67e74705SXin LiBefore entering a C++ method (or ctor/dtor), the 'this' region is bound 83*67e74705SXin Lito the object region. In ctors, we synthesize 'this' region with 84*67e74705SXin LiCXXRecordDecl*, which means we do not use type qualifiers. In methods, we 85*67e74705SXin Lisynthesize 'this' region with CXXMethodDecl*, which has getThisType() 86*67e74705SXin Litaking type qualifiers into account. It does not matter we use qualified 87*67e74705SXin Li'this' region in one method and unqualified 'this' region in another 88*67e74705SXin Limethod, because we only need to ensure the 'this' region is consistent 89*67e74705SXin Liwhen we synthesize it and create it directly from CXXThisExpr in a single 90*67e74705SXin Limethod call. 91*67e74705SXin Li 92*67e74705SXin Li= Working on the Analyzer = 93*67e74705SXin Li 94*67e74705SXin LiIf you are interested in bringing up support for C++ expressions, the 95*67e74705SXin Libest place to look is the visitation logic in GRExprEngine, which 96*67e74705SXin Lihandles the simulation of individual expressions. There are plenty of 97*67e74705SXin Liexamples there of how other expressions are handled. 98*67e74705SXin Li 99*67e74705SXin LiIf you are interested in writing checkers, look at the Checker and 100*67e74705SXin LiCheckerVisitor interfaces (Checker.h and CheckerVisitor.h). Also look 101*67e74705SXin Liat the files named *Checker.cpp for examples on how you can implement 102*67e74705SXin Lithese interfaces. 103*67e74705SXin Li 104*67e74705SXin Li= Debugging the Analyzer = 105*67e74705SXin Li 106*67e74705SXin LiThere are some useful command-line options for debugging. For example: 107*67e74705SXin Li 108*67e74705SXin Li$ clang -cc1 -help | grep analyze 109*67e74705SXin Li -analyze-function <value> 110*67e74705SXin Li -analyzer-display-progress 111*67e74705SXin Li -analyzer-viz-egraph-graphviz 112*67e74705SXin Li ... 113*67e74705SXin Li 114*67e74705SXin LiThe first allows you to specify only analyzing a specific function. 115*67e74705SXin LiThe second prints to the console what function is being analyzed. The 116*67e74705SXin Lithird generates a graphviz dot file of the ExplodedGraph. This is 117*67e74705SXin Liextremely useful when debugging the analyzer and viewing the 118*67e74705SXin Lisimulation results. 119*67e74705SXin Li 120*67e74705SXin LiOf course, viewing the CFG (Control-Flow Graph) is also useful: 121*67e74705SXin Li 122*67e74705SXin Li$ clang -cc1 -help | grep cfg 123*67e74705SXin Li -cfg-add-implicit-dtors Add C++ implicit destructors to CFGs for all analyses 124*67e74705SXin Li -cfg-add-initializers Add C++ initializers to CFGs for all analyses 125*67e74705SXin Li -cfg-dump Display Control-Flow Graphs 126*67e74705SXin Li -cfg-view View Control-Flow Graphs using GraphViz 127*67e74705SXin Li -unoptimized-cfg Generate unoptimized CFGs for all analyses 128*67e74705SXin Li 129*67e74705SXin Li-cfg-dump dumps a textual representation of the CFG to the console, 130*67e74705SXin Liand -cfg-view creates a GraphViz representation. 131*67e74705SXin Li 132*67e74705SXin Li= References = 133*67e74705SXin Li 134*67e74705SXin Li[1] Precise interprocedural dataflow analysis via graph reachability, 135*67e74705SXin Li T Reps, S Horwitz, and M Sagiv, POPL '95, 136*67e74705SXin Li http://portal.acm.org/citation.cfm?id=199462 137*67e74705SXin Li 138*67e74705SXin Li[2] A memory model for static analysis of C programs, Z Xu, T 139*67e74705SXin Li Kremenek, and J Zhang, http://lcs.ios.ac.cn/~xzx/memmodel.pdf 140