Tokenization is the act of breaking down source text into language elements, such as operators, variable names, numbers, etc. The analysis interprets the sequences of tokens and builds Abstract syntax trees , which is a concrete representation of the program. Toxification and parsing are necessary for static analysis, but hardly interesting, just as ante-k-bank is necessary for playing poker, but not in the interesting part of the game.
If you are creating a static analyzer (you mean that you expect to work on one implemented in C or C ++), you will need fundamental knowledge of compilation (there is not much parsing if you do not create a parser for static analysis of the language analyzed), but, of course, about the presentation of programs (AST, triples, control diagrams and data flows, ...), the derivation of the type and properties and limits of accuracy of analysis (the reason for conservative analysis. Representations of programs are fundamental because they represent they are data structures that most static analyzers actually process; itโs too difficult to hide useful facts directly from the program text.These concepts can be used to implement the capabilities of static analysis in any programming language to implement tools like analysis; there is nothing special in implementing them in C or C ++.
Run, donโt go, to the nearest compiler class for the first part of this. If you do not have it, you cannot do anything effectively in creating the tool. The second part, which you are likely to find in the class of computer science graduates.
If you overcome this basic knowledge problem, you will either decide to implement the analysis tool from scratch, or build on the existing infrastructure of the analysis tool. Few decide to build from scratch; To create reliable parsers, flow analyzers, etc., necessary as the basis for any specific static analysis, a huge amount of work is required (years or decades). Mostly people try to use some existing infrastructure.
There is a huge list of candidates: http://en.wikipedia.org/wiki/List_of_tools_for_static_code_analysis
If you insist on processing C or C ++ and create your own complex analysis, you really need a tool that can handle real C and C ++ codes. There are IMHO a limited number of good candidates:
- GCC (and various transplants, such as Starynkevitch MELT, which I know little about)
- Clang (quite a spectacular toolbox)
- DMS (and its ends C and C ++ ) [my company tool]
- Open64 compiler infrastructure
- Rose Compiler Infrastructure (Based on EDG Industry Interface)
Each of them is a large system and requires a large investment to understand and start using. Do not underestimate the learning curve.
There are many other tools that look like C and C ++ processes, but โsortingโ is pretty useless for static analysis purposes.
If you intend to simply use the static analysis tool, you can avoid exploring most of the issues of parsing and presentation of programs; instead, you will need to learn as much as possible about the specific analysis tool that you are going to use. You donโt care much better if you understand what this tool does, why it does it, and why it generates the answers that it does (as a rule, it gives answers that are unsatisfactory in many ways due to conservative accuracy limits analysis).
Finally, it should be clear to you that you understand the difference between static analysis and dynamic analysis [using data collected at runtime to determine program properties). Most end users do not care about how you get information about your code, and each analysis approach has its own strengths and weaknesses.