Tokenization and AST

Question

Tokenization and AST

Take a pretty abstract question for all of you. I am looking at participating in a static code analysis project. It uses C and C ++ as a development language, so if any code could be in any of these languages in your answer, that would be great.

My question is: I need to understand some of the basic concepts / constructs used to process code for static analysis. I heard that people use things like AST and tokenization, etc. I'm just wondering, can anything explain how these things apply when creating a static analysis tool? The identifier is more like an explanation of tokenization, because I do not understand this very well. I understand this is a way to handle strings, but I'm not sure about this answer. In addition, I know that the project I'm looking at passes the code through the preprocessor before analyzing it. Can someone explain this? Of course, if this is a static code analysis, it does not need to be pre-processed?

Hope someone can clear this.

Greetings.

+4

c ++ c static-analysis abstract-syntax-tree

A.Smith Jan 19 '12 at 19:17

source share

3 answers

If you are interested in static analysis, you should not worry about parsing (for example, lexing, parsing, building an abstract syntax tree), because it does not recognize you very much.

Therefore, I suggest you use some existing parsing infrastructure. This is especially true if you want to parse C ++ source code, because simply coding the C ++ parser is a difficult task.

For example, you can use the GCC compiler or on LLVM / Clang . Keep in mind their open source license: GCC is under GPLv3, and Clang / LLVM is BSD-like. GCC is capable of handling not only C and C ++, but also Fortran, Go, Ada, Objective-C.

Because of the GCPv3 GCC license, your development over it should also be a free program under GPLv3. However, this also means that the GCC community is quite large. If you know Ocaml and are interested in only C static analysis, you can consider Frama-C (LGPL license)

Actually, I am working on GCC by providing the MELT extension (MELT is also licensed by GPLv3). MELT is a high-level domain language for extending GCC and should be an ideal choice for working in static analysis using the powerful structure and internal representations (Gimple, Tree, ...) of GCC.

(Of course, LLVM people would be happy to use their work, and no one knows both LLVM and GCC, because learning so much software is already a problem, so people should choose what they put their time into).

So my suggestion is to join in with the existing static analysis or compiler efforts , not reinvent the wheel starting from scratch (because it means years of effort, just to have a good parser).

This way, you will take advantage of the existing infrastructure (so that you will not worry about AST and parsing), and you will improve some existing project.

I would be very happy if you are interested in MELT. FWIW, there is a MELT textbook in Paris, January 24 ^th 2012, HiPEAC conference .

+5

Basile starynkevitch Jan 19 '12 at 19:29

source share

If you are studying static analysis for C and C ++, your first stop should be libclang along with related documents.

From the point of view of using the preprocessor, static analysis is designed to catch errors and possible unintended functionality of the code (mainly), so it should analyze what the compiler will compile, and not what the user sees.

I'm sure @Ira Baxter will come out with a much more complete answer :)

+2

Necrolis Jan 19 '12 at 19:24

source share

Ira Baxter · Accepted Answer · 2012-01-19T21:43:51+0000

Tokenization is the act of breaking down source text into language elements, such as operators, variable names, numbers, etc. The analysis interprets the sequences of tokens and builds Abstract syntax trees , which is a concrete representation of the program. Toxification and parsing are necessary for static analysis, but hardly interesting, just as ante-k-bank is necessary for playing poker, but not in the interesting part of the game.

If you are creating a static analyzer (you mean that you expect to work on one implemented in C or C ++), you will need fundamental knowledge of compilation (there is not much parsing if you do not create a parser for static analysis of the language analyzed), but, of course, about the presentation of programs (AST, triples, control diagrams and data flows, ...), the derivation of the type and properties and limits of accuracy of analysis (the reason for conservative analysis. Representations of programs are fundamental because they represent they are data structures that most static analyzers actually process; it’s too difficult to hide useful facts directly from the program text.These concepts can be used to implement the capabilities of static analysis in any programming language to implement tools like analysis; there is nothing special in implementing them in C or C ++.

Run, don’t go, to the nearest compiler class for the first part of this. If you do not have it, you cannot do anything effectively in creating the tool. The second part, which you are likely to find in the class of computer science graduates.

If you overcome this basic knowledge problem, you will either decide to implement the analysis tool from scratch, or build on the existing infrastructure of the analysis tool. Few decide to build from scratch; To create reliable parsers, flow analyzers, etc., necessary as the basis for any specific static analysis, a huge amount of work is required (years or decades). Mostly people try to use some existing infrastructure.

There is a huge list of candidates: http://en.wikipedia.org/wiki/List_of_tools_for_static_code_analysis

If you insist on processing C or C ++ and create your own complex analysis, you really need a tool that can handle real C and C ++ codes. There are IMHO a limited number of good candidates:

GCC (and various transplants, such as Starynkevitch MELT, which I know little about)
Clang (quite a spectacular toolbox)
DMS (and its ends C and C ++ ) [my company tool]
Open64 compiler infrastructure
Rose Compiler Infrastructure (Based on EDG Industry Interface)

Each of them is a large system and requires a large investment to understand and start using. Do not underestimate the learning curve.

There are many other tools that look like C and C ++ processes, but “sorting” is pretty useless for static analysis purposes.

If you intend to simply use the static analysis tool, you can avoid exploring most of the issues of parsing and presentation of programs; instead, you will need to learn as much as possible about the specific analysis tool that you are going to use. You don’t care much better if you understand what this tool does, why it does it, and why it generates the answers that it does (as a rule, it gives answers that are unsatisfactory in many ways due to conservative accuracy limits analysis).

Finally, it should be clear to you that you understand the difference between static analysis and dynamic analysis [using data collected at runtime to determine program properties). Most end users do not care about how you get information about your code, and each analysis approach has its own strengths and weaknesses.

Tokenization and AST

More articles: