Get AST from .Net assembly without source code (IL code)

I would like to analyze .Net assemblies to be independent of C #, VB.NET or something else.

I know Roslyn and NRefactory, but they seem to work only at the C # source code level?
There is also the Infrastructure Compiler project : Model Code and API API “on CodePlex, which claims to support a hierarchical object model, which is blocks of code in a language-independent structured form” that sounds exactly what I'm looking for.

However, I cannot find useful documentation or code that really does this.

Any advice on how to do this?

Can Mono.Cecil do something?

+5
source share
4 answers

You can do this, and there is another (albeit tiny) example of this in the ILSpy source.

var assembly = AssemblyDefinition.ReadAssembly("path/to/assembly.dll"); var astBuilder = new AstBuilder(new DecompilerContext(assembly.MainModule)); decompiler.AddAssembly(assembly); astBuilder.SyntaxTree... 
+2
source

The CCI code is somewhere between the IL disassembler and the full C # decompiler: it gives your code some structure (like if expressions and expressions), but also contains some low-level stack operations like push and pop .

CCI contains a sample that shows this: PeToText .

For example, to get the code model for the first method of type Program (in the global namespace), you can use this code:

 string fileName = "whatever.exe"; using (var host = new PeReader.DefaultHost()) { var module = (IModule)host.LoadUnitFrom(fileName); var type = (ITypeDefinition)module.UnitNamespaceRoot.Members .Single(m => m.Name.Value == "Program"); var method = (IMethodDefinition)type.Members.First(); var methodBody = new SourceMethodBody(method.Body, host, null, null); } 

To demonstrate, if you decompile the above code and show it with PeToText, you will get:

 Microsoft.Cci.ITypeDefinition local_3; Microsoft.Cci.ILToCodeModel.SourceMethodBody local_5; string local_0 = "C:\\code\\tmp\\nuget tmp 2015\\bin\\Debug\\nuget tmp 2015.exe"; Microsoft.Cci.PeReader.DefaultHost local_1 = new Microsoft.Cci.PeReader.DefaultHost(); try { push (Microsoft.Cci.IModule)local_1.LoadUnitFrom(local_0).UnitNamespaceRoot.Members; push Program.<>c.<>9__0_0; if (dup == default(System.Func<Microsoft.Cci.INamespaceMember, bool>)) { pop; push Program.<>c.<>9.<Main0>b__0_0; Program.<>c.<>9__0_0 = dup; } local_3 = (Microsoft.Cci.ITypeDefinition)System.Linq.Enumerable.Single<Microsoft.Cci.INamespaceMember>(pop, pop); local_5 = new Microsoft.Cci.ILToCodeModel.SourceMethodBody((Microsoft.Cci.IMethodDefinition)System.Linq.Enumerable.First<Microsoft.Cci.ITypeDefinitionMember>(local_3.Members).Body, local_1, (Microsoft.Cci.ISourceLocationProvider)null, (Microsoft.Cci.ILocalScopeProvider)null, 0); } finally { if (local_1 != default(Microsoft.Cci.PeReader.DefaultHost)) { local_1.Dispose(); } } 

All those push , pop and dup statements and the lambda cache condition should be noted.

+1
source

As far as I know, it is impossible to build AST from binary (without sources), since the ACT created by the parser itself is part of the compilation process from sources. Mono.Cecil will not help, because you can only change opcode / metadata codes and not analyze the assembly.

But since this is .NET, you can dump IL code from a dll using ildasm. Then you can transfer the generated sources to any parser using the CIL dictionary and get the AST from the parser. The problem is, as far as I know, there is only one public CIL grammar for the parser, so you really have no choice. And ECMA-355 is big enough, so it's a bad idea to write your own grammar. Therefore, I can offer you only one solution:

  • Go through the build to ildasm.exe to get the CIL.
  • Then pass the CIL to the ANTLR v3 parser with this Glossary CIL is connected (note that it is a little outdated - the grammar was created in 2004 and the latest CIL specification is 2006, but CIL doesn't really change)
  • You can then freely access the AST created by ANTLR

Note that you will need ANTLR v3, not v4, since the grammar is written for the third version, and it is hardly possible to port it to v4 without a good knowledge of ANTLR syntax.

You can also try exploring the new Microsoft ryujit compiler sources in github (part of CoreCLR) - I'm not sure if this helps, but theoretically it should contain implementations of the grammar and CIL parser as it works with CIL code. But it is written in CPP, it has a huge code base and lack of documentation, as it is under active development, so it might be easier to get stuck with ANTLR.

0
source

If you treat the binary .net file as a stream of bytes, you should be able to "parse" it just fine.

You just write a grammar whose tokens are essentially bytes. You can, of course, create a classic lexer / parser with just about any lexer / parser toolkit, specifying lexer to read single bytes as tokens.

You can then build the AST using standard AST building equipment for the parser (stand alone for YACC, automatically with ANTLR4).

What you will find, of course, is that “parsing” is not enough; you still have to create symbol tables and perform management and data flow analysis if you intend to seriously analyze the corresponding code. See My Essay on LifeAfterParsing.

You will also probably have to consider the “excellent” functions that provide key runtimes for certain programming languages ​​that actually generate CIL code. And that will make your analyzers language dependent. Yes, you can still share the part of the analysis that works on the universal CIL.

0
source

Source: https://habr.com/ru/post/1207399/


All Articles