Subprogramme

Question

Subprogramme

Is there any article describing any algorithm / method for outputting routines from a compiled program? In other words: is there an algorithm for finding blocks of code that appear more than once in a program? These blocks may have instructions reordered (without changing the behavior of the program, of course), so that they are more likely to find a match.

This process can be seen as the opposite of a routine that is executed by compilers to avoid calls, but by increasing the size of the binary.

It seems to me that this is a very difficult theoretical problem.

+6

assembly compiler-theory computation-theory information-theory

philix Dec 6 '11 at 21:39

source share

3 answers

What you are looking for is called a "clone detector." You can do this in source code or object code. The basic idea is to decide which points of variability you want to accept.

You can read about our cloneDR detector , which finds duplicated code, comparing the syntax trees of the source files, skip matches. This happens in many files, not just in one source file. This is similar to detecting a “general subexpression,” but it works with both declarations and executable code. When the match is not exact, it can determine the parameters for the "subprogram" (abstraction).

See my article on Clone Detection Using Abstract Syntax Trees for a description of the algorithms.

CloneDR does this for many languages, using language-precise parsing of the front end .

The site describes how CloneDR works and compares CloneDR with a number of other clone detection tools.

CloneDR does not handle command reordering. Less scalable methods that find duplicates by comparing PDGs can do this. They are pretty close to comparing data flow graphs, which can be useful for finding machine code matches.

+3

Ira Baxter Dec 7 '11 at 0:39

source share

Perhaps this is stupid .. but consider the "diff". This is mainly a limited version.

-1

Yttrill Dec 23 '11 at 13:32

source share

Mackie messer · Accepted Answer · 2011-12-06T22:02:59+0000

Well, this is an interesting problem. People really worked on it. A quick search returns these two:

Keith D. Cooper, Nathaniel McIntosh: Advanced Code Compression for Embedded RISC Processors , PLDI 1999.
Christopher W. Fraser, Eugene W. Myers, Alan L. Wendt: Analysis and Compression of Assembly Code , SIGPLAN Notifications, June 1984.

But there are probably many more. You can use Google Scholar to find more recent documents that link to these old ones.

Subprogramme

More articles: