What is a suitable lexer generator that I can use to separate identifiers from many language source files?

I am working on a group project for my university that will be used to detect plagiarism in computer science.

My group primarily moves away from the hash / fingerprint methods described in this article: Winnowing: Local algorithms for fingerprinting a document . This is very similar to how the MOSS plagiarism detection system works .

We mainly use the xx hashes of the students' source code and look through them in the database for the corresponding matches (along with a lot of optimization in the way we determine which hashes are selected as document prints).

The first aspect of our project is its “Front-End” part, which will contain semantic knowledge about each file format that our detection system can process. This will allow us to remove some details from the document that we no longer want to detect plagiarism. Basically, we want to be able to rename all variables in different programming languages ​​into a constant line or letter.

What is an easy solution (lexer generator or something similar) that we can use to help rename all variables in source files of different languages ​​into constants?

Our project is written in Java.

, . , (java, ++, python ..).

+3
4

/ ANTLR. TXL, , . .

+3

ANTLR, , JFlex.

+1

, , , . , , , . Tcl , , (Lisp?).

0

acacia-lex lexer .

Lexer , , , "ident1" → "[a..d]", "ident2" → "[e..h]".

, , (), , "ident1" → "ident1" , "ident2" → "ident2" .

0

Source: https://habr.com/ru/post/1729619/


All Articles