I wrote SimiCheck and you can use it. If you are interested in the API, I could write it very quickly. I wrote the original algorithm as part of the CrowdGrader tool, but then decided to make the comparison tools available independently. SimiCheck can process code, Word (.docx), html, pdf, text, ..., as well as .zip, .tar, .gz, .tgz and some other formats and can handle variable renaming, moving code, code for several files etc.
source share