This is a whole area of ββstudy:
The thing with the mentioned approaches is that changes in the size / settings of the tab and the like matter. Most homework assignments even require a student name at the top. This will make all identical views look alike.
I suggest running the view in the preprocessor (excluding comments for something) and through some (very strict) index indenter (astil, bcpp, cindent ...?) To remove any "surface differences".
You might even want to consider ignoring it if you made some false positives. It could even define a plagiarist with a taste for naming conventions (renaming FindSpork() to FindSpork() ?).
There are several heuristics that I could add. However, this should lead you back on track.
Change PS of course, after all the rest, you can still run it through the checksum. So for example, you could do
cat submission.cpp | astyle -bj | cpp - | md5sum
get something from a fingerprint that is much less sensitive to random / superficial changes (like comments or spaces).
source share