Can Git determine if the two source files are essentially copies of each other?

Question

Can Git determine if the two source files are essentially copies of each other?

Sorry if this is off topic, but here is your chance to reduce the number of "homework" on this site :-)

I teach a C programming class, where students work with a small library of numerical routines in C. This year, source files from several student groups had a significant amount of code duplication.

(Up until the debug printf reports are identically deferred. I mean, how can you not be.)

I know that Git can detect that two source files are similar to each other beyond a certain threshold, but I am never a manager to get them to work with two source files that are not in the Git repository.

Keep in mind that these are not particularly difficult students. They are unlikely to encounter the problem of changing variable / function names.

Is there a way that I can use Git to detect significant and literal duplication of aka plagiarism code? Or is there some other tool that you could recommend for this

+6

git c

lindelof Jan 21 '12 at 5:46

source share

5 answers

Mankarse · Answer 1 · 2012-01-21T05:50:33+0000

Why use git at all? A simple but effective method would be to compare the size of the differences between all the different views, and then manually check and compare them with the smallest differences.

Ravi · Answer 2 · 2012-01-21T05:54:46+0000

Moss is a tool developed by Stanford CS prof. I think they use it there too. This is similar to diff for source code.

Blender · Answer 3 · 2012-01-21T05:52:56+0000

You can use diff and check if these two files are similar:

 diff -iEZbwB -U 0 file1.cpp file2.cpp

These options tell diff ignore the space changes and create a git like diff file. Try this on two samples.

Brooks moses · Answer 4 · 2012-01-21T05:59:12+0000

Adding to other answers, you can use diff - but I don't think the answers will be useful on their own. What you want is the number of lines that match, minus the number of non-empty lines, and to get it automatically, you need to do a fair bit of magic with wc -l and grep to calculate the sum of the lengths of the files, minus the length of the diff file , minus the number of empty lines that diff included as a match. And even then you will miss some cases where diff decided that identical lines do not match because different things are inserted in front of them.

A much better option is one of the suggestions listed at https://stackoverflow.com/questions/5294447/how-can-i-find-source-code-copying (or at https://stackoverflow.com/questions/4131900 / how-to-detect-plagiarized-code , although the answers seem to be duplicated).

Sylvain leroux · Answer 5 · 2015-12-30T12:37:58+0000

Using diff is absolutely not a good idea if you don't want to take risks in the realm of combinational hell:

If you have 2 views, you must do 1 diff to check for plagiarism,
If you have 3 applications, you need to do 2 diffs to check for plagiarism,
If you have 4 views, you must do 6 diffs to check for plagiarism,
...
If you have n views, you need to execute (n-1)! diff!

On the other hand, Moss , already proposed in another answer, uses a completely different algorithm . Basically, it calculates a set of fingerprints for significant k-grams of each document. A fingerprint is actually a hash used to classify documents, and possible plagiarism is detected when two documents end up sorting in the same bucket.

Can Git determine if the two source files are essentially copies of each other?

More articles: