Search for copied homework

Question

Search for copied homework

Sometimes my students try to submit the same files for their homework. If they did their homework themselves, it would be impossible for any two files to be exactly the same.

I placed my homework in folders located like this: /section/id/

Thus, each section of the course has its own folder, each student has his own folder, and all files are at this last level. Student files come in various formats.

How can I check if there are exactly the same files (ignoring file names) in any subfolder?

+4

bash

Village Dec 18 '11 at 0:27

source share

6 answers

Create md5 of all files and paste them into the dictionary.

+3

Betty Dec 18 '11 at 0:32

source share

List those files that have at least one duplicate:

 md5sum * | sort | uniq -w32 --all-repeat=separate | awk '{print $2}'

Of course, this only finds files that are completely identical.

To handle things in subfolders, you'll want to change it to work with find .

+3

Oliver Charlesworth Dec 18 '11 at 0:34

source share

This is a whole area of study:

Plagiarism detection
other (will search later)

The thing with the mentioned approaches is that changes in the size / settings of the tab and the like matter. Most homework assignments even require a student name at the top. This will make all identical views look alike.

I suggest running the view in the preprocessor (excluding comments for something) and through some (very strict) index indenter (astil, bcpp, cindent ...?) To remove any "surface differences".

You might even want to consider ignoring it if you made some false positives. It could even define a plagiarist with a taste for naming conventions (renaming FindSpork() to FindSpork() ?).

There are several heuristics that I could add. However, this should lead you back on track.

Change PS of course, after all the rest, you can still run it through the checksum. So for example, you could do

 cat submission.cpp | astyle -bj | cpp - | md5sum

get something from a fingerprint that is much less sensitive to random / superficial changes (like comments or spaces).

+3

sehe Dec 18 '11 at 0:51

source share

If you are really interested in replicas, group the files by size. If the group has more than one member, run md5sum in the files, and then sort | uniq -c sort | uniq -c to see if there are duplicates.

+2

choroba Dec 18 '11 at 0:34

source share

fdupes works well for this task

+1

ennuikiller Dec 18 '11 at 0:35

source share

jaypal singh · Accepted Answer · 2011-12-18T01:04:49+0000

This will help you identify the exact files from your students using the following single-line for loop and awk :

Step: 1 - for i in path/to/files; do cksum "$i"; done > cksum.txt for i in path/to/files; do cksum "$i"; done > cksum.txt
Step: 2 - awk 'NR==FNR && a[$1]++ { b[$1]; next } $1 in b' cksum.txt cksum.txt awk 'NR==FNR && a[$1]++ { b[$1]; next } $1 in b' cksum.txt cksum.txt

Test:

Some examples of files in which `student 2` used an identical file as `student 1`

 [jaypal:~/Temp/homework] ls -lrt total 32 -rw-r--r-- 1 jaypalsingh staff 10 17 Dec 17:58 student1 -rw-r--r-- 1 jaypalsingh staff 10 17 Dec 17:58 student2 -rw-r--r-- 1 jaypalsingh staff 10 17 Dec 17:58 student3 -rw-r--r-- 1 jaypalsingh staff 10 17 Dec 17:58 student4 [jaypal:~/Temp/homework] cat student1 homework1 [jaypal:~/Temp/homework] cat student2 homework1 [jaypal:~/Temp/homework] cat student3 homework3 [jaypal:~/Temp/homework] cat student4 homework4

Step 1:

Create the cksum.txt file using the `cksum` utility

 [jaypal:~/Temp/homework] for i in *; do cksum "$i"; done > cksum.txt [jaypal:~/Temp/homework] cat cksum.txt 4294967295 0 cksum.txt 1271506813 10 student1 1271506813 10 student2 1215889011 10 student3 1299429862 10 student4

Step 2:

Using `awk` single-line file identifies all files that are the same

 [jaypal:~/Temp/homework] awk 'NR==FNR && a[$1]++ { b[$1]; next } $1 in b' cksum.txt cksum.txt 1271506813 10 student1 1271506813 10 student2

Test 2:

 [jaypal:~/Temp/homework] for i in stu*; do cksum "$i"; done > cksum.txt [jaypal:~/Temp/homework] awk 'NR==FNR && a[$1]++ { b[$1]; next } $1 in b' cksum.txt cksum.txt 1271506813 10 student1 1271506813 10 student2 1271506813 10 student5 [jaypal:~/Temp/homework] cat student5 homework1

Search for copied homework

Some examples of files in which student 2 used an identical file as student 1

Create the cksum.txt file using the cksum utility

Using awk single-line file identifies all files that are the same

More articles:

Some examples of files in which `student 2` used an identical file as `student 1`

Create the cksum.txt file using the `cksum` utility

Using `awk` single-line file identifies all files that are the same