Search for copied homework

Sometimes my students try to submit the same files for their homework. If they did their homework themselves, it would be impossible for any two files to be exactly the same.

I placed my homework in folders located like this: /section/id/

Thus, each section of the course has its own folder, each student has his own folder, and all files are at this last level. Student files come in various formats.

  • How can I check if there are exactly the same files (ignoring file names) in any subfolder?
+4
source share
6 answers

This will help you identify the exact files from your students using the following single-line for loop and awk :

Step: 1 - for i in path/to/files; do cksum "$i"; done > cksum.txt for i in path/to/files; do cksum "$i"; done > cksum.txt
Step: 2 - awk 'NR==FNR && a[$1]++ { b[$1]; next } $1 in b' cksum.txt cksum.txt awk 'NR==FNR && a[$1]++ { b[$1]; next } $1 in b' cksum.txt cksum.txt

Test:

Some examples of files in which student 2 used an identical file as student 1

 [jaypal:~/Temp/homework] ls -lrt total 32 -rw-r--r-- 1 jaypalsingh staff 10 17 Dec 17:58 student1 -rw-r--r-- 1 jaypalsingh staff 10 17 Dec 17:58 student2 -rw-r--r-- 1 jaypalsingh staff 10 17 Dec 17:58 student3 -rw-r--r-- 1 jaypalsingh staff 10 17 Dec 17:58 student4 [jaypal:~/Temp/homework] cat student1 homework1 [jaypal:~/Temp/homework] cat student2 homework1 [jaypal:~/Temp/homework] cat student3 homework3 [jaypal:~/Temp/homework] cat student4 homework4 

Step 1:

Create the cksum.txt file using the cksum utility

 [jaypal:~/Temp/homework] for i in *; do cksum "$i"; done > cksum.txt [jaypal:~/Temp/homework] cat cksum.txt 4294967295 0 cksum.txt 1271506813 10 student1 1271506813 10 student2 1215889011 10 student3 1299429862 10 student4 

Step 2:

Using awk single-line file identifies all files that are the same

 [jaypal:~/Temp/homework] awk 'NR==FNR && a[$1]++ { b[$1]; next } $1 in b' cksum.txt cksum.txt 1271506813 10 student1 1271506813 10 student2 

Test 2:

 [jaypal:~/Temp/homework] for i in stu*; do cksum "$i"; done > cksum.txt [jaypal:~/Temp/homework] awk 'NR==FNR && a[$1]++ { b[$1]; next } $1 in b' cksum.txt cksum.txt 1271506813 10 student1 1271506813 10 student2 1271506813 10 student5 [jaypal:~/Temp/homework] cat student5 homework1 
+3
source

Create md5 of all files and paste them into the dictionary.

+3
source

List those files that have at least one duplicate:

 md5sum * | sort | uniq -w32 --all-repeat=separate | awk '{print $2}' 

Of course, this only finds files that are completely identical.

To handle things in subfolders, you'll want to change it to work with find .

+3
source

This is a whole area of ​​study:

The thing with the mentioned approaches is that changes in the size / settings of the tab and the like matter. Most homework assignments even require a student name at the top. This will make all identical views look alike.

I suggest running the view in the preprocessor (excluding comments for something) and through some (very strict) index indenter (astil, bcpp, cindent ...?) To remove any "surface differences".

You might even want to consider ignoring it if you made some false positives. It could even define a plagiarist with a taste for naming conventions (renaming FindSpork() to FindSpork() ?).

There are several heuristics that I could add. However, this should lead you back on track.

Change PS of course, after all the rest, you can still run it through the checksum. So for example, you could do

 cat submission.cpp | astyle -bj | cpp - | md5sum 

get something from a fingerprint that is much less sensitive to random / superficial changes (like comments or spaces).

+3
source

If you are really interested in replicas, group the files by size. If the group has more than one member, run md5sum in the files, and then sort | uniq -c sort | uniq -c to see if there are duplicates.

+2
source

fdupes works well for this task

+1
source

Source: https://habr.com/ru/post/1386877/


All Articles