The Wikipedia article has a pretty good discussion of the algorithm and even mentions how you can implement the hash movie function (see "Using hashing to find substring substrings"). It also discusses ways to improve execution speed using a hash table or a Bloom filter.
You should also understand that the worst case is a pretty far-fetched example. An example given in a Wikipedia article is “finding a line in 10,000,” and then “b” in a line of 10 million “a”.
You should be able to implement a rolling hash using the methods described in this Wikipedia entry. If you are having trouble implementing this, leave a more specific question about how to do this, showing that you tried.
It is unlikely that you will encounter something that comes to the worst case in real documents. Even if you are faced with the worst case, a hash hash will not reduce complexity. Implementing a rolling hash gives a linear improvement at runtime, which will depend on the complexity of n*m
. If you find that the worst case happens often, then you probably need a different algorithm.
Another thing to note is that although O(m*n)
can be a problem, you should look at the scale. How big are the documents you study? You say that you work with source code files. If you look at typical class projects, then you are probably saying maybe 2000 lines of code. These documents will not demonstrate the worst case. Even if they did, n*m
would not be a very large number.
However, if you have 100 documents and you want to know if someone is a significant duplicate of the other, your big problem is O (n ^ 2), because you have to check every document against everyone else. The number of document comparisons is (n*(n-1))/2
. If you want to optimize your process, you need a different algorithm. Ideally, something that will give you a “fingerprint” of the document. This way, you can calculate the fingerprint for each document once, and then compare the fingerprints for similarity.
A documented fingerprint is a well-known issue. However, constructing a fingerprint useful for comparison purposes is slightly less straightforward. Would you like to look at a technique called shingling. I also saw some research on using a small Bloom filter (256 bytes or so) to present a document and the ability to quickly compare with it.
All that said, I suspect that if you say a hundred or two source code files, each of which can be 1000 or 2000 lines long, a naive O (n ^ 2) comparison technique using a good Rabin-Carp implementation will do what you want . It will take some time (you are going to make 5000 separate document comparisons), but I do not think that the speed of RK implementation will be your limiting factor.