This, at least, is not very trivial, because on the one hand you have a lot of data, and on the other hand, even more.
The simplest approach is a lucene index for 7 million phrases and let the hadoop job request an index. Not quite sure if you need a solr server for this or any similar python implementations.
The cartographer must write out the phrase id or linenumber, regardless of what you need to specify. Or at least the phrase itself, as well as a coincidence. In the reduction step, you can go for a reduction in the phrase key and write out all related phrases with the score. (or whatever you want)
For similarities, you can read here:
The similarity of Apache Lucene
Apache lucene
source share