Suggestions needed to optimize the O (n ^ 2) algorithm

Question

Suggestions needed to optimize the O (n ^ 2) algorithm

I want to optimize a fairly simple algorithm, which is currently O (n ² ). I have a records file where each of them should be compared to the others in one file. If these two are “the same” (the comparator function is quite complex), consistent records are output. Please note that there may be several entries that match each other , and there is no sense of order - only if the match is True or False.

Pseudocode:

For (outRec in sourceFile) { Get new filePointer for targetFile //starting from the top of the file for inner loop For (inRec in targetFile) { if (compare(outRec, inRec) == TRUE ) { write outRec write inRec } increment some counters } increment some other counters }

Data is not sorted in any way, and there is no preprocessing, it is possible to order data.

Any ideas on how this can become anything less than O (n ² )? I am thinking of applying the MapReduce paradigm to code, breaking external and internal loops, possibly using a map-bound function. I am sure that I have code that Hadoop, but wanted to test alternatives before I took the time to code it.

Suggestions appreciated!

Added: record types. Basically, I need to match names / strings. match types are shown in the example below.

 1,Joe Smith,Daniel Foster 
 2,Nate Johnson,Drew Logan 
 3,Nate Johnson, Jack Crank 
 4,Joey Smyth,Daniel Jack Foster 
 5,Joe Morgan Smith,Daniel Foster 
 
 Expected output: Records 1,4,5 form a match set End of output

Added: these files will be quite large. The largest file is expected to be around 200 million records.

+6

optimization algorithm hadoop

banncee Jul 12 '11 at 13:44

source share

10 answers

I'm not sure about the properties of your comparator and dataset, but assuming your comparator determines the equivalence relation for your rows, nothing is said here:

Create a map for the input file and use the comparator function as the key comparator of the map. Map values are a sequence / list of lines, i.e. All lines that are “the same” are added sequentially to the same card entry). It takes O (n * log n) time.
Go through the other lines of the file and check if each line matches the keyword on the map. In this case, due to the equivalence relationship that is implied by your comparator, you know that this line is “the same” as all lines in the value of this map entry. Takes O (n * log n + C), depending on how many matches you should print.

Please note that in the worst case, according to your description of the problem, you cannot get better than O (n ^ 2), simply because there may be results of O (n ^ 2) matching entries that you should output!

+4

Frank Jul 12 '11 at 13:57

source share

We need to know more about your comparison function. Is your comparison transitive? (That is, A == B and B == C mean A == C?) Is it reflective? (Does A == BB == A?)

If your comparison function is transitive and reflexive, and many of the same entries are common, you can group your entries into groups by comparing them with one “representative sample” of the group. This might come close to O (N) at best.

Please note that hashing records assumes hash (A) == hash (B) <=> compare (A, B) == true, but if the comparison (A, B) can be true even if bytes (A)! = bytes (B) it would be difficult to develop an appropriate hash algorithm.

+2

Dennis Jul 12 '11 at 13:56

source share

FYI MapReduce will not implement the algorithmic complexity of the solution. It adds some overhead, but then parallelizes it so that you can use the necessary resources in less wall time.

To increase the time of a wall clock, you need to do the following: find ways to avoid comparison. Any way to do this would be a victory. And even if your comparison logic is complicated, you can still use sorting to help.

For example, suppose you have some kind of dimension in which this data is disseminated. Data that is too much dependent on this dimension is not guaranteed to be compared with equals, although being close in this dimension does not guarantee equality. Then, what you can do is sort the data by this dimension, and then only perform comparisons between elements that are close in size. Voila! Most O(n*n) comparisons have now disappeared.

Let makes it more difficult. Suppose you can identify two such dimensions that are independent of each other. Sort the data by the first such sizes. Divide the data in the first dimension into strips. (Make the overlap of the bands maximum, they can vary in this dimension and still compare equal.) Now take each strip and sort it by the second dimension. Then make comparisons between pairs of elements that are reasonably close in this dimension, and include pairs in your answer if it compares the same, and this is the first bar in which it can appear. (This grandfather logic is necessary because overlapping may mean that a pair that compares the equal may appear in several bands.) This is likely to be even better than the first approach, because you managed to narrow everything down so that you compare the strings with a little the number of "nearest" lines.

If you want to use less resources, you need to focus on ways to avoid actually making individual comparisons. Everything you come up with along the way will help.

+2

btilly Jul 12 '11 at 18:06

source share

Just browse through each entry in your file and paste them into the hash table. At each step, check if the entry is in the hash table. If so, print it. This can be done in O (n).

+1

tskuzzy Jul 12 '11 at 13:52

source share

As you mentioned, you are not lucky that it will be better than O (n ^ 2), but you can do it.

I have a working solution that will work with HDFS, you can expand it using a distributed cache.

 public class MatchImporter extends Mapper<LongWritable, Text, Text, Text> { FileSystem fs; private BufferedReader stream; @Override protected void setup(Context context) throws IOException, InterruptedException { fs = FileSystem.get(context.getConfiguration()); } private void resetFile() throws IOException { if (stream != null) stream.close(); stream = new BufferedReader(new InputStreamReader(fs.open(new Path( "files/imp/in/target.txt")))); } private boolean compare(Text in, String target) { return target.contains(in.toString()); } enum Counter { PROGRESS } @Override protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { resetFile(); String line = null; while ((line = stream.readLine()) != null) { // increment a counter to don't let the task die context.getCounter(Counter.PROGRESS).increment(1); context.progress(); if (compare(value, line)) { context.write(new Text(line), value); } } } public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = new Job(conf); job.setMapperClass(MatchImporter.class); job.setReducerClass(Reducer.class); job.setJarByClass(MatchImporter.class); Path in = new Path("files/imp/in/source.txt"); Path out = new Path("files/imp/out/"); FileInputFormat.addInputPath(job, in); FileSystem fs = FileSystem.get(conf); if (fs.exists(out)) fs.delete(out, true); SequenceFileOutputFormat.setOutputPath(job, out); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(Text.class); job.waitForCompletion(true); } }

Using input in source.txt file:

 thomas phil jen james christian stefan stephan john

and target.txt

 john meat jay hardly

will lead to the output of the gearbox:

 john meat john

The trick is that you can split the source text and do parallel comparisons in parallel. This will give you acceleration, but will not improve you in big O.

One note here: You must report progress using a counter, because comparing with the whole file may take all the time. This will prevent the loss of your task in a distributed environment.

A little tip: Try to split your source.txt into chunks of 64 m and make target.txt equal to sequencefile . This will gain a lot of speed, then you will have to rewrite the things you read.

Wish you luck!

+1

Thomas jungblut Jul 12 '11 at 18:44

source share

You did not indicate what percentage of input will match, or how often you can get an exact match compared to an inaccurate one. If you can do a little preprocessing to reduce the size of the problem, this can be a big help.

If you just sort the input and run the comparison function on adjacent records, you can select enough duplicates to make the n ^ 2nd pass acceptable.

0

Mark ransom Jul 12 '11 at 20:47

source share

If you really cannot do anything better than an opaque equivalence relation, then your worst case will always be O (n ^ 2) - for example, for the case when there are no matches, you will need to compare each to make sure. (as people mentioned, you can parallelize this, but it still won’t be especially acceptable for hundreds of millions of records, maybe it’s not worth the resources required to do this).

Usually, however, there is a better way to do this.

If you really have an equivalence relation (that is, if you have some logical guarantee that if match (a, b) = match (b, c) = true, then (a, c) also matches), maybe some canonical form into which you can convert your notes, which lends itself to hashing and / or ordering.

In your example, you seem to fit the "Joe Smith" options. If so, you can increase the comparison criteria to select one specific member of the equivalence class to represent the whole. For example, select "JOSEPH" to represent all names equivalent to "Joe", "SMITH" to represent all names equivalent to "Smythe", etc.

After you do this conversion, you can use the hash table to reduce your operation to O (n), not O (n ^ 2).

0

comingstorm Jul 13 '11 at 0:19

source share

Thanks to all the great offers.
After going through the options, it seems like the best approach, given that my timeline is to use the MapReduce infrastructure to parallelize the problem and add more hardware to it. I understand that this does not reduce the complexity of O (n ² ).
The only possible solution I can come up with is to run some type of minHash for the data, split the data into overlapping sections and compare in bands and overlaps. This should reduce the number of comparisons, but I'm not sure how expensive hashing will be.

0

banncee Jul 14 '11 at 15:43

source share

As noted by btilly, you actually don't need transitivity to classify records. In the case of names of English people, you can represent each name with two initials, and each record with a sorted list of initials. Then you just need to do a full O (N ^ 2) comparison between entries of the same class. There is an additional problem, that the same pair of records can be displayed in several classes, but it can be easily detected by supporting a separate set of pairs of matching records (indicated by record indexes).

In this example, you would put record 1 in the class "DF, JS", record 2 in the class "DL, NJ", record 3 in the class "JC, NJ", record 4 in the classes "DJ, JS", "JF, JS "and" DF, JS "and entry 5 in the classes" DF, JM "," DF, JS "and" DF, MS ". You get a total of 7 classes: "DF, JM", "DF, MS", "DF, JS", "DJ, JS", "DL, NJ", "JC, NJ", "JF, JS", of which only the class "DF, JS" contains several records, namely records 1, 4 and 5. Thus, in this example you only need to perform the full comparison function twice.

On the other hand, there is a problem that people have strange names. This blog post on this subject is worth a look if you have not seen this before. No matter what you do, you will miss some matches.

0

han Aug 2 '11 at 19:29

source share

Joe · Accepted Answer · 2011-07-12T13:47:51+0000

Assuming the files are not ridiculously large, I am completely looking at the file and calculating the hash for the line and tracking the hash / line combination # (or file pointer position). Then sort the list of hashes and determine those that appear more than once.

Suggestions needed to optimize the O (n ^ 2) algorithm

More articles: