Java algorithm: writing a list of pairs according to several criteria

Question

Java algorithm: writing a list of pairs according to several criteria

I am afraid this will be a difficult question. I have been thinking about the correct solution to this problem for a long time and I hope that a fresh piece of brains will better consider the problem - let it get to it:

Data

What we are working with is a csv file containing several columns for which the following issues are important:

User ID (integer, 3 to 8 digits, multiple entries with the same user ID) LIST USED THIS
Request (string)
Epoc (long time value)
clickurl (String)

Each record in the data that we work with here has zero values for these attributes.

Sample data:

SID,UID,query,rawdate,timestamp,timegap,epoc,lengthwords,lengthchars,rank,clickurl 5,142,westchester.gov,2006-03-20 03:55:57,Mon Mar 20 03:55:57 CET 2006,0,1142823357504,1,15,1,http://www.westchestergov.com 10,142,207 ad2d 530,2006-04-08 01:31:14,Sat Apr 08 01:31:14 CEST 2006,10000,1144452674507,3,12,1,http://www.courts.state.ny.us 11,142,vera.org,2006-04-08 08:38:42,Sat Apr 08 08:38:42 CEST 2006,11000,1144478322507,1,8,1,http://www.vera.org

Note. there are several entries that have the same meaning for "Epoc", this is due to the tools used to collect data

Note2 : the list has size ~ 700000, just fyi

Purpose : match pairs of records that have the same query

Scope: records that have the same user ID

Due to this anomaly, the following should be considered in the data collection process:

If two entries have the same value for "Query" and for "Epoc", the following items in the list should be checked for these criteria until the next entry has a different value for one of these attributes. A group of records that have the same Query and Epoc values should be treated as a “single record”, so to match a pair, another record must be found that matches the “Query” value. For lack of a better name, call a group that shares the same value as Query and Epoc a 'chain'

Now that it comes out, it gets a little easier, there are 3 types of paired tracks that we can get from this:

Recording and recording
Login and conversation
Chain and chain

Type 1 here means only two entries in the list that have the same value for “Request”, but not for “Epoc”.

So this sums up equal query pairs

There is also the case of Different pairs of queries , which can be described as follows:

After we match equal pairs of queries, it is likely that there are records that were not connected to other records because their query did not match - every record that was not mapped to another record because it is part of a set called "different requests"

The members of this set should be conjugated without any criteria, but the chains are still considered as a -one-input pair.

As for pair matching in general, there can be no redundant pairs - one record can be part of n many pairs, but two separate records can form only one pair.

Example:

The following entries must be paired.

 UID,Query,Epoc,Clickurl 772,Donuts,1141394053510,https://www.dunkindonuts.com/dunkindonuts/en.html 772,Donuts,1141394053510,https://www.dunkindonuts.com/dunkindonuts/en.html 772,Donuts,1141394053510,https://www.dunkindonuts.com/dunkindonuts/en.html 772,raspberry pi,1141394164710,http://www.raspberrypi.org/ 772,stackoverflow,1141394274810,http://en.wikipedia.org/wiki/Buffer_overflow 772,stackoverflow,1141394274850,http://www.stackoverflow.com 772,tall women,1141394275921,http://www.tallwomen.org/ 772,raspberry pi,1141394277991,http://www.raspberrypi.org/ 772,Donuts,114139427999,http://de.wikipedia.org/wiki/Donut 772,stackoverflow,1141394279999,http://www.stackoverflow.com 772,something,1141399299991,http:/something.else/something/

In this example, donuts are a chain, so pairs (linenumbers without a header):

Equal query pairs: (1-3.9) (4.8) (5.6) (5.10) (6.10)
Different pairs of queries: (7,11)

My unsuccessful approach to the problem:

The algorithm that I developed for this works as follows:

Iterate through the list of entries until the value for UserID changes.

It is then applied to a separate list that contains only those iterations that have the same user ID:

  for (int i = 0; i < list.size(); i++) { Entry tempI = list.get(i); Boolean iMatched = false; //boolean to save whether or not c1 is set Boolean c1done = false; Boolean c2done = false; //Hashsets holding the clickurl values of the entries that form a pair HashSet<String> c1 = null; HashSet<String> c2 = null; for (int j = i + 1; j < list.size(); j++) { Entry tempJ = list.get(j); // Queries match if (tempI.getQuery().equals(tempJ.getQuery())) { // wheter or not Entry at position i has been matched or not if (!iMatched) { iMatched = true; } HashSet<String> e1 = new HashSet<String>(); HashSet<String> e2 = new HashSet<String>(); int k = 0; // Times match HashSet<String> chainset = new HashSet<String>(); if (tempI.getEpoc() == tempJ.getEpoc()) { chainset.add(tempI.getClickurl()); chainset.add(tempJ.getClickurl()); } else { e1.add(tempI.getClickurl()); if (c1 == null) { c1 = e1; c1done = true; } else { if (c2 == null) { c2 = e1; c2done = true; } } } //check how far the chain goes and get their entries if ((j + 1) < list.size()) { Entry tempjj = list.get(j + 1); if (tempjj.getEpoc() == tempJ.getEpoc()) { k = j + 1; //search for the end of the chain while ((k < list.size()) && (tempJ.getQuery().equals(list.get(k) .getQuery())) && (tempJ.getEpoc() == list.get(k).getEpoc())) { chainset.add(tempJ.getClickurl()); chainset.add(list.get(k).getClickurl()); k++; } j = k + 1; //continue the iteration at the end of the chain if (c1 == null) { c1 = chainset; c1done = true; } else { if (c2 == null) { c2 = chainset; c2done = true; } } // Times don't match } } else { e2.add(tempJ.getClickurl()); if (c1 == null) { c1 = e2; c1done = true; } else { if (c2 == null) { c2 = e2; c2done = true; } } } /** Block that compares the clicks in the Hashsets and computes the resulting data * left out for now to not make this any more complicated than it already is **/ // Queries don't match } else { if (!dq.contains(tempJ)) { //note: dq is an ArrayList holding the entries of the differen query set dq.add(tempJ); } } if (j == al.size() - 1) { if (!iMatched) { dq.add(tempI); } } } if (dq.size() >= 2) { for (int z = 0; z < dq.size() - 1; z++) { if (dq.get(z + 1) != null) { /** Filler, iterate dq just like the normal list with two loops * **/ } } } }

So, using an excessive number of loops, I try to match the pairs, which leads to an awfully long work, the end of which I have not seen until this point

Well, I hope that I have not forgotten anything important, I will add additional necessary information later If you have done this so far, thanks for reading - I hope you have an idea that can help me.

+4

java list algorithm pair

Rickyfox 30 sept '12 at 11:54

source share

2 answers

Dylan · Answer 1 · 2012-09-30T19:11:08+0000

Use SQL to import data into db, and then execute the queries. Your txt file is too large; no wonder it takes so long to get through this. :)

maniek · Answer 2 · 2012-09-30T18:10:11+0000

First, delete all entries except one from each chain. To do this, sort by (userid, query, epoch), delete duplicates.

Then scan the sorted list. take all the entries for the pair (userid, query). If there is only one, save it in the list for further processing, otherwise release all the pairs.

For all records for this user that you saved for further processing (these are types 2 and 3), emit pairs.

Java algorithm: writing a list of pairs according to several criteria

More articles: