I am afraid this will be a difficult question. I have been thinking about the correct solution to this problem for a long time and I hope that a fresh piece of brains will better consider the problem - let it get to it:
Data
What we are working with is a csv file containing several columns for which the following issues are important:
- User ID (integer, 3 to 8 digits, multiple entries with the same user ID) LIST USED THIS
- Request (string)
- Epoc (long time value)
- clickurl (String)
Each record in the data that we work with here has zero values for these attributes.
Sample data:
SID,UID,query,rawdate,timestamp,timegap,epoc,lengthwords,lengthchars,rank,clickurl 5,142,westchester.gov,2006-03-20 03:55:57,Mon Mar 20 03:55:57 CET 2006,0,1142823357504,1,15,1,http:
Note. there are several entries that have the same meaning for "Epoc", this is due to the tools used to collect data
Note2 : the list has size ~ 700000, just fyi
Purpose : match pairs of records that have the same query
Scope: records that have the same user ID
Due to this anomaly, the following should be considered in the data collection process:
If two entries have the same value for "Query" and for "Epoc", the following items in the list should be checked for these criteria until the next entry has a different value for one of these attributes. A group of records that have the same Query and Epoc values should be treated as a “single record”, so to match a pair, another record must be found that matches the “Query” value. For lack of a better name, call a group that shares the same value as Query and Epoc a 'chain'
Now that it comes out, it gets a little easier, there are 3 types of paired tracks that we can get from this:
- Recording and recording
- Login and conversation
- Chain and chain
Type 1 here means only two entries in the list that have the same value for “Request”, but not for “Epoc”.
So this sums up equal query pairs
There is also the case of Different pairs of queries , which can be described as follows:
After we match equal pairs of queries, it is likely that there are records that were not connected to other records because their query did not match - every record that was not mapped to another record because it is part of a set called "different requests"
The members of this set should be conjugated without any criteria, but the chains are still considered as a -one-input pair.
As for pair matching in general, there can be no redundant pairs - one record can be part of n many pairs, but two separate records can form only one pair.
Example:
The following entries must be paired.
UID,Query,Epoc,Clickurl 772,Donuts,1141394053510,https://www.dunkindonuts.com/dunkindonuts/en.html 772,Donuts,1141394053510,https://www.dunkindonuts.com/dunkindonuts/en.html 772,Donuts,1141394053510,https://www.dunkindonuts.com/dunkindonuts/en.html 772,raspberry pi,1141394164710,http://www.raspberrypi.org/ 772,stackoverflow,1141394274810,http://en.wikipedia.org/wiki/Buffer_overflow 772,stackoverflow,1141394274850,http://www.stackoverflow.com 772,tall women,1141394275921,http://www.tallwomen.org/ 772,raspberry pi,1141394277991,http://www.raspberrypi.org/ 772,Donuts,114139427999,http://de.wikipedia.org/wiki/Donut 772,stackoverflow,1141394279999,http://www.stackoverflow.com 772,something,1141399299991,http:/something.else/something/
In this example, donuts are a chain, so pairs (linenumbers without a header):
- Equal query pairs: (1-3.9) (4.8) (5.6) (5.10) (6.10)
- Different pairs of queries: (7,11)
My unsuccessful approach to the problem:
The algorithm that I developed for this works as follows:
Iterate through the list of entries until the value for UserID changes.
It is then applied to a separate list that contains only those iterations that have the same user ID:
for (int i = 0; i < list.size(); i++) { Entry tempI = list.get(i); Boolean iMatched = false; //boolean to save whether or not c1 is set Boolean c1done = false; Boolean c2done = false; //Hashsets holding the clickurl values of the entries that form a pair HashSet<String> c1 = null; HashSet<String> c2 = null; for (int j = i + 1; j < list.size(); j++) { Entry tempJ = list.get(j); // Queries match if (tempI.getQuery().equals(tempJ.getQuery())) { // wheter or not Entry at position i has been matched or not if (!iMatched) { iMatched = true; } HashSet<String> e1 = new HashSet<String>(); HashSet<String> e2 = new HashSet<String>(); int k = 0; // Times match HashSet<String> chainset = new HashSet<String>(); if (tempI.getEpoc() == tempJ.getEpoc()) { chainset.add(tempI.getClickurl()); chainset.add(tempJ.getClickurl()); } else { e1.add(tempI.getClickurl()); if (c1 == null) { c1 = e1; c1done = true; } else { if (c2 == null) { c2 = e1; c2done = true; } } } //check how far the chain goes and get their entries if ((j + 1) < list.size()) { Entry tempjj = list.get(j + 1); if (tempjj.getEpoc() == tempJ.getEpoc()) { k = j + 1; //search for the end of the chain while ((k < list.size()) && (tempJ.getQuery().equals(list.get(k) .getQuery())) && (tempJ.getEpoc() == list.get(k).getEpoc())) { chainset.add(tempJ.getClickurl()); chainset.add(list.get(k).getClickurl()); k++; } j = k + 1; //continue the iteration at the end of the chain if (c1 == null) { c1 = chainset; c1done = true; } else { if (c2 == null) { c2 = chainset; c2done = true; } } // Times don't match } } else { e2.add(tempJ.getClickurl()); if (c1 == null) { c1 = e2; c1done = true; } else { if (c2 == null) { c2 = e2; c2done = true; } } } /** Block that compares the clicks in the Hashsets and computes the resulting data * left out for now to not make this any more complicated than it already is **/ // Queries don't match } else { if (!dq.contains(tempJ)) { //note: dq is an ArrayList holding the entries of the differen query set dq.add(tempJ); } } if (j == al.size() - 1) { if (!iMatched) { dq.add(tempI); } } } if (dq.size() >= 2) { for (int z = 0; z < dq.size() - 1; z++) { if (dq.get(z + 1) != null) { /** Filler, iterate dq just like the normal list with two loops * **/ } } } }
So, using an excessive number of loops, I try to match the pairs, which leads to an awfully long work, the end of which I have not seen until this point
Well, I hope that I have not forgotten anything important, I will add additional necessary information later If you have done this so far, thanks for reading - I hope you have an idea that can help me.