I have a system that tracks which documents users view. Each document has its own identifier and the cluster to which it belongs. My system keeps track of the session id and number of views. Now I would like to build an SQL query that will give me two columns - a session identifier and a classified cluster. The classification algorithm is simple:
1. select all sessions 2. for each session S I. prepare an accumulator ACC for clusters II. select the clusters of viewed documents for this session III. for each cluster C accumulate the cluster count ( ACC[C]++ ) IV. find the maximum in the ACC. That is the cluster that the session was classified to
The table structures are as follows: I am using MySQL 5.5.16:
Session
+-------+-----------+--------------------+ | ID | sessionID | classified_cluster | +-------+-----------+--------------------+
Sessiondocument
+-------+-----------+------------+ | ID | sessionID | documentID | +-------+-----------+------------+
cluster
+-------+-------+ | ID | label | +-------+-------+
Clusterdocument
+-------+-----------+------------+ | ID | clusterID | documentID | +-------+-----------+------------+
So, basically, I want to select clusters for each session, count the appearance of each cluster to view documents and find the maximum value. Then the identifier of the cluster itself, which was the most, is the result for the session, so the final set of results contains the session identifier and the most common cluster:
Result
+-----------+-----------------------+ | sessionID | classifiedIntoCluster | +-----------+-----------------------+
I managed to get clusters of viewed documents for each session (step 2 / II.) Using this query:
SELECT SD.session_id, CD.cluster_id FROM cluster_document AS CD INNER JOIN session_document AS SD ON CD.document_id = SD.document_id WHERE session_id IN (SELECT session_id FROM session)
It's hard for me to deal with the rest. Is this possible with nested SELECT queries? Should I use a cursor, and if so, can someone show an example using a cursor? Any help would be much appreciated.
EDIT # 1: added C # implementation, MySQL dump and expected result
C # implementation
private void ClassifyUsers() { int nClusters = Database.SelectClusterCount(); //get number of clusters DataSet sessions = Database.SelectSessions(); //get all sessions foreach (DataRow session in sessions.Tables[0].Rows) { //foreach session int[] acc = new int[nClusters]; //prepare an accumulator for each known cluster string s_id = session["session_id"].ToString(); DataSet sessionClusters = Database.SelectSessionClusters(s_id); //get clusters for this session foreach (DataRow cluster in sessionClusters.Tables[0].Rows) { //for each cluster int c = Convert.ToInt32(cluster["cluster_id"].ToString()) - 1; acc[c]++; //accumulate the cluster count } //find the maximum in the accumulator -> that is the most relevant cluster int max = 0; for (int j = 0; j < acc.Length; j++) { if (acc[j] >= acc[max]) max = j; } max++; Database.UpdateSessionCluster(s_id, max); //update the session with its new assigned cluster } }
Table structure, test data and expected result
Table structure and test data
Expected Result
EDIT # 2: added a smaller data set and further walkthrough of the algorithm
Below is a small data set:
Session
session id | cluster abc 0 def 0 ghi 0 jkl 0 mno 0
Cluster
cluster_id | label 1 A 2 B 3 C 4 D 5 E
SESSION_DOCUMENT
id | session_id | document_id 1 abc 1 2 def 5 3 jkl 3 4 ghi 4 5 mno 2 6 def 2 7 abc 5 8 ghi 3
CLUSTER_DOCUMENT
id | cluster_id | document_id 1 1 2 2 1 3 3 2 5 4 3 5 5 3 1 6 4 3 7 5 2 8 5 4
Algorithm detail
Step 1: Get Clusters for Documents Viewed by the Session
session_id | cluster_id | label | document_id abc 3 C 1 abc 2 B 5 abc 3 C 5
Step 2: Counting Clusters
session_id | cluster_id | label | occurrence abc 3 C 2 <--- MAX abc 2 B 1 ---- def 2 B 1 def 3 C 1 def 1 A 1 def 5 E 1 <--- MAX ---- ghi 5 E 1 ghi 1 A 1 ghi 4 D 1 <--- MAX ---- jkl 1 A 1 jkl 4 D 1 <--- MAX ---- mno 1 A 1 mno 5 E 1 <--- MAX
Step 3 (final result): find the maximum allowable cluster for each session (see above) and create the final result set (session_id, cluster_id):
session_id | cluster_id abc 3 def 5 ghi 4 jkl 4 mno 5
EDIT No. 3: Accepted clarification of the answer
Both of these answers are correct. They both provide a solution to the problem. I gave Mostacho Bridge the accepted answer because he first gave the solution and provided a different version of the solution using VIEW
. The mankuTimma solution has the same quality as Mosty Mostacho. Therefore, we have two equally good solutions, I just chose Mostacho Bridges, because he was the first.
Thanks to both of them for their contribution. ,