SQL classification

Question

SQL classification

I have a system that tracks which documents users view. Each document has its own identifier and the cluster to which it belongs. My system keeps track of the session id and number of views. Now I would like to build an SQL query that will give me two columns - a session identifier and a classified cluster. The classification algorithm is simple:

1. select all sessions 2. for each session S I. prepare an accumulator ACC for clusters II. select the clusters of viewed documents for this session III. for each cluster C accumulate the cluster count ( ACC[C]++ ) IV. find the maximum in the ACC. That is the cluster that the session was classified to

The table structures are as follows: I am using MySQL 5.5.16:

Session

 +-------+-----------+--------------------+ | ID | sessionID | classified_cluster | +-------+-----------+--------------------+

Sessiondocument

 +-------+-----------+------------+ | ID | sessionID | documentID | +-------+-----------+------------+

cluster

 +-------+-------+ | ID | label | +-------+-------+

Clusterdocument

 +-------+-----------+------------+ | ID | clusterID | documentID | +-------+-----------+------------+

So, basically, I want to select clusters for each session, count the appearance of each cluster to view documents and find the maximum value. Then the identifier of the cluster itself, which was the most, is the result for the session, so the final set of results contains the session identifier and the most common cluster:

Result

 +-----------+-----------------------+ | sessionID | classifiedIntoCluster | +-----------+-----------------------+

I managed to get clusters of viewed documents for each session (step 2 / II.) Using this query:

 SELECT SD.session_id, CD.cluster_id FROM cluster_document AS CD INNER JOIN session_document AS SD ON CD.document_id = SD.document_id WHERE session_id IN (SELECT session_id FROM session)

It's hard for me to deal with the rest. Is this possible with nested SELECT queries? Should I use a cursor, and if so, can someone show an example using a cursor? Any help would be much appreciated.

EDIT # 1: added C # implementation, MySQL dump and expected result

C # implementation

  private void ClassifyUsers() { int nClusters = Database.SelectClusterCount(); //get number of clusters DataSet sessions = Database.SelectSessions(); //get all sessions foreach (DataRow session in sessions.Tables[0].Rows) { //foreach session int[] acc = new int[nClusters]; //prepare an accumulator for each known cluster string s_id = session["session_id"].ToString(); DataSet sessionClusters = Database.SelectSessionClusters(s_id); //get clusters for this session foreach (DataRow cluster in sessionClusters.Tables[0].Rows) { //for each cluster int c = Convert.ToInt32(cluster["cluster_id"].ToString()) - 1; acc[c]++; //accumulate the cluster count } //find the maximum in the accumulator -> that is the most relevant cluster int max = 0; for (int j = 0; j < acc.Length; j++) { if (acc[j] >= acc[max]) max = j; } max++; Database.UpdateSessionCluster(s_id, max); //update the session with its new assigned cluster } }

Table structure, test data and expected result

Table structure and test data

Expected Result

EDIT # 2: added a smaller data set and further walkthrough of the algorithm

Below is a small data set:

Session

 session id | cluster abc 0 def 0 ghi 0 jkl 0 mno 0

Cluster

 cluster_id | label 1 A 2 B 3 C 4 D 5 E

SESSION_DOCUMENT

 id | session_id | document_id 1 abc 1 2 def 5 3 jkl 3 4 ghi 4 5 mno 2 6 def 2 7 abc 5 8 ghi 3

CLUSTER_DOCUMENT

 id | cluster_id | document_id 1 1 2 2 1 3 3 2 5 4 3 5 5 3 1 6 4 3 7 5 2 8 5 4

Algorithm detail

Step 1: Get Clusters for Documents Viewed by the Session

 session_id | cluster_id | label | document_id abc 3 C 1 abc 2 B 5 abc 3 C 5 ----- def 2 B 5 def 3 C 5 def 1 A 2 def 5 E 2 ---- ghi 5 E 4 ghi 1 A 3 ghi 4 D 3 ---- jkl 1 A 3 jkl 4 D 3 ---- mno 1 A 2 mno 5 E 2

Step 2: Counting Clusters

 session_id | cluster_id | label | occurrence abc 3 C 2 <--- MAX abc 2 B 1 ---- def 2 B 1 def 3 C 1 def 1 A 1 def 5 E 1 <--- MAX ---- ghi 5 E 1 ghi 1 A 1 ghi 4 D 1 <--- MAX ---- jkl 1 A 1 jkl 4 D 1 <--- MAX ---- mno 1 A 1 mno 5 E 1 <--- MAX

Step 3 (final result): find the maximum allowable cluster for each session (see above) and create the final result set (session_id, cluster_id):

 session_id | cluster_id abc 3 def 5 ghi 4 jkl 4 mno 5

EDIT No. 3: Accepted clarification of the answer

Both of these answers are correct. They both provide a solution to the problem. I gave Mostacho Bridge the accepted answer because he first gave the solution and provided a different version of the solution using VIEW . The mankuTimma solution has the same quality as Mosty Mostacho. Therefore, we have two equally good solutions, I just chose Mostacho Bridges, because he was the first.

Thanks to both of them for their contribution. ,

+6

algorithm sql mysql classification document-classification

brozo Feb 16 '12 at 9:01

source share

2 answers

If I understand your problem correctly, for each session you want the cluster to have the largest number of document types. So, here you go. The following query returns the maximum counter or number of occurrences of a specific cluster identifier for each session identifier.

 SELECT SESSION_ID,MAX(CNT) MAX_CNT FROM (SELECT SD.SESSION_ID, CD.CLUSTER_ID,COUNT(*) AS CNT FROM CLUSTER_DOCUMENT AS CD INNER JOIN SESSION_DOCUMENT AS SD ON CD.DOCUMENT_ID = SD.DOCUMENT_ID GROUP BY SD.SESSION_ID,CD.CLUSTER_ID) CNT1 GROUP BY SESSION_ID

Then attach the result with an additional query (where I calculate the score) again to get the cluster id of the maximum value. If there are two cluster identifiers with the same number of occurrences, I use the cluster identifier with the maximum value. I tested your data and it works. In addition, this query should now work in all databases.

 SELECT B.SESSION_ID, MAX(CNT2.CLUSTER_ID) FROM (SELECT SESSION_ID,MAX(CNT) MAX_CNT FROM (SELECT SD.SESSION_ID, CD.CLUSTER_ID,COUNT(*) AS CNT FROM CLUSTER_DOCUMENT AS CD INNER JOIN SESSION_DOCUMENT AS SD ON CD.DOCUMENT_ID = SD.DOCUMENT_ID GROUP BY SD.SESSION_ID,CD.CLUSTER_ID) CNT1 GROUP BY SESSION_ID) B JOIN (SELECT SD.SESSION_ID, CD.CLUSTER_ID,COUNT(*) AS CNT FROM CLUSTER_DOCUMENT AS CD INNER JOIN SESSION_DOCUMENT AS SD ON CD.DOCUMENT_ID = SD.DOCUMENT_ID GROUP BY SD.SESSION_ID,CD.CLUSTER_ID) CNT2 ON B.SESSION_ID = CNT2.SESSION_ID AND B.MAX_CNT = CNT2.CNT GROUP BY B.SESSION_ID

+2

mankuTimma Feb 16 '12 at 10:31

source share

Mosty mostacho · Accepted Answer · 2012-02-16T18:46:32+0000

Well, I have some doubts about how to select an event when there are many equal, but looking at the C # code, it seems that this choice is not deterministic.

Now, given sampla data, step 2 actually results in:

 +------------+------------+-------+------------+ | SESSION_ID | CLUSTER_ID | LABEL | OCCURRENCE | +------------+------------+-------+------------+ | abc | 3 | C | 2 | | def | 1 | A | 1 | | def | 2 | B | 1 | | def | 3 | C | 1 | | def | 5 | E | 1 | | ghi | 1 | A | 1 | | ghi | 4 | D | 1 | | ghi | 5 | E | 1 | | jkl | 1 | A | 1 | | jkl | 4 | D | 1 | | mno | 1 | A | 1 | | mno | 5 | E | 1 | +------------+------------+-------+------------+

So, continuing this data, I get session_id and max (cluster_id) for this session id, resulting in:

 +------------+------------+ | SESSION_ID | CLUSTER_ID | +------------+------------+ | abc | 3 | | def | 5 | | ghi | 5 | | jkl | 4 | | mno | 5 | +------------+------------+

The maximum (cluster_id) should just make this non-deterministic choice. This is the request:

 select s1.session_id, max(s1.cluster_id) as cluster_id from ( select sd.session_id, cd.cluster_id, count(*) as Occurrence from session_document sd join cluster_document cd on sd.document_id = cd.document_id join cluster c on c.cluster_id = cd.cluster_id group by sd.session_id, cd.cluster_id, c.label ) as s1 left join ( select sd.session_id, count(*) as Occurrence from session_document sd join cluster_document cd on sd.document_id = cd.document_id join cluster c on c.cluster_id = cd.cluster_id group by sd.session_id, cd.cluster_id, c.label ) as s2 on s1.session_id = s2.session_id and s1.occurrence < s2.occurrence where s2.occurrence is null group by s1.session_id

Perhaps adding a view will increase performance (replacing the above request):

 create view MaxOccurrences as ( select sd.session_id, cd.cluster_id, count(*) as Occurrence from session_document sd join cluster_document cd on sd.document_id = cd.document_id join cluster c on c.cluster_id = cd.cluster_id group by sd.session_id, cd.cluster_id, c.label ); select s1.session_id, max(s1.cluster_id) as cluster_id from MaxOccurrences as s1 left join MaxOccurrences as s2 on s1.session_id = s2.session_id and s1.occurrence < s2.occurrence where s2.occurrence is null group by s1.session_id

Let me know if it works.

SQL classification

More articles: