I create a CGI script that polls an SQLite database and builds a statistics table. Below is a table of the source database, as well as a snippet of the corresponding code. Everything works (functionally), but CGI itself is very slow since I have several nested SELECT COUNT(id) calls. I believe that my best bet in optimization is to ask the SO community, since my time with Google was relatively fruitless.
Table:
CREATE TABLE messages ( id TEXT PRIMARY KEY ON CONFLICT REPLACE, date TEXT, hour INTEGER, sender TEXT, size INTEGER, origin TEXT, destination TEXT, relay TEXT, day TEXT);
(Yes, I know that the table is not normalized, but it is filled with excerpts from the mail journal ... I was pleased to receive the extract and fill the work, not to mention normalizing it. I do not think that the table structure has much in common with my question at the moment, but I could be wrong.)
Example line:
476793200A7|Jan 29 06:04:47|6| admin@mydomain.com |4656|web02.mydomain.pvt| user@example.com |mail01.mydomain.pvt|Jan 29
And, the Python code that builds my tables:
#!/usr/bin/python print 'Content-type: text/html\n\n' from datetime import date import re p = re.compile('(\w+) (\d+)') d_month = {'Jan':1,'Feb':2,'Mar':3,'Apr':4,'May':5,'Jun':6,'Jul':7,'Aug':8,'Sep':9,'Oct':10,'Nov':11,'Dec':12} l_wkday = ['Mo','Tu','We','Th','Fr','Sa','Su'] days = [] curs.execute('SELECT DISTINCT(day) FROM messages ORDER BY day') for day in curs.fetchall(): m = p.match(day[0]).group(1) m = d_month[m] d = p.match(day[0]).group(2) days.append([day[0],"%s (%s)" % (day[0],l_wkday[date.weekday(date(2010,int(m),int(d)))])]) curs.execute('SELECT DISTINCT(sender) FROM messages') senders = curs.fetchall() for sender in senders: curs.execute('SELECT COUNT(id) FROM messages WHERE sender=%s',(sender[0])) print ' <div id="'+sender[0]+'">' print ' <h1>Stats for Sender: '+sender[0]+'</h1>' print ' <table><caption>Total messages in database: %d</caption>' % curs.fetchone()[0] print ' <tr><td> </td><th colspan=24>Hour of Day</th></tr>' print ' <tr><td class="left">Day</td><th>%s</th></tr>' % '</th><th>'.join(map(str,range(24))) for day in days: print ' <tr><td>%s</td>' % day[1] for hour in range(24): sql = 'SELECT COUNT(id) FROM messages WHERE sender="%s" AND day="%s" AND hour="%s"' % (sender[0],day[0],str(hour)) curs.execute(sql) d = curs.fetchone()[0] print ' <td>%s</td>' % (d>0 and str(d) or '') print ' </tr>' print ' </table></div>' print ' </body>\n</html>\n'
I'm not sure if there are any ways to combine some of the queries or approach from a different angle to extract the data. I also thought about creating a second table with accounts in it and just updating it when updating the original table. I have been looking at this for too long, so tomorrow I am going to attack him again, I hope, with some understanding of experts;)
Edit: using the GROUP BY answer below, I was able to get the data needed from the database in one query. I switched to Perl, because nested mailbox support in Python just didn't work very well for how I needed to approach this (by creating a set of HTML tables in a certain way). Here is a snippet of recycled code:
my %data; my $rows = $db->selectall_arrayref("SELECT COUNT(id),sender,day,hour FROM messages GROUP BY sender,day,hour ORDER BY sender,day,hour"); for my $row (@$rows) { my ($ct, $se, $dy, $hr) = @$row; $data{$se}{$dy}{$hr} = $ct; } for my $se (keys %data) { print "Sender: $se\n"; for my $dy (keys %{$data{$se}}) { print "Day: ",time2str('%a',str2time("$dy 2010"))," $dy\n"; for my $hr (keys %{$data{$se}{$dy}}) { print "Hour: $hr = ".$data{$se}{$dy}{$hr}."\n"; } } print "\n"; }
What once ran around 28.024s now takes 0.415s!