How can I optimize multiple nested SELECTs in SQLite (w / Python)?

I create a CGI script that polls an SQLite database and builds a statistics table. Below is a table of the source database, as well as a snippet of the corresponding code. Everything works (functionally), but CGI itself is very slow since I have several nested SELECT COUNT(id) calls. I believe that my best bet in optimization is to ask the SO community, since my time with Google was relatively fruitless.

Table:

 CREATE TABLE messages ( id TEXT PRIMARY KEY ON CONFLICT REPLACE, date TEXT, hour INTEGER, sender TEXT, size INTEGER, origin TEXT, destination TEXT, relay TEXT, day TEXT); 

(Yes, I know that the table is not normalized, but it is filled with excerpts from the mail journal ... I was pleased to receive the extract and fill the work, not to mention normalizing it. I do not think that the table structure has much in common with my question at the moment, but I could be wrong.)

Example line:

 476793200A7|Jan 29 06:04:47|6| admin@mydomain.com |4656|web02.mydomain.pvt| user@example.com |mail01.mydomain.pvt|Jan 29 

And, the Python code that builds my tables:

 #!/usr/bin/python print 'Content-type: text/html\n\n' from datetime import date import re p = re.compile('(\w+) (\d+)') d_month = {'Jan':1,'Feb':2,'Mar':3,'Apr':4,'May':5,'Jun':6,'Jul':7,'Aug':8,'Sep':9,'Oct':10,'Nov':11,'Dec':12} l_wkday = ['Mo','Tu','We','Th','Fr','Sa','Su'] days = [] curs.execute('SELECT DISTINCT(day) FROM messages ORDER BY day') for day in curs.fetchall(): m = p.match(day[0]).group(1) m = d_month[m] d = p.match(day[0]).group(2) days.append([day[0],"%s (%s)" % (day[0],l_wkday[date.weekday(date(2010,int(m),int(d)))])]) curs.execute('SELECT DISTINCT(sender) FROM messages') senders = curs.fetchall() for sender in senders: curs.execute('SELECT COUNT(id) FROM messages WHERE sender=%s',(sender[0])) print ' <div id="'+sender[0]+'">' print ' <h1>Stats for Sender: '+sender[0]+'</h1>' print ' <table><caption>Total messages in database: %d</caption>' % curs.fetchone()[0] print ' <tr><td>&nbsp;</td><th colspan=24>Hour of Day</th></tr>' print ' <tr><td class="left">Day</td><th>%s</th></tr>' % '</th><th>'.join(map(str,range(24))) for day in days: print ' <tr><td>%s</td>' % day[1] for hour in range(24): sql = 'SELECT COUNT(id) FROM messages WHERE sender="%s" AND day="%s" AND hour="%s"' % (sender[0],day[0],str(hour)) curs.execute(sql) d = curs.fetchone()[0] print ' <td>%s</td>' % (d>0 and str(d) or '') print ' </tr>' print ' </table></div>' print ' </body>\n</html>\n' 

I'm not sure if there are any ways to combine some of the queries or approach from a different angle to extract the data. I also thought about creating a second table with accounts in it and just updating it when updating the original table. I have been looking at this for too long, so tomorrow I am going to attack him again, I hope, with some understanding of experts;)


Edit: using the GROUP BY answer below, I was able to get the data needed from the database in one query. I switched to Perl, because nested mailbox support in Python just didn't work very well for how I needed to approach this (by creating a set of HTML tables in a certain way). Here is a snippet of recycled code:

 my %data; my $rows = $db->selectall_arrayref("SELECT COUNT(id),sender,day,hour FROM messages GROUP BY sender,day,hour ORDER BY sender,day,hour"); for my $row (@$rows) { my ($ct, $se, $dy, $hr) = @$row; $data{$se}{$dy}{$hr} = $ct; } for my $se (keys %data) { print "Sender: $se\n"; for my $dy (keys %{$data{$se}}) { print "Day: ",time2str('%a',str2time("$dy 2010"))," $dy\n"; for my $hr (keys %{$data{$se}{$dy}}) { print "Hour: $hr = ".$data{$se}{$dy}{$hr}."\n"; } } print "\n"; } 

What once ran around 28.024s now takes 0.415s!

+4
source share
2 answers

First of all, you can use the group by clause:

 select count(*), sender from messages group by sender; 

and with this, you perform one request for all senders, not a request for each sender. Another possibility might be:

 select count(*), sender, day, hour from messages group by sender, day, hour order by sender, day, hour; 

I have not tested it, but at least now you know about the existence of a group proposal. this should reduce the number of queries, and I think this is the first big step to improve performance.

second, create indexes based on search columns, in your case sender, day and hour.

if this is not enough, use profiling tools to find the most expensive time. You should also consider using fetchmany instead of fetchall to maintain low memory consumption. remember that since the sqlite module is encoded in C, use it as much as possible.

+3
source

To get started, create an index:

CREATE INDEX messages_sender_by_day ON messages (sender, day);

(You probably do not need to include an “hour” there.)

If this does not help, or you have already tried it, then correct your question a little: give us the code for generating test data and SQL for all indexes in the table.

Maintaining an account cache is pretty common, but I can't tell if this is needed here.

+1
source

Source: https://habr.com/ru/post/1300346/


All Articles