Optimize this python password analysis code

The runtime of this code on my laptop for an input file of 4.2 GB is 48 seconds. The input file has a tab delimiter, with each value displayed in quotation marks. Each entry ends with a newline, for example. '"val1"\t"val2"\t"val3"\t..."valn"\n'

I tried using multiprocessing with 10 threads: one for parsing lines, 8 to parse individual lines and populating the output queue, and the other to reduce the output queue in defaultdict shown below, but the code took 300 seconds to run, more than 6 times longer, than the following:

 from collections import defaultdict def get_users(log): users = defaultdict(int) f = open(log) # Read header line h = f.readline().strip().replace('"', '').split('\t') ix_profile = h.index('profile.type') ix_user = h.index('profile.id') # If either ix_* is the last field in h, it will include a newline. # That fine for now. for (i, line) in enumerate(f): if i % 1000000 == 0: print "Line %d" % i # progress notification l = line.split('\t') if l[ix_profile] != '"7"': # "7" indicates a bad value # use list slicing to remove quotes users[l[ix_user][1:-1]] += 1 f.close() return users 

I verified that I am not bound to I / O by removing everything except the print statement from the for loop. This code worked after 9 seconds, and I will look at the lower bound of how fast this code can work.

I have many of these 5 GB files for processing, so even a slight improvement at runtime (I know I can remove the print!) Will help. The machine I'm running on has 4 cores, so I can't help but wonder if there is a way to make multi-threaded / multiprocessor code work faster than the previous code.

UPDATE:

I rewrote the multiprocessing code as follows:

 from multiprocessing import Pool, cpu_count from collections import defaultdict def parse(line, ix_profile=10, ix_user=9): """ix_profile and ix_user predetermined; hard-coding for expedience.""" l = line.split('\t') if l[ix_profile] != '"7"': return l[ix_user][1:-1] def get_users_mp(): f = open('20110201.txt') h = f.readline() # remove header line pool = Pool(processes=cpu_count()) result_iter = pool.imap_unordered(parse, f, 100) users = defaultdict(int) for r in result_iter: if r is not None: users[r] += 1 return users 

It works after 26 seconds, acceleration of 1.85x. Not bad, but with 4 cores, not as strong as I hoped.

+6
source share
7 answers

Use regular expressions.

The test determines that the expensive part of the process is calling str.split (). You probably need to build a list and a bunch of string objects for each string is expensive.

First, you need to build a regular expression to match the string. Sort of:

 expression = re.compile(r'("[^"]")\t("[^"]")\t') 

If you call expression.match (line) .groups (), you will get the first two columns extracted as two string objects, and you can do the logic with them directly.

This now assumes that the two columns of interest are the first two columns. If not, you just need to tweak the correct expression to match the correct columns. Your code checks the header to see where the columns are. You can generate a regex based on this, but I'm going to guess that the columns are really always in the same place. Just make sure they still exist and use regex in strings.

EDIT

from collections import defaultdict import re

 def get_users(log): f = open(log) # Read header line h = f.readline().strip().replace('\'', '').split('\t') ix_profile = h.index('profile.type') ix_user = h.index('profile.id') assert ix_user < ix_profile 

This code assumes the user is in front of the profile.

  keep_field = r'"([^"]*)"' 

This regex will commit one column

  skip_field = r'"[^"]*"' 

This regular expression will match the column, but will not display the results. (Note the lack of parentheses)

  fields = [skip_field] * len(h) fields[ix_profile] = keep_field fields[ix_user] = keep_field 

Create a list for all fields and save only those that we need

  del fields[max(ix_profile, ix_user)+1:] 

Eliminate all the fields after the ones we care about (they take time to match, and we don't care about them)

  regex = re.compile(r"\t".join(fields)) 

Actually create a regex.

  users = defaultdict(int) for line in f: user, profile = regex.match(line).groups() 

Pull out two values ​​and follow the logic

  if profile != "7": # "7" indicates a bad value # use list slicing to remove quotes users[user] += 1 f.close() return users 
+4
source

If you use unix or cygwin, the following small script will create you the frequency of the user ID where the profile will be! = 7. Should be fast.

Updated with awk to count user IDs

 #!/bin/bash FILENAME="test.txt" IX_PROFILE=`head -1 ${FILENAME} | sed -e 's/\t/\n/g' | nl -w 1 | grep profile.type | cut -f1` IX_USER=`head -1 ${FILENAME} | sed -e 's/\t/\n/g' | nl -w 1 | grep profile.id | cut -f1` # Just the userids # sed 1d ${FILENAME} | cut -f${IX_PROFILE},${IX_USER} | grep -v \"7\" | cut -f2 # userids counted: # sed 1d ${FILENAME} | cut -f${IX_PROFILE},${IX_USER} | grep -v \"7\" | cut -f2 | sort | uniq -c # Count using awk..? sed 1d ${FILENAME} | cut -f${IX_PROFILE},${IX_USER} | grep -v \"7\" | cut -f2 | awk '{ count[$1]++; } END { for (x in count) { print x "\t" count[x] } }' 
+2
source

When you see that your log file is tabbed, you can use the csv module - with the dialect='excel-tab' argument - to increase performance and readability. This is, of course, if you need to use Python instead of faster console commands.

+1
source

If using regular expressions can speed it up so much by ignoring the tail of a line that doesn't need to be split, perhaps a simpler approach might help:

 [snip) ix_profile = h.index('profile.type') ix_user = h.index('profile.id') maxsplits = max(ix_profile, ix_user) + 1 #### new statement #### # If either ix_* is the last field in h, it will include a newline. # That fine for now. for (i, line) in enumerate(f): if i % 1000000 == 0: print "Line %d" % i # progress notification l = line.split('\t', maxsplits) #### changed line #### [snip] 

Please give a whirlwind to your data.

+1
source

Perhaps you can do

 users[l[ix_user]] += 1 

instead

 users[l[ix_user][1:-1]] += 1 

and delete the quotes on the dict at the end. You need to save some time.

For a multi-threaded approach: try to read several thousand lines from a file each time and transfer these thousands of lines to a stream for processing. Doing this in turn seems to be too much overhead.

Or read this article in the solution, as it seems to be doing something very similar to what you are trying to do.

0
source

It may be a little more than a point, but Python has extremely strange behavior when dealing with multiple threads (especially bad when threads are not tied to IO). More specifically, it sometimes works much slower than with single-threaded. This is because the global interpreter lock (GIL) in Python is used to ensure that no more than one thread can execute in the Python interpreter at any given time.

Due to the limitation that only one thread can actually use the interpreter at any given time, the fact that you have multiple cores will not help you. In fact, this can actually make things a lot worse due to some pathological interactions between the two threads trying to get the GIL. If you want to stick with Python, you have one of two options:

  • Try using Python 3.2 (or higher, 3.0 will not work). It has a completely different way to deal with GIL, which in some cases fixes a multi-threaded slowdown problem. I assume that you are not using the Python 3 series, as you are using the old print statement.
  • Use processes instead of threads. Since processes share open file descriptors, you really don't need to pass any state between processes as soon as you start eating the file (you could use pipes or messages if you really need to). This will slightly increase the initial start-up time, because processes take longer to create than threads, but you avoid the GIL problem.

If you need more information about this wonderfully mysterious bit of Python, take a look at the GIL-related talks on this page: http://www.dabeaz.com/talks.html .

0
source

I understand that I had almost the same idea as Winston Evert: creating a regular expression.

But my regex is:

  • performed when ix_profile < ix_user also cases when ix_profile > ix_user

  • regex only fixes the user column: the profile column is mapped to the submatrix '"(?!7")[^\t\r\n"]*"' , which does not match if "7" is present in this column; therefore we only get the right user with a single specific group

.

In addition, I tested several matching and extraction algorithms:

1) with re.finditer ()

2) with re.match () , and the regex matches 40 fields

3) with re.match ( ) and matching only regex max (ix_profile, ix_user) + 1 field

4) like 3 , but with a simple dictionary instead of the defaultdict instance

To measure time, my code creates a file based on the information you gave regarding its contents.

.

I tested the following 4 functions in 4 codes:

1

 def get_users_short_1(log): users_short = defaultdict(int) f = open(log) # Read header line h = f.readline().strip().replace('"', '').split('\t') ix_profile = h.index('profile.type') ix_user = h.index('profile.id') # If either ix_* is the last field in h, it will include a newline. # That fine for now. glo = 40*['[^\t]*'] glo[ix_profile] = '"(?!7")[^\t"]+"' glo[ix_user] = '"([^\t"]*)"' glo[39] = '"[^\t\r\n]*"' regx = re.compile('^'+'\t'.join(glo),re.MULTILINE) content = f.read() for mat in regx.finditer(content): users_short[mat.group(1)] += 1 f.close() return users_short 

2

 def get_users_short_2(log): users_short = defaultdict(int) f = open(log) # Read header line h = f.readline().strip().replace('"', '').split('\t') ix_profile = h.index('profile.type') ix_user = h.index('profile.id') # If either ix_* is the last field in h, it will include a newline. # That fine for now. glo = 40*['[^\t]*'] glo[ix_profile] = '"(?!7")[^\t"]*"' glo[ix_user] = '"([^\t"]*)"' regx = re.compile('\t'.join(glo)) for line in f: gugu = regx.match(line) if gugu: users_short[gugu.group(1)] += 1 f.close() return users_short 

3

 def get_users_short_3(log): users_short = defaultdict(int) f = open(log) # Read header line h = f.readline().strip().replace('"', '').split('\t') ix_profile = h.index('profile.type') ix_user = h.index('profile.id') # If either ix_* is the last field in h, it will include a newline. # That fine for now. glo = (max(ix_profile,ix_user) + 1) * ['[^\t]*'] glo[ix_profile] = '"(?!7")[^\t"]*"' glo[ix_user] = '"([^\t"]*)"' regx = re.compile('\t'.join(glo)) for line in f: gugu = regx.match(line) if gugu: users_short[gugu.group(1)] += 1 f.close() return users_short 

4

Full code 4 that seems the fastest:

 import re from random import choice,randint,sample import csv import random from time import clock choi = 1 if choi: ntot = 1000 chars = 'abcdefghijklmnopqrstuvwxyz0123456789' def ry(a=30,b=80,chars=chars,nom='abcdefghijklmnopqrstuvwxyz'): if a==30: return ''.join(choice(chars) for i in xrange(randint(30,80))) else: return ''.join(choice(nom) for i in xrange(randint(8,12))) num = sample(xrange(1000),200) num.sort() print 'num==',num several = [e//3 for e in xrange(0,800,7) if e//3 not in num] print print 'several==',several with open('biggy.txt','w') as f: head = ('aaa','bbb','ccc','ddd','profile.id','fff','ggg','hhhh','profile.type','iiii', 'jjj','kkkk','lll','mmm','nnn','ooo','ppp','qq','rr','ss', 'tt','uu','vv','ww','xx','yy','zz','razr','fgh','ty', 'kfgh','zer','sdfs','fghf','dfdf','zerzre','jkljkl','vbcvb','kljlk','dhhdh') f.write('\t'.join(head)+'\n') for i in xrange(1000): li = [ ry(a=8).join('""') if n==4 else ry().join('""') for n in xrange(40) ] if i in num: li[4] = '@#~&=*;' li[8] = '"7"' if i in several: li[4] = '"BRAD"' f.write('\t'.join(li)+'\n') from collections import defaultdict def get_users(log): users = defaultdict(int) f = open(log) # Read header line h = f.readline().strip().replace('"', '').split('\t') ix_profile = h.index('profile.type') ix_user = h.index('profile.id') # If either ix_* is the last field in h, it will include a newline. # That fine for now. for (i, line) in enumerate(f): #if i % 1000000 == 0: print "Line %d" % i # progress notification l = line.split('\t') if l[ix_profile] != '"7"': # "7" indicates a bad value # use list slicing to remove quotes users[l[ix_user][1:-1]] += 1 f.close() return users def get_users_short_4(log): users_short = {} f = open(log) # Read header line h = f.readline().strip().replace('"', '').split('\t') ix_profile = h.index('profile.type') ix_user = h.index('profile.id') # If either ix_* is the last field in h, it will include a newline. # That fine for now. glo = (max(ix_profile,ix_user) + 1) * ['[^\t]*'] glo[ix_profile] = '"(?!7")[^\t"]*"' glo[ix_user] = '"([^\t"]*)"' regx = re.compile('\t'.join(glo)) for line in f: gugu = regx.match(line) if gugu: gugugroup = gugu.group(1) if gugugroup in users_short: users_short[gugugroup] += 1 else: users_short[gugugroup] = 1 f.close() return users_short print '\n\n' te = clock() USERS = get_users('biggy.txt') t1 = clock()-te te = clock() USERS_short_4 = get_users_short_4('biggy.txt') t2 = clock()-te if choi: print '\nlen(num)==',len(num),' : number of lines with ix_profile==\'"7"\'' print "USERS['BRAD']==",USERS['BRAD'] print 'then :' print str(ntot)+' lines - '+str(len(num))+' incorrect - '+str(len(several))+\ ' identical + 1 user BRAD = '+str(ntot - len(num)-len(several)+1) print '\nlen(USERS)==',len(USERS) print 'len(USERS_short_4)==',len(USERS_short_4) print 'USERS == USERS_short_4 is',USERS == USERS_short_4 print '\n----------------------------------------' print 'time of get_users() :\n', t1,'\n----------------------------------------' print 'time of get_users_short_4 :\n', t2,'\n----------------------------------------' print 'get_users_short_4() / get_users() = '+str(100*t2/t1)+ ' %' print '----------------------------------------' 

One of the results of this code 4 is, for example:

 num== [2, 12, 16, 23, 26, 33, 38, 40, 43, 45, 51, 53, 84, 89, 93, 106, 116, 117, 123, 131, 132, 135, 136, 138, 146, 148, 152, 157, 164, 168, 173, 176, 179, 189, 191, 193, 195, 199, 200, 208, 216, 222, 224, 227, 233, 242, 244, 245, 247, 248, 251, 255, 256, 261, 262, 266, 276, 278, 291, 296, 298, 305, 307, 308, 310, 312, 314, 320, 324, 327, 335, 337, 340, 343, 350, 356, 362, 370, 375, 379, 382, 385, 387, 409, 413, 415, 419, 433, 441, 443, 444, 446, 459, 462, 474, 489, 492, 496, 505, 509, 511, 512, 518, 523, 541, 546, 548, 550, 552, 558, 565, 566, 572, 585, 586, 593, 595, 601, 609, 610, 615, 628, 632, 634, 638, 642, 645, 646, 651, 654, 657, 660, 662, 665, 670, 671, 680, 682, 687, 688, 690, 692, 695, 703, 708, 716, 717, 728, 729, 735, 739, 741, 742, 765, 769, 772, 778, 790, 792, 797, 801, 808, 815, 825, 828, 831, 839, 849, 858, 859, 862, 864, 872, 874, 890, 899, 904, 906, 913, 916, 920, 923, 928, 941, 946, 947, 953, 955, 958, 959, 961, 971, 975, 976, 979, 981, 985, 989, 990, 999] several== [0, 4, 7, 9, 11, 14, 18, 21, 25, 28, 30, 32, 35, 37, 39, 42, 44, 46, 49, 56, 58, 60, 63, 65, 67, 70, 72, 74, 77, 79, 81, 86, 88, 91, 95, 98, 100, 102, 105, 107, 109, 112, 114, 119, 121, 126, 128, 130, 133, 137, 140, 142, 144, 147, 149, 151, 154, 156, 158, 161, 163, 165, 170, 172, 175, 177, 182, 184, 186, 196, 198, 203, 205, 207, 210, 212, 214, 217, 219, 221, 226, 228, 231, 235, 238, 240, 249, 252, 254, 259, 263] len(num)== 200 : number of lines with ix_profile=='"7"' USERS['BRAD']== 91 then : 1000 lines - 200 incorrect - 91 identical + 1 user BRAD = 710 len(USERS)== 710 len(USERS_short_4)== 710 USERS == USERS_short_4 is True ---------------------------------------- time of get_users() : 0.0788686830309 ---------------------------------------- time of get_users_short_4 : 0.0462885646081 ---------------------------------------- get_users_short_4() / get_users() = 58.690677756 % ---------------------------------------- 

But the results are more or less variable. I got:

 get_users_short_1() / get_users() = 82.957476637 % get_users_short_1() / get_users() = 82.3987686867 % get_users_short_1() / get_users() = 90.2949842932 % get_users_short_1() / get_users() = 78.8063007461 % get_users_short_1() / get_users() = 90.4743181768 % get_users_short_1() / get_users() = 81.9635560003 % get_users_short_1() / get_users() = 83.9418269406 % get_users_short_1() / get_users() = 89.4344442255 % get_users_short_2() / get_users() = 80.4891442088 % get_users_short_2() / get_users() = 69.921943776 % get_users_short_2() / get_users() = 81.8006709304 % get_users_short_2() / get_users() = 83.6270772928 % get_users_short_2() / get_users() = 97.9821084403 % get_users_short_2() / get_users() = 84.9307558629 % get_users_short_2() / get_users() = 75.9384820018 % get_users_short_2() / get_users() = 86.2964748485 % get_users_short_3() / get_users() = 69.4332754744 % get_users_short_3() / get_users() = 58.5814726668 % get_users_short_3() / get_users() = 61.8011476831 % get_users_short_3() / get_users() = 67.6925083362 % get_users_short_3() / get_users() = 65.1208124156 % get_users_short_3() / get_users() = 72.2621727569 % get_users_short_3() / get_users() = 70.6957501222 % get_users_short_3() / get_users() = 68.5310031226 % get_users_short_3() / get_users() = 71.6529128259 % get_users_short_3() / get_users() = 71.6153554073 % get_users_short_3() / get_users() = 64.7899044975 % get_users_short_3() / get_users() = 72.947531363 % get_users_short_3() / get_users() = 65.6691965629 % get_users_short_3() / get_users() = 61.5194374401 % get_users_short_3() / get_users() = 61.8396133666 % get_users_short_3() / get_users() = 71.5447862466 % get_users_short_3() / get_users() = 74.6710538858 % get_users_short_3() / get_users() = 72.9651233485 % get_users_short_4() / get_users() = 65.5224210767 % get_users_short_4() / get_users() = 65.9023813161 % get_users_short_4() / get_users() = 62.8055210129 % get_users_short_4() / get_users() = 64.9690049062 % get_users_short_4() / get_users() = 61.9050866134 % get_users_short_4() / get_users() = 65.8127125992 % get_users_short_4() / get_users() = 66.8112344201 % get_users_short_4() / get_users() = 57.865635278 % get_users_short_4() / get_users() = 62.7937713964 % get_users_short_4() / get_users() = 66.3440149528 % get_users_short_4() / get_users() = 66.4429530201 % get_users_short_4() / get_users() = 66.8692388625 % get_users_short_4() / get_users() = 66.5949137537 % get_users_short_4() / get_users() = 69.1708488794 % get_users_short_4() / get_users() = 59.7129743801 % get_users_short_4() / get_users() = 59.755297387 % get_users_short_4() / get_users() = 60.6436352185 % get_users_short_4() / get_users() = 64.5023727945 % get_users_short_4() / get_users() = 64.0153937511 % 

.

I would like to know what result you would get with my code in your real file with a computer, which is certainly more powerful than mine. Please give me the news.

.

.

EDIT 1

FROM

 def get_users_short_Machin(log): users_short = defaultdict(int) f = open(log) # Read header line h = f.readline().strip().replace('"', '').split('\t') ix_profile = h.index('profile.type') ix_user = h.index('profile.id') maxsplits = max(ix_profile, ix_user) + 1 # If either ix_* is the last field in h, it will include a newline. # That fine for now. for line in f: #if i % 1000000 == 0: print "Line %d" % i # progress notification l = line.split('\t', maxsplits) if l[ix_profile] != '"7"': # "7" indicates a bad value # use list slicing to remove quotes users_short[l[ix_user][1:-1]] += 1 f.close() return users_short 

I have

 get_users_short_Machin() / get_users() = 60.6771821308 % get_users_short_Machin() / get_users() = 71.9300992989 % get_users_short_Machin() / get_users() = 85.1695214715 % get_users_short_Machin() / get_users() = 72.7722233685 % get_users_short_Machin() / get_users() = 73.6311173237 % get_users_short_Machin() / get_users() = 86.0848484053 % get_users_short_Machin() / get_users() = 75.1661981729 % get_users_short_Machin() / get_users() = 72.8888452474 % get_users_short_Machin() / get_users() = 76.7185685993 % get_users_short_Machin() / get_users() = 82.7007096958 % get_users_short_Machin() / get_users() = 71.1678957888 % get_users_short_Machin() / get_users() = 71.9845835126 % 

Using a simple dict:

 users_short = {} ....... for line in f: #if i % 1000000 == 0: print "Line %d" % i # progress notification l = line.split('\t', maxsplits) if l[ix_profile] != '"7"': # "7" indicates a bad value # use list slicing to remove quotes us = l[ix_user][1:-1] if us not in users_short: users_short[us] = 1 else: users_short[us] += 1 

slightly improves runtime but stays above my last 4 code

 get_users_short_Machin2() / get_users() = 71.5959919389 % get_users_short_Machin2() / get_users() = 71.6118864535 % get_users_short_Machin2() / get_users() = 66.3832514274 % get_users_short_Machin2() / get_users() = 68.0026407277 % get_users_short_Machin2() / get_users() = 67.9853921552 % get_users_short_Machin2() / get_users() = 69.8946203037 % get_users_short_Machin2() / get_users() = 71.8260030248 % get_users_short_Machin2() / get_users() = 78.4243267003 % get_users_short_Machin2() / get_users() = 65.7223734428 % get_users_short_Machin2() / get_users() = 69.5903935612 % 

.

EDIT 2

The fastest:

 def get_users_short_CSV(log): users_short = {} f = open(log,'rb') rid = csv.reader(f,delimiter='\t') # Read header line h = rid.next() ix_profile = h.index('profile.type') ix_user = h.index('profile.id') # If either ix_* is the last field in h, it will include a newline. # That fine for now. glo = (max(ix_profile,ix_user) + 1) * ['[^\t]*'] glo[ix_profile] = '"(?!7")[^\t\r\n"]*"' glo[ix_user] = '"([^\t\r\n"]*)"' regx = re.compile('\t'.join(glo)) for line in f: gugu = regx.match(line) if gugu: gugugroup = gugu.group(1) if gugugroup in users_short: users_short[gugugroup] += 1 else: users_short[gugugroup] = 1 f.close() return users_short 

result

 get_users_short_CSV() / get_users() = 31.6443901114 % get_users_short_CSV() / get_users() = 44.3536176134 % get_users_short_CSV() / get_users() = 47.2295100511 % get_users_short_CSV() / get_users() = 45.4912200716 % get_users_short_CSV() / get_users() = 63.7997241038 % get_users_short_CSV() / get_users() = 43.5020255488 % get_users_short_CSV() / get_users() = 40.9188320386 % get_users_short_CSV() / get_users() = 43.3105062139 % get_users_short_CSV() / get_users() = 59.9184895288 % get_users_short_CSV() / get_users() = 40.22047881 % get_users_short_CSV() / get_users() = 48.3615872543 % get_users_short_CSV() / get_users() = 47.0374831251 % get_users_short_CSV() / get_users() = 44.5268626789 % get_users_short_CSV() / get_users() = 53.1690205938 % get_users_short_CSV() / get_users() = 43.4022458372 % 

.

EDIT 3

I tested get_users_short_CSV () with 10,000 lines in a file instead of 1000:

 len(num)== 2000 : number of lines with ix_profile=='"7"' USERS['BRAD']== 95 then : 10000 lines - 2000 incorrect - 95 identical + 1 user BRAD = 7906 len(USERS)== 7906 len(USERS_short_CSV)== 7906 USERS == USERS_short_CSV is True ---------------------------------------- time of get_users() : 0.794919186656 ---------------------------------------- time of get_users_short_CSV : 0.358942826532 ---------------------------------------- get_users_short_CSV() / get_users() = 41.5618307521 % get_users_short_CSV() / get_users() = 42.2769300584 % get_users_short_CSV() / get_users() = 45.154631132 % get_users_short_CSV() / get_users() = 44.1596819482 % get_users_short_CSV() / get_users() = 30.3192350266 % get_users_short_CSV() / get_users() = 34.4856637748 % get_users_short_CSV() / get_users() = 43.7461535628 % get_users_short_CSV() / get_users() = 41.7577246935 % get_users_short_CSV() / get_users() = 41.9092878608 % get_users_short_CSV() / get_users() = 44.6772360665 % get_users_short_CSV() / get_users() = 42.6770989413 % 
0
source

Source: https://habr.com/ru/post/887398/


All Articles