The quickest check, does a line start with a value in a list?

I have thousands of values (like a list, but can convert to a dictionary or, if that helps) and want to compare files with millions of lines . What I want to do is filter the lines in the files only to those that start with the values ​​in the list .

What is the fastest way to do this?

My slow code is:

  for line in source_file:
    # Go through all IDs
    for id in my_ids:
      if line.startswith(str(id) + "|"):
        #replace comas with semicolons and pipes with comas
        target_file.write(line.replace(",",";").replace("|",","))
+4
source share
3 answers

If you are sure that the line starts with id + "|" and "|" will not be present in id, I think you could play some kind of trick with "|" . For example:

my_id_strs = map(str, my_ids)
for line in source_file:
    first_part = line.split("|")[0]
    if first_part in my_id_strs:
        target_file.write(line.replace(",",";").replace("|",","))

Hope this helps :)

+3

string.translate . , .

from string import maketrans

trantab = maketrans(",|", ";,")

ids = ['%d|' % id for id in my_ids]

for line in source_file:
    # Go through all IDs
    for id in ids:
      if line.startswith(id):
        #replace comas with semicolons and pipes with comas
        target_file.write(line.translate(trantab))
        break

from string import maketrans

#replace comas with semicolons and pipes with comas
trantab = maketrans(",|", ";,")
idset = set(my_ids)

for line in source_file:
    try:
        if line[:line.index('|')] in idset:            
            target_file.write(line.translate(trantab))
    except ValueError as ve:
        pass
+1

Use regex. Here is the implementation:

import re

def filterlines(prefixes, lines):
    pattern = "|".join([re.escape(p) for p in prefixes])
    regex = re.compile(pattern)
    for line in lines:
        if regex.match(line):
            yield line

First we build and compile a regular expression (expensive, but only one), but then the match is very fast.

Test code for the above:

with open("/usr/share/dict/words") as words:
    prefixes = [line.strip() for line in words]

lines = [
    "zoo this should match",
    "000 this shouldn't match",
]

print(list(filterlines(prefixes, lines)))
0
source

Source: https://habr.com/ru/post/1615171/


All Articles