How to find the byte position of a specific line in a file

What is the fastest way to find the byte position of a specific line in a file from the command line?

eg.

$ linepos myfile.txt 13
5283

I am writing a parser for CSV with a size of several GB, and if the analyzer stops, I would like to resume work from the last position. The parser is in Python, but even iterating over file.readlines()takes a lot of time since there are millions of lines in the file. I would just like to do it file.seek(int(command.getoutput("linepos myfile.txt %i" % lastrow))), but I cannot find a shell command to do this effectively.

Edit: Sorry for the confusion, but I'm looking for a solution other than Python. I already know how to do this with Python.

+4
source share
3 answers

@chepner :

position = 0  # or wherever you left off last time
try:
    with open('myfile.txt') as file:
        file.seek(position)  # zero in base case
        for line in file:
            position = file.tell() # current seek position in file
            # process the line
except:
    print 'exception occurred at position {}'.format(position)
    raise
+3

, . len -, . ( )

position = 0  # or wherever you left off last time
try:
    with open('myfile.txt') as file:  # don't you go correcting me on naming it file. we don't call file directly anyway!
        file.seek(position)  # zero in base case
        for line in file:
            position += len(line)
            # process the line
except:
    # yes, a naked exception. TWO faux pas in one answer?!?
    print 'exception occurred at position {}'.format(position)
    raise # re-raise to see traceback or what have you
+1

, ,

$ echo -e '#!/bin/bash\necho abracadabra' >/tmp/script
$ pattern=bash
$ sed -rn "0,/$pattern/ {s/^(.*)$pattern.*$/\1/p ;t exit; p; :exit }" /tmp/script \
    | wc -c 
8

, , 1.

NB 1: sed , , , , pattern, 7 ( β†’ #!/bin/), wc -c

$ sed -rn "0,/$pattern/ {s/^(.*)$pattern.*$/\1/p ;t exit; p; :exit }" /tmp/script \
   | hexdump -C
00000000  23 21 2f 62 69 6e 2f 0a                           |#!/bin/.|
00000008

, , EOF. , .

NB 2: , sed . , , .

NB 3: , pattern . pattern, .


Update. Ive .

$ grep -bo bash <<< '#!/bin/bash'
7:bash

GNU grep :

-b, --byte-offset
    Print the 0-based byte offset within the input file before  each  line  of
    output. If -o (--only-matching)  is specified, print the offset of the
    matching part itself.

Id grep, -F, .

$ grep -F '!@##$@#%%^%&*%^&*(^)((**%%^@#' <<<'!@##$@#%%^%&*%^&*(^)((**%%^@#' 
!@##$@#%%^%&*%^&*(^)((**%%^@#
0

Source: https://habr.com/ru/post/1525280/


All Articles