Removing hash comments that are not enclosed in quotation marks

Question

Removing hash comments that are not enclosed in quotation marks

I use python to view the file and remove any comments. A comment is defined as a hash and everything to the right of it if the hash is not inside double quotes. I currently have a solution, but it seems suboptimal:

filelines = [] r = re.compile('(".*?")') for line in f: m = r.split(line) nline = '' for token in m: if token.find('#') != -1 and token[0] != '"': nline += token[:token.find('#')] break else: nline += token filelines.append(nline)

Is there a way to find the first hash not inside quotes without loops (i.e. through regular expressions?)

Examples:

 ' "Phone #":"555-1234" ' -> ' "Phone #":"555-1234" ' ' "Phone "#:"555-1234" ' -> ' "Phone "' '#"Phone #":"555-1234" ' -> '' ' "Phone #":"555-1234" #Comment' -> ' "Phone #":"555-1234" '

Edit: Here is a clean regular solution created by user2357112. I tested it and it works great:

 filelines = [] r = re.compile('(?:"[^"]*"|[^"#])*(#)') for line in f: m = r.match(line) if m != None: filelines.append(line[:m.start(1)]) else: filelines.append(line)

See his answer for more details on how this regular expression works.

Edit2: Here is the version of user2357112 that I modified to account for escape characters (\). This code also excludes "if", including checking the end of the line ($):

 filelines = [] r = re.compile(r'(?:"(?:[^"\\]|\\.)*"|[^"#])*(#|$)') for line in f: m = r.match(line) filelines.append(line[:m.start(1)])

+4

python comments regex quotes strip

RPGillespie Jul 22 '13 at 15:12

source share

3 answers

You can remove comments using this script:

 import re print re.sub(r'("(?:[^"]+|(?<=\\)")*")|#[^\n]*', lambda m: m.group(1) or '', '"Phone #"#:"555-1234"')

The idea is to fix the part in double quotes and replace it yourself before looking for a sharp one:

 ( # open the capture group 1 " # " (?: # open a non-capturing group [^"]+ # all characters except " | # OR (?<=\\)" # escaped quote )* # repeat zero or more times " # " ) # close the capture group 1 | # OR #[^\n]* # a sharp and zero or one characters that are not a newline.

0

Casimir et Hippolyte Jul 22 '13 at 15:40

source share

This code was so ugly, I had to publish it.

 def remove_comments(text): char_list = list(text) in_str = False deleting = False for i, c in enumerate(char_list): if deleting: if c == '\n': deleting = False else: char_list[i] = None elif c == '"': in_str = not in_str elif c == '#': if not in_str: deleting = True char_list[i] = None char_list = filter(lambda x: x is not None, char_list) return ''.join(char_list)

Seems to work. Although I'm not sure how it can handle newline characters between windows and linux.

0

RussW Jul 22 '13 at 16:10

source share

user2357112 · Accepted Answer · 2013-07-22T15:36:50+0000

 r'''(?: # Non-capturing group "[^"]*" # A quote, followed by not-quotes, followed by a quote | # or [^"#] # not a quote or a hash ) # end group * # Match quoted strings and not-quote-not-hash characters until... (#) # the comment begins! '''

This is a verbose regular expression designed to work on one line, so be sure to use the re.VERBOSE flag and re.VERBOSE it one line at a time. It will write the first duty free hash as group 1, if any, so you can use match.start(1) to get the index. It does not handle escape-back screens if you want a backslash to fit in a string. This is not verified.

Removing hash comments that are not enclosed in quotation marks

More articles: