I am new to Python and I am writing a series from a script to convert between some proprietary markup formats. I repeat line after line above the files, and then basically do a large number of (100-200) replacements, which are mainly divided into 4 categories:
line = line.replace("-","<EMDASH>")
line = line.replace("<\\@>","@")
line = line.replace("<\\n>","")
line = line.replace("\xe1","•")
The str.replace () function seems pretty efficient (pretty low in numbers when parsing profiling output), but is there a better way to do this? I saw the re.sub () method with the function as an argument, but not sure if this would be better? I think it depends on what kind of optimizations Python does internally. Thought I would ask for advice before creating a big dict that might not be very useful!
Also, I am a little versed in tags (which look like HTML but not HTML). I identify the tags as follows:
m = re.findall('(<[^>]+>)',line)
And then do ~ 100 search / replace (basically removing matches) in matching tags, for example:
m = re.findall('(<[^>]+>)',line)
for tag in m:
tag_new = re.sub("\*t\([^\)]*\)","",tag)
tag_new = re.sub("\*p\([^\)]*\)","",tag_new)
if tag != tag_new:
line = line.replace(tag,tag_new,1)
Any thoughts on efficiency here?
Thank!
source
share