I have a semi-structured .txt file. The file is as follows:
<tags>
blabla<text>
I want this
</text>
blabla<text>
And this
</text>
bla<text>
and this
</text>blabla
</tags>
I want to get the text inside the tags <text>. I managed to do this using the string section and replace, but I don’t think it is very efficient or beautiful.
Here is my code:
with open('collection.txt') as f:
read_data = f.read()
text1 = read_data.partition("<text>")[2].partition("</text>")[0]
temp1 = read_data.replace(text1,'').replace('<text>','',1).replace('</text>','',1)
text2 = temp1.partition("<text>")[2].partition("</text>")[0]
temp2 = read_data.replace(text2,'').replace('<text>','',2).replace('</text>','',2)
text3 = temp2.partition("<text>")[2].partition("</text>")[0]
BeautifulSoup, the element tree, and other XML parsers did not work. Any suggestions for improving the code? I tried compiling a regex, but to no avail.
source
share