Piece of text inside python tags

Question

Piece of text inside python tags

I have a semi-structured .txt file. The file is as follows:

<tags>
    blabla<text>
              I want this
         </text>
    blabla<text>
               And this
           </text>
        bla<text>
                 and this
            </text>blabla
</tags>

I want to get the text inside the tags <text>. I managed to do this using the string section and replace, but I don’t think it is very efficient or beautiful.

Here is my code:

with open('collection.txt') as f:
 read_data = f.read()

text1 = read_data.partition("<text>")[2].partition("</text>")[0]
temp1 = read_data.replace(text1,'').replace('<text>','',1).replace('</text>','',1)
text2 = temp1.partition("<text>")[2].partition("</text>")[0]
temp2 = read_data.replace(text2,'').replace('<text>','',2).replace('</text>','',2)
text3 = temp2.partition("<text>")[2].partition("</text>")[0]

BeautifulSoup, the element tree, and other XML parsers did not work. Any suggestions for improving the code? I tried compiling a regex, but to no avail.

+4

python beautifulsoup text-extraction

gd13 Mar 31 '18 at 8:15

source share

4 answers

XML, xml.etree ( ):

import xml.etree.ElementTree as ET
doc = ET.parse('collection.txt')
print([el.text.strip() for el in doc.findall('.//text')])
# output: ['I want this', 'And this', 'and this']

+3

phihag 31 . '18 8:55

regex is your best friend!

import re

p = re.compile(r'<text>([^</]*)</text>')
result = p.findall(data_txt)
result = [x.strip() for x in result]
print(result)

+1

Brett 7533 Mar 31 '18 at 9:03

source share

re.findall('<text>\s*.*\s*</text>', data)

another solution for this

+1

Someone Mar 31 '18 at 9:19

source share

Martin Evans · Accepted Answer · 2018-03-31T09:18:32+0000

You can use BeautifulSoup as follows to get all text entries:

from bs4 import BeautifulSoup

with open('collection.txt') as f:
    read_data = f.read()

soup = BeautifulSoup(read_data, 'xml')

for text in soup.find_all('text'):
    print(text.get_text(strip=True))

Providing you:

I want this
And this
and this

, , , . ,  , .

Piece of text inside python tags

More articles: