Python: reading part of a text file

Hi all

I am new to python and programming. I need to read fragments of a large text file, the format is as follows:

<word id="8" form="hibernis" lemma="hibernus1" postag="np---nb-" head-"7" relation="ADV"/> 

I need form , lemma and postag . For example, for the above, I need hibernis , hibernus1 and np---nb- .

How do I tell python to read until it reaches the form, to read ahead, until it reaches the quotation mark " and then read the information between the quotes of "hibernis" ? Actually struggling with this.

My attempts so far have been to remove punctuation, split the sentence, and then extract the information I need from the list. I have problems getting python to iterate over the whole file, although I can only get this work for 1 line. My code is below:

 f=open('blank.txt','r') quotes=f.read() noquotes=quotes.replace('"','') f.close() rf=open('blank.txt','w') rf.write(noquotes) rf.close() f=open('blank.txt','r') finished = False postag=[] while not finished: line=f.readline() words=line.split() postag.append(words[4]) postag.append(words[6]) postag.append(words[8]) finished=True 

Thank you for your feedback / criticism

thanks

+4
source share
9 answers

If it is XML, use ElementTree to parse it:

 from xml.etree import ElementTree line = '<word id="8" form="hibernis" lemma="hibernus1" postag="np---nb-" head="7" relation="ADV"/>' element = ElementTree.fromstring(line) 

For each XML element, you can easily extract the name and all attributes:

 >>> element.tag 'word' >>> element.attrib {'head': '7', 'form': 'hibernis', 'postag': 'np---nb-', 'lemma': 'hibernus1', 'relation': 'ADV', 'id': '8'} 

So, if you have a document with a bunch of word XML elements, something like this will select the information you need:

 from xml.etree import ElementTree XML = ''' <words> <word id="8" form="hibernis" lemma="hibernus1" postag="np---nb-" head="7" relation="ADV"/> </words>''' root = ElementTree.fromstring(XML) for element in root.findall('word'): form = element.attrib['form'] lemma = element.attrib['lemma'] postag = element.attrib['postag'] print form, lemma, postag 

Use parse() instead of fromstring() if you only have a file name.

+5
source

I would suggest using a regex module: re

Is something like that possible?

 #!/usr/bin/python import re if __name__ == '__main__': data = open('x').read() RE = re.compile('.*form="(.*)" lemma="(.*)" postag="(.*?)"', re.M) matches = RE.findall(data) for m in matches: print m 

This assumes that the <word ...> lines are on the same line and each part is in that exact order and that you do not need to deal with the full xml analysis.

+2
source

Does your XML file match? If so, try the SAX parser:

 import xml.sax class Handler (xml.sax.ContentHandler): def startElement (self, tag, attrs): if tag == 'word': print 'form=', attrs['form'] print 'lemma=',attrs['lemma'] print 'postag=',attrs['postag'] ch = Handler () f = open ('myfile') xml.sax.parse (f, ch) 

(this is rude .. this may not be entirely correct).

+1
source

In addition to the regular RegEx request, since it looks like an XML form, you can try something like BeautifulSoup ( http://www.crummy.com/software/BeautifulSoup/ )

It is very easy to use and finds tags / attributes in things like HTML / XML, even if they are not โ€œwell formedโ€. Maybe worth a look.

+1
source

Manually analyzing xml is usually the wrong thing. First, your code will break if a quote in any of the attributes escapes. Retrieving attributes from an xml parser is probably cleaner and less error prone.

A similar approach may also encounter problems when parsing the entire file if you have lines that do not match the format. You can handle this by creating a parseline method (something like

 def parse (line): try: return parsed values here except: 

You can also simplify this with the filter and map functions:

 lines = filter( lambda line: parseable(line), f.readlines()) values = map (parse, lines) 
0
source

Just highlight your problem:

 finished = False counter = 0 while not finished: counter += 1 finished=True print counter 
0
source

With regular expressions, this is the point (you can make part of file.readline ()):

 import re line = '<word id="8" form="hibernis" lemma="hibernus1" postag="np---nb-" head-"7" relation="ADV"/>' r = re.compile( 'form="([^"]*)".*lemma="([^"]*)".*postag="([^"]*)"' ) match = r.search( line ) print match.groups() >>> ('hibernis', 'hibernus1', 'np---nb-') >>> 
0
source

First, do not spend a lot of time rewriting your file. This is generally a waste of time. Processing to clean and parse tags so fast that you will be completely happy working from the source file all the time.

 source= open( "blank.txt", "r" ) for line in source: # line has a tag-line structure # <word id="8" form="hibernis" lemma="hibernus1" postag="np---nb-" head-"7" relation="ADV"/> # Assumption -- no spaces in the quoted strings. parts = line.split() # parts is [ '<word', 'id="8"', 'form="hibernis"', ... ] assert parts[0] == "<word" nameValueList = [ part.partition('=') for part in parts[1:] ] # nameValueList is [ ('id','=','"8"'), ('form','=','"hibernis"'), ... ] attrs = dict( (n,eval(v)) for n, _, v in nameValueList ) # attrs is { 'id':'8', 'form':'hibernis', ... } print attrs['form'], attrs['lemma'], attrs['posttag'] 
0
source

wow, you guys are fast :) If you want all the attributes of the list (and the order is known), you can use something like this:

 import re print re.findall('"(.+?)"',INPUT) 

INPUT is a string, for example:

 <word id="8" form="hibernis" lemma="hibernus1" postag="np---nb-" head="7" relation="ADV"/> 

and printed list:

 ['8', 'hibernis', 'hibernus1', 'np---nb-', '7', 'ADV'] 
0
source

Source: https://habr.com/ru/post/1285906/


All Articles