Extract field list from reStructuredText

Let's say I have the following restart:

Some text ... :foo: bar Some text ... 

I would like it to be like this:

 {"foo": "bar"} 

I tried using this:

 tree = docutils.core.publish_parts(text) 

It parses the list of fields, but in the end I get some pseudo XML in tree["whole"]? :

 <document source="<string>"> <docinfo> <field> <field_name> foo <field_body> <paragraph> bar 

Since the tree dict does not contain any other useful information, and this is just a line, I am not sure how to parse the list of fields from the reST document. How can I do it?

+6
source share
3 answers

You can try using something like the following code. Instead of using the publish_parts method publish_parts I used publish_doctree to get a pseudo-XML representation of your document. Then I converted to XML DOM to extract all field elements. Then I get the first field_name and field_body elements of each field element.

 from docutils.core import publish_doctree source = """Some text ... :foo: bar Some text ... """ # Parse reStructuredText input, returning the Docutils doctree as # an `xml.dom.minidom.Document` instance. doctree = publish_doctree(source).asdom() # Get all field lists in the document. fields = doctree.getElementsByTagName('field') d = {} for field in fields: # I am assuming that `getElementsByTagName` only returns one element. field_name = field.getElementsByTagName('field_name')[0] field_body = field.getElementsByTagName('field_body')[0] d[field_name.firstChild.nodeValue] = \ " ".join(c.firstChild.nodeValue for c in field_body.childNodes) print d # Prints {u'foo': u'bar'} 

The xml.dom module is not the easiest to work with (why I need to use .firstChild.nodeValue , and not just .nodeValue for example), so you can use the xml.etree.ElementTree module, which is much easier for me to work with. If you use lxml, you can also use XPATH notation to find all field , field_name and field_body .

+7
source

I have an alternative solution that I find less burdensome, but perhaps more fragile. After reviewing the implementation of the node class https://sourceforge.net/p/docutils/code/HEAD/tree/trunk/docutils/docutils/nodes.py you will see that it supports a walking method that can be used to pull out the desired data, not By creating two different xml representations of your data. Here is what I am using in my protoip code:

https://github.com/h4ck3rm1k3/gcc-introspector/blob/master/peewee_adaptor.py#L33

 from docutils.core import publish_doctree import docutils.nodes 

and then

 def walk_docstring(prop): doc = prop.__doc__ doctree = publish_doctree(doc) class Walker: def __init__(self, doc): self.document = doc self.fields = {} def dispatch_visit(self,x): if isinstance(x, docutils.nodes.field): field_name = x.children[0].rawsource field_value = x.children[1].rawsource self.fields[field_name]=field_value w = Walker(doctree) doctree.walk(w) # the collected fields I wanted pprint.pprint(w.fields) 
0
source

Here is my ElementTree implementation:

 from docutils.core import publish_doctree from xml.etree.ElementTree import fromstring source = """Some text ... :foo: bar Some text ... """ def gen_fields(source): dom = publish_doctree(source).asdom() tree = fromstring(dom.toxml()) for field in tree.iter(tag='field'): name = next(field.iter(tag='field_name')) body = next(field.iter(tag='field_body')) yield {name.text: ''.join(body.itertext())} 

Using

 >>> next(gen_fields(source)) {'foo': 'bar'} 
0
source

Source: https://habr.com/ru/post/916744/


All Articles