How to convert coreNLP parsing tree to data.tree R package

I would like the parse tree generated by the coreNLP R-package to be in data.tree R. The parse tree is created using the following code:

 options( java.parameters = "-Xmx2g" ) 
library(NLP)
library(coreNLP)
#initCoreNLP() # change this if downloaded to non-standard location
initCoreNLP(annotators = "tokenize,ssplit,pos,lemma,parse")
## Some text.
s <- c("A rare black squirrel has become a regular visitor to a suburban garden.")
s <- as.String(s)


anno<-annotateString(s)
parse_tree <- getParse(anno)
parse_tree

The output parse tree is as follows:
> parse_tree
[1] "(ROOT\r\n  (S\r\n    (NP (DT A) (JJ rare) (JJ black) (NN squirrel))\r\n    (VP (VBZ has)\r\n      (VP (VBN become)\r\n        (NP (DT a) (JJ regular) (NN visitor))\r\n        (PP (TO to)\r\n          (NP (DT a) (JJ suburban) (NN garden)))))\r\n    (. .)))\r\n\r\n"

I found that after publishing Visualize the structure of the parse tree . It hides the openNLP package created by the parse tree in the form of a tree. But the parse tree is different from the one generated by coreNLP, and the solution is not converted to the data.tree format that I want.

EDIT By adding the two lines below, we can use the function provided in the publication Visualize the structure of the parse tree .

# this step modifies coreNLP parse tree to mimic openNLP parse tree
parse_tree <- gsub("[\r\n]", "", parse_tree)
parse_tree <- gsub("ROOT", "TOP", parse_tree)

library(igraph)
library(NLP)

parse2graph(parse_tree,  # plus optional graphing parameters
            title = sprintf("'%s'", x), margin=-0.05,
            vertex.color=NA, vertex.frame.color=NA,
            vertex.label.font=2, vertex.label.cex=1.5, asp=0.5,
            edge.width=1.5, edge.color='black', edge.arrow.size=0)

data.tree, data.tree

+4
1

edgelist, data.tree . parse2graph :

parse2tree <- function(ptext) {
  stopifnot(require(NLP) && require(igraph))

  ## Replace words with unique versions
  ms <- gregexpr("[^() ]+", ptext)                                      # just ignoring spaces and brackets?
  words <- regmatches(ptext, ms)[[1]]                                   # just words
  regmatches(ptext, ms) <- list(paste0(words, seq.int(length(words))))  # add id to words

  ## Going to construct an edgelist and pass that to igraph
  ## allocate here since we know the size (number of nodes - 1) and -1 more to exclude 'TOP'
  edgelist <- matrix('', nrow=length(words)-2, ncol=2)

  ## Function to fill in edgelist in place
  edgemaker <- (function() {
    i <- 0                                       # row counter
    g <- function(node) {                        # the recursive function
      if (inherits(node, "Tree")) {            # only recurse subtrees
        if ((val <- node$value) != 'TOP1') { # skip 'TOP' node (added '1' above)
          for (child in node$children) {
            childval <- if(inherits(child, "Tree")) child$value else child
            i <<- i+1
            edgelist[i,1:2] <<- c(val, childval)
          }
        }
        invisible(lapply(node$children, g))
      }
    }
  })()

  ## Create the edgelist from the parse tree
  edgemaker(Tree_parse(ptext))
  tree <- FromDataFrameNetwork(as.data.frame(edgelist))
  return (tree)
}


parse_tree <- "(ROOT\r\n  (S\r\n    (NP (DT A) (JJ rare) (JJ black) (NN squirrel))\r\n    (VP (VBZ has)\r\n      (VP (VBN become)\r\n        (NP (DT a) (JJ regular) (NN visitor))\r\n        (PP (TO to)\r\n          (NP (DT a) (JJ suburban) (NN garden)))))\r\n    (. .)))\r\n\r\n"
parse_tree <- gsub("[\r\n]", "", parse_tree)
parse_tree <- gsub("ROOT", "TOP", parse_tree)

library(data.tree)

tree <- parse2tree(parse_tree)
tree
SetNodeStyle(tree, style = "filled,rounded", shape = "box", fillcolor = "GreenYellow")
plot(tree)

enter image description here

+1

Source: https://habr.com/ru/post/1629489/


All Articles