Frequency labeling in data structure R

I am reading the source of the R FAQ in texinfo and I think it would be easier to manage and extend it if it were analyzed as a structure of R. There are several existing examples related to this:

  • state package

  • bibtex records

  • Rd files

with some desirable features.

In my opinion, frequently asked questions are not used in the R community because they lack i) easy access from the R command line (that is, through the R package); ii) powerful search functions; iii) cross-references; iv) extensions for introduced packages. Drawing ideas from bibtex and fortunes packages, we could introduce a new system in which:

  • Frequently asked questions can be found with R. Typical calls will resemble the fortune() interface: faq("lattice print") or faq() #surprise me! , faq(51) , faq(package="ggplot2") .

  • Packages can provide their own FAQ.rda , the format of which is not yet clear (see below)

  • Sweave / knitr drivers are provided to output beautifully formatted Markdown / LaTeX, etc.

Question

I am not sure what the best input format is . Either to convert existing frequently asked questions, or to add new entries.

It is cumbersome to use the R syntax with a tree of nested lists (or ad hoc S3 / S4 / ref class or structure ,

 \list(title = "Something to be \\escaped", entry = "long text with quotes, links and broken characters", category = c("windows", "mac", "test")) 

Rd , even if it is not an R-structure per se (it is rather a subset of LaTeX with its own parser), may perhaps be a more attractive example of an input format. It also has a set of tools for analyzing the structure in R. However, its current goal is quite specific and varied, focusing on general documentation on the functions of R, and not on frequently asked questions of records. Its syntax is also not perfect, I think more modern markup, something like markdowns, will be more readable.

Is there anything else, perhaps examples of parsing markup files in R structures? An example of deviating Rd files from their destination?

To summarize

I would like to come up with:

1 is a good design for an R structure (a class, perhaps) that extends the fortune package to more general entries, such as FAQ items

2- a more convenient format for entering new frequently asked questions (instead of the current texinfo format)

3 - a parser written in R or in some other language (bison?) To convert existing frequently asked questions into a new structure (1) and / or a new input format (2) into an R.

Update 2: in the last two days of the bounty period, I received two answers, both interesting, but completely different. Since the question is quite extensive (possibly incorrect), none of the answers gives a complete solution, so I will not (so far anyway) accept the answer. Regarding generosity, I will explain it by the answer that was voted before the expiration, wishing there was a way to divide it more evenly.

+42
markup r r-faq parsing markdown
May 26 '12 at 3:39
source share
2 answers

(This indicates point 3.)

You can convert the texinfo file to XML

 wget http://cran.r-project.org/doc/FAQ/R-FAQ.texi makeinfo --xml R-FAQ.texi 

and then read it with the XML package.

 library(XML) doc <- xmlParse("R-FAQ.xml") r <- xpathSApply( doc, "//node", function(u) { list(list( title = xpathSApply(u, "nodename", xmlValue), contents = as(u, "character") )) } ) free(doc) 

But it’s much easier to convert it to text.

 makeinfo --plaintext R-FAQ.texi > R-FAQ.txt 

and analyze the result manually.

 doc <- readLines("R-FAQ.txt") # Split the document into questions # ie, around lines like ****** or ======. i <- grep("[*=]{5}", doc) - 1 i <- c(1,i) j <- rep(seq_along(i)[-length(i)], diff(i)) stopifnot(length(j) == length(doc)) faq <- split(doc, j) # Clean the result: since the questions # are in the subsections, we can discard the sections. faq <- faq[ sapply(faq, function(u) length(grep("[*]", u[2])) == 0) ] # Use the result cat(faq[[ sample(seq_along(faq),1) ]], sep="\n") 
+8
Jun 02 '12 at 10:30
source share

I do not understand your goals a bit. It seems you need all the documentation related to R, converted to some format that R can manipulate, apparently so that you can write R-procedures to better extract information from the documentation.

There are apparently three assumptions here.

1) It is easy to convert these different document formats (texinfo, RD files, etc.) into some standard form with (emphasize) some implicit unified structure and semantics.
Because if you cannot compare them all with one structure, you will have to write separate R-tools for each type and, possibly, for each separate document, and then work on post-conversion will violate the benefits.

2) that R is the correct language for writing such document processing tools; suppose that you are a little inclined towards R, because you work in R and do not want to contemplate “leaving” the development environment in order to get information about working with R better. I am not an expert on R, but I believe that R is basically a numerical language and does not offer special help for string processing, pattern recognition, natural language parsing or output, all of which I expect to play an important role in extracting information from converted documents that contain a lot of natural language. I do not propose a specific alternative language (Prolog ??), but you might be better off if you manage to convert to the normal form (task 1) in order to carefully select the target language for processing.

3) So that you can extract useful information from these structures. Library science is what the 20th century tried to push; Now we are all in the "Information Search" and "Data Merge" methods. But in fact, reasoning about unofficial documents defeated most attempts to do so. There are no obvious systems that organize raw text and extract deep meaning from it (Watson's noteworthy IBM Jeopardy system, but even there it’s not clear that Watson “knows” if you want Watson to answer the question “If the surgeon opens you a knife ? "No matter how much source code you gave him. The fact is that you may be able to convert the data, but it is not clear what you can successfully do with it.

All that is said, most markup systems in the text have a markup structure and raw text. You can “unravel” them into tree structures (or graphic structures, if you assume that certain things are reliable cross-references, they certainly have a texin). XML is widely distributed as a medium for such parsed structures and is capable of representing arbitrary trees or graphs, it is ... OK ... to capture such trees or graphs. [Then people press RDF or OWL or another knoweldge coding system that uses XML, but that does not change the problem; you choose a canonical target regardless of R]. So what you really want is that it will read various tagged structures (texinfo, RD files) and spit out XML or equivalent trees / graphics. Here, I think, you are doomed to create separate O (N) parsers to cover all N markup styles; how else would the tool know what a markup of value (hence, parsing) is? (You can imagine a system that could read the marked documents when describing the markup description, but even this is O (N): someone else needs to describe the markup). This parsing alone pertains to this unified notation, then you can use the easily constructed R-parser to read the XML (if it does not already exist), or if R is not the right answer, parse this with any correct answer.

There are tools to help you create parsers and parse trees for arbitrary lanuages ​​(and even translators from parsing trees to other forms). ANTLR - one; it is used by enough people, so you may even accidentally find a texinfo analyzer tag that has already been built. Our DMS Software Reengineering Toolkit is different; After parsing, the DMS will export the XML document with the parsing tree directly (but it will not necessarily be in the single view that you ideally want). These tools are likely to make markup easier to read and represent it in XML.

But I think that your real problem will solve what you want to extract / do, and then find a way to do it. If you don't have a clear idea of ​​how to do the latter, doing all the front parsers just seems like a lot of work with fuzzy returns. Perhaps you have a simpler goal ("manage and expand," but these words can hide a lot), which is more doable.

+6
Jun 02 2018-12-12T00:
source share



All Articles