I would like to know the provisions of the terms from a dictionary found in a set of short texts. The problem is the last lines of the following code, roughly based on From a list of lines identifying the names of people and which are not
library(tm)
pkd.names.quotes <- c(
"Mr. Rick Deckard",
"Do Androids Dream of Electric Sheep",
"Roy Batty",
"How much is an electric ostrich?",
"My schedule for today lists a six-hour self-accusatory depression.",
"Upon him the contempt of three planets descended.",
"J.F. Sebastian",
"Harry Bryant",
"goat class",
"Holden, Dave",
"Leon Kowalski",
"Dr. Eldon Tyrell"
)
firstnames <- c("Sebastian", "Dave", "Roy",
"Harry", "Dave", "Leon",
"Tyrell")
dict <- sort(unique(tolower(firstnames)))
corp <- VCorpus(VectorSource(pkd.names.quotes))
tdm <-
TermDocumentMatrix(corp, control = list(tolower = TRUE, dictionary = dict))
inspect(corp)
inspect(tdm)
View(as.matrix(tdm))
data.frame(
Name = rownames(tdm)[tdm$i],
Segment = colnames(tdm)[tdm$j],
Content = pkd.names.quotes[tdm$j],
Postion = regexpr(
pattern = rownames(tdm)[tdm$i],
text = tolower(pkd.names.quotes[tdm$j])
)
)
Exit with a warning and only with the first line.
Name Segment Content Postion
1 roy 3 Roy Batty 1
2 sebastian 7 J.F. Sebastian -1
3 harry 8 Harry Bryant -1
4 dave 10 Holden, Dave -1
5 leon 11 Leon Kowalski -1
6 tyrell 12 Dr. Eldon Tyrell -1
Warning message:
In regexpr(pattern = rownames(tdm)[tdm$i], text = tolower(pkd.names.quotes[tdm$j])) :
argument 'pattern' has length > 1 and only the first element will be used
I know a solution with pattern = paste (vector, collapse = "|") , but my vector can be very long (all popular names).
Could there be a simple vector version of this command or a solution that for each line takes a new template parameter?