I would like to apply the function qdap polarityto a vector of documents, each of which can contain several sentences and get the corresponding polarity for each document. For instance:
library(qdap)
polarity(DATA$state)$all$polarity
[1] -0.8165 -0.4082 0.0000 -0.8944 0.0000 0.0000 0.0000 -0.5774 0.0000
[10] 0.4082 0.0000
Warning message:
In polarity(DATA$state) :
Some rows contain double punctuation. Suggested use of `sentSplit` function.
This warning cannot be ignored, as it appears to add polarity points to each sentence in the document. This can lead to the fact that the polarity values at the document level will be outside the boundaries [-1, 1].
I know the ability to start first sentSplitand then average over sentences, perhaps by weighing the polarity by the number of words, but this is (1) inefficient (takes about 4 times as long as it works on complete documents with a warning) and (2) it is not clear how weight offers. This option will look something like this:
DATA$id <- seq(nrow(DATA))
sentences <- sentSplit(DATA, "state")
library(data.table)
pol.dt <- data.table(polarity(sentences$state)$all)
pol.dt[, id := sentences$id]
document.polarity <- pol.dt[, sum(polarity * wc) / sum(wc), "id"]
I was hoping I could run polarityin a version of the vector with deleted periods, but it seems to sentSplitdo more. This works on DATA, but not on other types of text (I'm not sure about the full set of gaps except periods).
So, I suspect that the best way to approach this is to make each element of the document vector look like one long sentence. How can I do this or is there another way?