Assessing the polarity of a document using the R qdap package without sendSplit

I would like to apply the function qdap polarityto a vector of documents, each of which can contain several sentences and get the corresponding polarity for each document. For instance:

library(qdap)
polarity(DATA$state)$all$polarity
# Results:
 [1] -0.8165 -0.4082  0.0000 -0.8944  0.0000  0.0000  0.0000 -0.5774  0.0000
[10]  0.4082  0.0000
Warning message:
In polarity(DATA$state) :
  Some rows contain double punctuation.  Suggested use of `sentSplit` function.

This warning cannot be ignored, as it appears to add polarity points to each sentence in the document. This can lead to the fact that the polarity values ​​at the document level will be outside the boundaries [-1, 1].

I know the ability to start first sentSplitand then average over sentences, perhaps by weighing the polarity by the number of words, but this is (1) inefficient (takes about 4 times as long as it works on complete documents with a warning) and (2) it is not clear how weight offers. This option will look something like this:

DATA$id <- seq(nrow(DATA)) # For identifying and aggregating documents 
sentences <- sentSplit(DATA, "state")
library(data.table) # For aggregation
pol.dt <- data.table(polarity(sentences$state)$all)
pol.dt[, id := sentences$id]
document.polarity <- pol.dt[, sum(polarity * wc) / sum(wc), "id"]

I was hoping I could run polarityin a version of the vector with deleted periods, but it seems to sentSplitdo more. This works on DATA, but not on other types of text (I'm not sure about the full set of gaps except periods).

So, I suspect that the best way to approach this is to make each element of the document vector look like one long sentence. How can I do this or is there another way?

+3
2

Max qdap (1.3.4), , , sqrt(n), n - . 1.3.5 , .

:

library(qdap)
counts(polarity(DATA$state))[, "polarity"]

## > counts(polarity(DATA$state))[, "polarity"]
##  [1] -0.8164966 -0.4472136  0.0000000 -1.0000000  0.0000000  0.0000000  0.0000000
##  [8] -0.5773503  0.0000000  0.4082483  0.0000000
## Warning message:
## In polarity(DATA$state) : 
##   Some rows contain double punctuation.  Suggested use of `sentSplit` function.

strip . , , , . :

## > counts(polarity("Really, I hate it"))[, "polarity"]
## [1] -0.5
## > counts(polarity(strip("Really, I hate it")))[, "polarity"]
## [1] -0.9

. .

+2

polarity , :

SimplifyText <- function(x) {
  return(removePunctuation(removeNumbers(stripWhitespace(tolower(x))))) 
}
polarity(SimplifyText(DATA$state))$all$polarity
# Result (no warning)
 [1] -0.8165 -0.4472  0.0000 -1.0000  0.0000  0.0000  0.0000 -0.5774  0.0000
[10]  0.4082  0.0000 
0

Source: https://habr.com/ru/post/1661975/


All Articles