Creating a corpus with spanish text in R

Trying to render text and wordcloud in spanish text. I actually have 9 different .txt files, but just posting them for playback.

"Nos los Representantes del pueblo de la Nación ARGENTINA, reunidos en Congreso Governor-General and Major Provincias que la componenten, en cumplimiento de pactos preexistentes, con el objeto the contar la unión nacional, afianzar laustic la justia hall, promover el benestar general, y asegurar los beneficios de la libertad, para nosotros, para nuestra posteridad, y para todos los hombres del mundo que quieran habitar en el suelo argentino: invocando la protección de Dios, fuente de toda razón just ordenamos, decretamos y Estachemos Esta Constitución, para la Nación ARGENTINA. "

The file is saved as a .txt file. Below is my naive attempt to create a term-document matrix with the correct encoding. When I check it, I do not get the text as it is in the source file (for example, "constitución" becomes "constitucif3n"). I am new to text development and I know that the solution is probably due to a wide variety of interdependent settings, I decided that I would ask here instead of searching for 4 hours. Thanks in advance.

#Generate Term-Document-Matrix

#Convert Text to Corpus and Clean
cleanCorpus <- function(corpus) {
  corpus.tmp <- tm_map(corpus, removePunctuation)
  corpus.tmp <- tm_map(corpus.tmp, stripWhitespace)
  corpus.tmp <- tm_map(corpus.tmp, tolower)
  corpus.tmp <- tm_map(corpus.tmp, removeWords, stopwords("spanish"))
  return(corpus.tmp)
}

generateTDM <- function(path) {
  cor.tmp <- Corpus(DirSource(directory=path, encoding="ISO8859-1"))
  cor.cl <- cleanCorpus(cor.tmp)
  tdm.tmp <- TermDocumentMatrix(cor.cl)
  tdm.s <- removeSparseTerms(tdm.tmp, 0.7)
}

tdm <- generateTDM(pathname)
tdm.m <- as.matrix(tdm)
+4
source share
1 answer

Answer. Make sure the source text file is encoded in UTF-8 encoding. To do this, I had to change the save settings in TextEdit on Mac. It did everything without problems.

+1

Source: https://habr.com/ru/post/1539357/


All Articles