Trying to render text and wordcloud in spanish text. I actually have 9 different .txt files, but just posting them for playback.
"Nos los Representantes del pueblo de la Nación ARGENTINA, reunidos en Congreso Governor-General and Major Provincias que la componenten, en cumplimiento de pactos preexistentes, con el objeto the contar la unión nacional, afianzar laustic la justia hall, promover el benestar general, y asegurar los beneficios de la libertad, para nosotros, para nuestra posteridad, y para todos los hombres del mundo que quieran habitar en el suelo argentino: invocando la protección de Dios, fuente de toda razón just ordenamos, decretamos y Estachemos Esta Constitución, para la Nación ARGENTINA. "
The file is saved as a .txt file. Below is my naive attempt to create a term-document matrix with the correct encoding. When I check it, I do not get the text as it is in the source file (for example, "constitución" becomes "constitucif3n"). I am new to text development and I know that the solution is probably due to a wide variety of interdependent settings, I decided that I would ask here instead of searching for 4 hours. Thanks in advance.
cleanCorpus <- function(corpus) {
corpus.tmp <- tm_map(corpus, removePunctuation)
corpus.tmp <- tm_map(corpus.tmp, stripWhitespace)
corpus.tmp <- tm_map(corpus.tmp, tolower)
corpus.tmp <- tm_map(corpus.tmp, removeWords, stopwords("spanish"))
return(corpus.tmp)
}
generateTDM <- function(path) {
cor.tmp <- Corpus(DirSource(directory=path, encoding="ISO8859-1"))
cor.cl <- cleanCorpus(cor.tmp)
tdm.tmp <- TermDocumentMatrix(cor.cl)
tdm.s <- removeSparseTerms(tdm.tmp, 0.7)
}
tdm <- generateTDM(pathname)
tdm.m <- as.matrix(tdm)
source
share