I work with Unicode text in R using the tm text package. I would like the Unicode characters not to be destroyed when they are read in the program, but I cannot find the missing keyword. Here is an example of Unicode text that is instantly screwed after reading as a body
library(tm)
u <- VectorSource("The great Chāṇakya (350–283 BC).",encoding = "UTF-8")
v <- Corpus(u)
inspect(v)
writeCorpus(v,'test.txt')
I also tried using UTF-16 with the same results. Is there any way to pass this text through tm without destroying it?
source
share