Reading text in Unicode using tm in R?

I work with Unicode text in R using the tm text package. I would like the Unicode characters not to be destroyed when they are read in the program, but I cannot find the missing keyword. Here is an example of Unicode text that is instantly screwed after reading as a body

library(tm)
u <- VectorSource("The great Chāṇakya (350–283 BC).",encoding = "UTF-8")
v <- Corpus(u)
inspect(v)
## [[1]]
## The great Chaṇakya (350–283 BC).  <--The ā has been coerced to "a"

writeCorpus(v,'test.txt')
## yields: The great Cha<U+1E47>akya (350–283 BC).

I also tried using UTF-16 with the same results. Is there any way to pass this text through tm without destroying it?

+4
source share

Source: https://habr.com/ru/post/1528020/


All Articles