Reading text in Unicode using tm in R?

Question

Reading text in Unicode using tm in R?

I work with Unicode text in R using the tm text package. I would like the Unicode characters not to be destroyed when they are read in the program, but I cannot find the missing keyword. Here is an example of Unicode text that is instantly screwed after reading as a body

library(tm)
u <- VectorSource("The great Chāṇakya (350–283 BC).",encoding = "UTF-8")
v <- Corpus(u)
inspect(v)
## [[1]]
## The great Chaṇakya (350–283 BC).  <--The ā has been coerced to "a"

writeCorpus(v,'test.txt')
## yields: The great Cha<U+1E47>akya (350–283 BC).

I also tried using UTF-16 with the same results. Is there any way to pass this text through tm without destroying it?

+4

r unicode tm

Michael k Feb 21 '14 at 3:33

source share

No one has answered this question yet.

See similar questions:

8

UTF-8 file output in R

or similar:

1215

Why is Java code executing in comments with some Unicode characters?