Reading an Rdata File with Different Encoding

I have a .RData file for reading on my Linux computer (UTF-8), but I know that the file is in Latin1 because I created them myself on Windows. Unfortunately, I do not have access to the source files or the Windows machine, and I need to read these files on my Linux machine.

To read the Rdata file, the usual procedure is to run load("file.Rdata") . Functions like read.csv have an encoding argument that can be used to solve such problems, but load does not have such a thing. If I try load("file.Rdata", encoding = latin1) , I just get this (expected) error:

Download error ("file.Rdata", encoding = "latin1"): unused argument (encoding = "latin1")

What else can I do? My files are loaded with text variables containing accents that get corrupted when opened in UTF-8.

+5
source share
3 answers

Thanks to 42 comments, I was able to write a function to transcode the file:

 fix.encoding <- function(df, originalEncoding = "latin1") { numCols <- ncol(df) for (col in 1:numCols) Encoding(df[, col]) <- originalEncoding return(df) } 

The ball here is the Encoding(df[, col]) <- "latin1" command, which takes the col column in dataframe df and converts it to latin1 format. Unfortunately, Encoding only accepts column objects as input, so I had to create a function to expand all the columns of the dataframe and apply the transformation.

Of course, if your problem is in just a few columns, you'd better just apply Encoding to these columns, and not to the entire data area (you can modify the above function to take a set of columns as input). In addition, if you encounter the opposite problem, that is, after reading the R object created in Linux or Mac OS on Windows, you should use originalEncoding = "UTF-8" .

+3
source

Thanks for posting this. I took the liberty of changing your function if you have a data framework with some columns as a character and some as non-character. Otherwise, an error occurs:

 > fix.encoding(adress) Error in `Encoding<-`(`*tmp*`, value = "latin1") : a character vector argument expected 

So here is a modified function:

 fix.encoding <- function(df, originalEncoding = "latin1") { numCols <- ncol(df) for (col in 1:numCols) if(class(df[, col]) == "character"){ Encoding(df[, col]) <- originalEncoding } return(df) } 

However, this will not change the encoding of the level names in the "factor" column. Fortunately, I found that this changed all the factors in your data structure to a character (which may not be the best approach, but in my case, this is what I need):

 i <- sapply(df, is.factor) df[i] <- lapply(df[i], as.character) 
+1
source

following the previous answers, this is a small update that makes him work on factors and dplyr. Thank you for the inspiration.

 fix.encoding <- function(df, originalEncoding = "UTF-8") { numCols <- ncol(df) df <- data.frame(df) for (col in 1:numCols) { if(class(df[, col]) == "character"){ Encoding(df[, col]) <- originalEncoding } if(class(df[, col]) == "factor"){ Encoding(levels(df[, col])) <- originalEncoding } } return(as_data_frame(df)) } 
+1
source

Source: https://habr.com/ru/post/1237175/


All Articles