Apply encoding to all Data.Table data

I have the following file that reads in a data table. For instance:

raw <- fread("avito_train.tsv", nrows=1000) 

Then, if I change the encoding of a specific column and row as follows:

 Encoding(raw$title[2]) <- "UTF-8" 

It works great.

But how can I apply encoding to all columns and all rows?

I checked the fread documentation, but there does not seem to be any encoding option. Also, I tried Encoding(raw) , but this gives me an error (expected argument of a character vector).

Edit: This article provides more information about the external text in RStudio on Windows http://quantifyingmemory.blogspot.com/2013/01/r-and-foreign-characters.html

+6
source share
3 answers

I tried this:

 Encoding(raw$title) <- "UTF-8" 

Defines the encoding for the entire column. This will be good at the moment. Still open to any other options, so it will do this automatically upon import.

+4
source

Unfortunately, there is no way to do this when importing (for now) with fread.

As long as you seem to already understand this, I will post a way to configure the encoding of all dt after import.

One way to do this is to loop over all columns of characters in the data table:

 for (name in colnames(raw[,sapply(raw, is.character), with=F])){ Encoding(raw[[name]]) <- "UTF-8"} 

colnames ... the bit first gets the columns, which are the characters ( with = F needed for dt, it seems), and then gets the names of the columns to be iterated over. In short: this gives users what you have already found works, but in all char columns.

Now ... since there is no guarantee that colnames are for your integers, floats, etc. does not require any massaging, it is necessary to solve the following:

 for (name in colnames(raw)){ Encoding(colnames(raw)) <- "UTF-8" } 
+3
source

This was recently implemented in the development version of data.table, v1.9.5. It will be soon translated into CRAN (as v1.9.6). Could you give the developer a version to check if this allows you this?

fread() received the encoding argument, especially for window problems.

 require(data.table) # v1.9.5+ fread("file.txt", encoding="UTF-8") 

should solve the problem. There is no file to check. If this does not solve your problem, write on the project page a problem with the reproduced example / file.

+3
source

Source: https://habr.com/ru/post/971543/


All Articles