Exit for read.csv ()

I am trying to load csv into R for some processing, but I had a strange problem while trying to read the data itself.

Csv has no headers and I use the following simple code to read data

newClick <- read.csv("test.csv", header = F) 

And the following is an example of a data set:

 10000011791441224671,V_Display,exit 10000011951441812316,V_Display,exit 10000013211441319797,V_Display,exit 1000001331441725509,V_Display,exit 10000013681418242863,C_GoogleNonBrand,V_Display,V_Display,V_Display,V_Display,_Display,V_Display,V_Display,V_Display,V_Display,V_Display,V_Display,V_Display,V_Display,V_Display,V_Display,exit 10000014031441295393,V_Display,exit 

The output for this data is the expected data frame 6 total. out of 18 variables.

Here is the hard part. If I add another row to the dataset, for example

 10000011791441224671,V_Display,exit 10000011951441812316,V_Display,exit 1000000191441228436,V_Display,exit 10000013211441319797,V_Display,exit 1000001331441725509,V_Display,exit 10000013681418242863,C_GoogleNonBrand,V_Display,V_Display,V_Display,V_Display,V_Display,V_Display,V_Display,V_Display,V_Display,V_Display,V_Display,V_Display,V_Display,V_Display,V_Display,exit 10000014031441295393,V_Display,exit 

The result for this is a weird 12 total of three variables. Upon careful analysis, I realized that the entire second last row is divided into 6 rows with three columns that seem strange.

Any thoughts on this?

+5
source share
2 answers

As mentioned in the comments, this is because the number of columns is determined by the first five lines of input. If you're in a traffic jam, here is a possible workaround I checked and seems to work well. The secret is to introduce a vector for col.names , which is the length of the number of columns in the data. We can get the number of columns using count.fields() . Insert the file name for file .

 ## get the number of columns ncols <- max(count.fields(file, sep = ",")) ## read the data with all columns as character df <- read.csv(file, header = FALSE, col.names = paste0("V", seq_len(ncols))) 

Here is the tested code with your data:

 txt <- "10000011791441224671,V_Display,exit\n10000011951441812316,V_Display,exit\n1000000191441228436,V_Display,exit\n10000013211441319797,V_Display,exit\n1000001331441725509,V_Display,exit\n10000013681418242863,C_GoogleNonBrand,V_Display,V_Display,V_Display,V_Display,V_Display,V_Display,V_Display,V_Display,V_Display,V_Display,V_Display,V_Display,V_Display,V_Display,V_Display,exit\n10000014031441295393,V_Display,exit" ncols <- max(count.fields(textConnection(txt), sep = ",")) df <- read.csv(text = txt, header = FALSE, col.names = paste0("V", seq_len(ncols))) dim(df) # [1] 7 18 , exit \ n1000000191441228436, V_Display, exit \ n10000013211441319797, V_Display, exit \ n1000001331441725509, V_Display, exit \ n10000013681418242863, C_GoogleNonBrand, V_Display, V_Display, V_Display, V_Display, V_Display, V_Display, txt <- "10000011791441224671,V_Display,exit\n10000011951441812316,V_Display,exit\n1000000191441228436,V_Display,exit\n10000013211441319797,V_Display,exit\n1000001331441725509,V_Display,exit\n10000013681418242863,C_GoogleNonBrand,V_Display,V_Display,V_Display,V_Display,V_Display,V_Display,V_Display,V_Display,V_Display,V_Display,V_Display,V_Display,V_Display,V_Display,V_Display,exit\n10000014031441295393,V_Display,exit" ncols <- max(count.fields(textConnection(txt), sep = ",")) df <- read.csv(text = txt, header = FALSE, col.names = paste0("V", seq_len(ncols))) dim(df) # [1] 7 18 
+3
source

In r documentation ,

"The number of columns of data is determined by looking at the first five lines of input> (or the entire input if it has less than five lines), or from the length> col.names if it is specified, and it Perhaps this is not true if> fill or blank. lines.skip are correct, so specify col.names "if necessary

Since the first 5 lines contain a wider observation in the first example, and not in the second example, the data set does correctly on the first and is wrapped in separate lines on the second.

It is impossible to ensure that this is not the case, add column headers to the CSV or determine the correct number of columns using the col.name argument of the read.csv function.

0
source

Source: https://habr.com/ru/post/1243190/


All Articles