Import poorly structured data into r

I was given the data in a text file that looks like this:

Measurement: mc Loop: var1=0, var2=-5, var3=1.8 values: iteration data 0 1.203 1 1.206 2 2.206 3 1.201 4 1.204 5 1.204 6 1.204 statistics: max 1.206 min 1.201 mean 1.204 stddev 0.001 avgdev 0.001 failedtimes 0 Measurement: mc Loop: var1=10, var2=-5, var3=1.8 values: iteration data 0 1.203 1 1.206 2 2.206 3 1.201 statistics: max 1.206 min 1.201 mean 1.204 stddev 0.001 avgdev 0.001 failedtimes 0 

I want to get the data in a more normal format, for example:

 var1, var2, var3, iteration, data, 0, -5, 1.8, 0, 1.203, 0, -5, 1.8, 1, 1.206, ... 10, -5, 1.8, 0, 1.203, 

I have problems with data analysis. Help in search engines

+5
source share
2 answers

One way is to use a small bit of simple regular expression and readLines to pull the appropriate lines.

Your data

 txt <- "Measurement: mc Loop: var1=0, var2=-5, var3=1.8 values: iteration data 0 1.203 1 1.206 2 2.206 3 1.201 4 1.204 5 1.204 6 1.204 statistics: max 1.206 min 1.201 mean 1.204 stddev 0.001 avgdev 0.001 failedtimes 0 Measurement: mc Loop: var1=10, var2=-5, var3=1.8 values: iteration data 0 1.203 1 1.206 2 2.206 3 1.201 statistics: max 1.206 min 1.201 mean 1.204 stddev 0.001 avgdev 0.001" 


 # Read in : you can pass the file path instead of textConnection r = readLines(textConnection(txt)) # Find indices of relevant parts of string that you want to keep id1 = grep("var", r) id2 = grep("iteration", r) id3 = grep("statistics", r) # indices for iteration data m = mapply( seq, id2, id3-1) # Use read.table to parse the relevant rows lst <- lapply(seq_along(m), function(x) cbind(read.table(text=r[id1][x], sep=","), #var data read.table(text=r[m[[x]]], header=TRUE))) # iteration data dat <- do.call(rbind, lst) # Remove the var= text and convert to numeric dat[] <- lapply(dat, function(x) as.numeric(gsub("var\\d+=", "", x))) dat # V1 V2 V3 iteration data # 1 0 -5 1.8 0 1.203 # 2 0 -5 1.8 1 1.206 # 3 0 -5 1.8 2 2.206 # 4 0 -5 1.8 3 1.201 # 5 0 -5 1.8 4 1.204 # 6 0 -5 1.8 5 1.204 # 7 0 -5 1.8 6 1.204 # 8 10 -5 1.8 0 1.203 # 9 10 -5 1.8 1 1.206 # 10 10 -5 1.8 2 2.206 # 11 10 -5 1.8 3 1.201 

Actually, it might be a little clearer to divide the data into sections, and then apply the function, i.e.

 sp <- split(r, cumsum(grepl("measure", r, TRUE))) # Function to parse fun <- function(x){ id1 = grep("var", x) id2 = grep("iteration", x) id3 = grep("statistics", x) m = seq(id2, id3-1) cbind(read.table(text=x[id1], sep=","), read.table(text=x[m], header=TRUE)) } lst <- lapply(sp, fun) 

Then continue as before

+6
source

Here is the pipeline that reads and processes it. Suppose the data is in L according to the note at the end. You probably need to create this using L <- readLines("myfile.dat") .

Trim leading and trailing spaces with trimws - this step may not be necessary, but it will not hurt if the data has spaces at the beginning of lines. Then grep cross out lines that begin with a number or contain var , replacing the space and replacing the comma with the newline character v , a , r , = . This puts it in a form in which read.table can read it into a data frame of two columns, in which the first column is 1, 2, 3, and then the iteration number, and the second column is the value var1 , var2 , var3 and data all repeated for each group. We form the grouping variable by identifying successive runs using the expression cumsum(...) %/% 2 . This assumes at least 2 iterations (0 and 1) for each group. (It can be seen from the above data that this is so, but if not, it can be addressed with an additional code, as shown below). Finally, split into a grouping expression and remake each such split group into the desired data frame.

 library(purrr) L %>% trimws %>% grep(pattern = "^\\d|var", value = TRUE) %>% chartr(old = "var=,", new = " \n") %>% read.table(text = .) %>% split(cumsum(c(FALSE, diff(.$V1) != 1)) %/% 2) %>% map_df(function(x) data.frame(var1 = x[1, 2], var2 = x[2, 2], var3 = x[3, 2],iteration = x[-(1:3), 1], data = x[-(1:3), 2])) 

giving:

  var1 var2 var3 iteration data 1 0 -5 1.8 0 1.203 2 0 -5 1.8 1 1.206 3 0 -5 1.8 2 2.206 4 0 -5 1.8 3 1.201 5 0 -5 1.8 4 1.204 6 0 -5 1.8 5 1.204 7 0 -5 1.8 6 1.204 8 10 -5 1.8 0 1.203 9 10 -5 1.8 1 1.206 10 10 -5 1.8 2 2.206 11 10 -5 1.8 3 1.201 

variation . This version of the code also handles the case where there is only one iteration, that is, iteration 0, and simplifies the calculation of grouping with a few lines of code. Here, two instances of -9999 can be any number that is not displayed in the data.

 L %>% grep(pattern = "^\\s*\\d|var", value = TRUE) %>% sub(pattern = "var", replacement = "-9999 var") %>% gsub(pattern = "[^0-9.,-]", replacement = " ") %>% gsub(pattern = ",", replacement = "\n") %>% strsplit("\\s+") %>% unlist %>% as.numeric %>% split(cumsum(. == -9999)) %>% map_df(function(x) { x <- t(matrix(x[-1], 2)) data.frame(var1 = x[1, 2], var2 = x[2, 2], var3 = x[3, 2], iteration = x[-(1:3), 1], data = x[-(1:3), 2]) }) 

dplyr / tidyr We could use the dplyr and tidyr packages one at a time. vars has 3 columns var1 , var2 and var3 and one row per group. values has one column containing nested two frames of iteration and data column data and has one row for each group, but each such row contains a data frame of many rows.

 library(tidyr) library(dplyr) vars <- L %>% grep(pattern = "var", value = TRUE) %>% gsub(pattern = "[=,]", replacement = " ") %>% read.table(text = ., col.names = c(NA, "var1", NA, "var2", NA, "var3")) %>% select(var1, var2, var3) values <- L %>% trimws %>% grep(pattern = "^\\d", value = TRUE) %>% read.table(text = ., col.names = c("iteration", "data")) %>% mutate(g = cumsum(iteration == 0)) %>% nest(-g) %>% select(-g) cbind(vars, values) %>% unnest 

Note:

 Lines <- "Measurement: mc Loop: var1=0, var2=-5, var3=1.8 values: iteration data 0 1.203 1 1.206 2 2.206 3 1.201 4 1.204 5 1.204 6 1.204 statistics: max 1.206 min 1.201 mean 1.204 stddev 0.001 avgdev 0.001 failedtimes 0 Measurement: mc Loop: var1=10, var2=-5, var3=1.8 values: iteration data 0 1.203 1 1.206 2 2.206 3 1.201 statistics: max 1.206 min 1.201 mean 1.204 stddev 0.001 avgdev 0.001 failedtimes 0" L <- readLines(textConnection(Lines)) 
+3
source

Source: https://habr.com/ru/post/1263481/


All Articles