Limited space file with limited space limited to two columns

For any reason, the data is provided in the following format:

0001 This is text for 0001
0002 This has spaces in between
0003 Yet this is only supposed to be two columns
0009 Why didn't they just comma delimit you may ask?
0010 Or even use quotations?
001  Who knows
0012 But now I'm here with his file
0013 And hoping someone has an elegant solution?

So it should be two columns. What I would like to have is a column for the first records, i.e. 0001,0002,0003,0009,0010,001,0012,0013and another column for everything else.

+4
source share
4 answers

I would recommend the function input.filefrom the "iotools" package.

Usage will be something like this:

library(iotools)
input.file("yourfile.txt", formatter = dstrsplit, nsep = " ", col_types = "character")

Here is an example. (I just created a dummy temporary file in my workspace for the purpose of illustration).

x <- tempfile()
writeLines(c("0001 This is text for 0001",
             "0002 This has spaces in between",
             "0003 Yet this is only supposed to be two columns",
             "0009 Why didn't they just comma delimit you may ask?",
             "0010 Or even use quotations?",
             "001  Who knows",
             "0012 But now I'm here with his file",
             "0013 And hoping someone has an elegant solution?"), con = x)

library(iotools)
input.file(x, formatter = dstrsplit, nsep = " ", col_types = "character")
#   rowindex                                              V1
# 1     0001                           This is text for 0001
# 2     0002                      This has spaces in between
# 3     0003     Yet this is only supposed to be two columns
# 4     0009 Why didn't they just comma delimit you may ask?
# 5     0010                         Or even use quotations?
# 6      001                                       Who knows
# 7     0012                  But now I'm here with his file
# 8     0013     And hoping someone has an elegant solution?

Elegant enough ?; -)


Update 1

data.frame ( @Jaap), "iotools" , , input.file.

, :

dstrsplit(as.character(mydf$V1), nsep = " ", col_types = "character")

2

, - , , Jaap, akrun "iotools" . this Gist. : , "iotoos" - . tomtom, , .

+3

separate tidyr ( ). extra = "merge" , :

library(tidyr)
separate(mydf, V1, c("nr","text"), sep = " ", extra = "merge")
# or:
mydf %>% separate(V1, c("nr","text"), sep = " ", extra = "merge")

:

    nr                                           text
1 0001                          This is text for 0001
2 0002                     This has spaces in between
3 0003    Yet this is only supposed to be two columns
4 0009 Why didnt they just comma delimit you may ask?
5 0010                        Or even use quotations?
6  001                                      Who knows
7 0012                  But now Im here with his file
8 0013    And hoping someone has an elegant solution?

:

mydf <- structure(list(V1 = structure(c(1L, 2L, 3L, 4L, 6L, 5L, 7L, 8L), 
                                      .Label = c("0001 This is text for 0001", "0002 This has spaces in between",
                                                 "0003 Yet this is only supposed to be two columns", "0009 Why didnt they just comma delimit you may ask?", 
                                                 "001  Who knows", "0010 Or even use quotations?", "0012 But now Im here with his file", "0013 And hoping someone has an elegant solution?"), class = "factor")), 
              .Names = "V1", class = "data.frame", row.names = c(NA,-8L))
+5

, (, lapply):

unlist(strsplit(gsub("([0-9]{1,}) ","\\1~",x), "~" ))

, : gsub - ( ) \\1. [0-9] , {1,} . , ( - , ), strsplit .

0

tstrsplit data.table. "data.frame" "data.table" (setDT(mydf)), tstrsplit "V1", , (regex lookaround).

library(data.table)
res <- setDT(mydf)[, tstrsplit(V1, "(?<=\\d)\\s+", perl=TRUE)]
res
#     V1                                             V2
#1: 0001                          This is text for 0001
#2: 0002                     This has spaces in between
#3: 0003    Yet this is only supposed to be two columns
#4: 0009 Why didnt they just comma delimit you may ask?
#5: 0010                        Or even use quotations?
#6:  001                                      Who knows
#7: 0012                  But now Im here with his file
#8: 0013    And hoping someone has an elegant solution?

Names can be changed if necessary. setnames

setnames(res, c("nr", "text"))
0
source

Source: https://habr.com/ru/post/1624594/


All Articles