I need to read data from text files (many of them are very large), which usually look like this:
# 2013 # 3090050010 # CCOU # 01 # 022 # 1 # # N 03/03/2015 # 2013 # 3090050010 # CCOU # 01 # 023 # 1 ## 03/16/2015 # 2013 # 3090050010 # CCOU # # 005 02 # 1 # 1692528 # 03/16/2015 # 2013 # 3090430110 # CCOU # 15 # 504 # 2 # blablablablablablablablablablablablablab labla # 10/10/2014
# 2013 # 3090430110 # CCOU # 15 # 505 # 2 ## 10/01/2014
So, "#" is a separator, and sometimes long lines use two lines. I have a workaround when I ignore lines that don't start with "#" using grep:
x<-readLines("data.txt")
y <- grep("^#",x)
app<-x[y]
NamesForCols<-c("..",...)
myDat<-read.table(text=app,header =F,sep="#",quote="",col.names = NamesForCols, colClasses=c("NULL", "factor", NA,NA,NA,NA,NA,"character","NULL"), fill=T,blank.lines.skip=T,comment.char = "",allowEscapes = T)
But I am not happy with this decision (there is a loss of significant data). Is there a way to read the data.txt file so that each record necessarily entails matching the "#" symbol exactly 8 times, although this sometimes involves visiting two lines? Any other suggestion would be welcome. Thank!
source
share