Regular expression to convert raw text to data columns

I have the source code from a program that I want to convert to DataFrame. The text file is not formatted and as shown below.

 10037    149439Special Event       11538.00       13542.59   2004.59
 10070     10071Weekday        8234.00        9244.87   1010.87
 10216     13463Weekend        145.00              0   -145.00

I can read the data in Rusing readLines()in the base package. How can I convert this to data that looks like this (column names can be any).

 A        B         C              D              E          F
 10037    149439    Special Event  11538.00       13542.59   2004.59
 10070     10071    Weekday        8234.00         9244.87   1010.87
 10216     13463    Weekend        145.00                0   -145.00

What regular expression should be used to achieve this? I know this is perfect for combining regexec()and regmatches(). But I can’t come up with an expression that breaks the line into the right components.

+4
source share
3 answers

:

raw <- readLines("filename.txt")
data.frame(do.call(rbind, strsplit(raw, " {2,}|(?<=\\d)(?=[A-Z])", perl = TRUE)))

#       X1     X2            X3       X4       X5      X6
# 1  10037 149439 Special Event 11538.00 13542.59 2004.59
# 2  10070  10071       Weekday  8234.00  9244.87 1010.87
# 3  10216  13463       Weekend   145.00        0 -145.00

" {2,}|(?<=\\d)(?=[A-Z])" , "|" ( ).

  • " {2,}" . , .
  • "(?<=\\d)(?=[A-Z])" , , . .
+5

"txt.txt" . .

> read <- readLines("txt.txt")
> S <- strsplit(read, "[A-Za-z]|\\s")
> W <- do.call(rbind, lapply(S, function(x) x[nzchar(x)]))
> D <- data.frame(W[,1:2], col, W[,3:5])
> names(D) <- LETTERS[seq(D)]
> D
##       A      B            C        D        E       F
## 1 10037 149439 SpecialEvent 11538.00 13542.59 2004.59
## 2 10070  10071      Weekday  8234.00  9244.87 1010.87
## 3 10216  13463      Weekend   145.00        0 -145.00

, .

PS. "" "" , .

+3

Something like this at least works on your example, but I don't know all of your corner cases ...

([0-9]+) +([0-9]+)(.+) ([0-9.-]+) +([0-9.-]+) +([0-9.-]+)

Captured groups from 1 to 6 correspond respectively. your columns are from A to F.

0
source

Source: https://habr.com/ru/post/1538095/


All Articles