Is an awk preprocessing file required or can it be executed directly in R?

I used to process the csv file with awk, here is my first script:

tail -n +2 shifted_final.csv | awk -F, 'BEGIN {old=$2} {if($2!=old){print $0; old=$2;}}' | less

this script searches for duplicate values ​​in the second column (if the value in row n is the same as in row n + 1, n + 2 ...) and prints only the first occurrence. For example, if you are submitting the following input:

ord,orig,pred,as,o-p
1,0,0,1.0,0
2,0,0,1.0,0
3,0,0,1.0,0
4,0,0,0.0,0
5,0,0,0.0,0
6,0,0,0.0,0
7,0,0,0.0,0
8,0,0,0.0,0
9,0,0,0.0,0
10,0,0,0.0,0
11,0,0,0.0,0
12,0,0,0.0,0
13,0,0,0.0,0
14,0,0,0.0,0
15,0,0,0.0,0
16,0,0,0.0,0
17,0,0,0.0,0
18,0,0,0.0,0
19,0,0,0.0,0
20,0,0,0.0,0
21,0,0,0.0,0
22,0,0,0.0,0
23,4,0,0.0,4
24,402,0,1.0,402
25,0,0,1.0,0

Then the conclusion will be:

1,0,0,1.0,0
23,4,0,0.0,4
24,402,0,1.0,402
25,0,0,1.0,0

EDIT: I made it a bit difficult to add a second script:

The second script does the same, but prints the last duplicate event:

tail -n +2 shifted_final.csv | awk -F, 'BEGIN {old=$2; line=$0} {if($2==old){line=$0}else{print line; old=$2; line=$0}} END {print $0}' | less

The output will be:

22,0,0,0.0,0
23,4,0,0.0,4
24,402,0,1.0,402
25,0,0,1.0,0

I believe that R is a powerful language that should handle such tasks, but I only found questions regarding calling awk scripts from R, etc. How to do it in R?

+4
2

, , @nicola:

Idx.first <- c(TRUE, tbl$orig[-1] != tbl$orig[-nrow(tbl)])
##
R> tbl[Idx.first,]
#    ord orig pred as o.p
# 1    1    0    0  1   0
# 23  23    4    0  0   4
# 24  24  402    0  1 402
# 25  25    0    0  1   0

, first, TRUE @nicola :

Idx.last <- c(tbl$orig[-1] != tbl$orig[-nrow(tbl)], TRUE)
##
R> tbl[Idx.last,]
#    ord orig pred as o.p
# 22  22    0    0  0   0
# 23  23    4    0  0   4
# 24  24  402    0  1 402
# 25  25    0    0  1   0

tbl$orig[-1] != tbl$orig[-nrow(tbl)] 2- n- 2 1- n-1- 2. - , TRUE , n-1, TRUE ( 1) , TRUE ( 2) .


:

tbl <- read.table(text = "ord,orig,pred,as,o-p
1,0,0,1.0,0
2,0,0,1.0,0
3,0,0,1.0,0
4,0,0,0.0,0
5,0,0,0.0,0
6,0,0,0.0,0
7,0,0,0.0,0
8,0,0,0.0,0
9,0,0,0.0,0
10,0,0,0.0,0
11,0,0,0.0,0
12,0,0,0.0,0
13,0,0,0.0,0
14,0,0,0.0,0
15,0,0,0.0,0
16,0,0,0.0,0
17,0,0,0.0,0
18,0,0,0.0,0
19,0,0,0.0,0
20,0,0,0.0,0
21,0,0,0.0,0
22,0,0,0.0,0
23,4,0,0.0,4
24,402,0,1.0,402
25,0,0,1.0,0",
header = TRUE,
sep = ",")
+5

() , , ( @nrussell ):

idx <- c(1, cumsum(rle(tbl[,2])[[1]])[-1])
tbl[idx,]
#   ord orig pred as o.p x
#1    1    0    0  1   0 1
#23  23    4    0  0   4 2
#24  24  402    0  1 402 3
#25  25    0    0  1   0 4

"" orig.

  • rle(tbl[,2])[[1]] ( ) , orig
  • cumsum(...)
  • , c(1, cumsum(...)[-1]) 1,
+4

Source: https://habr.com/ru/post/1616107/


All Articles