Select the first row of the matrix based on the value of the first, eighth and ninth columns using awk or sed

Question

Select the first row of the matrix based on the value of the first, eighth and ninth columns using awk or sed

I have several rows where the 1, 8, and 9 columns are basically the same. The total number of rows is more than 60K. Now I want to simplify saving only the first rows for which the 1st, 8th and 9th columns are the same.

Input file:

chr exon_start  exon_end    cnv tumor_DOC   control_DOC rationormalized_after_smoothing CNV_start   CNV_end seg_mean
chr1    762097  762270  3   821 717 1.456610215 762097  6706109 1.297328502
chr1    861281  861490  3   101 117 1.29744744  762097  6706109 1.297328502
chr1    7868860 7869039 2   78  119 1.123385189 7796356 8921423 1.088752407
chr1    7869841 7870041 2   140 169 1.123385189 7796356 8921423 1.088752407
chr1    7870411 7870596 2   83  163 1.123385189 7796356 8921423 1.088752407
chr1    7879297 7879467 2   290 360 1.024742732 7796356 8921423 1.088752407
chr1    21012415    21012609    3   89  135 1.230421209 19536504    21054539    1.247494175
chr1    21013924    21014512    3   234 219 1.359224182 19536504    21054539    1.247494175
chr1    21016588    21016803    3   172 179 1.230421209 19536504    21054539    1.247494175
chr1    21024895    21025101    3   147 120 1.230421209 19536504    21054539    1.247494175
chr14   20920169    20920704    3   211 214 1.254261327 20840851    20923828    1.288877208
chr14   20922716    20922919    3   253 262 1.228396526 20840851    20923828    1.288877208
chr14   20923634    20923828    3   188 201 1.206226522 20840851    20923828    1.288877208
chr14   20924141    20924329    2   244 344 0.902299535 20924141    21465086    1.088234038
chr14   20924787    20925701    2   314 306 1.305351797 20924141    21465086    1.088234038
chr14   20926636    20926836    2   134 136 1.206226522 20924141    21465086    1.088234038

Required Conclusion:

chr exon_start  exon_end    cnv tumor_DOC   control_DOC rationormalized_after_smoothing CNV_start   CNV_end seg_mean
chr1    762097  762270  3   821 717 1.456610215 762097  6706109 1.297328502
chr1    7869841 7870041 2   140 169 1.123385189 7796356 8921423 1.088752407
chr1    21024895    21025101    3   147 120 1.230421209 19536504    21054539    1.247494175
chr14   20922716    20922919    3   253 262 1.228396526 20840851    20923828    1.288877208
chr14   20924141    20924329    2   244 344 0.902299535 20924141    21465086    1.088234038

I save only one row for each individual category, which has a similar column1, column 8 and column 9, it is best to just save the first row whenever there is a change.

How can I achieve this in awk, sed or in R?

+4

grep awk r sed

vchris_ngs Apr 22 '15 at 14:34

source share

2 answers

awk:

awk '!seen[$1,$8,$9]++' file

seen[], (field1, field8, field9) . , . 1 , !value False, awk .

:

seen[$1,$8,$9] 0 ( , ).
!0 True, .
seen[$1,$8,$9] .

:

seen[$1,$8,$9] - 1 .
!1 False, .
seen[$1,$8,$9] .

Test

$ awk '!seen[$1,$8,$9]++' a
chr exon_start  exon_end    cnv tumor_DOC   control_DOC rationormalized_after_smoothing CNV_start   CNV_end seg_mean
chr1    762097  762270  3   821 717 1.456610215 762097  6706109 1.297328502
chr1    7868860 7869039 2   78  119 1.123385189 7796356 8921423 1.088752407
chr1    21012415    21012609    3   89  135 1.230421209 19536504    21054539    1.247494175
chr14   20920169    20920704    3   211 214 1.254261327 20840851    20923828    1.288877208
chr14   20924141    20924329    2   244 344 0.902299535 20924141    21465086    1.088234038

+5

fedorqui 22 . '15 14:35

Roland · Accepted Answer · 2015-04-22T14:46:30+0000

Import your data into R (specify file):

DF <- read.table(text = "chr exon_start  exon_end    cnv tumor_DOC   control_DOC rationormalized_after_smoothing CNV_start   CNV_end seg_mean
chr1    762097  762270  3   821 717 1.456610215 762097  6706109 1.297328502
chr1    861281  861490  3   101 117 1.29744744  762097  6706109 1.297328502
chr1    7868860 7869039 2   78  119 1.123385189 7796356 8921423 1.088752407
chr1    7869841 7870041 2   140 169 1.123385189 7796356 8921423 1.088752407
chr1    7870411 7870596 2   83  163 1.123385189 7796356 8921423 1.088752407
chr1    7879297 7879467 2   290 360 1.024742732 7796356 8921423 1.088752407
chr1    21012415    21012609    3   89  135 1.230421209 19536504    21054539    1.247494175
chr1    21013924    21014512    3   234 219 1.359224182 19536504    21054539    1.247494175
chr1    21016588    21016803    3   172 179 1.230421209 19536504    21054539    1.247494175
chr1    21024895    21025101    3   147 120 1.230421209 19536504    21054539    1.247494175
chr14   20920169    20920704    3   211 214 1.254261327 20840851    20923828    1.288877208
chr14   20922716    20922919    3   253 262 1.228396526 20840851    20923828    1.288877208
chr14   20923634    20923828    3   188 201 1.206226522 20840851    20923828    1.288877208
chr14   20924141    20924329    2   244 344 0.902299535 20924141    21465086    1.088234038
chr14   20924787    20925701    2   314 306 1.305351797 20924141    21465086    1.088234038
chr14   20926636    20926836    2   134 136 1.206226522 20924141    21465086    1.088234038", header = TRUE)

, 1, 8, 9 :

DF[!duplicated(DF[, c(1,8,9)]),]
#     chr exon_start exon_end cnv tumor_DOC control_DOC rationormalized_after_smoothing CNV_start  CNV_end seg_mean
#1   chr1     762097   762270   3       821         717                       1.4566102    762097  6706109 1.297329
#3   chr1    7868860  7869039   2        78         119                       1.1233852   7796356  8921423 1.088752
#7   chr1   21012415 21012609   3        89         135                       1.2304212  19536504 21054539 1.247494
#11 chr14   20920169 20920704   3       211         214                       1.2542613  20840851 20923828 1.288877
#14 chr14   20924141 20924329   2       244         344                       0.9022995  20924141 21465086 1.088234

Select the first row of the matrix based on the value of the first, eighth and ninth columns using awk or sed

Test

More articles: