Cut specific columns from multiple files and modify them with unix tools

Question

Cut specific columns from multiple files and modify them with unix tools

I have several hundred files in a folder. Each of these files is a tab delimited text file containing more than a million rows and 27 columns. From each file I want to be able to extract only certain columns (for example, pull out only columns: 1,2,11,12,13). Columns 3:10 and 14:27 can be ignored. I want to be able to do this for all files in a folder (say 2300 files). The columns from each of the 2300 files are as follows: ..........

Sample.ID SNP.Name col3 col10 Sample.Index Allele1...Forward Allele2...Forward col14 ....col27 1234567890_A rs758676 - - 1 TT - ....col27 1234567890_A rs3916934 - - 1 TT - ....col27 1234567890_A rs2711935 - - 1 TC - ....col27 1234567890_A rs17126880 - - 1 - - - ....col27 1234567890_A rs12831433 - - 1 TT - ....col27 1234567890_A rs12797197 - - 1 TC - ....col27

Cut columns from the second file may look like this.

 Sample.ID SNP.Name col3 col10 Sample.Index Allele1...Forward Allele2...Forward col14 ....col27 1234567899_C rs758676 - - 100 TA - ....col27 1234567899_C rs3916934 - - 100 TT - ....col27 1234567899_C rs2711935 - - 100 TC - ....col27 1234567899_C rs17126880 - - 100 CG - ....col27 1234567899_C rs12831433 - - 100 TT - ....col27 1234567899_C rs12797197 - - 100 TC - ....col27

Cut columns from the 3rd file may look like this.

 Sample.ID SNP.Name col3 col10 Sample.Index Allele1...Forward Allele2...Forward col14 ....col27 1234567999_F rs758676 - - 256 AA - ....col27 1234567999_F rs3916934 - - 256 TT - ....col27 1234567999_F rs2711935 - - 256 TC - ....col27 1234567999_F rs17126880 - - 256 CG - ....col27 1234567999_F rs12831433 - - 256 TT - ....col27 1234567999_F rs12797197 - - 256 CC - ....col27

The width of Sample.ID , Sample.Index same in each file, but can vary between files. The Sample.ID value Sample.ID same for each file, but different from the files. Each section file has the same value in the "SNP.Name" column. The Sample.Index column can sometimes be the same from another file. The values of the other two columns (Allele1...Forward & Allele2...Forward) can be changed and inserted with "" sep under each SNP.Name for each Sample.ID .

Finally, I want to combine (tab-delemited) all cut columns from 2300 files into this format ......

 Sample.Index Sample.ID rs758676 rs3916934 rs2711935 rs17126880 rs12831433 rs12797197 1 1234567890_A TTTTTC 0 0 TTTC 200 1234567899_C TATTTCCGTTTC 256 1234567999_F AATTTCCGTTCC

In simple terms, I want to be able to convert a long format to a wide format based on the Sample.ID column. This is similar to the reshape function in R. I tried this with R and it runs out of memory and it is very slow. Can anyone help with unix tools?

When reshape.sh was applied to 20 files ... he created a false string "Samples" in the output. Here are the first 4 fields.

 Sample.Index Sample.ID rs476542 rs7073746 1234567891_A 11 CCAG 1234567892_A 191 TCAG 1234567893_A 204 TCGG 1234567894_A 15 TCAG 1234567895_A 158 TTAA 1234567896_A 208 TCAA 1234567897_A 111 TTGG 1234567898_A 137 TCGG 1234567899_A 216 TCAG 1234567900_A 113 TCGG 1234567901_A 152 TCAG 1234567902_A 178 CCAA 1234567903_A 135 CCAA 1234567904_A 125 TCAA 1234567905_A 194 CCAA 1234567906_A 110 CCGG 1234567907_A 126 CCAA Sample - 1234567908_A 169 CCGG 1234567909_A 173 CCGG 1234567910_A 168 TCAA

+4

unix bash awk perl sed

user645600 Mar 18 '11 at 19:53

source share

1 answer

Siegex · Accepted Answer · 2011-03-18T20:33:15+0000

 #!/bin/bash awk ' BEGIN { maxInd = length("Sample.Index") maxID = length("Sample.ID") } FNR>11 && $2 ~ "^rs" { SNP[$2] key[$11,$1] val[$2,$11,$1]=$12" "$13 maxInd = (len=length($11)) > maxInd ? len : maxInd maxID = (len=length($1)) > maxID ? len : maxID } END { printf("%-*s\t%*s\t", maxInd, "Sample.Index", maxID, "Sample.ID") for (rs in SNP) printf("%s\t", rs) printf("\n") for(pair in key) { split(pair,a,SUBSEP) printf("%-*s\t%*s\t", maxInd, a[1], maxID, a[2]) for(rs in SNP) { ale = val[rs,a[1],a[2]] out = ale == "- -" || ale == "" ? "0 0" : ale printf("%*s\t", length(rs), out) } printf("\n") } }' DNA*.txt

Proof of concept

 $ ./reshapeDNA Sample.Index Sample.ID rs2711935 rs10829026 rs3924674 rs2635442 rs715350 rs17126880 rs7037313 rs11983370 rs6424572 rs7055953 rs758676 rs7167305 rs12831433 rs2147587 rs12797197 rs3916934 rs11002902 11 1234567890_A TT 0 0 CC 0 0 0 0 TC 0 0 CCTG 0 0 CC 0 0 TCAGTTTCGG 111 1234567892_A TTTCCC 0 0 0 0 CCTCCCTT 0 0 CC 0 0 TTAATTTTGG 1 1234567894_A TT 0 0 TCCCAGCC 0 0 CC 0 0 TCCCTTTTAGTTCCGG 12 1234567893_A TT 0 0 CCTCAATC 0 0 CC 0 0 TTCCTGTCAGTTTCGG 15 1234567891_A TTCCCC 0 0 0 0 CCCCCCTT 0 0 CC 0 0 TCAGTTTTGG

Cut specific columns from multiple files and modify them with unix tools

Proof of concept

More articles: