I have several hundred files in a folder. Each of these files is a tab delimited text file containing more than a million rows and 27 columns. From each file I want to be able to extract only certain columns (for example, pull out only columns: 1,2,11,12,13). Columns 3:10 and 14:27 can be ignored. I want to be able to do this for all files in a folder (say 2300 files). The columns from each of the 2300 files are as follows: ..........
Sample.ID SNP.Name col3 col10 Sample.Index Allele1...Forward Allele2...Forward col14 ....col27 1234567890_A rs758676 - - 1 TT - ....col27 1234567890_A rs3916934 - - 1 TT - ....col27 1234567890_A rs2711935 - - 1 TC - ....col27 1234567890_A rs17126880 - - 1 - - - ....col27 1234567890_A rs12831433 - - 1 TT - ....col27 1234567890_A rs12797197 - - 1 TC - ....col27
Cut columns from the second file may look like this.
Sample.ID SNP.Name col3 col10 Sample.Index Allele1...Forward Allele2...Forward col14 ....col27 1234567899_C rs758676 - - 100 TA - ....col27 1234567899_C rs3916934 - - 100 TT - ....col27 1234567899_C rs2711935 - - 100 TC - ....col27 1234567899_C rs17126880 - - 100 CG - ....col27 1234567899_C rs12831433 - - 100 TT - ....col27 1234567899_C rs12797197 - - 100 TC - ....col27
Cut columns from the 3rd file may look like this.
Sample.ID SNP.Name col3 col10 Sample.Index Allele1...Forward Allele2...Forward col14 ....col27 1234567999_F rs758676 - - 256 AA - ....col27 1234567999_F rs3916934 - - 256 TT - ....col27 1234567999_F rs2711935 - - 256 TC - ....col27 1234567999_F rs17126880 - - 256 CG - ....col27 1234567999_F rs12831433 - - 256 TT - ....col27 1234567999_F rs12797197 - - 256 CC - ....col27
The width of Sample.ID , Sample.Index same in each file, but can vary between files. The Sample.ID value Sample.ID same for each file, but different from the files. Each section file has the same value in the "SNP.Name" column. The Sample.Index column can sometimes be the same from another file. The values โโof the other two columns (Allele1...Forward & Allele2...Forward) can be changed and inserted with "" sep under each SNP.Name for each Sample.ID .
Finally, I want to combine (tab-delemited) all cut columns from 2300 files into this format ......
Sample.Index Sample.ID rs758676 rs3916934 rs2711935 rs17126880 rs12831433 rs12797197 1 1234567890_A TTTTTC 0 0 TTTC 200 1234567899_C TATTTCCGTTTC 256 1234567999_F AATTTCCGTTCC
In simple terms, I want to be able to convert a long format to a wide format based on the Sample.ID column. This is similar to the reshape function in R. I tried this with R and it runs out of memory and it is very slow. Can anyone help with unix tools?
When reshape.sh was applied to 20 files ... he created a false string "Samples" in the output. Here are the first 4 fields.
Sample.Index Sample.ID rs476542 rs7073746 1234567891_A 11 CCAG 1234567892_A 191 TCAG 1234567893_A 204 TCGG 1234567894_A 15 TCAG 1234567895_A 158 TTAA 1234567896_A 208 TCAA 1234567897_A 111 TTGG 1234567898_A 137 TCGG 1234567899_A 216 TCAG 1234567900_A 113 TCGG 1234567901_A 152 TCAG 1234567902_A 178 CCAA 1234567903_A 135 CCAA 1234567904_A 125 TCAA 1234567905_A 194 CCAA 1234567906_A 110 CCGG 1234567907_A 126 CCAA Sample - 1234567908_A 169 CCGG 1234567909_A 173 CCGG 1234567910_A 168 TCAA