How to encode a range of variables in a cut command

I have a file with 2 columns and I want to use the values ​​from the second column to set the range in the cut command to select a range of characters from another file. The range I want is the character at the position of the value in the second column plus the next 10 characters. I will give an example after a while.

My files are as follows:

A file with 2 columns and no empty lines between the lines ( file1.txt ):

NAME1 10 NAME2 25 NAME3 48 NAME4 66 

The file I want to extract a range of variable characters (only one very long line without spaces and bold) ( file2.txt ):

GATCGAGCGG GATTCTTTTT TTTTA GGCGAGTCAG CTAGCATCAGCTA CGAGAGGCGA GGGCGGGC TATCACGACT ACGACTACGACTACAGCATCAGCATCAGCGCACTAGAGCGAGGCTAGCTAGCTACGACTACGATCAGCATCGCACATCGACTACGATCAGCATCAGCTACGCATCGAAGAGAGAGC

... or, more literally (for copy / paste for verification):

 GATCGAGCGGGATTCTTTTTTTTTAGGCGAGTCAGCTAGCATCAGCTACGAGAGGCGAGGGCGGGCTATCACGACTACGACTACGACTACAGCATCAGCATCAGCGCACTAGAGCGAGGCTAGCTAGCTACGACTACGATCAGCATCGCACATCGACTACGATCAGCATCAGCTACGCATCGAAGAGAGAGC 

Desired result file, one sequence per line ( result.txt ):

 GATTCTTTTT GGCGAGTCAG CGAGAGGCGA TATCACGACT 

As a result, the file will have characters from 10-20, 25-35, 48-58 and 66-76, each range in a new line. Thus, it will always maintain a range of 10, but at different start points and those start points are set to the values ​​in the second column from the first file.

I tried the command:

 for i in $(awk '{print $2}' file1.txt); do p1=$i; p2=`expr "$1" + 10` cut -c$p1-$2 file2.txt > result.txt; done 

I do not receive any exit or error message.

I also tried:

 while read line; do set $line p2=`expr "$2" + 10` cut -c$2-$p2 file2.txt > result.txt; done <file1.txt 

This last command gives me an error message:

 cut: invalid range with no endpoint: - Try 'cut --help' for more information. expr: non-integer argument 
+5
source share
4 answers

There is no need to cut ; dd can do the indexing work in a file and read only the number of bytes required. (Note that status=none is GNUism, you may need to leave it on other platforms and redirect stderr otherwise if you want to suppress the information log).

 while read -r name index _; do dd if=file2.txt bs=1 skip="$index" count=10 status=none printf '\n' done <file1.txt >result.txt 

This approach avoids excessive memory requirements (as when reading the entire file2 - provided that it is large), and has limited performance requirements (service data is equal to the beginning of one dd copy for each sequence to be extracted).

+4
source

Using awk

 $ awk 'FNR==NR{a=$0; next} {print substr(a,$2+1,10)}' file2 file1 GATTCTTTTT GGCGAGTCAG CGAGAGGCGA TATCACGACT 
+3
source

If file2.txt not too large , you can read it in memory, and use Bash substrings to extract the desired ranges:

 data=$(<file2.txt) while read -r name index _; do echo "${data:$index:10}" done <file1.txt >result.txt 

This will be much more efficient than running cut or another process for each range definition.

(Thanks @CharlesDuffy for the tip to read data without the useless cat loop and while .)

+2
source

One way to solve it:

 #!/bin/bash while read line; do pos=$(echo "$line" | cut -f2 -d' ') x=$(head -c $(( $pos + 10 )) file2.txt | tail -c 10) echo "$x" done < file1.txt > result.txt 

This is not a solution that could be used by an experienced bash hacker, but it is very good for those who are not familiar with bash. It uses tools that are very versatile, although somewhat harmful, if you need high performance. Shell scripting is usually used by people who rarely use scripts, but know a few commands and just want to do their job. That's why I include this solution, even if other answers are better for more experienced people.

The first line is pretty simple. It simply extracts numbers from file1.txt . The second line uses very beautiful head and tail tools. They are usually used with lines instead of characters. However, I print the first pos + 10 characters with head . The result is passed to tail , which prints the last 10 characters.

Thanks to @CharlesDuffy for the improvement.

0
source

Source: https://habr.com/ru/post/1273182/


All Articles