Declaring a New Data Type for DNA

I deal with biology, in particular DNA, and often there is a problem with the size of the data that comes from the genome sequence.

For those of you who have no experience in biology, I will give a brief overview of the DNA sequence. DNA consists of four letters: A, T, G, and C, whose specific order determines what happens in the cell.

The main problem with DNA sequencing technology is the size of the data that produces the results (for the whole genome, often much more than gigabytes).

I know that the size of an int in C varies from computer to computer, but it has even more options for storing information than four options. Is there a way to determine the type / method of determining the "base", which takes only 2 or 3 bits? I was looking for a definition of structure, but I am afraid that this is not what I am looking for. Thank you

Also, will this work better in other languages ​​(maybe at a higher level like java)?

+4
source share
6 answers

Could you just stuff two ATGC sets into one byte? How:

0 1 0 1 1 0 0 1
A T G C A T G C

So this one byte will represent TC, AC?

+2
source

Java, , . , , AFAIK - byte, 8 ( -128 127).

, , , Java , , , , .

, , MySQL, , .

+1

, , 1,2,3,4 A, T, G, C. , , 5- 5 U. , . , /, Start Stop . , ..

dna, rna, , ++ Fortrain. ( 1,4 )

, , , ( N ).

, , , 64, . , -, .

+1
+1

char.

, C\++. , , 1 .

, , , , CPU .

0

, , .

- ints (, , -, , , ), /.

, , 2- int. , (a) ints:

unsinged int *a, dna[large number];
a = dna;
*a = 0;

:

A:

da = 0;
da = ~da;
da = da << 2;
da = ~da; (11)

G:

dg = 0;
dg = ~dg;
dg = dg << 1;
dg = ~dg;
dg = dg << 1; (10);

.. T C

:

while ((b  = getchar())!=EOF){

i = sizeof(int)*8;    /*bytes into bits*/

if (i-= 2 > 0){       /*keeping track of how much unused memory is left in int*/
    if (b =='a' || b == 'A')
        *a = *a | da;
    else if (b == 't' || b == 'T')
        *a = *a | ta;
    else if (t...
    else if (g...
    else
        error;
    *a = *a << 2;
} else{
    *++a = 0; /*advance to next 32-bit set*/
    i = sizeof(int)*8     /* it may be more efficient to set this value aside earlier, I don't honestly know enough to know this yet*/
    if (b == 'a'...
    else if (b == 't'...
    ...
    else
        error;
    *a = *a <<2;
}
}

. 32 int ( 16 ). . C.

C. , , , , . , FORTRAN , , , - ( , ); : http://arstechnica.com/science/2014/05/scientific-computings-future-can-any-coding-language-top-a-1950s-behemoth/. , , , .

, , -: http://www.mathcs.emory.edu/~cheung/Courses/255/Syllabus/1-C-intro/bit-array.html

0

Source: https://habr.com/ru/post/1545932/


All Articles