How much memory is needed to store the human genome?

I am looking for the amount of memory in bytes (MB, GB, TB, etc.) needed to store one human genome. I read several Wikipedia articles about DNA, chromosomes, base pairs, genes, and I have an approximate guess, but before revealing anything, I would like to see how others approach this issue.

An alternative question is how many atoms are contained in human DNA, but this will be off topic for this site.

I understand that this will be an approximate value, so I am looking for the minimum value that could store the DNA of any person.

+69
bioinformatics storage genetics dna-sequence
Jan 21 '12 at 16:22
source share
12 answers

If you trust such things, here is what Wikipedia claims (from http://en.wikipedia.org/wiki/Human_genome#Information_content ):

2.9 billion base pairs of the human haploid genome correspond to a maximum of about 725 megabytes of data, since each base pair can be encoded in 2 bits. Since individual genomes vary by less than 1% from each other, they can be compressed without loss to about 4 megabytes.

+52
Jan 21 '12 at 16:26
source share

You do not store all the DNA in a single stream, most often when it is stored on the chromosomes.

The large chromosome occupies about 300 MB, and the small - about 50 MB.




Edit:

I think that the first reason that it is not stored in 2 bits per base pair is that it can become an obstacle to working with data. Most people would not know how to convert this. And even when a conversion program is provided, many people in large companies or research institutes are not allowed / need to ask or not know how to install the programs ...

1 GB of storage costs nothing, even downloading 3 GB takes only 4 minutes at a speed of 100 Mbps, and most companies have higher speeds.

Another thing is that the data is not as simple as they say.

For example, the sequencing method invented by Craig_Venter was a major breakthrough, but has its drawbacks . He cannot split long chains of the same base pair, therefore it is not always 100% clear if there are 8 A or 9 A. Things you should take care of later ...

Another example is DNA methylation, because you cannot store this information in a 2-bit representation.

+25
Jan 21 '12 at 16:32
source share

Basically, each base pair takes 2 bits (you can use 00, 01, 10, 11 for T, G, C, and A). Since there are about 2.9 billion base pairs (2 * 2.9 billion) in the human genome, Bit ~ = 691 megabytes.

I am not an expert, however the Human Genome Wikipedia page reads as follows:

Raw MB:

  • Male (XY): 770 MB
  • Woman (XX): 756MB

I'm not sure where their deviation comes from, but I'm sure you can figure it out.

+11
Jan 21 '12 at 16:33
source share

Yes, the minimum RAM required for all human DNA is about 770 MB. However, a 2-bit representation is practical. It is hard to do a search or do some calculations. Therefore, some mathematicians have developed a more efficient way to store these database sequences ... and use them in search and comparison algorithms, such as, for example, GARLI (www.bio.utexas.edu/faculty/antisense/garli/garli.html). This application works on my PC right now, so I can tell you ... that it practically has DNA stored around: 1,563 MB .

+8
Jan 25 '14 at 21:20
source share

The human genome contains 2.9 billion base pairs. Therefore, if you represented each base pair as a byte, then it would take 2.9 billion bytes or 2.9 GB. Perhaps you may have come up with a more creative way to store base pairs, since each base pair requires only 2 bits. So you could probably store 4 pairs of base bytes, the total of which is less than GB.

+3
Jan 21 '12 at 16:26
source share

There are 4 nucleotide bases that make up our DNA, these are A, C, G, T, so for each base it takes 2 bits in the DNA. There are about 2.9 billion databases, so about 700 megabytes. The strange thing is to fill a normal data CD! coincidence?!?

+3
Apr 24 2018-12-12T00:
source share

just did it too. raw sequence ~ 700 MB. if you use a fixed storage sequence or a storage algorithm with a fixed storage sequence, as well as the fact that the changes are calculated at 1%, up to 120 MB are calculated using the perchromosome-sequenceoffset-installedelta storage. what is it for storage.

+2
Mar 14 '14 at 2:03
source share

Most of the answers, with the exception of slayton, rauchen, Paul Amstrong users, are completely incorrect when it comes to clean one-on-one storage without compression methods.

The human genome with 3Gb nucleotides corresponds to 3Gb bytes, and not ~ 750MB. The constructed haploid genome according to the NCBI currently has a size of 3436687 KB or 3.436687 GB. Check here for yourself.

Haploid = single copy of the chromosome. Diploid = two versions of the haploid. People have 22 unique chromosomes x 2 = 44. The male 23rd chromosome is X, Y and is 46 in total. Females of the 23rd chrome. is X, X and thus is 46 in total.

For men, this will be the 23 + 1 chromosome when storing data on the hard drive, and for women, it will be 23 chromosomes, which explains the small differences mentioned from time to time in the answers. X chrome. of men is equal to X chrome. from women.

Thus, the loading of the genome (23 + 1) into memory is carried out in parts via BLAST using the created databases from fasta files. Regardless of the version with zippers or not, nucleotides are unlikely to be compressed. In the early days, one of the tricks was replacing tandem repeats (GACGACGAC with shorter encoding, for example, “3GAC”; from 9 bytes to 4 bytes). The reason was to save hard disk space (500bm-2GB hard drive plate area with 7,200 rpm and SCSI connectors). To search for a sequence, this was also done with a query.

If the storage of the “encoded nucleotide” is 2-bit in letter, then you get per byte:

A = 00
C = 01
G = 10
T = 11

This is the only way you can fully profit from the positions 1,2,3,4,5,6,7 and 8 for 1 byte of encoding. For example, the combination 01.01.10.11 corresponding to "ACTG". This one is responsible for reducing the file size by 4 times, as we see in the other answers. Thus, the size of 3.4 GB will be reduced to 0.85917175 GB ... ~ 860 MB, including the required conversion program (23 KB-4 MB).

But ... in biology, you want to read something, so gzipped compression is more than enough. Unzipped you can still read. If this byte pad was used, it becomes more difficult to read the data. This is why fasta files are actually text files.

+1
Mar 01 '18 at 10:30
source share

Each person has one human genome, and according to the National Research Institute of the Human Genome , we have a total of 30,000 genes containing about 3 billion base pairs (two bases = base pairs). There are 4 different bases of adenine (A), guanine (G), cytosine (C) and thymine (T). We can set A to 00 or 01000001 (as usual). I will be responsible for the main pair, consisting of two bytes and two bits, although I think that bytes would be a more realistic option, because data would be easier to deal with.

I am going to assume that the data structure is such that each line is a sequence of a gene / base pair (for example, ATCG ...), read from bottom to top, since order is important, similar to the letters in a word. The new line on Linux is 1 byte and 2 bytes on Windows, but this will have a slight effect on size.

eg

GENE1... GENE2... 

24,000 genes in the human genome require 24,000 new lines = 24 KB, 38 KB (negligible). If each base pair is 2 bytes and since there are 3 billion, it will be 6 GB. If each base pair is 2 bits, then the file size will be close to 6,000,000 bits or 750 MB.

Therefore, I would say that the human genome will occupy about 750 MB or 6 GB of space. Please correct me or improve this answer if I missed something.

0
Feb 02 '19 at 19:00
source share

All answers do not take into account the fact that nuDNA is not the only DNA that determines the human genome. MtDNA is also inherited and contributes an additional 16,500 base pairs to the human genome, which is more consistent with Wikipedia's assumption of 770 MB for men and 756 MB for women.

This does not mean that the human genome can be easily stored on a 4 GB USB drive. Bits do not represent information per se; it is a combination of bits that represent information. Thus, in the case of nDNA and mtDNA, the bits are encoded (not to be confused with compressed) to represent proteins and enzymes, which themselves require a lot of MB of raw data to represent, especially in terms of functionality.

Food for Thought: 80% of the human genome is called "non-coding" DNA, so do you really believe that the entire human body and brain can be represented with only 151 to 154 MB of raw data?

0
Feb 17 '19 at 15:00
source share

One base - T, C, A, G (in the base-4 number system: 0, 1, 2, 3) - is encoded as two bits (not one), so one base pair is encoded with four bits .

-one
Apr 29 '18 at 5:14
source share

There are only 2 types of base pairs, cytosine can only bind to guanine, and adenine can bind only to thymine. Therefore, each base pair can be considered one bit. This means that a whole strand of human DNA ~ 3 billion bits will be about ~ 350 megabytes.

-2
May 18 '17 at
source share



All Articles