Using Django with MySQL to store and search for large DNA microchip results

I am trying to set up a django application that allows me to store and search for the results of a dna microarray with unique probes of ~ 500 thousand for a large number of objects.

The created model that I worked with is as follows:

class Subject(models.Model): name = models.CharField() class Chip(models.Model): chip_name = models.Charfield() class Probe(models.Model): chips = models.ManyToManyField(Chip, related_name="probes" ) rs_name = models.CharField(unique=True) chromosome = models.IntegerField() location = models.IntegerField() class Genotype(models.Model): probe = models.ForeignKey(Probe, related_name='genotypes') subject = models.ForeignKey(Subject, related_name='genotypes') genotype = models.CharField() 

I was wondering if there is a better way to set this up there? I just thought that for each subject I will create 500k rows in a Genotype table.

If I use MySQL db, will it be able to process a large number of objects, each of which adds 500k rows to this table?

+4
source share
1 answer

Well, if you need a result (genotype) for each object for each object, then the standard many-to-many mediation table (genotype) will be really huge. With 1000 subjects, you will have 500 million records.

If you could store the values ​​for the genotype encoded / serialized field in one or more columns, this would significantly reduce the number of records. The problem of saving 500k encoded in one column will be a problem, but if you can separate them into groups, it should be workable. This will reduce the number of entries to nr. Subjects. Or another possibility could be related to the ProbeGroup-s probe and have nr. ProbeResults = nr. Theme * nr. ProbeGroup. The first option would be something like this:

 class SubjectProbeResults(models.Model): subject = models.ForeignKey(Subject, related_name='probe_results') pg_a_genotypes = models.TextField() .. pg_n_genotypes = models.TextField() 

This, of course, makes it difficult to search / filter the results, but should not be too hard if the saved format is simple. You can have the following format in genotype columns: "probe1_id | genotype1, probe2_id | genotype2, probe3_id | genotype3, ..."

Get a sample of objects for a specific probe of genotype +.

a. Determine which group the ie Group C probe belongs to β†’ pg_c_genotypes

b. Request the appropriate column for the probe_id + genotype combination.

 from django.db.models import Q qstring = "%s|%s" % (probe_id, genotype) subjects = Subject.objects.filter(Q(probe_results__pg_c_genotypes__contains=',%s,' % qstring) | \ Q(probe_results__pg_c_genotypes__startswith='%s,' % qstring) | \ Q(probe_results__pg_c_genotypes__endswith=',%s' % qstring)) 

Another option that I mentioned is to have a ProbeGroup model, and each Probe will have a ForeignKey value for the ProbeGroup . And then:

 class SubjectProbeResults(models.Model): subject = models.ForeignKey(Subject, related_name='probe_results') probe_group = models.ForeignKey(ProbeGroup, related_name='probe_results') genotypes = models.TextField() 

You can query the genotype field the same way, but now you can query the group directly, rather than specifying the column to look for. Thus, if you have ex. 1000 probes in a group β†’ 500 groups. Then for 1000 items you will have 500K SubjectProbeResults , still a lot, but, of course, more manageable than 500M. But you may have fewer groups, you will need to check what works best.

+1
source

Source: https://habr.com/ru/post/1342286/


All Articles