How to remove duplicated SNPs using PLink?

I work with PLINK to analyze genome data.

Does anyone know how to remove duplicated SNPs?

+4
source share
3 answers

In PLINK 1.9, use --list-duplicate-vars suppress-first , which lists the duplicates and removes one (first), leaving the other intact. I know this is slipping away.

Instead of using --exclude , as Davy suggested, you can also use --extract , preserving rather than disposing of the SNP list. This is an easy way for any Unix-based system (if your data is in the PED / MAP format and chromosome shortened):

 for i in {1..22}; do cat yourfile_chr${i}.map | grep "$i" | cut -f -4 | uniq | cut -f -2 | keepers_chr${i}.txt; done 

This will create a keepers_chr.txt file with SNP identifiers for SNP in unique places. Then run PLINK by downloading the source file and use --extract keepers_chr , with --make-bed --out unique_file

+4
source

There is no command to do this automatically, which I know, but the way I did this in the past is to get a list of duplicate SNPs, for example, change duplicates to rs1001.dup, and then run --update-allele --update-name , then create a list of duplicates, so all entries will have .dup at the end of their names, and then run --extract duplicateSNPs.txt --make-bed --out yourfilename.dups.removed

Obtaining a list of duplicate SNPs should not be too complicated if you are familiar with R. Sorry, you just "learn X !!!". answer

+2
source

R is easier, although you need to use a TPED file. Once you manage to get the TPED file, just copy it and paste it into the R console:

 a = read.table("yourfile.TPED",sep = " ",header=FALSE) b = a[!duplicated(a$V2),] write.table(b,file="newfile.TPED",sep=" ",quote = FALSE,col.names = FALSE, row.names=FALSE) 

newfile.TPED will work without duplicates in the working directory R. TIP: you can change yourfile.TPED and newfile.TPED part of the script for the actual name of your file.

0
source

Source: https://habr.com/ru/post/1403346/


All Articles