How to remove duplicated SNPs using PLink?

Question

How to remove duplicated SNPs using PLink?

I work with PLINK to analyze genome data.

Does anyone know how to remove duplicated SNPs?

+4

bioinformatics

user1236418 Mar 25 '12 at 19:19

source share

3 answers

Benjamatic · Answer 1 · 2016-03-23T17:26:47+0000

In PLINK 1.9, use --list-duplicate-vars suppress-first , which lists the duplicates and removes one (first), leaving the other intact. I know this is slipping away.

Instead of using --exclude , as Davy suggested, you can also use --extract , preserving rather than disposing of the SNP list. This is an easy way for any Unix-based system (if your data is in the PED / MAP format and chromosome shortened):

 for i in {1..22}; do cat yourfile_chr${i}.map | grep "$i" | cut -f -4 | uniq | cut -f -2 | keepers_chr${i}.txt; done

This will create a keepers_chr.txt file with SNP identifiers for SNP in unique places. Then run PLINK by downloading the source file and use --extract keepers_chr , with --make-bed --out unique_file

Davy kavanagh · Answer 2 · 2012-06-22T09:13:21+0000

There is no command to do this automatically, which I know, but the way I did this in the past is to get a list of duplicate SNPs, for example, change duplicates to rs1001.dup, and then run --update-allele --update-name , then create a list of duplicates, so all entries will have .dup at the end of their names, and then run --extract duplicateSNPs.txt --make-bed --out yourfilename.dups.removed

Obtaining a list of duplicate SNPs should not be too complicated if you are familiar with R. Sorry, you just "learn X !!!". answer

user2765374 · Answer 3 · 2015-06-25T15:26:29+0000

R is easier, although you need to use a TPED file. Once you manage to get the TPED file, just copy it and paste it into the R console:

 a = read.table("yourfile.TPED",sep = " ",header=FALSE) b = a[!duplicated(a$V2),] write.table(b,file="newfile.TPED",sep=" ",quote = FALSE,col.names = FALSE, row.names=FALSE)

newfile.TPED will work without duplicates in the working directory R. TIP: you can change yourfile.TPED and newfile.TPED part of the script for the actual name of your file.

How to remove duplicated SNPs using PLink?

More articles: