How to split a string and count the frequency of the alphabet using dplyr pipe

Question

How to split a string and count the frequency of the alphabet using dplyr pipe

I have the following data frame:

library(tidyverse) dat <- structure(list(fasta_header = c(">seq1", ">seq2"), sequence = c("MPSRGTRPE", "VSSKYTFWNF")), .Names = c("fasta_header", "sequence"), row.names = c(NA, -2L), class = c("tbl_df", "tbl", "data.frame")) dat #> # A tibble: 2 x 2 #> fasta_header sequence #> <chr> <chr> #> 1 >seq1 MPSRGTRPE #> 2 >seq2 VSSKYTFWNF

What I want to do is calculate the amino acid frequency for each row. The desired result is (manually)

  fasta_header sequence MPSRGTEVKYFWN >seq1 MPSRGTRPE 1 1 1 2 1 1 1 0 0 0 0 0 0 >seq2 VSSKYTFWNF 0 0 2 0 0 1 0 1 1 1 2 1 1

How can I do this using the dplyr piping method?

+5

r dplyr tidyverse

scamander Apr 4 '18 at 8:58

source share

2 answers

Here you go

 library(tidyverse) library(stringr) library(dplyr) dat <- structure(list(fasta_header = c(">seq1", ">seq2"), sequence = c("MPSRGTRPE", "VSSKYTFWNF")), .Names = c("fasta_header", "sequence"), row.names = c(NA, -2L), class = c("tbl_df", "tbl", "data.frame")) # Vector of unique amino acids uniqueaa <- as.character(dat$`sequence`) %>% strsplit(split="") %>% c() %>% unlist() %>% unique() %>% data.frame(stringsAsFactors = F) colnames(uniqueaa) <- "uniqueaa" # Count occurences result <- apply(uniqueaa,1,function(x) str_count(dat$sequence, x["uniqueaa"])) colnames(result) <- uniqueaa$uniqueaa rownames(result) <- dat$sequence result MPSRGTEVKYFWN MPSRGTRPE 1 2 1 2 1 1 1 0 0 0 0 0 0 VSSKYTFWNF 0 0 2 0 0 1 0 1 1 1 2 1 1

+1

gpier Apr 4 '18 at 9:40

source share

Andrew Gustar · Accepted Answer · 2018-04-04T09:53:00+0000

The comments above are correct, but if you really want the tidyverse pipeline ...

 library(tidyverse) #uses dplyr, purrr, tidyr and stringr dat %>% mutate(split=map(sequence, ~unlist(str_split(., "")))) %>% #split into characters unnest() %>% #unnest into a new column group_by(fasta_header, sequence) %>% #group count(split) %>% #count letters for each group spread(key=split, value=n, fill=0) #convert to wide format fasta_header sequence EFGKMNPRSTVWY <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> 1 >seq1 MPSRGTRPE 1. 0. 1. 0. 1. 0. 2. 2. 1. 1. 0. 0. 0. 2 >seq2 VSSKYTFWNF 0. 2. 0. 1. 0. 1. 0. 0. 2. 1. 1. 1. 1.

How to split a string and count the frequency of the alphabet using dplyr pipe

More articles: