How to create one hot coding for DNA sequences?

Question

How to create one hot coding for DNA sequences?

I would like to create one hot coding for a set of DNA sequences. For example, the sequence ACGTCCA may be presented below in transposed form. But the code below will generate one hot coding in a horizontal way, in which I would prefer it in vertical form. Can anybody help me?

ACGTCCA 1000001 - A 0100110 - C 0010000 - G 0001000 - T

Code example:

 from sklearn.preprocessing import OneHotEncoder import itertools # two example sequences seqs = ["ACGTCCA","CGGATTG"] # split sequences to tokens tokens_seqs = [seq.split("\\") for seq in seqs] # convert list of of token-lists to one flat list of tokens # and then create a dictionary that maps word to id of word, # like {A: 1, B: 2} here all_tokens = itertools.chain.from_iterable(tokens_seqs) word_to_id = {token: idx for idx, token in enumerate(set(all_tokens))} # convert token lists to token-id lists, eg [[1, 2], [2, 2]] here token_ids = [[word_to_id[token] for token in tokens_seq] for tokens_seq in tokens_seqs] # convert list of token-id lists to one-hot representation vec = OneHotEncoder(n_values=len(word_to_id)) X = vec.fit_transform(token_ids) print X.toarray()

However, the code gives me the output:

 [[ 0. 1.] [ 1. 0.]]

Expected Result:

 [[1. 0. 0. 0. 0. 0. 1. 0. 1. 0. 0. 1. 1. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0.] [0. 0. 0. 1. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 1. 1. 0. 0. 0. 1. 0. 0. 0. 0. 1. 1. 0.]]

+5

python arrays scikit-learn itertools one-hot one-hot-encoding

Xiong89 Dec 14 '15 at 9:43

source share

2 answers

 def one_hot_encode(seq): mapping = dict(zip("ACGT", range(4))) seq2 = [mapping[i] for i in seq] return np.eye(4)[seq2] one_hot_encode("AACGT") ## Output: array([[1., 0., 0., 0.], [1., 0., 0., 0.], [0., 1., 0., 0.], [0., 0., 1., 0.], [0., 0., 0., 1.]])

0

Dridk Jun 05 '19 at 17:11

source share

John zwinck · Accepted Answer · 2015-12-14T10:19:12+0000

I suggest doing this a little more manually:

 import numpy as np seqs = ["ACGTCCA","CGGATTG"] CHARS = 'ACGT' CHARS_COUNT = len(CHARS) maxlen = max(map(len, seqs)) res = np.zeros((len(seqs), CHARS_COUNT * maxlen), dtype=np.uint8) for si, seq in enumerate(seqs): seqlen = len(seq) arr = np.chararray((seqlen,), buffer=seq) for ii, char in enumerate(CHARS): res[si][ii*seqlen:(ii+1)*seqlen][arr == char] = 1 print res

This will give you the desired result:

 [[1 0 0 0 0 0 1 0 1 0 0 1 1 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0] [0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 1 1 0 0 0 1 0 0 0 0 1 1 0]]

How to create one hot coding for DNA sequences?

More articles: