Quick way to convert rows to ints lists in a Pandas column?

I am trying to calculate the Hamming distance between all the rows in a column in a large data frame. I have over 100,000 rows in this column, so all are paired combinations that are 10x10 ^ 9 comparisons. These lines are short DNA sequences. I would like to quickly convert each row in a column to a list of integers, where a unique integer represents each character in the row. For example.

"ACGTACA" -> [0, 1, 2, 3, 1, 2, 1]

then I use scipy.spatial.distance.pdistto quickly and efficiently calculate the distance between them. Is there a quick way to do this in Pandas?

I tried to use apply, but it is rather slow:

mapping = {"A":0, "C":1, "G":2, "T":3}
df.apply(lambda x: np.array([mapping[char] for char in x]))

get_dummiesand other categorical operations are not applicable because they operate at the row level. Not inside the line.

+4
source share
4 answers

Since the Hamming distance is not worried about the difference in the size, I can get accelerated by 40-60% by simply changing df.apply(lambda x: np.array([mapping[char] for char in x]))to df.apply(lambda x: map(ord, x))on-filled data sets.

+2
source

I have not tested the performance of this, but you could also try something like

atest = "ACGTACA"
alist = atest.replace('A', '3.').replace('C', '2.').replace('G', '1.').replace('T', '0.').split('.')
anumlist = [int(x) for x in alist if x.isdigit()]

leads to:

[3, 2, 1, 0, 3, 2, 3]

Edit: Good, so testing it with atest = "ACTACA" * 100000 takes some time: / Maybe not a good idea ...

Edit 5: Another improvement:

import datetime
import numpy as np

class Test(object):
    def __init__(self):
        self.mapping = {'A' : 0, 'C' : 1, 'G' : 2, 'T' : 3}

    def char2num(self, astring):
        return [self.mapping[c] for c in astring]

def main():
        now = datetime.datetime.now()
        atest = "AGTCAGTCATG"*10000000
        t = Test()
        alist = t.char2num(atest)
        testme = np.array(alist)
        print testme, len(testme)
        print datetime.datetime.now() - now    

if __name__ == "__main__":
    main()

It takes about 16 seconds for 110,000,000 characters and keeps your processor busy instead of your ram:

[0 2 3 ..., 0 3 2] 110000000
0:00:16.866659
+1
source

In [39]: pd.options.display.max_rows=12

In [40]: N = 100000

In [41]: chars = np.array(list('ABCDEF'))

In [42]: s = pd.Series(np.random.choice(chars, size=4 * np.prod(N)).view('S4'))

In [45]: s
Out[45]: 
0        BEBC
1        BEEC
2        FEFA
3        BBDA
4        CCBB
5        CABE
         ... 
99994    EEBC
99995    FFBD
99996    ACFB
99997    FDBE
99998    BDAB
99999    CCFD
dtype: object

, .

In [43]: maxlen = s.str.len().max()

In [44]: result = pd.concat([ s.str[i].astype('category',categories=chars).cat.codes for i in range(maxlen) ], axis=1)

In [47]: result
Out[47]: 
       0  1  2  3
0      1  4  1  2
1      1  4  4  2
2      5  4  5  0
3      1  1  3  0
4      2  2  1  1
5      2  0  1  4
...   .. .. .. ..
99994  4  4  1  2
99995  5  5  1  3
99996  0  2  5  1
99997  5  3  1  4
99998  1  3  0  1
99999  2  2  5  3

[100000 rows x 4 columns]

, (, )

In [46]: %timeit pd.concat([ s.str[i].astype('category',categories=chars).cat.codes for i in range(maxlen) ], axis=1)
10 loops, best of 3: 118 ms per loop
+1

There is ordnot much difference between using or searching based on a dictionary that accurately displays A-> 0, C-> 1, etc .:

import pandas as pd
import numpy as np

bases = ['A', 'C', 'T', 'G']

rowlen = 4
nrows = 1000000

dna = pd.Series(np.random.choice(bases, nrows * rowlen).view('S%i' % rowlen))

lookup = dict(zip(bases, range(4)))

%timeit dna.apply(lambda row: map(lookup.get, row))
# 1 loops, best of 3: 785 ms per loop

%timeit dna.apply(lambda row: map(ord, row))
# 1 loops, best of 3: 713 ms per loop

Jeff's solution is also just around the corner in terms of performance:

%timeit pd.concat([dna.str[i].astype('category', categories=bases).cat.codes for i in range(rowlen)], axis=1)
# 1 loops, best of 3: 1.03 s per loop

The main advantage of this approach for matching strings with int lists is that categories can then be considered as a single (nrows, rowlen)uint8 array using an attribute .values, which can then be passed directly to pdist.

0
source

Source: https://habr.com/ru/post/1613885/


All Articles