let's say I have a huge panda / numpy dataset, where each element is a list of ordered values:
sequences = np.array([12431253, 123412531, 12341234,12431253, 145345],
[5463456, 1244562, 23452],
[243524, 141234,12431253, 456367],
[456345, 253451],
[75635, 14145, 12346,12431253])
or,
sequences = pd.DataFrame({'sequence': [[12431253, 123412531, 12341234,12431253, 145345],
[5463456, 1244562, 23452],
[243524, 141234, 456367,12431253],
[456345, 253451],
[75635, 14145, 12346,12431253]]})
and I want to replace them with a different set of identifiers that start with 0, so I create this mapping:
from compiler.ast import flatten
from sets import Set
mapping = pd.DataFrame({'v0': list(Set(flatten(sequences['sequence']))), 'v1': range(len(Set(flatten(sequences['sequence'])))})
......
so the result i was looking for:
sequences = np.array([1, 2, 3,1, 4], [5, 6, 7], [8, 9, 10,1], [11, 12], [13, 14, 15,1])
How can I scale this to a huge data frame / number of sequences?
Thanks so much for any guidance! Very grateful!