I have the following data frame:
import pandas as pd
df = pd.DataFrame({ 'gene':["foo",
"bar // lal",
"qux",
"woz"], 'cell1':[5,9,1,7], 'cell2':[12,90,13,87]})
df = df[["gene","cell1","cell2"]]
df
It looks like this:
Out[6]:
gene cell1 cell2
0 foo 5 12
1 bar // lal 9 90
2 qux 1 13
3 woz 7 87
What I want to do is split the “gene” column so it looks like this:
gene cell1 cell2
foo 5 12
bar 9 90
lal 9 90
qux 1 13
woz 7 87
My current approach is this:
import pandas as pd
import timeit
def create():
df = pd.DataFrame({ 'gene':["foo",
"bar // lal",
"qux",
"woz"], 'cell1':[5,9,1,7], 'cell2':[12,90,13,87]})
df = df[["gene","cell1","cell2"]]
s = df["gene"].str.split(' // ').apply(pd.Series,1).stack()
s.index = s.index.droplevel(-1)
s.name = "Genes"
del df["gene"]
df.join(s)
if __name__ == '__main__':
print(timeit.timeit("create()", setup="from __main__ import create", number=100))
It is very slow. In fact, I have about 40 thousand lines for checking and the process.
What is the quick implementation of this?