I am reading tab delimited data in a pandas Dataframe using read_csv, but I have tabs occurring inside the column data, which means I cannot just use "\ t" as the delimiter. In particular, the last entries on each line are a set of optional tab delimited tags that correspond to [A-Za-z] [A-Za-z0-9]: [A-Za-z] :. + There are no guarantees as to how many tags there will be or which ones will be present, and different sets of tags may occur on different lines. An example of the data looks like this (all white spaces are tabs in my data):
C42TMACXX:5:2316:15161:76101 163 1 @<@DFFADDDF:DD NH:i:1 HI:i:1 AS:i:200 nM:i:0 C42TMACXX:5:2316:15161:76101 83 1 CCCCCACDDDCB@B NH:i:1 HI:i:1 nM:i:1 C42TMACXX:5:1305:26011:74469 163 1 CCCFFFFFHHHHGJ NH:i:1 HI:i:1 AS:i:200 nM:i:0
I suggest trying to read tags in a single column, and I thought I could do this by passing in a regular expression for the separator that excludes tabs that occur in the context of tags.
Following http://www.rexegg.com/regex-best-trick.html I wrote the following regular expression for this: [A-Za-z] [A-Za-z0-9]: [A-Za-Z] : [^ \ t] + \ t ..: | (\ t). I tested it on an online regexp tester and it seems to fit the tabs I want as delimiters.
But when I started
df = pd.read_csv(myfile.txt, sep=r"[A-Za-z][A-Za-z0-9]:[A-Za-z]:[^\t]+\t..:|(\t)", header=None, engine="python") print(df)
I get the following output for this data:
0 1 2 3 4 5 6 7 8 \ 0 C42TMACXX:5:2316:15161:76101 \t 163 \t 1 \t @<@DFFADDDF:DD \t NaN 1 C42TMACXX:5:2316:15161:76101 \t 83 \t 1 \t CCCCCACDDDCB@B \t NaN 2 C42TMACXX:5:1305:26011:74469 \t 163 \t 1 \t CCCFFFFFHHHHGJ \t NaN 9 10 11 12 13 14 0 NaN i:1 \t NaN NaN i:0 1 NaN i:1 \t nM:i:1 NaN None 2 NaN i:1 \t NaN NaN i:0
I expected / want:
0 1 2 3 4 0 C42TMACXX:5:2316:15161:76101 163 1 @<@DFFADDDF:DD NH:i:1 HI:i:1 AS:i:200 nM:i:0 1 C42TMACXX:5:2316:15161:76101 83 1 CCCCCACDDDCB@B NH:i:1 HI:i:1 nM:i:1 2 C42TMACXX:5:1305:26011:74469 163 1 CCCFFFFFHHHHGJ NH:i:1 HI:i:1 AS:i:200 nM:i:0
How to achieve this?
In case that matters, I use pandas 0.17.1, and my real data files are of the order of 100 million + lines.
source share