Limit the delimiter to only a few tabs when using pandas read_csv

I am reading tab delimited data in a pandas Dataframe using read_csv, but I have tabs occurring inside the column data, which means I cannot just use "\ t" as the delimiter. In particular, the last entries on each line are a set of optional tab delimited tags that correspond to [A-Za-z] [A-Za-z0-9]: [A-Za-z] :. + There are no guarantees as to how many tags there will be or which ones will be present, and different sets of tags may occur on different lines. An example of the data looks like this (all white spaces are tabs in my data):

C42TMACXX:5:2316:15161:76101 163 1 @<@DFFADDDF:DD NH:i:1 HI:i:1 AS:i:200 nM:i:0 C42TMACXX:5:2316:15161:76101 83 1 CCCCCACDDDCB@B NH:i:1 HI:i:1 nM:i:1 C42TMACXX:5:1305:26011:74469 163 1 CCCFFFFFHHHHGJ NH:i:1 HI:i:1 AS:i:200 nM:i:0 

I suggest trying to read tags in a single column, and I thought I could do this by passing in a regular expression for the separator that excludes tabs that occur in the context of tags.

Following http://www.rexegg.com/regex-best-trick.html I wrote the following regular expression for this: [A-Za-z] [A-Za-z0-9]: [A-Za-Z] : [^ \ t] + \ t ..: | (\ t). I tested it on an online regexp tester and it seems to fit the tabs I want as delimiters.

But when I started

 df = pd.read_csv(myfile.txt, sep=r"[A-Za-z][A-Za-z0-9]:[A-Za-z]:[^\t]+\t..:|(\t)", header=None, engine="python") print(df) 

I get the following output for this data:

  0 1 2 3 4 5 6 7 8 \ 0 C42TMACXX:5:2316:15161:76101 \t 163 \t 1 \t @<@DFFADDDF:DD \t NaN 1 C42TMACXX:5:2316:15161:76101 \t 83 \t 1 \t CCCCCACDDDCB@B \t NaN 2 C42TMACXX:5:1305:26011:74469 \t 163 \t 1 \t CCCFFFFFHHHHGJ \t NaN 9 10 11 12 13 14 0 NaN i:1 \t NaN NaN i:0 1 NaN i:1 \t nM:i:1 NaN None 2 NaN i:1 \t NaN NaN i:0 

I expected / want:

  0 1 2 3 4 0 C42TMACXX:5:2316:15161:76101 163 1 @<@DFFADDDF:DD NH:i:1 HI:i:1 AS:i:200 nM:i:0 1 C42TMACXX:5:2316:15161:76101 83 1 CCCCCACDDDCB@B NH:i:1 HI:i:1 nM:i:1 2 C42TMACXX:5:1305:26011:74469 163 1 CCCFFFFFHHHHGJ NH:i:1 HI:i:1 AS:i:200 nM:i:0 

How to achieve this?

In case that matters, I use pandas 0.17.1, and my real data files are of the order of 100 million + lines.

+5
source share
1 answer

I quickly looked through the pandas docs and it seems that the regular expression used as a delimiter cannot use groups.

 C42TMACXX:5:2316:15161:76101 163 1 @<@DFFADDDF:DD NH:i:1 HI:i:1 AS:i:200 nM:i:0 C42TMACXX:5:2316:15161:76101 83 1 CCCCCACDDDCB@B NH:i:1 HI:i:1 nM:i:1 C42TMACXX:5:1305:26011:74469 163 1 CCCFFFFFHHHHGJ NH:i:1 HI:i:1 AS:i:200 nM:i:0 ^ ^ ^ ^ 

You only need to match the first 4 tabs, but you cannot help using groups.

The solution is to isolate the desired \t with lookaheads and lookbehinds.

Here is a regex that should work:

(?<=\d)\t(?=\d)|\t(?=[ A-Z@ <:]{14})|(?<=[ A-Z@ <:]{14})\t

Explanation

(?<=\d)\t(?=\d) : the tab preceding the digit (?<=...) , and then (?=...) digit

=> corresponds to the first and second tabs

| OR

\t(?=[ A-Z@ <:]{14}) : tab followed by 14 consecutive characters present in the LETTER, @, <or:

=> corresponds to the third tab

| OR

(?<=[ A-Z@ <:]{14})\t : tab preceded by the same four characters set

=> corresponds to the 4th tab

Demo

Note

If you need to allow more characters in 14 characters in a row, just add them to the set.

+1
source

Source: https://habr.com/ru/post/1240098/


All Articles