Excel-like text import in Python: automatic parsing of fixed-width columns

In Excel, if you import text with uppercase letters, in which the columns do not line up perfectly and the data may be missing, for example

pH pKa/Em n(slope) 1000*chi2 vdw0 CYS-I0014_ >14.0 0.00 LYS+I0013_ 11.827 0.781 0.440 0.18 

you are given the option to treat it as fixed-width columns, and Excel can automatically determine the width of the column, usually with pretty good results. Is there a library in Python that can just as easily break poorly formed text of a fixed width?

EDIT: This is what importing fixed-width text looks like in Excel. In the first step, you simply check the “fixed width” switch, and then here, in the second step, Excel is already added to the column breaks automatically. The only time this cannot be done properly is when at least one white space character is broken in each line, overlapping each line.

fixed width text importing in excel

+6
source share
1 answer

Firstly, Excel (2003, at home) is not so smart. If your column 1000 * chi2 contains spaces, for example. 1000 * chi2, excel will guess incorrectly.

Trivial case: if your data was originally separated by tabs (not spaces), and multiple tabs were used to display empty columns, then, at least in TCL, it's easy to split each row into the contents of the tab, and I think the trivial is also in Python

But I guess your problem is that they used only whitespace characters. The biggest clue I see to solve this problem was to insert text into a notepad and select a fixed-size font. Everything builds neatly, and you can use the number of characters in each line as a measure of length.

So, IF you can rely on this function of your input, then you can use the sieve approach to determine where the column breaks. When you break lines in the first pass, pay attention to the "positions" along the line that are occupied by non-white space, excluding the position from your list if it is EVER occupied by non-white space. As you walk, you will quickly reach a number of positions that are NEVER occupied by white space. These are therefore your column delimiters. In your example, your "sieve" will be in positions 10-16, 23-24.32, 42-47, which have never been practiced without spaces (provided that I can count). Thus, the addition of this set is your set of column positions in which your data should lie. Thus, the foreach line, each block of unsafe space will fit into one of your columns from the set of positions (i.e., the set of additions) indicated above. I have never been encoded in Python, so the binding is a TCL script that will identify using the sieve approach where the column in the text is split and emit a new text file, and these spaces are replaced by one tab - i.e. 10-16 are replaced with one tab, 23-24 with another, etc. The resulting file is divided into a tab, i.e. The trivial case. I admit, I just tried this on YOUR little data copied to an ex.txt text file; the output goes to ex_.txt. I suspect this could also be a problem if the headers contain spaces.

Hope this helps!

 set fh [open ex.txt] set contents [read $fh];#ok for small-to-medium files. close $fh #first pass set occupied {} set lines [split $contents \n];#split contents at line breaks. foreach line $lines { set chrs [split $line {}];#split each line into chars. set pos 0 foreach chr $chrs { if {$chr ne " "} { lappend occupied $pos } incr pos } } #drop out with long list of occupied "positions": sort to create #our sieve. set datacols [lsort -unique -integer $occupied] puts "occupied: $datacols" #identify column boundaries. set colset {} set start [lindex $datacols 0];#first occupied pos might be > 0?? foreach index $datacols { if {$start < $index} { set end $index;incr end -1 lappend colset [list $start $end] puts "col break starts at $start, ends at $end";#some instro! set start $index } incr start } #Now convert input file to trivial case output file, replacing #sieved space chars with tab characters. set tesloc [lreverse $colset];#reverse the column list! set fh [open ex_.txt w] foreach line $lines { foreach ele $tesloc { set line [string replace $line [lindex $ele 0] [lindex $ele 1] "\t" ] } puts "newline is $line" puts $fh $line } close $fh 
+4
source

Source: https://habr.com/ru/post/890536/


All Articles