Processing text from a non-flat file (to extract information as if it were * a flat file)

Question

Processing text from a non-flat file (to extract information as if it were * a flat file)

I have a longitudinal data set created by a computer simulator, which can be represented by the following tables ("var" are variables):

time subject var1 var2 var3 t1 subjectA ... t2 subjectB ...

and

 subject name subjectA nameA subjectB nameB

However, the created file writes the data file in a format similar to the following:

 time t1 description subjectA nameA var1 var2 var3 subjectB nameB var1 var2 var3 time t2 description subjectA nameA var1 var2 var3 subjectB nameB var1 var2 var3 ...(and so on)

I use a (python) script to process this output in a flat text file so that I can import it into R, python, SQL or awk / grep to extract information - an example is the type of information desired from a single query (in SQL notation after conversion data in the table):

 SELECT var1, var2, var3 FROM datatable WHERE subject='subjectB'

I wonder if there is a more efficient solution, since each of these data files can be ~ 100 MB each (and I have hundreds of them), and creating a flat text file takes a lot of time and takes up extra hard disk space with redundant information. Ideally, I would directly interact with the original dataset to extract the information I need without creating an additional text file ... Is there an awk / perl solution for such tasks that is simpler? I'm pretty good at text processing in python, but my awk skills are rudimentary and I don't have any professional perl knowledge; I wonder if these or other domain-specific tools can provide a better solution.

Thanks!

Postscript: Wow, thanks everyone! I'm sorry that I can’t select all the answers @FM: thanks. My Python script is like your code with no filtering step. But your organization is clean. @PP: I thought I already own grep, but apparently not! This is very useful ... but I think grepping becomes difficult when mixing the "time" in the output (which I could not include as a possible extraction script in my example! This is bad). @ ghostdog74: It's just fantastic ... but changing the line to get "subjectA" was not easy ... (although I will read more on awk in the meantime and hopefully I will grok later). @weismat: Well said. @ S.Lott: This is extremely elegant and flexible - I didn’t ask for a python (ic) solution, but it fits in cleanly with the parsing, filtering and output suggested by PP and flexible enough to accommodate slightly different requests for extracting different types information from this hierarchical file.

Again, I am grateful to everyone - thank you very much.

+4

python awk perl text-processing flat-file

hatmatrix Feb 15 '10 at 7:33

source share

5 answers

If all you need is var1, var2, var3 when matching a specific object, you can try the following command:

 grep -A 1 'subjectB'

The command line argument -A 1 tells grep to print the matched line and one line after the matched line (in which case the variables get to the line after the object).

You might want to use the -E option to make grep search for the regular expression and bind the object search to the beginning of the line (for example, grep -A 1 -E '^subjectB' ).

Finally, the output will consist of the subject line and the variable of the string you want. You can hide the subject line:

 grep -A 1 'subjectB' |grep -v 'subjectB'

And you can process the variable string:

 grep -A 1 'subjectB' |grep -v 'subjectB' |perl -pe 's/ /,/g'

+2

PP Feb 15 '10 at 7:58

source share

A better option would be to modify the computer simulation to obtain a rectangular output. Assuming you can't do this, here's one approach:

To be able to use the data in R, SQL, etc., you need to convert them from hierarchical to rectangular, one way or another. If you already have a parser that can convert the entire file into a rectangular dataset, you are basically there. The next step is to add extra flexibility to your parser so that it can filter out unwanted data records. Instead of a file converter, you will have a utility for extracting data.

The following is an example in Perl, but you can do the same in Python. The general idea is to maintain a clean separation between (a) parsing, (b) filtering, and (c) output. Thus, you have a flexible environment that makes it easy to add various filtering or output methods, depending on your needs for crystal data. You can also configure filtering methods to accept parameters (from the command line or configuration file) for more flexibility.

 use strict; use warnings; read_file($ARGV[0], \&check_record); sub read_file { my ($file_name, $check_record) = @_; open(my $file_handle, '<', $file_name) or die $!; # A data structure to hold an entire record. my $rec = { time => '', desc => '', subj => '', name => '', vars => [], }; # A code reference to get the next line and do some cleanup. my $get_line = sub { my $line = <$file_handle>; return unless defined $line; chomp $line; $line =~ s/^\s+//; return $line; }; # Start parsing the data file. while ( my $line = $get_line->() ){ if ($line =~ /^time (\w+)/){ $rec->{time} = $1; $rec->{desc} = $get_line->(); } else { ($rec->{subj}, $rec->{name}) = $line =~ /(\w+) +(\w+)/; $rec->{vars} = [ split / +/, $get_line->() ]; # OK, we have a complete record. Now invoke our filtering # code to decide whether to export record to rectangular format. $check_record->($rec); } } } sub check_record { my $rec = shift; # Just an illustration. You'll want to parameterize this, most likely. write_output($rec) if $rec->{subj} eq 'subjectB' and $rec->{time} eq 't1' ; } sub write_output { my $rec = shift; print join("\t", $rec->{time}, $rec->{subj}, $rec->{name}, @{$rec->{vars}}, ), "\n"; }

+2

Fmc Feb 15 '10 at 12:43

source share

If you are lazy and you have enough RAM, then I will work on a RAM disk, and not in the file system, if you need it immediately.
I don't think Perl or awk will be faster than Python if you just transcode the current algorithm into another language.

+1

weismat Feb 15 '10 at 8:16

source share

 awk '/time/{f=0}/subjectB/{f=1;next}f' file

+1

ghostdog74 Feb 15 '10 at 9:03

source share

S. Lott · Accepted Answer · 2010-02-15T13:14:27+0000

Here's what Python generators are.

 def read_as_flat( someFile ): line_iter= iter(someFile) time_header= None for line in line_iter: words = line.split() if words[0] == 'time': time_header = [ words[1:] ] # the "time" line description= line_iter.next() time_header.append( description ) elif words[0] in subjectNameSet: data = line_iter.next() yield time_header + data

You can use this as a standard Python iterator

 for time, description, var1, var2, var3 in read_as_flat( someFile ): etc.

Processing text from a non-flat file (to extract information as if it were * a flat file)

More articles: