I have a longitudinal data set created by a computer simulator, which can be represented by the following tables ("var" are variables):
time subject var1 var2 var3 t1 subjectA ... t2 subjectB ...
and
subject name subjectA nameA subjectB nameB
However, the created file writes the data file in a format similar to the following:
time t1 description subjectA nameA var1 var2 var3 subjectB nameB var1 var2 var3 time t2 description subjectA nameA var1 var2 var3 subjectB nameB var1 var2 var3 ...(and so on)
I use a (python) script to process this output in a flat text file so that I can import it into R, python, SQL or awk / grep to extract information - an example is the type of information desired from a single query (in SQL notation after conversion data in the table):
SELECT var1, var2, var3 FROM datatable WHERE subject='subjectB'
I wonder if there is a more efficient solution, since each of these data files can be ~ 100 MB each (and I have hundreds of them), and creating a flat text file takes a lot of time and takes up extra hard disk space with redundant information. Ideally, I would directly interact with the original dataset to extract the information I need without creating an additional text file ... Is there an awk / perl solution for such tasks that is simpler? I'm pretty good at text processing in python, but my awk skills are rudimentary and I don't have any professional perl knowledge; I wonder if these or other domain-specific tools can provide a better solution.
Thanks!
Postscript: Wow, thanks everyone! I'm sorry that I canโt select all the answers @FM: thanks. My Python script is like your code with no filtering step. But your organization is clean. @PP: I thought I already own grep, but apparently not! This is very useful ... but I think grepping becomes difficult when mixing the "time" in the output (which I could not include as a possible extraction script in my example! This is bad). @ ghostdog74: It's just fantastic ... but changing the line to get "subjectA" was not easy ... (although I will read more on awk in the meantime and hopefully I will grok later). @weismat: Well said. @ S.Lott: This is extremely elegant and flexible - I didnโt ask for a python (ic) solution, but it fits in cleanly with the parsing, filtering and output suggested by PP and flexible enough to accommodate slightly different requests for extracting different types information from this hierarchical file.
Again, I am grateful to everyone - thank you very much.