Loop optimization, passing parameters from an external file, naming array arguments in awk

I am new to awk. Using Windows-GNU gawk in UNXUTILS.

You have 2 kinds of records sorted in order by date and time in my files, 30 fields. Order records (starting with "O"), where the quantity is the 15th field, and 18 -field Trade records (beginning with "T"), where the quantity is the eighth field. The main research data are the historical and archival data of the Indian stock market, covering 15 days in April 2006, about 1,000 firms and include about 100 million individual orders or trade records. My test data is 500 entries for two dates and about 200 firms.

My goal at the moment is only to calculate for each firm and each date, the number of accumulated orders and the number of transactions on a firm basis.

The source data is sorted by date and time (the companies are obviously messed up, as are the voters, who usually do not vote in alphabetical order!). And now I have two separate text files: one contains a list of only individual corporate characters; and the other is separate dates, one per line.

I want to try to complete the calculations in a way that does not require me to go through all the records over and over for each of the firms and dates. The basic calculations given by firm = FIRM_1 and date = DATE_1 are easy, for example. what i remind

# For each order record with firm_symbol = FIRM_1, date = DATE_1, # cumulate its Order quantity ($15). ( /^O/ && $4~/FIRM_1/ ) && $2~/DATE_1/ { Order_Q[FIRM_1_DATE_1]=Order_Q[FIRM_1_DATE_1]+$15] } # For each trade record with firm_symbol = FIRM_1, date = DATE_1, #cumulate its Trade quantity ($8). ( /^T/ && $4~/FIRM_1/ ) && $2~/DATE_1/ { Trade_Q[FIRM_1_DATE_1]=Trade_Q[FIRM_1_DATE_1]+$8] } END { print "FIRM_1 ", "DATE_1 ", Order_Q[FIRM_1_DATE_1], Trade_Q[FIRM_1_DATE_1] } 

The question is how to build an intelligent cycle for all firms and dates, given the size of the underlying data. There are several related questions.

  • I know that the name FIRM_1 does not have to be hardcoded inside this awk script, but can be specified as a command line parameter. But can you take one more step and get awk so that this name is written sequentially from the list of names in a separate file, one per line? (If possible, considering dates from the list of dates is also possible.)

  • I built the argument names of the array to store the order quantity and the trading amount, knowing FIRM_1 and DATE_1. If we can solve the above, can we build the names of the array arguments, such as FIRM_1_DATE_1 and FIRM_1_DATE_1 on the fly, inside awk while it works? Can a concatenation string help form a name?

  • I understand that I could use an editor macro or some such method to combine my two keys, FIRM (1000 values) and DATE (15 values) into one FIRM_DATE key (15000 values) before doing any of of this, in a separate step . If the 2 above is doable, I assume it doesn't matter for that. Will that help anyway?

  • In principle, we hope to preserve, possibly, 1000 firms every 15 days, 2 variables = 30,000 cell entries in 2 ORDER_Q and TRADE_Q arrays. It's a lot? I use a modest Windows desktop and I think it has 8 GB of RAM.

Any suggestion or link or example that will help reduce the need to migrate with the original large input several times will be very welcome. If something has to do with learning not only awk, but shell scripts, this will also be very welcome.

+2
source share
1 answer

Use associative arrays. Assuming $2 contains the company name and $4 date, then:

 awk '/^O/ { order_qty[$2,$4] += $15 } /^T/ { trade_qty[$2,$4] += $8 } END { for (key in order_qty) { print key, "O", order_qty[key]; } for (key in trade_qty) { print key, "T", trade_qty[key]; } }' 

This does not give you a specific order for companies or release dates. There are methods for doing this. This makes a single pass over data that accumulates results for all companies and all dates in one turn.

 awk ' { if (date[$4]++ == 0) date_list[d++] = $4; # Dates appear in order if (firm[$2]++ == 0) firm_list[f++] = $2; # Firms appear out of order } /^O/ { order_qty[$2,$4] += $15 } /^T/ { trade_qty[$2,$4] += $8 } END { for (i = 0; i < f; i++) { for (j = 0; j < d; j++) { if ((qty = order_qty[firm_list[i],date_list[j]]) > 0) print firm_list[i], date_list[j], "O", qty if ((qty = trade_qty[firm_list[i],date_list[j]]) > 0) print firm_list[i], date_list[j], "T", qty } } }' 

If you want firms in a specific (for example, sorted) order, sort the list of firms before printing. GNU awk provides built-in sorting functions. Otherwise, you need to write an awk function to do this. (See Programming Pearls or More Pearls Programming (or both) for more information on writing sort functions in awk .)

Warning: unverified code.

+2
source

Source: https://habr.com/ru/post/1203660/


All Articles