Associative arrays in awk causing memory constraints

Question

Associative arrays in awk causing memory constraints

This is due to my recent message in Awk code with associative arrays - the array does not seem to be filled, but without errors , as well as until the optimization of the loop, passing parameters from an external file, defining array arguments to awk

My main problem here is simply to calculate from the detailed ancient archival data of the financial market, daily aggregates #transactions, #shares, value, BY DATE, FIRM-ID, EXCHANGE, etc. Learned how to use associative arrays in awk for this, and was thrilled to be able to process over 129 million lines in clock mode for less than 11 minutes. Just before I finished the coffee.

Became a little more ambitious and moved from 2 array indices to 4, and now I can not process more than 6500 rows at a time.

Receive form error messages:

K: \ User Folders \ KRISHNANM \ PAPERS \ FII_Transaction_Data> zcat RAW_DATA \ 2003_1.zip | gawk -f CODE \ FII_daily_aggregates_v2.awk> OUTPUT \ 2003_1.txt &
gawk: CODE \ FII_daily_aggregates_v2.awk: 33: (FILENAME = - FNR = 49300) fatal: more_no des: nextfree: cannot allocate memory (not enough space)

On some launches, the car informed me that it only needs 52 KB of memory. I have what I think of a std configuration with Win-7 and 8 MB of RAM.

(Economist by education, not a computer scientist). I understand that moving from 2 to 4 arrays makes the problem computationally more complex for a computer, but is there anything that can be done to improve memory management, at least a little. I tried to close everything else that I do. An error should always be performed only with memory, never with disk space or anything else.

INPUT example:

49290,C198962542782200306,6/30/2003,433581,F5811773991200306,S5405611832200306,B5086397478200306,NESTLE INDIA LTD.,INE239A01016,6/27/2003,1,E9035083824200306,REG_DL_STLD_02,591.13,5655,3342840.15,REG_DL_INSTR_EQ,REG_DL_DLAY_P,DL_RPT_TYPE_N,DL_AMDMNT_DEL_00 49291,C198962542782200306,6/30/2003,433563,F6292896459200306,S6344227311200306,B6110521493200306,GRASIM INDUSTRIES LTD.,INE047A01013,6/27/2003,1,E9035083824200306,REG_DL_STLD_02,495.33,3700,1832721,REG_DL_INSTR_EQ,REG_DL_DLAY_P,DL_RPT_TYPE_N,DL_AMDMNT_DEL_00 49292,C198962542782200306,6/30/2003,433681,F6513202607200306,S1724027402200306,B6372023178200306,HDFC BANK LTD,INE040A01018,6/26/2003,1,E745964372424200306,REG_DL_STLD_02,242,2600,629200,REG_DL_INSTR_EQ,REG_DL_DLAY_D,DL_RPT_TYPE_N,DL_AMDMNT_DEL_00 49293,C7885768925200306,6/30/2003,48128,F4406661052200306,S7376401565200306,B4576522576200306,Maruti Udyog Limited,INE585B01010,6/28/2003,3,E912851176274200306,REG_DL_STLD_04,125,44600,5575000,REG_DL_INSTR_EQ,REG_DL_DLAY_P,DL_RPT_TYPE_N,DL_AMDMNT_DEL_00 49294,C7885768925200306,6/30/2003,48129,F4500260787200306,S1312094035200306,B4576522576200306,Maruti Udyog Limited,INE585B01010,6/28/2003,4,E912851176274200306,REG_DL_STLD_04,125,445600,55700000,REG_DL_INSTR_EQ,REG_DL_DLAY_P,DL_RPT_TYPE_N,DL_AMDMNT_DEL_00 49295,C7885768925200306,6/30/2003,48130,F6425024637200306,S2872499118200306,B4576522576200306,Maruti Udyog Limited,INE585B01010,6/28/2003,3,E912851176274200306,REG_DL_STLD_04,125,48000,6000000,REG_DL_INSTR_EU,REG_DL_DLAY_P,DL_RPT_TYPE_N,DL_AMDMNT_DEL_00

the code

 BEGIN { FS = "," } # For each array subscript variable -- DATE ($10), firm_ISIN ($9), EXCHANGE ($12), and FII_ID ($5), after checking for type = EQ, set up counts for each value, and number of unique values. ( $17~/_EQ\>/ ) { if (date[$10]++ == 0) date_list[d++] = $10; if (isin[$9]++ == 0) isin_list[i++] = $9; if (exch[$12]++ == 0) exch_list[e++] = $12; if (fii[$5]++ == 0) fii_list[f++] = $5; } # For cash-in, buy (B), or cash-out, sell (S) count NR = no of records, SH = no of shares, RV = rupee-value. (( $17~/_EQ\>/ ) && ( $11~/1|2|3|5|9|1[24]/ )) {{ ++BNR[$10,$9,$12,$5]} {BSH[$10,$9,$12,$5] += $15} {BRV[$10,$9,$12,$5] += $16} } (( $17~/_EQ\>/ ) && ( $11~/4|1[13]/ )) {{ ++SNR[$10,$9,$12,$5]} {SSH[$10,$9,$12,$5] += $15} {SRV[$10,$9,$12,$5] += $16} } END { { print NR, "records processed."} { print " " } { printf("%-11s\t%-13s\t%-20s\t%-19s\t%-7s\t%-7s\t%-14s\t%-14s\t%-18s\t%-18s\n", \ "DATE", "ISIN", "EXCH", "FII", "BNR", "SNR", "BSH", "SSH", "BRV", "SRV") } { for (u = 0; u < d; u++) { for (v = 0; v < i; v++) { for (w = 0; w < e; w++) { for (x = 0; x < f; x++) #check first below for records with zeroes, don't print them { if (BNR[date_list[u],isin_list[v],exch_list[w],fii_list[x]] + SNR[date_list[u],isin_list[v],exch_list[w],fii_list[x]] > 0) { BR = BNR[date_list[u],isin_list[v],exch_list[w],fii_list[x]] SR = SNR[date_list[u],isin_list[v],exch_list[w],fii_list[x]] BS = BSH[date_list[u],isin_list[v],exch_list[w],fii_list[x]] BV = BRV[date_list[u],isin_list[v],exch_list[w],fii_list[x]] SS = SSH[date_list[u],isin_list[v],exch_list[w],fii_list[x]] SV = SRV[date_list[u],isin_list[v],exch_list[w],fii_list[x]] { printf("%-11s\t%13s\t%20s\t%19s\t%7d\t%7d\t%14d\t%14d\t%18.2f\t%18.2f\n", \ date_list[u], isin_list[v], exch_list[w], fii_list[x], BR, SR, BS, SS, BV, SV) } } } } } } } }

Expected Result

  6 records processed. DATE ISIN EXCH FII BNR SNR BSH SSH BRV SRV 6/27/2003 INE239A01016 E9035083824200306 F5811773991200306 1 0 5655 0 3342840.15 0.00 6/27/2003 INE047A01013 E9035083824200306 F6292896459200306 1 0 3700 0 1832721.00 0.00 6/26/2003 INE040A01018 E745964372424200306 F6513202607200306 1 0 2600 0 629200.00 0.00 6/28/2003 INE585B01010 E912851176274200306 F4406661052200306 1 0 44600 0 5575000.00 0.00 6/28/2003 INE585B01010 E912851176274200306 F4500260787200306 0 1 0 445600 0.00 55700000.00

In this case, since the number of input records exceeds 6500, I am running out of memory problems. Only about 7 million records.

For a problem with the index of 2 arrays, although in a different data set, where 129 million copies were processed in hours in 11 minutes using the same GNU-AWK on one computer, see loop optimization, passing parameters from an external file, definition of arguments array in awk

Question: is it so that awk is not very smart with memory management, but that some other more modern tools (say, SQL) would perform this task with the same memory resources? Or is it just a characteristic of associative arrays that I found magical, allowing me to avoid many passes through the data, many SORT loops and procedures, but which may work well up to two array indices, and then the exponential cost of memory resources after that?

Afterword: The super-detailed, almost-idiotic tutorial, along with the code provided by Ed Morton in the comments below, makes a significant contribution, especially his GAWK script tst.awk. He taught me (a) to use SUBSEP wisely (b) to do an unnecessary loop, which is crucial in this problem, which has very rare arrays with different AWK constructs. Compared to the performance with my old code (only up to 6500 input lines accepted on one computer, the other couldn’t even go that far), the performance of Ed Morton tst.awk can be seen from the table below:

 **filename start end min in ln out lines 2008_1 12:08:40 AM 12:27:18 AM 0:18 391438 301160 2008_2 12:27:18 AM 12:52:04 AM 0:24 402016 314177 2009_1 12:52:05 AM 1:05:15 AM 0:13 302081 238204 2009_2 1:05:15 AM 1:22:15 AM 0:17 360072 276768 2010_1 "slept" 507496 397533 2010_2 3:10:26 AM 3:10:50 AM 0:00 76200 58228 2010_3 3:10:50 AM 3:11:18 AM 0:00 80988 61725 2010_4 3:11:18 AM 3:11:47 AM 0:00 86923 65885 2010_5 3:11:47 AM 3:12:15 AM 0:00 80670 63059**

Times were obtained simply from using% time% in the lines before and after running tst.awk, all of them are placed in a simple script package, "min" is the time taken for hours (for any rounding of EXCEL by default), "in ln" and "out lines" are the input and output lines, respectively. From processing all the data that we have, from January 2003 to January 2014 we find the theoretical maximum number of output records = # dates * # ISINs * # Exchanges * # FIIs = 2992 * 2955 * 567 * 82268, while the actual amount total output lines is only 5,261,942, which is only 1,275 * 10 ^ (- 8) of the theoretical maximum - very rare indeed . We were already aware that it was rare, but arrays can be so scarce - which is very important for memory management - we could not say until something really ended, for a real data set. Time seems to increase exponentially in input size, but within limits that do not create practical difficulties. Thanks, tone, Ed.

+5

memory awk gawk

Murgie Sep 29 '14 at 10:08

source share

1 answer

Ed morton · Accepted Answer · 2014-09-30T01:31:38+0000

There are no problems with associative arrays at all. In awk (with the exception of gawk for true 2D arrays), an associative array with 4 indices is identical to one of the two indices, since in reality it has only one index, which is a concatenation of each of the pseudo indices separated by SUBSEP.

Given what you say I am unable to process more than 6500 lines at a time. , the problem is much more likely in the way you wrote your code than any main awk problem, so if you need more help, put a little script with an input sample and expected output that demonstrates your problem and tries to decide if We have suggestions for improving memory usage.

Given your published script, I expect the problem is related to these nested loops in the END section. When you do:

 for (i=1; i<=maxI; i++) { for (j=1; j<=maxJ; j++) { if ( arr[i,j] != 0 ) { print arr[i,j] } } }

you CREATE arr [i, j] for every possible combination of i and j that did not exist before the loop, just testing for arr[i,j] != 0 . If you wrote instead:

 for (i=1; i<=maxI; i++) { for (j=1; j<=maxJ; j++) { if ( (i,j) in arr ) { print arr[i,j] } } }

then the loop itself will not create new entries in arr[] .

So change this block:

 if (BNR[date_list[u],isin_list[v],exch_list[w],fii_list[x]] + SNR[date_list[u],isin_list[v],exch_list[w],fii_list[x]] > 0) { BR = BNR[date_list[u],isin_list[v],exch_list[w],fii_list[x]] SR = SNR[date_list[u],isin_list[v],exch_list[w],fii_list[x]] BS = BSH[date_list[u],isin_list[v],exch_list[w],fii_list[x]] BV = BRV[date_list[u],isin_list[v],exch_list[w],fii_list[x]] SS = SSH[date_list[u],isin_list[v],exch_list[w],fii_list[x]] SV = SRV[date_list[u],isin_list[v],exch_list[w],fii_list[x]]

which probably unnecessarily turns each of BNR, SNR, BSH, BRV, SSH and SRV into huge, but very sparse arrays, something like this:

 idx = date_list[u] SUBSEP isin_list[v] SUBSEP exch_list[w] SUBSEP fii_list[x] BR = (idx in BNR ? BNR[idx] : 0) SR = (idx in SNR ? SNR[idx] : 0) if ( (BR + SR) > 0 ) { BS = (idx in BSH ? BSH[idx] : 0) BV = (idx in BRV ? BRV[idx] : 0) SS = (idx in SSH ? SSH[idx] : 0) SV = (idx in SRV ? SRV[idx] : 0)

and let us know if that helps. Also check your code for other places where you could do the same.

The reason you had this problem with 4 indexes when you're not with 2 is simply because you have 4 levels of nesting in loops, now creating much larger and more sparse arrays when you have only 2.

Finally, you have some weird syntax in the script, some of which @MarkSetchell pointed out in the comment, and your script is not as efficient as it can be, since you are not using else and therefore testing several conditions that cannot be true, and you repeatedly test the same condition, and this is not reliable, since you do not bind your REs (for example, you test /4|1[13]/ instead of /^(4|1[13])$/ ) so for example, your 4 will only match 14 or 41 , etc., not just 4 ), so change this whole script to:

 $ cat tst.awk BEGIN { FS = "," } # For each array subscript variable -- DATE ($10), firm_ISIN ($9), EXCHANGE ($12), and FII_ID ($5), after checking for type = EQ, set up counts for each value, and number of unique values. $17 ~ /_EQ\>/ { if (!seenDate[$10]++) date_list[++d] = $10 if (!seenIsin[$9]++) isin_list[++i] = $9 if (!seenExch[$12]++) exch_list[++e] = $12 if (!seenFii[$5]++) fii_list[++f] = $5 # For cash-in, buy (B), or cash-out, sell (S) count NR = no of records, SH = no of shares, RV = rupee-value. idx = $10 SUBSEP $9 SUBSEP $12 SUBSEP $5 if ( $11 ~ /^([12359]|1[24])$/ ) { ++BNR[idx]; BSH[idx] += $15; BRV[idx] += $16 } else if ( $11 ~ /^(4|1[13])$/ ) { ++SNR[idx]; SSH[idx] += $15; SRV[idx] += $16 } } END { print NR, "records processed." print " " printf "%-11s\t%-13s\t%-20s\t%-19s\t%-7s\t%-7s\t%-14s\t%-14s\t%-18s\t%-18s\n", "DATE", "ISIN", "EXCH", "FII", "BNR", "SNR", "BSH", "SSH", "BRV", "SRV" for (u = 1; u <= d; u++) { for (v = 1; v <= i; v++) { for (w = 1; w <= e; w++) { for (x = 1; x <= f; x++) { #check first below for records with zeroes, don't print them idx = date_list[u] SUBSEP isin_list[v] SUBSEP exch_list[w] SUBSEP fii_list[x] BR = (idx in BNR ? BNR[idx] : 0) SR = (idx in SNR ? SNR[idx] : 0) if ( (BR + SR) > 0 ) { BS = (idx in BSH ? BSH[idx] : 0) BV = (idx in BRV ? BRV[idx] : 0) SS = (idx in SSH ? SSH[idx] : 0) SV = (idx in SRV ? SRV[idx] : 0) printf "%-11s\t%13s\t%20s\t%19s\t%7d\t%7d\t%14d\t%14d\t%18.2f\t%18.2f\n", date_list[u], isin_list[v], exch_list[w], fii_list[x], BR, SR, BS, SS, BV, SV } } } } } }

I added seen before 4 array names just because for testing conditional arrays to pre-exist, values are usually called seen . In addition, when filling in SNR [] arrays, etc. I first created the idx variable, instead of repeatedly using field numbers each time to make it easier to change in the future, and mainly because string concatenation is relatively slow in awk, and what happens when you use multiple indexes in an array, so the best thing is to simply Concatenate strings explicitly. And I changed your array date_list [], etc., to start from 1 instead of zero, because all the awk arrays, rows and field numbers generated start with 1. You can create an array manually that starts with 0 or -357 or any other number you want, but it will save you legs someday if you always start them at 1.

I expect it could be made even more efficient by restricting nested loops to only the values that may exist for combinations of indices of enclosing loops (for example, not all u + v + w values are possible, so there will be times when you shouldn't bother looping on x). For instance:

 $ cat tst.awk BEGIN { FS = "," } # For each array subscript variable -- DATE ($10), firm_ISIN ($9), EXCHANGE ($12), and FII_ID ($5), after checking for type = EQ, set up counts for each value, and number of unique values. $17 ~ /_EQ\>/ { if (!seenDate[$10]++) date_list[++d] = $10 if (!seenIsin[$9]++) isin_list[++i] = $9 if (!seenExch[$12]++) exch_list[++e] = $12 if (!seenFii[$5]++) fii_list[++f] = $5 # For cash-in, buy (B), or cash-out, sell (S) count NR = no of records, SH = no of shares, RV = rupee-value. idx = $10 SUBSEP $9 SUBSEP $12 SUBSEP $5 if ( $11 ~ /^([12359]|1[24])$/ ) { seen[$10,$9] seen[$10,$9,$12] ++BNR[idx]; BSH[idx] += $15; BRV[idx] += $16 } else if ( $11 ~ /^(4|1[13])$/ ) { seen[$10,$9] seen[$10,$9,$12] ++SNR[idx]; SSH[idx] += $15; SRV[idx] += $16 } } END { printf "d = %d\n", d | "cat>&2" printf "i = %d\n", i | "cat>&2" printf "e = %d\n", e | "cat>&2" printf "f = %d\n", f | "cat>&2" print NR, "records processed." print " " printf "%-11s\t%-13s\t%-20s\t%-19s\t%-7s\t%-7s\t%-14s\t%-14s\t%-18s\t%-18s\n", "DATE", "ISIN", "EXCH", "FII", "BNR", "SNR", "BSH", "SSH", "BRV", "SRV" for (u = 1; u <= d; u++) { date = date_list[u] for (v = 1; v <= i; v++) { isin = isin_list[v] if ( (date,isin) in seen ) { for (w = 1; w <= e; w++) { exch = exch_list[w] if ( (date,isin,exch) in seen ) { for (x = 1; x <= f; x++) { fii = fii_list[x] #check first below for records with zeroes, don't print them idx = date SUBSEP isin SUBSEP exch SUBSEP fii if ( (idx in BNR) || (idx in SNR) ) { if (idx in BNR) { bnr = BNR[idx] bsh = BSH[idx] brv = BRV[idx] } else { bnr = bsh = brv = 0 } if (idx in SNR) { snr = SNR[idx] ssh = SSH[idx] srv = SRV[idx] } else { snr = ssh = srv = 0 } printf "%-11s\t%13s\t%20s\t%19s\t%7d\t%7d\t%14d\t%14d\t%18.2f\t%18.2f\n", date, isin, exch, fii, bnr, snr, bsh, ssh, brv, srv } } } } } } } }

Associative arrays in awk causing memory constraints

More articles: