Algorithm for selecting patterns of free values down from the list of sparse definitions

Question

Algorithm for selecting patterns of free values down from the list of sparse definitions

I have the following problem.

I am developing a stochastic simulator that selectively displays system configurations and stores statistics on how many times each configuration has been visited in certain time instances. Roughly the code works like this:

f[_Integer][{_Integer..}] :=0 ... someplace later in the code, eg, index = get index; c = get random configuration (ie a tuple of integers, say a pair {n1, n2}); f[index][c] = f[index][c] + 1; which tags that configuration c has occurred once more in the simulation at time instance index.

Once the code is finished, there is a list of definitions for f that looks something like this (I typed it manually to emphasize the most important parts)

 ?f f[1][{1, 2}] = 112 f[1][{3, 4}] = 114 f[2][{1, 6}] = 216 f[2][{2, 7}] = 227 ... f[index][someconfiguration] = some value ... f[_Integer][{_Integer..}] :=0

Please note that definitions that are free from templates that come first can be rather scarce. Also, you cannot find out which values and configurations will be selected.

The problem is to efficiently retrieve the values for the desired index, for example, output something like

 result = ExtractConfigurationsAndOccurences[f, 2]

which should provide a list with a structure

 result = {list1, list2}

Where

 list1 = {{1, 6}, {2, 7}} (* the list of configurations that occurred during the simulation*) list2 = {216, 227} (* how many times each of them occurred *)

The problem is that ExtractConfigurationsAndOccurences must be very fast. The only solution I could come up with was to use SubValues [f] (which gives a complete list) and filter it using the Cases statement. I understand that this procedure should be avoided at all costs, since there will be exponentially many configurations (definitions) for testing, which significantly slows down the code.

Is there a natural way in Mathematica to do this in a quick way?

I was hoping Mathematica would see f [2] as a single head with many down values, but using DownValues [f [2]] gives nothing. Also, using SubValues [f [2]] leads to an error.

+6

wolfram-mathematica

zorank Aug 31 '11 at 11:54

source share

4 answers

Leonid Shifrin · Answer 1 · 2011-08-31T12:22:30+0000

This is a complete revision of my previous answer. It turns out that in my previous attempts I overlooked a much simpler method based on a combination of packed arrays and sparse arrays, which is much faster and more efficient in terms of memory than all previous methods (at least in the range of sample sizes, where I tested it), although minimally SubValues original SubValues approach. Since the question was asked about the most efficient method, I will remove others from the answer (given that they are quite complex and take up a lot of space. Those who would like to see them can check past versions of this answer).

Original `SubValues`

Let's start by introducing the function to create test configurations for us. There he is:

 Clear[generateConfigurations]; generateConfigurations[maxIndex_Integer, maxConfX_Integer, maxConfY_Integer, nconfs_Integer] := Transpose[{ RandomInteger[{1, maxIndex}, nconfs], Transpose[{ RandomInteger[{1, maxConfX}, nconfs], RandomInteger[{1, maxConfY}, nconfs] }]}];

We can create a small sample to illustrate:

 In[3]:= sample = generateConfigurations[2,2,2,10] Out[3]= {{2,{2,1}},{2,{1,1}},{1,{2,1}},{1,{1,2}},{1,{1,2}}, {1,{2,1}},{2,{1,2}},{2,{2,2}},{1,{2,2}},{1,{2,1}}}

Here we have only 2 indexes and configurations, where both the numbers "x" and "y" vary from 1 to 2 - only 10 such configurations.

The following function will help us mimic the accumulation of frequencies for configurations as we increase the counters based on SubValues for repeating ones:

 Clear[testAccumulate]; testAccumulate[ff_Symbol, data_] := Module[{}, ClearAll[ff]; ff[_][_] = 0; Do[ doSomeStuff; ff[#1][#2]++ & @@ elem; doSomeMoreStaff; , {elem, data}]];

The symbols doSomeStuff and doSomeMoreStaff are shown here to indicate some code that may exclude or follow the count code. The data parameter should be a list of the form created by generateConfigurations . For instance:

 In[6]:= testAccumulate[ff,sample]; SubValues[ff] Out[7]= {HoldPattern[ff[1][{1,2}]]:>2,HoldPattern[ff[1][{2,1}]]:>3, HoldPattern[ff[1][{2,2}]]:>1,HoldPattern[ff[2][{1,1}]]:>1, HoldPattern[ff[2][{1,2}]]:>1,HoldPattern[ff[2][{2,1}]]:>1, HoldPattern[ff[2][{2,2}]]:>1,HoldPattern[ff[_][_]]:>0}

The following function extracts the resulting data (indices, configurations, and their frequencies) from the SubValues list:

 Clear[getResultingData]; getResultingData[f_Symbol] := Transpose[{#[[All, 1, 1, 0, 1]], #[[All, 1, 1, 1]], #[[All, 2]]}] &@ Most@SubValues [f, Sort -> False];

For instance:

 In[10]:= result = getResultingData[ff] Out[10]= {{2,{2,1},1},{2,{1,1},1},{1,{2,1},3},{1,{1,2},2},{2,{1,2},1}, {2,{2,2},1},{1,{2,2},1}}

To end the data processing cycle, there is a direct function to extract data for a fixed index based on Select :

 Clear[getResultsForFixedIndex]; getResultsForFixedIndex[data_, index_] := If[# === {}, {}, Transpose[#]] &[ Select[data, First@ # == index &][[All, {2, 3}]]];

In our test case

 In[13]:= getResultsForFixedIndex[result,1] Out[13]= {{{2,1},{1,2},{2,2}},{3,2,1}}

This seems to be close to what @zorank tried in code.

Faster solution based on arrays and sparse arrays

As @zorank noted, this becomes slow for a larger sample with more indexes and configurations. Now we will create a large sample to illustrate this ( note! This requires about 4-5 GB of RAM, so you may need to reduce the number of configurations if it exceeds the available RAM ):

 In[14]:= largeSample = generateConfigurations[20,500,500,5000000]; testAccumulate[ff,largeSample];//Timing Out[15]= {31.89,Null}

Now we will extract the full data from SubValues of ff :

 In[16]:= (largeres = getResultingData[ff]); // Timing Out[16]= {10.844, Null}

This takes some time, but you only need to do this once. But when we start retrieving data for a fixed index, we see that it is rather slow:

 In[24]:= getResultsForFixedIndex[largeres,10]//Short//Timing Out[24]= {2.687,{{{196,26},{53,36},{360,43},{104,144},<<157674>>,{31,305},{240,291}, {256,38},{352,469}},{<<1>>}}}

The main idea that we will use here to speed it up is to pack separate lists inside largeres , for indices, combinations and frequencies. Although the complete list cannot be packaged, these parts individually can:

 In[18]:= Timing[ subIndicesPacked = Developer`ToPackedArray[largeres[[All,1]]]; subCombsPacked = Developer`ToPackedArray[largeres[[All,2]]]; subFreqsPacked = Developer`ToPackedArray[largeres[[All,3]]]; ] Out[18]= {1.672,Null}

It also takes some time, but it is a one-time operation again.

The following functions will be used to more efficiently retrieve results for a fixed index:

 Clear[extractPositionFromSparseArray]; extractPositionFromSparseArray[HoldPattern[SparseArray[u___]]] := {u}[[4, 2, 2]] Clear[getCombinationsAndFrequenciesForIndex]; getCombinationsAndFrequenciesForIndex[packedIndices_, packedCombs_, packedFreqs_, index_Integer] := With[{positions = extractPositionFromSparseArray[ SparseArray[1 - Unitize[packedIndices - index]]]}, {Extract[packedCombs, positions],Extract[packedFreqs, positions]}];

Now we have:

 In[25]:= getCombinationsAndFrequenciesForIndex[subIndicesPacked,subCombsPacked,subFreqsPacked,10] //Short//Timing Out[25]= {0.094,{{{196,26},{53,36},{360,43},{104,144},<<157674>>,{31,305},{240,291}, {256,38},{352,469}},{<<1>>}}}

We get a 30x acceleration wrt naive Select approach.

Some complexity notes

Please note that the second solution is faster because it uses optimized data structures, but its complexity is the same as that of Select - one based, which is linear along the length of the general list of unique combinations for all indices. Therefore, theoretically discussed solutions based on a nested hash table, etc., can be asymptotically better. The problem is that in practice, we are likely to run into memory limitations long before that. For a sample of 10 million configurations, the above code was still 2-3 times faster than the fastest solution I posted earlier.

EDIT

The following modification:

 Clear[getCombinationsAndFrequenciesForIndex]; getCombinationsAndFrequenciesForIndex[packedIndices_, packedCombs_, packedFreqs_, index_Integer] := With[{positions = extractPositionFromSparseArray[ SparseArray[Unitize[packedIndices - index], Automatic, 1]]}, {Extract[packedCombs, positions], Extract[packedFreqs, positions]}];

makes code twice as fast. Moreover, for more sparse indices (for example, by invoking a sampling function with parameters such as generateConfigurations[2000, 500, 500, 5000000] ), the acceleration with respect to the Select function is about 100 times.

Sjoerd C. de Vries · Answer 2 · 2011-08-31T12:53:42+0000

I would probably use SparseArrays here ( see update below ), but if you insist on using functions and * Values to store and retrieve values, the approach should be to have the first part (f [2], etc. .) Replaced by the symbol you create on the fly, for example:

 Table[Symbol["f" <> IntegerString[i, 10, 3]], {i, 11}] (* ==> {f001, f002, f003, f004, f005, f006, f007, f008, f009, f010, f011} *) Symbol["f" <> IntegerString[56, 10, 3]] (* ==> f056 *) Symbol["f" <> IntegerString[56, 10, 3]][{3, 4}] = 12; Symbol["f" <> IntegerString[56, 10, 3]][{23, 18}] = 12; Symbol["f" <> IntegerString[56, 10, 3]] // Evaluate // DownValues (* ==> {HoldPattern[f056[{3, 4}]] :> 12, HoldPattern[f056[{23, 18}]] :> 12} *) f056 // DownValues (* ==> {HoldPattern[f056[{3, 4}]] :> 12, HoldPattern[f056[{23, 18}]] :> 12} *)

Personally, I prefer Leonid's solution, as it is much wider, but YMMV.

Update

In the OP request for using SparseArrays :
Large SparseArrays take up part of the size of standard nested lists. We can make f large (100,000 instances) a rare array of sparse arrays:

 f = SparseArray[{_} -> 0, 100000]; f // ByteCount (* ==> 672 *) (* initialize f with sparse arrays, takes a few seconds with f this large *) Do[ f[[i]] = SparseArray[{_} -> 0, {100, 110}], {i,100000}] // Timing//First (* ==> 18.923 *) (* this takes about 2.5% of the memory that a normal array would take: *) f // ByteCount (* ==> 108000040 *) ConstantArray[0, {100000, 100, 100}] // ByteCount (* ==> 4000000176 *) (* counting phase *) f[[1]][[1, 2]]++; f[[1]][[1, 2]]++; f[[1]][[42, 64]]++; f[[2]][[100, 11]]++; (* reporting phase *) f[[1]] // ArrayRules f[[2]] // ArrayRules f // ArrayRules (* ==>{{1, 2} -> 2, {42, 64} -> 1, {_, _} -> 0} ==>{{100, 11} -> 1, {_, _} -> 0} ==>{{1, 1, 2} -> 2, {1, 42, 64} -> 1, {2, 100, 11} -> 1, {_, _, _} -> 0} *)

As you can see, ArrayRules makes a good list with deposits and accounts. This can be done for each f [i] separately or for the whole ligament together (last line).

Dr. · Answer 3 · 2011-08-31T14:10:44+0000

In some scenarios (depending on the performance needed to generate the values), the following simple solution using an auxiliary list (f[i,0]) may be useful:

 f[_Integer][{_Integer ..}] := 0; f[_Integer, 0] := Sequence @@ {}; Table[ r = RandomInteger[1000, 2]; f[h = RandomInteger[100000]][r] = RandomInteger[10]; f[h, 0] = Union[f[h, 0], {r}]; , {i, 10^6}]; ExtractConfigurationsAndOccurences[f_, i_] := {f[i, 0], f[i][#] & /@ f[i, 0]}; Timing@ExtractConfigurationsAndOccurences [f, 10] Out[252]= {4.05231*10^-15, {{{172, 244}, {206, 115}, {277, 861}, {299, 862}, {316, 194}, {361, 164}, {362, 830}, {451, 306}, {614, 769}, {882, 159}}, {5, 2, 1, 5, 4, 10, 4, 4, 1, 8}}}

zorank · Answer 4 · 2011-09-02T11:28:35+0000

Many thanks to all for the help provided. I thought a lot about each input, and I believe that in the simulation the optimal solution:

 SetAttributes[linkedList, HoldAllComplete]; temporarySymbols = linkedList[]; SetAttributes[bookmarkSymbol, Listable]; bookmarkSymbol[symbol_]:= With[{old = temporarySymbols}, temporarySymbols= linkedList[old,symbol]]; registerConfiguration[index_]:=registerConfiguration[index]= Module[ { cs = linkedList[], bookmarkConfiguration, accumulator }, (* remember the symbols we generate so we can remove them later *) bookmarkSymbol[{cs,bookmarkConfiguration,accumulator}]; getCs[index] := List @@ Flatten[cs, Infinity, linkedList]; getCsAndFreqs[index] := {getCs[index],accumulator /@ getCs[index]}; accumulator[_]=0; bookmarkConfiguration[c_]:=bookmarkConfiguration[c]= With[{oldCs=cs}, cs = linkedList[oldCs, c]]; Function[c, bookmarkConfiguration[c]; accumulator[c]++; ] ] pattern = Verbatim[RuleDelayed][Verbatim[HoldPattern][HoldPattern[registerConfiguration [_Integer]]],_]; clearSimulationData := Block[{symbols}, DownValues[registerConfiguration]=DeleteCases[DownValues[registerConfiguration],pattern]; symbols = List @@ Flatten[temporarySymbols, Infinity, linkedList]; (*Print["symbols to purge: ", symbols];*) ClearAll /@ symbols; temporarySymbols = linkedList[]; ]

It is based on the Leonid solution from one of the previous posts, added by the belsairus proposal to enable additional indexing for processed configurations. The previous approaches are adapted in such a way that configurations can be naturally registered and retrieved using the same code more or less. This strikes two flies at once, since bookkeeping and search are both closely interconnected.

This approach will work better in a situation where you want to add simulation data gradually (all curves are usually noisy, so you need to add runs gradually to get good graphs). The sparse array method will work better when the data is generated at a time and then analyzed, but I don’t remember to be personally in a situation where I had to do this.

In addition, I was rather naive, thinking that data extraction and generation can be considered separately. In this particular case, it seems that you need to have both perspectives. I deeply apologize for the frank rejection of any previous proposals in this direction (there were several implicit ones).

There are some open / minor issues that I don't know how to handle, for example. when clearing characters, I cannot clear the headers such as the $ 164 battery, I can only clear its associated values. I do not know why. In addition, if With[{oldCs=cs}, cs = linkedList[oldCs, c]]; changed to something like cs = linkedList[cs, c]]; , configurations are not saved. I do not know why the second option does not work. But these minor issues are clearly defined satellite problems that can be resolved in the future. By and large, the problem is apparently resolved thanks to the generous help of all participants.

Many thanks for the help.

Relations Zoran

ps There are several timings, but to understand what is going on, I will add the code that is used for benchmarking. In short, the idea is to generate lists of configurations and simply pass them through them by calling registerConfiguration. This essentially mimics the data generation process. Here is the code used for testing:

 fillSimulationData[sampleArg_] :=MapIndexed[registerConfiguration[#2[[1]]][#1]&, sampleArg,{2}]; sampleForIndex[index_]:= Block[{nsamples,min,max}, min = Max[1,Floor[(9/10)maxSamplesPerIndex]]; max = maxSamplesPerIndex; nsamples = RandomInteger[{min, max}]; RandomInteger[{1,10},{nsamples,ntypes}] ]; generateSample := Table[sampleForIndex[index],{index, 1, nindexes}]; measureGetCsTime :=((First @ Timing[getCs[#]])& /@ Range[1, nindexes]) // Max measureGetCsAndFreqsTime:=((First @ Timing[getCsAndFreqs[#]])& /@ Range[1, nindexes]) // Max reportSampleLength[sampleArg_] := StringForm["Total number of confs = ``, smallest accumulator length ``, largest accumulator length = ``", Sequence@ @ {Total[#],Min[#],Max[#]}& [Length /@ sampleArg]]

The first example is relatively modest:

 clearSimulationData; nindexes=100;maxSamplesPerIndex = 1000; ntypes = 2; largeSample1 = generateSample; reportSampleLength[largeSample1]; Total number of confs = 94891, smallest accumulator length 900, largest accumulator length = 1000; First @ Timing @ fillSimulationData[largeSample1]

gives 1.375 seconds, which I think is fast.

 With[{times = Table[measureGetCsTime, {50}]}, ListPlot[times, Joined -> True, PlotRange -> {0, Max[times]}]]

gives a time of about 0.016 s, and

 With[{times = Table[measureGetCsAndFreqsTime, {50}]}, ListPlot[times, Joined -> True, PlotRange -> {0, Max[times]}]]

gives the same times. Now the real killer

 nindexes = 10; maxSamplesPerIndex = 100000; ntypes = 10; largeSample3 = generateSample; largeSample3 // Short {{{2,2,1,5,1,3,7,9,8,2},92061,{3,8,6,4,9,9,7,8,7,2}},8,{{4,10,1,5,9,8,8,10,8,6},95498,{3,8,8}}}

indicated as

 Total number of confs = 933590, smallest accumulator length 90760, largest accumulator length = 96876

gives a generation time of about 1.969-2.016 seconds, which is incredibly fast. I mean, this looks like a giant list of about a million elements and applying a function to each element.

The retrieval time for configs and {configs, freqs} is approximately 0.015 and 0.03 seconds.

For me, this is a mental speed that I never expected from Mathematica!

Algorithm for selecting patterns of free values ​​down from the list of sparse definitions

Original SubValues

Faster solution based on arrays and sparse arrays

Some complexity notes

More articles:

Algorithm for selecting patterns of free values down from the list of sparse definitions

Original `SubValues`