Indexing an array with a string (C)

I have an array of unsigned integers, each of which corresponds to a line with 12 characters, which can contain 4 different characters, namely "A", "B", "C", "D". Thus, the array will contain 4 ^ 12 = 16777216 elements. The order of the elements in the array is arbitrary; I can choose which one matches each row. So far, I have implemented it just like this:

unsigned int my_array[16777216]; char my_string[12]; int index = string_to_index(my_string); my_array[index] = ...; 

string_to_index() simply assigns 2 bits per character as follows: A β†’ 00, B β†’ 01, C β†’ 10, D β†’ 11 For example, ABCDABCDABCD corresponds to the index (000110110001101100011011) 2 = (1776411) 10

However, I know that each line that is used to access the array represents the previous line, shifted once to the left with the new last character. For example, after accessing ABCDABCDABCD, the next access will use BCDABCDABCDA or BCDABCDABCDB, BCDABCDABCDC, BCDABCDABCDD.

So my question is: Is there a better way to implement the string_to_index function to take this last fact into account, so that elements that are sequentially available are closer in the array? I hope to improve my caching performance by doing this.

edit: Perhaps I was not very clear: I am looking for a completely different line for the index matching scheme, so the ABCDABCDABCD and BCDABCDABCDA indices are closer.

+6
source share
2 answers

If the following assumptions are true for your problem, then the most appropriate solution is your solution.

  • The correct char part of the next line is randomly selected with equal probability for each valid character
  • The beginning of the sequence is not always the same (it is random).

Reason: When I first read your question, I came up with the following tree: (reduced your problem to a string three characters long and only 2 possible characters A and B for simplicity). Note that most of the children of the root node (AAA in this case) remain the same as the root node (AAA), so I will not build this branch further.

  AAA / \ AAB / \ ABA ABB / \ / \ BAA BAB BBA BBB 

In this tree, each node has the following possible sequence as child nodes. To improve the cache, you need to traverse this tree using width traversal and store it in an array in the same order. For the tree above, we get the following combination of index indices.

  • AAA 0
  • AAB 1
  • ABA 2
  • ABB 3
  • BAA 4
  • BAB 5
  • BBA 6
  • BBB 7

Assuming a value (A) = 0 and a value (B) = 1, the index can be calculated as

 index = 2^0 * (value(string[2])) + 2^1 * (value(string[1])) + 2^2 * (value(string[0])) 

This is the same decision as you. I wrote a python script to check this for other combinations (e.g. a 4 character string with ABC as possible characters). Script link

So, if only 2 assumptions made at the beginning are not false, then what is your decision already doing cache optimization.

+2
source

I think we could first define "closer."

For example, we could define a function F that takes a method to compute row indices. Then F will check each row index and return a specific value based on the distance of the indices of the neighboring rows.

Then we can compare the different methods of calculating the index and find the best one. Of course, at first we could consider short lines.

0
source

Source: https://habr.com/ru/post/969753/


All Articles