First, consider the memory taken by integers. You said that the range will be around 0-4000000. 24 bits is enough to represent 16777216 different values. If this is acceptable, you can use byte arrays for integers, with 3 bytes per integer and save 25%. You will need to index into an array something like this:
int getPackedInt(byte[] array, int index) { int i = index*3; return ((array[i] & 0xFF)<<16) + ((array[i+1] & 0xFF) <<8) + (array[i+2] & 0xFF); } int storePackedInt(byte[] array, int index, int value) { assert value >= 0 && value <= 0xFFFFFF; int i = index*3; array[i] = (byte)((value>>16) & 0xFF); array[i+1] = (byte)((value>>8) & 0xFF); array[i+2] = (byte)(value & 0xFF); }
Can you say something about the distribution of integers? If many of them will fit in 16 bits, you can use encoding with a variable number of bytes per number (something like UTF-8 to represent characters).
Next, consider whether it is possible to save memory in Strings. What are the characteristics of the strings? How long will they usually be? Will many lines split prefixes? A compression scheme adapted to the characteristics of your application can save a lot of space (as falsarella pointed out). OR, if many strings will use prefixes, storing them in some type of search may be more efficient. (There is a trie type called "patricia" that might be suitable for this application.) As a bonus, note that searching for strings in trie may be faster than searching for a hash map (although you will need to check to see if this is true in your application).
Will all the lines be ASCII? If so, 50% of the memory used for strings will be wasted, since Java char
is 16 bits. Again, in this case, you might consider using byte arrays.
If you only need to look at the lines and not iterate over the stored lines, you can also consider something quite unconventional: hash the lines and save only the hash. Since different lines can hash to the same value, it is likely that a line that has never been saved can still be found by searching. But if you use enough bits for a hash value (and a good hash function), you can make this probability so infinitely small that it will almost certainly never happen in the life expectancy of the universe.
Finally, there is memory for the structure itself, which contains strings and integers. I already suggested using trie, but if you decide not to do this, nothing will use less memory than parallel arrays - one sorted array of strings (to which you, as you said, can perform a binary search) and a parallel array of integer arrays. After you do a binary search to find the index in the String array, you can use the same index to access the array-of-integer array.
While you are building the structure, if you decide that trie search is a good choice, I would just use it directly. Otherwise, you could do 2 passes: one to create a set of strings (then put them in an array and sort them), and the second to add arrays of integers.