Which character table can I use to store ~ ​​50 mil strings with quick lookups without running out of heap space?

I have a file of ~ 50 million lines that I need to add to some character table at startup, and then search several times at a reasonable speed.

I tried to use DLB trie, as the search will be relatively fast, since all lines are <10 characters, but when filling DLB, I get either exceeding the upper limit of the GC limit or outofmemory - a heap space error. The same errors were detected using HashMap. This is for an assignment that will be compiled and launched by the grader, so I would prefer not just to allocate more heaps. Is there any other data structure that will have less memory usage, but have a reasonable search time?

+4
source share
2 answers

If you expect low prefix sharing, then trie might not be your best bet.

, , - " " , .

. , , , ArrayList. .

, 50 10 , :

10 character string:
    String: 12 byte header + 4 byte 'hash' + 4 byte 'value' ref = 24 bytes (aligned)
    char[]: 12 byte header + 4 byte 'length' + 10 * 2 byte 'char' = 40 bytes (aligned)
    Total: 24 + 40 = 64 bytes
Array of 50 million 10 character strings:
    String[]: 12 byte header + 4 byte 'length' + 50,000,000 * 4 byte 'String' ref = 200,000,016 bytes
    Values: 50,000,000 * 64 bytes = 3,200,000,000 bytes
    Total: 200,000,016 + 3,200,000,000 = 3,400,000,016 bytes = 3.2 GB

String[], ArrayList<String> String[]. Arrays.sort() 50% (~ 100 000 000 ) , ArrayList GC , .

, - ~ 3,5 , .

, , . , String 24 64 . char[].

, US-ASCII ISO-8859-1, char[] byte[], .

, 64 32 , 3,2 1,8 2 .


UPDATE

, , , . MCVE, , .

public class Test {
    public static void main(String[] args) {
        String[] wordsFromFile = { "appear", "attack", "cellar", "copper",
                                   "erratic", "grotesque", "guitar", "guttural",
                                   "kittens", "mean", "suit", "trick" };
        List<byte[]> wordList = new ArrayList<>();
        for (String word : wordsFromFile) // Simulating read from file
            wordList.add(word.getBytes(StandardCharsets.US_ASCII));
        byte[][] symbolTable = wordList.toArray(new byte[wordList.size()][]);

        test(symbolTable, "abc");
        test(symbolTable, "attack");
        test(symbolTable, "car");
        test(symbolTable, "kittens");
        test(symbolTable, "xyz");
    }
    private static void test(byte[][] symbolTable, String word) {
        int idx = Arrays.binarySearch(symbolTable,
                                      word.getBytes(StandardCharsets.US_ASCII),
                                      Test::compare);
        if (idx < 0)
            System.out.println("Not found: " + word);
        else
            System.out.println("Found    : " + word);
    }
    private static int compare(byte[] w1, byte[] w2) {
        for (int i = 0, cmp; i < w1.length && i < w2.length; i++)
            if ((cmp = Byte.compare(w1[i], w2[i])) != 0)
                return cmp;
        return Integer.compare(w1.length, w2.length);
    }
}

Not found: abc
Found    : attack
Not found: car
Found    : kittens
Not found: xyz
+3

char () . n - offset[n - 1] () offset[n] (). offset[-1] .

1GB (50M * 10 * 2) char 200MB (50M * 4) . .

, , . , .

trie, ​​ https://github.com/rklaehn/radixtree. , , , , . . . scala, java.

+2

Source: https://habr.com/ru/post/1655869/


All Articles