Regardless of any other problems / errors, an ArrayList can be very wasteful for this type of storage, because as the growing ArrayList array runs out of space, it doubles the size of its underlying storage array. Thus, it is possible that almost half of your storage is wasted. If you can pre-configure the storage array or ArrayList to the correct size, you can get significant savings.
Also (when you turn on the cap for cleaning paranoid data) make sure that your input files do not have extra spaces - you can use String.trim() for each word, if necessary, or clear the input files first. But I do not think that this can be a serious problem, given the file sizes you specify.
I would expect your inputs to take less than 2 MB to store the text itself (remember that Java uses UTF-16 internally, so it usually takes 2 bytes per character), but maybe an overhead of 1.5 MB for links to String objects, plus 1.5 MB of overhead for string lengths, and maybe again and again for offset and hash code (see String.java ) ... while the 24 MB heap still sounds a bit overwhelming, this not far off if you get the effect of almost doubling the unsuccessful size of an ArrayList.
In fact, rather than speculate, how about a test? The following code executed with -Xmx24M receives up to 560,000 6-character strings before stopping (on Java SE 7 JVM, 64-bit). It finally creeps up to 580,000 (with lots of GC windings, I think).
ArrayList<String> list = new ArrayList<String>(); int x = 0; while (true) { list.add(new String("123456")); if (++x % 1000 == 0) System.out.println(x); }
So, I donβt think there is an error in your code - storing a large number of small lines is simply not very efficient in Java - for the test above it takes 7 bytes per character due to all overhead (which can vary between 32-bit and 64 -bit machines, by the way, and depend on the JVM settings too!)
You can get slightly better results by storing an array of byte arrays rather than ArrayList of Strings. There are also more efficient data structures for storing strings, such as Running .