Our application should receive client data presented in XML format (several files) and analyze them in our general XML format (single file with a schema). For this purpose, we use the apache XMLBeans data binding infrastructure. The steps of this process are briefly described below.
First, we take raw java.io.File objects that point to client XML files on disk and load them into the collection. Then we iterate over this collection, creating one apache.xmlbeans.XmlObject file for each file. After all the files have been analyzed in XmlObjects, we create 4 collections containing separate objects from XML documents that are of interest to us (to be clear, these are not handmade objects, but what I can only describe as objects being created- proxy on the apache XMLBeans platform). As a last step, we then iterate over these collections to create our XML document (in memory), and then save it to disk.
For most use cases, this process works fine and can be easily launched in the JVM by specifying the -Xmx1500m command line argument. However, problems arise when a client receives "large data sets". Big in this case is 123 MB of client XML, distributed over 7 files. Such data sets cause our collections inside the code to be populated with almost 40,000 of the aforementioned βproxy objectsβ. In these cases, the memory usage just goes through the roof. I do not get outofmemory exceptions. the program simply crashes before garbage collection, freeing up a small amount of memory, then the program continues, uses this new space, and the cycle repeats. These parsing sessions currently take 4-5 hours. We strive to bring this to an hour.
It is important to note that the calculations needed to convert client xml to our xml require all xml data to be cross referenced. Therefore, we cannot implement a sequential parsing model or batch process into smaller blocks.
What have i tried so far
Instead of storing all 123Mb of client xml in memory, each request for data, upload files, find data and release links to these objects. This seems to reduce the amount of memory consumed during the process, but, as you can imagine, the time during which constant I / O removes the benefits of reduced memory.
I suspected that the problem was that we were storing XmlObject [] for 123 MB XML files, as well as a collection of objects taken from these documents (using xpath queries). To fix this, I changed the logic so that instead of querying these collections, documents were requested directly. The idea here is that in any case there are no 4 massive lists with 10 out of 1000 objects, but only a large collection of XmlObjects. This does not seem to make any difference, and in some cases increases the memory even further.
Grabbing at the straw now, I thought that the XmlObject that we use to create our xml memory before writing to disk becomes too large to support all the client data. Nevertheless, while executing some sizeOf requests on this object, it turned out that its largest object has this object less than 10Kb. After reading how XmlBeans manages large DOM objects, it seems to use some form of buffered writer and, as a result, manages this object pretty well.
So now I have no ideas; It is impossible to use SAX approaches instead of DOM approaches that are heavily used in memory, since we need 100% of client data in our application at any given time, cannot suspend the request for this data until we need it, because for the conversion process it takes a lot of cycles and the I / O time on the disk is not worth the saved amount of memory, and I can not structure my logic in such a way as to reduce the amount of space occupied by internal java collections. I was not lucky? Do I just have to agree that if I want to analyze 123 MB xml data in our Xml format, I cannot do this with a memory allocation of 1,500 m? Although 123Mb is a large data set in our domain, I cannot imagine that others have never had to do something similar with Gb data at a time.
Other information that may be important.
- I used JProbe to try to see if this could tell me anything useful. Although I am a noob profiling, I looked through their tutorials on memory leaks and stream locks, understood them and there seemed to be no leaks or bottlenecks in our code. After launching the application with a large data set, we quickly see a βsaw bladeβ shape on the memory analysis screen (see the attached image) when the PS Eden space is captured by the massive green PS Old Gen block. This leads me to believe that the problem here is simply the simple space occupied by collections of objects, and not the leak that holds unused memory.

- I work on a 64-bit Windows 7 platform, but it will need to run in a 32-bit environment.