OrientDB 2.0.0 Bulk loading using Java API is processor binding

I use OrientDB 2.0.0 to test its handling of bulk data loading. For sample data, I use the GDELT dataset from the Google GDELT Project (free download). I load a total of ~ 80M vertices, each with 8 properties, into class V of an empty graph database using the Java API.

The data is in the same tab delimited text file ( US-ASCII ), so I just read the text file from top to bottom. I set up the database using OIntentMassiveInsert() and set the transaction size to 25,000 records per commit.

I use an 8-core machine with 32G RAM and SSD, so hardware should not be a factor. I am running Windows 7 Pro with Java 8r31.

The first 20M (or so) records went pretty fast, in less than 2 seconds per batch of 25,000. I really liked it.

However, as the process continued, the insertion speed slowed significantly. The slowdown seems pretty linear. Here are some sample lines from my output log:

 Committed 25000 GDELT Event records to OrientDB in 4.09989189 seconds at a rate of 6097 records per second. Total = 31350000 Committed 25000 GDELT Event records to OrientDB in 9.42005182 seconds at a rate of 2653 records per second. Total = 40000000 Committed 25000 GDELT Event records to OrientDB in 15.883908716 seconds at a rate of 1573 records per second. Total = 45000000 Committed 25000 GDELT Event records to OrientDB in 45.814514946 seconds at a rate of 545 records per second. Total = 50000000 

As the operation progressed, memory usage was fairly constant, but CPU usage by the OrientDB was increased and maintained according to duration. In the beginning, the OrientDB Java process used about 5% of the CPU. Currently, it is about 90%, while the use is well distributed across all 8 cores.

Should I split the load operation into several consecutive connections or is it really a function of how the vertex data is managed internally, and it doesn’t matter if I stopped the process and continued pasting to where I left off?

Thanks.

[Update] The process eventually died with the error: java.lang.OutOfMemoryError: GC upper limit exceeded

All commits were successfully processed, and I got a little more than 51 m records. I will look at restructuring the bootloader in order to break 1 giant file into several small files (say, 1 m records each, for example) and treat each file as a separate load.

Once this is completed, I will try to take a flat list of Vertex and add some Edges. Any suggestions on how to do this in the context of bulk insertion where vertex identifiers have not yet been assigned? Thanks.

[Update 2] I am using the Graph API. Here is the code:

 // Open the OrientDB database instance OrientGraphFactory factory = new OrientGraphFactory("remote:localhost/gdelt", "admin", "admin"); factory.declareIntent(new OIntentMassiveInsert()); OrientGraph txGraph = factory.getTx(); // Iterate row by row over the file. while ((line = reader.readLine()) != null) { fields = line.split("\t"); try { Vertex v = txGraph.addVertex(null); // 1st OPERATION: IMPLICITLY BEGIN A TRANSACTION for (i = 0; i < headerFieldsReduced.length && i < fields.length; i++) { v.setProperty(headerFieldsReduced[i], fields[i]); } // Commit every so often to balance performance and transaction size if (++counter % commitPoint == 0) { txGraph.commit(); } } catch( Exception e ) { txGraph.rollback(); } } 

[Update 3 - 2015-02-08] The problem is resolved!

If I read the documentation more carefully, I would see that using transactions in bulk upload is the wrong strategy. I switched to using the β€œNoTx” graph and added the properties of the vertices in bulk, and it worked like a champion without slowing down with time or CPU binding.

I started with 52 m vertices in the database and added 19 m more in 22 minutes at a speed of just over 14,000 vertices per second, with each vertex having 16 properties.

 Map<String,Object> props = new HashMap<String,Object>(); // Open the OrientDB database instance OrientGraphFactory factory = new OrientGraphFactory("remote:localhost/gdelt", "admin", "admin"); factory.declareIntent(new OIntentMassiveInsert()); graph = factory.getNoTx(); OrientVertex v = graph.addVertex(null); for (i = 0; i < headerFieldsReduced.length && i < fields.length; i++) { props.put(headerFieldsReduced[i], fields[i]); } v.setProperties(props); 
+6
source share

Source: https://habr.com/ru/post/982102/


All Articles