I created 1 million Neo4j nodes in batches of 10,000, each batch in its own transaction. It is strange that parallelizing this process with multi-threaded execution did not have a positive impact on performance. It is as if transactions in different threads were blocking each other.
Here is a Scala code snippet that validates this using parallel collections:
import org.neo4j.kernel.EmbeddedGraphDatabase object Main extends App { val total = 1000000 val batchSize = 10000 val db = new EmbeddedGraphDatabase("neo4yay") Runtime.getRuntime().addShutdownHook( new Thread(){override def run() = db.shutdown()} ) (1 to total).grouped(batchSize).toSeq.par.foreach(batch => { println("thread %s, nodes from %d to %d" .format(Thread.currentThread().getId, batch.head, batch.last)) val transaction = db.beginTx() try{ batch.foreach(db.createNode().setProperty("Number", _)) }finally{ transaction.finish() } }) }
and here are the build.sbt lines needed to build and run it:
scalaVersion := "2.9.2" libraryDependencies += "org.neo4j" % "neo4j-kernel" % "1.8.M07" fork in run := true
You can switch between parallel and serial modes by deleting and adding a .par call before an external foreach . The console output clearly shows that when executing .par execution is indeed multithreaded.
To eliminate possible problems with concurrency in this code, I also tried an actor-based implementation, with approximately the same result (6 and 7 seconds respectively for serial and parallel versions).
So the question is: did I do something wrong or is this a limitation of Neo4j? Thanks!
source share