[Neo4j] Batch Inserter - db scaling issue (not index scaling issue)
Mark @ Gmail
markharwood at gmail.com
Fri Feb 18 08:07:41 CET 2011
Hi Johan and others
>>I am having a hard time to follow what the problems really are since conversation is split up in several thread
My fault, sorry. I was replying to a message posted before I subscribed to the list so didn't have the orginal poster's email.
>>as I understand it you are saying that it is the index lookups that are taking to long time?
In your current implementation, "Yes" - in the indexing implementation I provide on that Google code project there is no performance issue.
However, having fixed the Lucene indexing issue it only reveals that the *database* is now the bottleneck and blows up after 30 million edge inserts. That is now the issue here.
See the test results here : http://code.google.com/p/graphdb-load-tester/wiki/TestResults
>>For example inserting 500M relationships
>>requiring 1B index lookups (one for each node) with an avg index
>>lookup time of 1ms is 11 days worth of index lookup time.
That is why I suggested to Peter when he asked for help with indexing that a Bloom filter helps "know what you don't know" and an LRU Cache helps hang onto popular nodes. These are in my implementation and both avoid reads.
Re your suggestion about avoiding indexes by inserting in batches - I can't see how that will help because you can sort input data by from node key or to node key but will not necessarily end up with node pairs that are joined by edges conveniently located in the same batch and will therefore need an index service to add any edges - but as I say this is fixed in my implementation andindexing is not the remaining issue - the database is.
I do encourage you to try run it.
More information about the User