[Neo4j] Neo4j performance with >400million nodes

Alican Gecyasar alican.gecyasar at openconcept.ch
Tue Nov 1 07:54:58 CET 2011


hello david,
thank you for the quick reply! appreciate it very much.



Am 01.11.2011 01:01, schrieb David Montag:
> Hi Alican,
>
> On Mon, Oct 31, 2011 at 6:26 AM, algecya<alican.gecyasar at openconcept.ch>wrote:
>
>> Hello everyone,
>>
>> We are relatively new to neo4j and are evaluating some test scenarios in
>> order to decide to use neo4j in productive systems. We used the latest
>> stable release 1.4.2.
>>
>> I wrote an import script and generated some random data with the given tree
>> structure:
>>
>> http://neo4j-community-discussions.438527.n3.nabble.com/file/n3467806/neo4j_nodes.png
>>
>> Nodes Summary:
>> Nodes with Type A: 1
>> Nodes with Type B: 100
>> Nodes with Type C: 50'000 (100x500)
>> Nodes with Type D: 500'000 (50'000x10)
>> Nodes with Type E: 25'000'000 (500'000x50)
>> Nodes with Type F: 375'000'000 (25'000'000x15)
>>
>> This all worked quite OK, the import took approx. 30hours using the
>> batchimport.
>> We have multiple indexes, but we also have one index where all nodes are
>> indexed.
>>
>> My first question would be, does it make sense to index all nodes with the
>> same index?
>>
> It depends on how you intend you access the data. If you always know the
> type, then it would be beneficial to use different indices. Otherwise you
> might want to put it all in a single index. Do remember that the index will
> consume some disk space as well.
ok, we decided to create a "type node" for each type and let the nodes 
relate to it. (Instead of having the type as an attribute at each node) 
I guess I was thinking too much in relational database schemes.
therefore we will have an index per type.



>> If I would like to list all nodes with property "type":"type E" it is quite
>> slow the first time ~270s
>> Second time it is fast ~1/2s. I know this is normal and mostlikely fixed in
>> the current milestone version. But I am not sure how long the query will be
>> cached in memory. Are there any configurations I should be concerned about?
>>
> The difference there is all about disk access time. Will "give me all 25
> million E's" be a common operation?
We will need to find nodes with common attributes of type E , which may 
return approx.  1million results. But there will always be a search for 
different values.
E.g., nodes with type E have an attribute date created and an attribute 
name. I will need to find all attributes created at the given date(say 
year 2011) and the given name ("abc").
The second search will be date (2011) and name ("def"). If certain time 
passes and memory is being used for other searches, I am afraid my first 
search (2011,abc) will be kicked out of memory and the search will take 
long again the next time I query for it.



>> We also took the hardware sizing calculator. See the result here:
>>
>> http://neo4j-community-discussions.438527.n3.nabble.com/file/n3467806/neo4j_hardware.png
>>
>> Are these realistic result values? I guess 128GB RAM and 12TB SSD
>> harddrives
>> might be a bit cost intense.
>>
> The reason that the disk usage is 12TB is because you specified that each
> node on average has 10kB of data, and each relationship on average has 1kB
> of data. What kind of data are you storing on the nodes and relationships?
> These are pretty rough estimates not taking into account the number of
> properties nor the type of them. Also, if you decrease the property data by
> a factor 100 (100B/node, 10B/rel), then your database will only consume
> ~150-200GB.
Ok I see your point. I think I am getting the hang of graph-based 
databases now. I.e., I might not want to put all my data into attributes 
but create nodes instead...
My rough guess was to increase the amount of nodes to a 1'000'000'000 
and decrease the bytes consumed to 100B/node and 10/rel. The result is 
to have approx. 400GB (no problem at all).
But I am still a bit concerned about the 128GB RAM..


>> Are there any reference applications with these amount of nodes and
>> relations?
>>
> We are in the process of adding case studies. Please get in touch with
> sales for more info at this time.
Thank you, will do so.


>> Also Neoclipse won't start/connect to the database anymore with these
>> amount
>> of data.
>> Am I missing some configurations for neoclipse?
>>
> Are you getting an error message?
No error messages. Is there an option to enable logging?
I let neoclipse run for almost an hour and suddenly the graph appeared. 
But I can not navigate(its like frozen, but there are calculations going 
on..)
Not so sure why it takes so long though, the initial traversal depth is 
1, there are 16 nodes and 15 relations. I also decreased the amount of 
nodes to be displayed to 50.
I thought It would load data lazily?


Best regards
alican




>
> Best,
> David
>
>
>> Best regards
>> --
>> alican
>>
>>
>> --
>> View this message in context:
>> http://neo4j-community-discussions.438527.n3.nabble.com/Neo4j-performance-with-400million-nodes-tp3467806p3467806.html
>> Sent from the Neo4j Community Discussions mailing list archive at
>> Nabble.com.
>> _______________________________________________
>> Neo4j mailing list
>> User at lists.neo4j.org
>> https://lists.neo4j.org/mailman/listinfo/user
>>


More information about the User mailing list