[Neo4j] Stumped by performance issue in traversal - would take a month to run!

Martin Neumann m.neumann.1980 at gmail.com
Sun Aug 1 10:55:17 CEST 2010

there are some environmental optimizations you can do to speed things up.
Neo4j is stored as a graph on disk, so traversal translate to moving
the cursor on the hard drive if the data was not in RAM. For good
performance you need a fast hd (flash drive would do best).
Deleting lots of nodes can create holes in the db, so read operations have
to move longer physical distance on the had drive then necessary. Only way I
am aware of to get rid of holes reliably is to copy the DB into a fresh
clean Neo4j DB.

cheers Martin

On Fri, Jul 30, 2010 at 8:10 PM, Jeff Klann <jklann at iupui.edu> wrote:

> Hi, so I got 2GB more RAM and noticed that after adding some more memory
> map
> and increasing the heap space, my small query went from 6hrs to 3min. Quite
> reasonable!
> But the larger one that would take a month would still take a month. So
> I've
> been performance testing parts of it:
> The algorithm as in my first post showed *no* performance improvement on
> more RAM.
> But individual parts....
>   - Traversing only (first three lines) was much speedier, but still seems
> slow. 1.5 million traversals (15 out of 7000 items) took 23sec. It shaves
> off a few seconds if I run this twice and time it the second time, or if I
> don't print any node properties as I traverse. (Does Neo4J load ALL the
> properties for a node if one is accessed?) Even with a double run and not
> reading node properties, it still takes 16sec, which would make traversal
> take two hours. I thought Neo4J was suppposed to do ~1m traversals/sec,
> this
> is doing about 100k. Why? (And in fact on the other query it was getting
> about 800,000 traversals/sec.) Is one of Traversers vs. getRelationship
> iterators faster when getting all relationships of a type at depth 1?
>   - Searching for relationships between A & B (but not writing to them)
> takes it from 20s to 91s. Yuck. Maybe edge indexing is the way to avoid
> that?
>   - Incrementing a property on the root node for every A & B takes it from
> 20s to 61s (57s if it's all in one transaction). THAT seems weird. I
> imagine
> it has something to do with logging changes? Any way that can be turned off
> for a particular property (like it could be marked 'volatile' during a
> transaction or something)?
> I'm much more hopeful with the extra RAM but it's still kind of slow.
> Suggestions?
> Thanks,
> Jeff Klann
> On Wed, Jul 28, 2010 at 11:20 AM, Jeff Klann <jklann at iupui.edu> wrote:
> > Hi, I have an algorithm running on my little server that is very very
> slow.
> > It's a recommendation traversal (for all A and B in the catalog of items:
> > for each item A, how many customers also purchased another item in the
> > catalog B). It's processed 90 items in about 8 hours so far! Before I
> dive
> > deeper into trying to figure out the performance problem, I thought I'd
> > email the list to see if more experienced people have ideas.
> >
> > Some characteristics of my datastore: it's size is pretty moderate for a
> > database application. 7500 items, not sure how many customers and
> purchases
> > (how can I find the size of an index?) but probably ~1 million customers.
> > The relationshipstore + nodestore < 500mb. (Propertystore is huge but I
> > don't access it much in traversals.)
> >
> > The possibilities I see are:
> >
> > 1) *Neo4J is just slow.* Probably not slower than Postgres which I was
> > using previously, but maybe I need to switch to a distributed map-reduce
> db
> > in the cloud and give up the very nice graph modeling approach? I didn't
> > think this would be a problem, because my data size is pretty moderate
> and
> > Neo4J is supposed to be fast.
> >
> > 2) *I just need more RAM.* I definitely need more RAM - I have a measly
> > 1GB currently. But would this get my 20day traversal down to a few hours?
> > Doesn't seem like it'd have THAT much impact. I'm running Linux and
> nothing
> > much else besides Neo4j, so I've got 650m physical RAM. Using 300m heap,
> > about 300m memory-map.
> >
> > 3) *There's some secret about Neo4J performance I don't know.* Is there
> > something I'm unaware that Neo4J is doing? When I access a property, does
> it
> > load a chunk of properties I don't care about? For the current node/edge
> or
> > others? I turned off log rotation and I commit after each item A. Are
> there
> > other performance tips I might have missed?
> >
> > 4) *My algorithm is inefficient.* It's a fairly naive algorithm and maybe
> > there's some optimizations I can do. It looks like:
> >
> >> For each item A in the catalog:
> >>   For each customer C that has purchased that item:
> >>    For each item B that customer purchased:
> >>       Update the co-occurrence edge between A&B.
> >>
> >       (If the edge exists, add one to its weight. If it doesn't exist,
> >> create it with weight one.)
> >>
> > This is O(n^2) worst case, but practically it'll be much better due to
> the
> > sparseness of purchases. The large number of customers slows it down,
> > though. The slowest part, I suspect, is the last line. It's a lot of
> finding
> > and re-finding edges between As and Bs and updating the edge properties.
> I
> > don't see much way around it, though. I wrote another version that avoids
> > this but is always O(n^2), and it takes about 15 minutes per A to check
> > against all B (which would also take a month). The version above seems to
> be
> > averaging 3 customers/sec, which doesn't seem that slow until you realize
> > that some of these items were purchased by thousands of customers.
> >
> > I'd hate to give up on Neo4J. I really like the graph database concept.
> But
> > can it handle data? I hope someone sees something I'm doing wrong.
> >
> > Thanks,
> > Jeff Klann
> >
> _______________________________________________
> Neo4j mailing list
> User at lists.neo4j.org
> https://lists.neo4j.org/mailman/listinfo/user

More information about the User mailing list