[Neo4j] Stumped by performance issue in traversal - would take a month to run!

Martin Neumann m.neumann.1980 at gmail.com
Fri Aug 6 01:13:20 CEST 2010


> >
> > - Martin, I'm confused a bit about SSDs. I read up on them after I read
> > your
> > post. You said flash drives are best, but I read that even the highest
> > performing flash drives are about 30MB/s read, whereas modern hard
drives
> > are at least 50MB/s. True SSDs claim to be 50MB/s too but they're quite
> > expensive. So why is a flash drive best? I could definitely spring for
> one
> > big enough to hold my db if it'd help a lot, but it has that slower read
> > speed. Does the faster seek time really make that much of a difference?
> Any
> > brands you'd recommend?
> >

Neo4j stores the data as Graph on HD.

An example: e = (n1,n2)
e at location 1000
n1 at location 1
n2 at location 5

A traversal, assuming nothing is cached, would result in moving the head to
1 then to 1000 then back to 5.
Normal HD take a while to move to the locations before it can start to read
data. SSD does not have these delays. If you read little data that is spread
widely over the storage, like in a traversal, SSD are much faster then HD
even if they are slower to retrieve the data.
I don't have performance data on that myself but I heard rumors of around
20-40 times speedup.

cheers Martin


On Thu, Aug 5, 2010 at 9:02 PM, Jeff Klann <jklann at iupui.edu> wrote:

> Thanks for the answers.
>
> Yes, I can do online updates of the datastore, but while this is in R&D I
> will need to rerun the main loop when I change the algorithm and just for
> personal benefit I don't want to wait hours to see the changes. Seems to be
> running acceptably now, though. However, I haven't benchmarked it against
> doing JOINS in Postgres. Are there any good performance stats out there?
> The
> speed is about the same as I'd expect from SQL.
>
> The graph will probably be nearly a complete graph in the end. The edges
> between orders will eventually store various stats on the relationships
> between pairs of items. It'd be nice if I can query an index for outgoing
> edges from nodes with certain properties. Is this possible? I'll have a
> look
> at the edge indexer component.
>
> Thanks,
> Jeff Klann
>
> On Mon, Aug 2, 2010 at 2:40 PM, David Montag <
> david.montag at neotechnology.com
> > wrote:
>
> > Hi Jeff,
> >
> > Please see answers below.
> >
> > On Mon, Aug 2, 2010 at 5:47 PM, Jeff Klann <jklann at iupui.edu> wrote:
> >
> > > Thank you all for your continued interest in helping me. I tweaked the
> > code
> > > more to minimize writes to the database and it now looks like:
> > > For each item A
> > >   For each customer that purchased A
> > >      For each item B (with id>A) that A purchased
> > >         Increment (in memory) the weight of (A-B)
> > >   Write out the edges [(A-B):weight] to disk and clear the in-memory
> map
> > >
> > > This actually (if I'm not mistaken) covers all relationships and does
> > 7500
> > > items in about 45 minutes! Not too bad, especially due to (I think)
> > > avoiding
> > > edge-checking, and I think it's usable for my application, though it's
> > > still
> > > ~200k traversals/sec on average, which is a few times slower than I
> > hoped.
> > > I
> > > doubt that's much faster than a two-table join in SQL, though deeper
> > > traversals should show benefits.
> > >
> >
> > Do you need to do this computation on the full graph all the time? Maybe
> it
> > would be enough to do it once, and then update it when a customer buys
> > something? Usually, high one-time costs can be tolerated, and with Neo4j
> > you
> > can actually do the updating for a customer performing a purchase at
> > runtime
> > without performance problems.
> >
> >
> > >
> > > - David, thank you for your answers on traversers vs. getRelationships
> > and
> > > on property-loading. I imported some properties I don't really need,
> > > perhaps
> > > if I delete them it'll speed things up? Also I'm using the old
> > > Node.traverse(). How is the new framework better? I expect it has a
> nicer
> > > syntax, which I would like to try, but does it improve performance too?
> > >
> >
> > Well, depending on your setup you should be able to theoretically improve
> > performance compared to the old traversal framework. The old framework
> > keeps
> > track of visited nodes, so that you don't traverse to the same node
> twice.
> > This behavior is customizable in the new framework. Please see
> > http://wiki.neo4j.org/content/Traversal_Framework and check the
> Uniqueness
> > constraints. If you know exactly when to stop, then you should be able to
> > use Uniqueness.NONE, meaning that the framework does not keep track of
> > visited nodes, meaning that you could end up traversing in a cycle. In
> your
> > network however, you might know that you always traverse (item)
> <--BOUGHT--
> > (customer) --BOUGHT--> (item) --CORRELATION--> (item)*  and no further
> than
> > that, so then you know that you won't end up in a cycle. But yeah, then
> you
> > need to programmatically make sure you don't go too far. And I don't know
> > if
> > this gives you any performance benefits what so ever.
> >
> > Also, as I understand it, all properties for a node are loaded when they
> > are
> > first touched. Then they're kept in memory, so if you update properties
> > later on the same node, and it is still cached, it won't reread
> everything.
> >
> >
> > >
> > > - David, on checking relationships, I said checking 15 nodes for
> > > relationships to n other nodes (where n might be large, I'm not sure
> > large,
> > > but <<7500), takes 71s. The nodes are a highly-connected graph and also
> > > with
> > > edges going out to customers. So in the end the max & edges for a node
> > > would
> > > be very high, up to around 7500 items and 300,000 customers.
> > >
> >
> > Just so I understand your data model: if a customer buys N products A1 -
> > AN,
> > will there be be a complete graph between the nodes A1 - AN? When in your
> > algorithm do you need to check for the occurrence of a relationship
> between
> > A and B?
> >
> >
> > >
> > > - Martin, I'm confused a bit about SSDs. I read up on them after I read
> > > your
> > > post. You said flash drives are best, but I read that even the highest
> > > performing flash drives are about 30MB/s read, whereas modern hard
> drives
> > > are at least 50MB/s. True SSDs claim to be 50MB/s too but they're quite
> > > expensive. So why is a flash drive best? I could definitely spring for
> > one
> > > big enough to hold my db if it'd help a lot, but it has that slower
> read
> > > speed. Does the faster seek time really make that much of a difference?
> > Any
> > > brands you'd recommend?
> > >
> >
> > I think the general consensus is that an SSD is usually the single best
> > upgrade you can get for a computer or server. The blazingly fast seeks
> make
> > all the difference. If you have a big file with data spread out over it
> and
> > you need to read and write to different locations of the file rapidly,
> that
> > means a lot of work for the heads in a conventional hard drive. The SSD
> > nails this. Know when you start an application or do something processing
> > heavy, and you hear your hard drive "work"? It's seeking.
> >
> > As for brands, I've heard good things about the Intel X25 ones. I have an
> > SSD in my mac, but I don't know what brand it is. All I know is that it's
> > ridiculously fast.
> >
> > David
> >
> >
> > >
> > > I will post some code snippets. Looks like there are a lot of sites for
> > > sharing codes snippets. Any recommendation?
> > >
> > > Thanks all,
> > > Jeff Klann
> > >
> > > On Mon, Aug 2, 2010 at 8:44 AM, David Montag <
> > > david.montag at neotechnology.com
> > > > wrote:
> > >
> > > > Hi Jeff,
> > > >
> > > > If I'm not mistaken, Neo4j loads all properties for a node or
> > > relationship
> > > > when you invoke any operation that touches a property. As for the
> > > > performance of traversals, it is highly dependent on how deep you
> > > traverse,
> > > > and what you do during the traversal, so ymmv.
> > > >
> > > > Using a traverser is slower than doing getRelationships, as the
> > traverser
> > > > does extra processing to keep state around. Are you using
> > Node#traverse()
> > > > or
> > > > the new traversal framework? Is your code available somewhere?
> > > >
> > > > Are you saying that checking whether there's a relationship between A
> > and
> > > B
> > > > takes over 20s? How many relationships do A and B have? What does
> your
> > > neo
> > > > config look like (params)? Edge indexing might be a solution, you can
> > > look
> > > > at the new indexing component for that. (
> > > > https://svn.neo4j.org/laboratory/components/lucene-index/)
> > > >
> > > > As for the incrementing of a property - while you're within a
> > > transaction,
> > > > couldn't you increment a variable and then write that variable at the
> > end
> > > > of
> > > > the transaction?
> > > >
> > > > David
> > > >
> > > > On Fri, Jul 30, 2010 at 8:10 PM, Jeff Klann <jklann at iupui.edu>
> wrote:
> > > >
> > > > > Hi, so I got 2GB more RAM and noticed that after adding some more
> > > memory
> > > > > map
> > > > > and increasing the heap space, my small query went from 6hrs to
> 3min.
> > > > Quite
> > > > > reasonable!
> > > > >
> > > > > But the larger one that would take a month would still take a
> month.
> > So
> > > > > I've
> > > > > been performance testing parts of it:
> > > > >
> > > > > The algorithm as in my first post showed *no* performance
> improvement
> > > on
> > > > > more RAM.
> > > > > But individual parts....
> > > > >   - Traversing only (first three lines) was much speedier, but
> still
> > > > seems
> > > > > slow. 1.5 million traversals (15 out of 7000 items) took 23sec. It
> > > shaves
> > > > > off a few seconds if I run this twice and time it the second time,
> or
> > > if
> > > > I
> > > > > don't print any node properties as I traverse. (Does Neo4J load ALL
> > the
> > > > > properties for a node if one is accessed?) Even with a double run
> and
> > > not
> > > > > reading node properties, it still takes 16sec, which would make
> > > traversal
> > > > > take two hours. I thought Neo4J was suppposed to do ~1m
> > traversals/sec,
> > > > > this
> > > > > is doing about 100k. Why? (And in fact on the other query it was
> > > getting
> > > > > about 800,000 traversals/sec.) Is one of Traversers vs.
> > getRelationship
> > > > > iterators faster when getting all relationships of a type at depth
> 1?
> > > > >   - Searching for relationships between A & B (but not writing to
> > them)
> > > > > takes it from 20s to 91s. Yuck. Maybe edge indexing is the way to
> > avoid
> > > > > that?
> > > > >   - Incrementing a property on the root node for every A & B takes
> it
> > > > from
> > > > > 20s to 61s (57s if it's all in one transaction). THAT seems weird.
> I
> > > > > imagine
> > > > > it has something to do with logging changes? Any way that can be
> > turned
> > > > off
> > > > > for a particular property (like it could be marked 'volatile'
> during
> > a
> > > > > transaction or something)?
> > > > >
> > > > > I'm much more hopeful with the extra RAM but it's still kind of
> slow.
> > > > > Suggestions?
> > > > >
> > > > > Thanks,
> > > > > Jeff Klann
> > > > >
> > > > > On Wed, Jul 28, 2010 at 11:20 AM, Jeff Klann <jklann at iupui.edu>
> > wrote:
> > > > >
> > > > > > Hi, I have an algorithm running on my little server that is very
> > very
> > > > > slow.
> > > > > > It's a recommendation traversal (for all A and B in the catalog
> of
> > > > items:
> > > > > > for each item A, how many customers also purchased another item
> in
> > > the
> > > > > > catalog B). It's processed 90 items in about 8 hours so far!
> Before
> > I
> > > > > dive
> > > > > > deeper into trying to figure out the performance problem, I
> thought
> > > I'd
> > > > > > email the list to see if more experienced people have ideas.
> > > > > >
> > > > > > Some characteristics of my datastore: it's size is pretty
> moderate
> > > for
> > > > a
> > > > > > database application. 7500 items, not sure how many customers and
> > > > > purchases
> > > > > > (how can I find the size of an index?) but probably ~1 million
> > > > customers.
> > > > > > The relationshipstore + nodestore < 500mb. (Propertystore is huge
> > but
> > > I
> > > > > > don't access it much in traversals.)
> > > > > >
> > > > > > The possibilities I see are:
> > > > > >
> > > > > > 1) *Neo4J is just slow.* Probably not slower than Postgres which
> I
> > > was
> > > > > > using previously, but maybe I need to switch to a distributed
> > > > map-reduce
> > > > > db
> > > > > > in the cloud and give up the very nice graph modeling approach? I
> > > > didn't
> > > > > > think this would be a problem, because my data size is pretty
> > > moderate
> > > > > and
> > > > > > Neo4J is supposed to be fast.
> > > > > >
> > > > > > 2) *I just need more RAM.* I definitely need more RAM - I have a
> > > measly
> > > > > > 1GB currently. But would this get my 20day traversal down to a
> few
> > > > hours?
> > > > > > Doesn't seem like it'd have THAT much impact. I'm running Linux
> and
> > > > > nothing
> > > > > > much else besides Neo4j, so I've got 650m physical RAM. Using
> 300m
> > > > heap,
> > > > > > about 300m memory-map.
> > > > > >
> > > > > > 3) *There's some secret about Neo4J performance I don't know.* Is
> > > there
> > > > > > something I'm unaware that Neo4J is doing? When I access a
> > property,
> > > > does
> > > > > it
> > > > > > load a chunk of properties I don't care about? For the current
> > > > node/edge
> > > > > or
> > > > > > others? I turned off log rotation and I commit after each item A.
> > Are
> > > > > there
> > > > > > other performance tips I might have missed?
> > > > > >
> > > > > > 4) *My algorithm is inefficient.* It's a fairly naive algorithm
> and
> > > > maybe
> > > > > > there's some optimizations I can do. It looks like:
> > > > > >
> > > > > >> For each item A in the catalog:
> > > > > >>   For each customer C that has purchased that item:
> > > > > >>    For each item B that customer purchased:
> > > > > >>       Update the co-occurrence edge between A&B.
> > > > > >>
> > > > > >       (If the edge exists, add one to its weight. If it doesn't
> > > exist,
> > > > > >> create it with weight one.)
> > > > > >>
> > > > > > This is O(n^2) worst case, but practically it'll be much better
> due
> > > to
> > > > > the
> > > > > > sparseness of purchases. The large number of customers slows it
> > down,
> > > > > > though. The slowest part, I suspect, is the last line. It's a lot
> > of
> > > > > finding
> > > > > > and re-finding edges between As and Bs and updating the edge
> > > > properties.
> > > > > I
> > > > > > don't see much way around it, though. I wrote another version
> that
> > > > avoids
> > > > > > this but is always O(n^2), and it takes about 15 minutes per A to
> > > check
> > > > > > against all B (which would also take a month). The version above
> > > seems
> > > > to
> > > > > be
> > > > > > averaging 3 customers/sec, which doesn't seem that slow until you
> > > > realize
> > > > > > that some of these items were purchased by thousands of
> customers.
> > > > > >
> > > > > > I'd hate to give up on Neo4J. I really like the graph database
> > > concept.
> > > > > But
> > > > > > can it handle data? I hope someone sees something I'm doing
> wrong.
> > > > > >
> > > > > > Thanks,
> > > > > > Jeff Klann
> > > > > >
> > > > > _______________________________________________
> > > > > Neo4j mailing list
> > > > > User at lists.neo4j.org
> > > > > https://lists.neo4j.org/mailman/listinfo/user
> > > > >
> > > > _______________________________________________
> > > > Neo4j mailing list
> > > > User at lists.neo4j.org
> > > > https://lists.neo4j.org/mailman/listinfo/user
> > > >
> > > _______________________________________________
> > > Neo4j mailing list
> > > User at lists.neo4j.org
> > > https://lists.neo4j.org/mailman/listinfo/user
> > >
> > _______________________________________________
> > Neo4j mailing list
> > User at lists.neo4j.org
> > https://lists.neo4j.org/mailman/listinfo/user
> >
> _______________________________________________
> Neo4j mailing list
> User at lists.neo4j.org
> https://lists.neo4j.org/mailman/listinfo/user
>


More information about the User mailing list