[Neo] How to efficiently query in Neo4J?
rick.bullotta at burningskysoftware.com
rick.bullotta at burningskysoftware.com
Thu Apr 8 12:43:20 CEST 2010
Since no one responded yesterday, I wanted to re-emphasize that there
are probably substantial optimizations that can be made in a well-known
problem domain such as this. For example, by using pre-calculated
"relevance" measures for tags, and by narrowing the returned set of
posts/nodes as rapidly as possible using the "least used" tag(s) in
progressive order. It would be quite trivial (and reasonably
performant) to maintain a pair of properties on each node in the tag
hierarchy that count the # of relationships of the tag and all its
children. Each time a tagging relationship was added to a post, simply
add 1 to this property for the tag node and all its ancestors/parents.
Then, when you are provided with a list of tags to search upon, order
them by the least frequently used tag by leveraging this metric and
execute your traversals/set analysis in that order. I also think my
proposal for a two-directional search (first one from the direction of
the least frequently used tag to the posts that include it, followed by
a search from each of those posts back to its tags as described in a
previous message) could be quite fast.
Another "compound index" approach that can be used, which is somewhat
of a brute force method, is to maintain a property on each tag node
that consists of its "aggregrate name" - e.g.
Activities.Active.Cycling.MountainBiking. When doing a search for
Cycling activities in Italy, you could grab the aggregate names for
Italy (Europe.Italy) and Cycling (Activities.Active.Cycling), then,
using whatever mechanism you choose for your initial node traversal
(exhaustive or least-frequently-used tag), you can compare the
aggregate name for the tags assigned to a post node to the aggregate
names for the desired nodes using a simple String.startsWith(). For
example, if I posted regarding a mountain bike ride I took in the hills
around Siena, tagged with the above aggregate names, it would
successfully match. My first thought was that this could be
problematic if a tag term appeared multiple times in the tag hierarchy,
but that's easily managed on the query side.
Just trying to make the point that sometime abstract or generic
traversal schemes aren't always optimal and that it is often worth the
effor to explore domain-specific approaches.
Does that make any sense?
-------- Original Message --------
Subject: Re: [Neo] How to efficiently query in Neo4J?
From: Craig Taverner <craig at amanzi.com>
Date: Wed, April 07, 2010 7:05 pm
To: Neo user discussions <user at lists.neo4j.org>
I have been using what you tag the 'composite index' although in mysql.
> fast, but a pain to manage (as you need to keep the index up to
date), so I
> would like to stay away from indexes *if possible*.
I would think that you only need to take action when you add or modify
node, and then only to (re)connect it to the index tree (creating index
nodes on demand, if missing). This can be embedded in your domain
so indexing is automatic. You can even synchronously 'garbage-collect'
unused index nodes (if the node unlinked was the last node for that
node). I think the index-service for this needs to be well tested for
scenarios, but should ultimately have a very simple API, with no manual
My one concern with the composite index for your case is that all my
thinking in this has been for numerical indexes, where I plan to query
inequalities (eg. return all restaurants with rating >= 4 stars). I've
thought about how to solve hierarchical tags like you have.
One further optimisation is to only store new items in the hash on the
> traversal. Then, in the subsequent traversals, if the key does not
> there is no need to add key with count 1, as it cannot ever be
> limits the memory requirements to the order of the first traversal,
> you pick that well, it should be better.
Nice idea. It makes your approach more like the 'one set intersection'
approach in term of memory.
Picking a good first query seems a common need for many of the
presume RDBMS have a query optimization phase that figures that out.
hoping to completely avoid that kind of non-deterministic approach with
Neo mailing list
User at lists.neo4j.org
More information about the User