[Neo] How to efficiently query in Neo4J?

rick.bullotta at burningskysoftware.com rick.bullotta at burningskysoftware.com
Thu Apr 8 12:43:20 CEST 2010


   Since no one responded yesterday, I wanted to re-emphasize that there
   are probably substantial optimizations that can be made in a well-known
   problem domain such as this.   For example, by using pre-calculated
   "relevance" measures for tags, and by narrowing the returned set of
   posts/nodes as rapidly as possible using the "least used" tag(s) in
   progressive order.  It would be quite trivial (and reasonably
   performant) to maintain a pair of properties on each node in the tag
   hierarchy that count the # of relationships of the tag and all its
   children.  Each time a tagging relationship was added to a post, simply
   add 1 to this property for the tag node and all its ancestors/parents.
   Then, when you are provided with a list of tags to search upon, order
   them by the least frequently used tag by leveraging this metric and
   execute your traversals/set analysis in that order.  I also think my
   proposal for a two-directional search (first one from the direction of
   the least frequently used tag to the posts that include it, followed by
   a search from each of those posts back to its tags as described in a
   previous message) could be quite fast.



   Another "compound index" approach that can be used, which is somewhat
   of a brute force method, is to maintain a property on each tag node
   that consists of its "aggregrate name" - e.g.
   Europe.Italy.Toscana.Siena or
   Activities.Active.Cycling.MountainBiking.  When doing a search for
   Cycling activities in Italy, you could grab the aggregate names for
   Italy (Europe.Italy) and Cycling (Activities.Active.Cycling), then,
   using whatever mechanism you choose for your initial node traversal
   (exhaustive or least-frequently-used tag), you can compare the
   aggregate name for the tags assigned to a post node to the aggregate
   names for the desired nodes using a simple String.startsWith().  For
   example, if I posted regarding a mountain bike ride I took in the hills
   around Siena, tagged with the above aggregate names, it would
   successfully match.  My first thought was that this could be
   problematic if a tag term appeared multiple times in the tag hierarchy,
   but that's easily managed on the query side.



   Just trying to make the point that sometime abstract or generic
   traversal schemes aren't always optimal and that it is often worth the
   effor to explore domain-specific approaches.



   Does that make any sense?







   -------- Original Message --------
   Subject: Re: [Neo] How to efficiently query in Neo4J?
   From: Craig Taverner <craig at amanzi.com>
   Date: Wed, April 07, 2010 7:05 pm
   To: Neo user discussions <user at lists.neo4j.org>
   Hi Alastair,
   I have been using what you tag the 'composite index' although in mysql.
   Its
   > fast, but a pain to manage (as you need to keep the index up to
   date), so I
   > would like to stay away from indexes *if possible*.
   >
   I would think that you only need to take action when you add or modify
   a
   node, and then only to (re)connect it to the index tree (creating index
   nodes on demand, if missing). This can be embedded in your domain
   classes,
   so indexing is automatic. You can even synchronously 'garbage-collect'
   unused index nodes (if the node unlinked was the last node for that
   index
   node). I think the index-service for this needs to be well tested for
   all
   scenarios, but should ultimately have a very simple API, with no manual
   management requirement.
   My one concern with the composite index for your case is that all my
   thinking in this has been for numerical indexes, where I plan to query
   with
   inequalities (eg. return all restaurants with rating >= 4 stars). I've
   not
   thought about how to solve hierarchical tags like you have.
   One further optimisation is to only store new items in the hash on the
   first
   > traversal. Then, in the subsequent traversals, if the key does not
   exist,
   > there is no need to add key with count 1, as it cannot ever be
   emitted.
   > This
   > limits the memory requirements to the order of the first traversal,
   so if
   > you pick that well, it should be better.
   >
   Nice idea. It makes your approach more like the 'one set intersection'
   approach in term of memory.
   Picking a good first query seems a common need for many of the
   solutions. I
   presume RDBMS have a query optimization phase that figures that out.
   I'm
   hoping to completely avoid that kind of non-deterministic approach with
   the
   composite index.
   Cheers, Craig
   _______________________________________________
   Neo mailing list
   User at lists.neo4j.org
   [1]https://lists.neo4j.org/mailman/listinfo/user

References

   1. https://lists.neo4j.org/mailman/listinfo/user


More information about the User mailing list