[Neo] How to efficiently query in Neo4J?
craig at amanzi.com
Thu Apr 8 00:27:12 CEST 2010
This is a very interesting discussion because I've been working on a
solution I call a 'composite index' that, for some datasets, should return
the correct results in one simple traverse, with no temporary memory used,
but first my high level view of this:
I think there are several types of solutions:
- *set intersection*, query on each, create intersection of all
result-sets. This is what Alastair seems to originally suggest, but dislike
because it takes more time and memory than other options.
- *one set intersection*, improved version of above, with only the first
traversers results kept in memory, and used for testing in all further
traversers. This is what Max suggested. This method works best if you can
predict which query will produce the smallest initial set.
- *inner traverse*, as with the previous one, you should pick the right
first traverser, and then for each match, run an inner traverser backwards
up the other tag trees as an extra test. This one requires no memory for
storing the result set, but does make for a longer total traverse,
especially if the first query is poorly chosen.
- *composite index*, this is my latest idea, building an index
(essentially the same thing as the tag tree), but make the first layer of
the tree a composite of all existing tag combinations. The worst case
scenario is that this will contain as many index nodes as data nodes, which
can be innefficient. But in most real world datasets there is some
correlation or locality and the index set is much smaller. Then you query on
the index directly, for all tags of interest.
But Alastairs last suggestion is most interesting. I've not thought of that
before. One concern is that it needs to keep a hash of all nodes that match
any of the query criteria, which is a super-set of the sets found in the
set-intersection approach above, so that might be a memory issue. I'm going
to think about it some more and see how it fits in with the other
approaches. I'm hopefull it really is something new and clever :-)
On Wed, Apr 7, 2010 at 6:19 PM, Alastair James <al.james at gmail.com> wrote:
> I am guessing the most efficient way might be to have a two stage return
> E.g. The custom return evaluator class has a hash table of <node id> =>
> <count> pairs. Each time 'isReturnableNode' is called, it increments the
> count for that node id in the hash. If count >= <total number of tags to
> check for> return the node.
> Thus, if you run a traversal starting from each tag using the same instance
> of the custom return evaluator, none of the traversals will emit any nodes
> (good for memory!) until the last one, where the ones matching all the tags
> will be emitted.
> Does that sound like it would work?
> On 7 April 2010 16:20, Max De Marzi Jr. <maxdemarzi at gmail.com> wrote:
> > I've had similar issues and they way I've done it (which may not be the
> > right way) is to run the first traversal and store the returned nodes.
> > Then
> > run the second traversal and return only if it is contained in the set of
> > returned nodes in the first traversal.
> > The traverses hit each node only once, and since we want to return only
> > they are found twice, I don't think there is a clean way to do it in a
> > single traversal.
> > On Wed, Apr 7, 2010 at 9:53 AM, Alastair James <al.james at gmail.com>
> > > Hi there...
> > >
> > > I am looking at moving a website to a model based on Neo4J, however, I
> > > having trouble seeing how to optimise the 'main query' type for Neo4J.
> > >
> > > Briefly, the site consists of posts, each tagged with various
> > > e.g. (its a travel site) location, theme, cost etc... Also the tags
> > > are hierarchical. So, for location we have (say) 'tuscany' inside
> > > inside 'europe'. For theme we have (say) 'cycling' inside 'activity'.
> > >
> > > I can beautifully model the parent child relationships of these 'tags'
> > > using
> > > a graph Db as objects with a 'CHILD_OF' relationship type. Then the
> > > go
> > > in and have related 'tags' related to them with a 'TAGGED_WITH'
> > > relationship
> > > types. Fine.
> > >
> > > However, here is the complex bit (well to me): I need to be able to
> > > all
> > > posts tagged with a set of tags (AND operation so each post must be
> > tagged
> > > with each tag).
> > >
> > > Whats more, these queries need to respect the parent relationships in
> > > tags. So if I search for 'activity' in 'france' it needs to traverse
> > > CHILD_OF relationships to find things tagged with any child of
> > > AND any child of 'France'.
> > >
> > > It seems pretty easy to write a traversal class that would find all
> > > in
> > > any child of ether 'tag' node, simply follow 'CHILD_OF' and
> > > backwards and return nodes of type post.
> > >
> > > However how to do the AND bit? The only way I can see is to return both
> > > lists and union them in user code... however that seems inelegant and
> > > not scale brilliantly.
> > >
> > > Any ideas how to optimise?
> > >
> > > Al
> > > _______________________________________________
> > > Neo mailing list
> > > User at lists.neo4j.org
> > > https://lists.neo4j.org/mailman/listinfo/user
> > >
> > _______________________________________________
> > Neo mailing list
> > User at lists.neo4j.org
> > https://lists.neo4j.org/mailman/listinfo/user
> Dr Alastair James
> CTO James Publishing Ltd.
> WINNER Travolution Awards Best Travel Information Website 2009
> WINNER IRHAS Awards, Los Angeles, Best Travel Website 2008
> WINNER Travolution Awards Best New Online Travel Company 2008
> WINNER Travel Weekly Magellan Award 2008
> WINNER Yahoo! Finds of the Year 2007
> "Noli nothis permittere te terere!"
> Neo mailing list
> User at lists.neo4j.org
More information about the User