[Neo4j] Question about labelling all connected components

Tobias Ivarsson tobias.ivarsson at neotechnology.com
Thu Jul 22 16:35:37 CEST 2010


The first obvious thing is that labelled.contains(currentNode.getId()) is
going to take more time as your dataset grows, since it's a linear search
for the element in an ArrayList. A HashSet would be a much more appropriate
data structure for your application.

The other thing that comes to mind is the memory overhead of the
labelled-collection. Eventually it is going to contain every node in the
graph, and be very large. This "steals" some of the memory that could have
been used for caching the graph, forcing Neo4j to do more I/O than it would
have if it could have used that memory for cache.

Would it be possible for you to replace
the !labelled.contains(currentNode.getId())-check with
currentNode.getProperty("componentID",null) == null? Or are there situations
where the node could have that property and not be considered labeled?

Cheers,
Tobias

On Thu, Jul 22, 2010 at 3:35 PM, Arijit Mukherjee <arijit72 at gmail.com>wrote:

> Hi All
>
> I'm trying to label all connected components in a graph - i.e. all
> nodes that are connected will have a common "componentID" property
> set. I'm using the Traverser to do this. For each node in the graph
> (unless it is already labelled, which I track by inserting the node ID
> in a list), the traverser finds out all the neighbours using BFS, and
> then the node and all the neighbours are labelled with a certain
> value. The code is something like this -
>
>        Iterable<Node> allNodes = graphDbService.getAllNodes();
>        ArrayList labelled = new ArrayList();
>        for (Node currentNode : allNodes) {
>            if (currentNode.hasProperty("number") &&
> !labelled.contains(currentNode.getId())) {
>                Traverser traverser =
> currentNode.traverse(Order.BREADTH_FIRST,
>                        StopEvaluator.END_OF_GRAPH,
>                        ReturnableEvaluator.ALL_BUT_START_NODE,
>                        RelTypes.CALLS, Direction.BOTH);
>                int currentID = initialID;
>                initialID++;
>                currentNode.setProperty("componentID", currentID);
>                labelled.add(currentNode.getId());
>                for (Node friend : traverser) {
>                    friend.setProperty("componentID", currentID);
>                    // mark each node as labelled
>                    labelled.add(friend.getId());
>                }
>           }
>     }
>
> This works well for a small graph (2000 nodes). But for a graph of
> about 1 million nodes, this is taking about 45 minutes on a 64-bit
> Intel 2.3GHz CPU, 4GB RAM (Java 1.6 update 21 and Neo4J 1.0).
>
> Is this normal? Or is the code I'm using faulty? Is there any other
> way to label the connected components?
>
> Regards
> Arijit
>
> --
> "And when the night is cloudy,
> There is still a light that shines on me,
> Shine on until tomorrow, let it be."
> _______________________________________________
> Neo4j mailing list
> User at lists.neo4j.org
> https://lists.neo4j.org/mailman/listinfo/user
>



-- 
Tobias Ivarsson <tobias.ivarsson at neotechnology.com>
Hacker, Neo Technology
www.neotechnology.com
Cellphone: +46 706 534857


More information about the User mailing list