[Neo4j] Compacting files?

Alex Averbuch alex.averbuch at gmail.com
Wed Jun 2 18:12:40 CEST 2010


Hi Craig,
Just a quick note about needing to keep all IDs in memory during an
import/export operation. The way I'm doing it at the moment it's not
necessary to do so.

When exporting:
Write IDs to the exported format (this could be JSON, XML, GML, GraphML,
etc)

When importing:
First import all Nodes, this is easy to do in most formats (all that I've
tried).
While importing Nodes, store & index 1 extra property in every Node, I call
this "GID" for global ID.
Next import all Relationships, using the GID and Lucene to locate start Node
& end Node.

The biggest graph I've tried with this approach had 2.5million Nodes &
250million Relationships.
It took a quite a long time, but much of the slowness was because it was
performed on an old laptop with 2GB of RAM, I didn't give the BatchInserter
a properties file, and I used default JVM parameters.

There is at least one obvious downside to this though, and that is that you
"pollute" the dataset with GID properties.

Alex

On Wed, Jun 2, 2010 at 5:53 PM, Craig Taverner <craig at amanzi.com> wrote:

> I've thought about this briefly, and somehow it actually seems easier (to
> me) to consider a compacting (defragmenting) algorithm than a generic
> import/export. The problem is that in both cases you have to deal with the
> same issue, the node/relationship ID's are changed. For the import/export
> this means you need another way to store the connectedness, so you export
> the entire graph into another format that maintains the connectedness in
> some way (perhaps a whole new set of IDs), and the re-import it again.
> Getting a very complex, large and cyclic graph to work like this seems hard
> to me because you have to maintain a complete table in memory of the
> identity map during the export (which makes the export unscalable).
>
> But de-fragmenting can be done by changing ID's in batches, breaking the
> problem down into smaller steps, and never neading to deal with the entire
> graph at the same time at any point. For example, take the node table, scan
> from the base collecting free ID's. Once you have a decent block, pull that
> many nodes down from above in the table. Since you keep the entire set in
> memory, you maintain the mapping of old-new and can use that to 'fix' the
> relationship table also. Rinse and repeat :-)
>
> One option for the entire graph export that might work for most datasets
> that have predominantly tree structures is to export to a common tree
> format, like JSON (or, .... XML). This maintains most of the relationships
> without requiring any memory of id mappings. The less common cyclic
> connections can be maintained with temporary ID's and a table of such ID's
> maintained in memory (assuming it is much smaller than the total graph).
> This can allow complete export of very large graphs if the temp id table
> does indeed remain small. Probably true for many datasets.
>
> On Wed, Jun 2, 2010 at 2:30 PM, Johan Svensson <johan at neotechnology.com
> >wrote:
>
> > Alex,
> >
> > You are correct about the "holes" in the store file and I would
> > suggest you export the data and then re-import it again. Neo4j is not
> > optimized for the use case were more data is removed than added over
> > time.
> >
> > It would be possible to write a compacting utility but since this is
> > not a very common use case I think it is better to put that time into
> > producing a generic export/import dump utility. The plan is to get a
> > export/import utility in place as soon as possible so any input on how
> > that should work, what format to use etc. would be great.
> >
> > -Johan
> >
> > On Wed, Jun 2, 2010 at 9:23 AM, Alex Averbuch <alex.averbuch at gmail.com>
> > wrote:
> > > Hey,
> > > Is there a way to compact the data stores (relationships, nodes,
> > properties)
> > > in Neo4j?
> > > I don't mind if its a manual operation.
> > >
> > > I have some datasets that have had a lot of relationships removed from
> > them
> > > but the file is still the same size, so I'm guessing there are a lot of
> > > holes in this file at the moment.
> > >
> > > Would this be hurting lookup performance?
> > >
> > > Cheers,
> > > Alex
> > _______________________________________________
> > Neo4j mailing list
> > User at lists.neo4j.org
> > https://lists.neo4j.org/mailman/listinfo/user
> >
> _______________________________________________
> Neo4j mailing list
> User at lists.neo4j.org
> https://lists.neo4j.org/mailman/listinfo/user
>


More information about the User mailing list