[Neo4j] [SPAM] Cats and Dogs, living together

Craig Taverner craig at amanzi.com
Wed Dec 1 23:33:16 CET 2010


While I don't do anything as fancy as Rick, I do one very specific thing
which is a kind of special case of one of his suggestions. We use neo4j as a
combined index and statistics tree for data stored in very large flat binary
files. We originally used to port all this data into neo4j, but found a few
problems with that:

   - The total database size was usually many, many times larger than the
   original data (I mean 10-50 times larger)
   - The write performance to the database is the bottleneck during import

Once we started working with extremely large datasets, potentially
approaching the database capacity, we decided to improve both performance
and scalability by storing only an index in the database, but one containing
key statistical results, the ones most likely to be needed by any further
analysis. This allows the tool to still perform the analyses required on the
entire dataset, but with higher import speeds, higher analysis speeds and
lower database sizes. As in Ricks suggestion we maintain references to the
original files and record offsets, so that drill down to the original data
remains possible.

While this is specific to our use case, the principle is probably reusable
in other domains.

On Wed, Dec 1, 2010 at 8:59 PM, Peter Neubauer <
peter.neubauer at neotechnology.com> wrote:

> Thanks for the great feedback Rick! I created enhancement requests for
> your suggestions at https://trac.neo4j.org/ticket/292 - 288 so we
> don't drop them!
>
> Cheers,
>
> /peter neubauer
>
> GTalk:      neubauer.peter
> Skype       peter.neubauer
> Phone       +46 704 106975
> LinkedIn   http://www.linkedin.com/in/neubauer
> Twitter      http://twitter.com/peterneubauer
>
> http://www.neo4j.org               - Your high performance graph database.
> http://www.thoughtmade.com - Scandinavia's coolest Bring-a-Thing party.
>
>
>
> On Wed, Dec 1, 2010 at 9:22 PM,  <rick.bullotta at burningskysoftware.com>
> wrote:
> >   These topics are fairly integral to what we do in ThingWorx,  so I can
> >   share some feedback:
> >
> >
> >
> >   In our world "what goes where" isn't always determined by us - there
> >   are zillions of legacy data stores that might need to be integrated.
> >   They could be relational, proprietary, or accessed via some type of API
> >   or service invocation.  Thus, our application needs to elastic enough
> >   to leverage those sources as well.  In our case, we've chosen to create
> >   an abstraction layer for datasets, services, and events that allows us
> >   to manage this complexity and to create heterogeneous
> >   views/services/applications from that complexity.
> >
> >
> >
> >   In terms of data we "can" control, we've chosen (for now) to put it all
> >   into Neo.  This includes our modeling/metamodel (data, scripts/logic,
> >   visualizations, services, domain data types, etc.) as well as the data
> >   we collect (which is basically in two main forms: activity streams and
> >   "tables").  We've implementing an in-memory data transformation engine
> >   that allows us to do "sql-like" things (filter, sort, aggregate, join,
> >   etc.) on data from any of the aforementioned sources, as well as for
> >   data from our own domain objects (which uses the same dataset
> >   abstraction that we apply to external data).
> >
> >
> >
> >   In terms of transactions, at this point, we have not yet going as far
> >   as implementing hybrid transactions that wrap both external (JDBC)
> >   transactions and Neo transactions.  However, we have abstracted the way
> >   things get "invoked" such that it would be easy to place a single
> >   transaction wrapper around anything that might potentially manipulate
> >   data (in fact, it's implemented today but only for Neo transactions).
> >
> >
> >
> >   I'm not sure what you mean in terms of "message queues for data
> >   distribution", but we use queues in two main ways within ThingWorx.
> >   First, we use "writer" queues to manage writing of stream entries and
> >   data table entries into Neo, since these will tend to be very high
> >   frequency/high volume writes and we didn't want to have to create a
> >   separate transaction for each of them.  We use a set of workers that
> >   flush writes after each "X" seconds have elapsed or when "Y" records
> >   are waiting to be written.  There are persistence helpers that know how
> >   to persist the various types of domain objects that get queued up.  The
> >   other place we use them is for distribution of "events".  These could
> >   be from internal or external sources, as a result of data mutation,
> >   user interaction, service invocation, timer, etc...we use queues as a
> >   means of regulating the flow/loading and to manage
> >   distribution/subscriptions.
> >
> >
> >
> >   In terms of data migration strategies, that's an area where we're
> >   currently doing some exploration.  We already have some basic "stuff"
> >   to take structures from RDBMS tables and turn them into their
> >   equivalent structures (and metamodel structures) in our platform and
> >   therefore in Neo, but we haven't really done much with it yet nor have
> >   we done much with things like indexes and constraints.  Just simple
> >   data for now.  What we are also exploring is using Neo to "index" data
> >   that might reside in external tables.  The searchable view of the data
> >   would reside in Neo and we would maintain a reference back to the
> >   original source (table/row/unique identifier in that row) when we need
> >   to retrieve the original data.  Sort of a spidering/crawling approach
> >   for now, though we would like it to also be event driven at some point.
> >
> >
> >
> >   In terms of "features" that could help in the context of the questions
> >   you've asked, I suppose a few things come to mind:
> >
> >
> >
> >   - The ability to enlist/contain a Neo transaction into other
> >   transactions (and vice versa, I suppose)
> >
> >   - Richer data typing beyond the primitives that Neo stores today
> >   (DateTime and Location being a few interesting and common ones).
> >   Ideally this could be extensible. Currently, we use the domain object's
> >   metadata to help with this, which works OK
> >
> >   - Special treatment for storing/retrieving large strings or blobs
> >   (perhaps even at the expense of performance on these activities, but
> >   indirectly improving performance on node/relationship/property
> >   reads/writes due to reduced memory consumption)
> >
> >   - Support for "structured" storage (e.g. a property that represents a
> >   structure rather than a primitive). Using stuff like serialization is
> >   too fragile and platform/language-specific, but perhaps with some type
> >   of minimalist metamodeling/schemas this could be accomplished fairly
> >   easily (or some type of generic persistent model that knew how to deal
> >   with JSON objects, XML documents, native Java objects, Maps/Sets,
> >   etc.).  This is all stuff that we've had to write on our own
> >
> >   - Support for the idea of "node types" (similar to relationship
> >   types).  Currently, we stamp each node with a String property that
> >   indicates its "type".  Strings are not the most efficient way to do it,
> >   as we all know.
> >
> >
> >
> >   Rick
> >
> >
> >
> >
> >
> >   -------- Original Message --------
> >   Subject: [SPAM] [Neo4j] Cats and Dogs, living together
> >   From: Andreas Kollegger <[1]andreas.kollegger at neotechnology.com>
> >   Date: Wed, December 01, 2010 12:52 pm
> >   To: Neo4j user discussions <[2]user at lists.neo4j.org>
> >   Would anybody be willing to share experiences with trying to introduce
> >   Neo4j into a system with another relational (or other NoSQL) database?
> >   We're starting to think about best practices for integration:
> >   * Hybrid data-modeling: what goes where?
> >   * XA transactions
> >   * message queues for data distribution
> >   * data migration strategies
> >   Any problems or feature-requests related to living in a
> >   multi-storage-platform world are welcome.
> >   Cheers,
> >   Andreas
> >   _______________________________________________
> >   Neo4j mailing list
> >   [3]User at lists.neo4j.org
> >   [4]https://lists.neo4j.org/mailman/listinfo/user
> >
> > References
> >
> >   1. mailto:andreas.kollegger at neotechnology.com
> >   2. mailto:user at lists.neo4j.org
> >   3. mailto:User at lists.neo4j.org
> >   4. https://lists.neo4j.org/mailman/listinfo/user
> > _______________________________________________
> > Neo4j mailing list
> > User at lists.neo4j.org
> > https://lists.neo4j.org/mailman/listinfo/user
> >
> _______________________________________________
> Neo4j mailing list
> User at lists.neo4j.org
> https://lists.neo4j.org/mailman/listinfo/user
>


More information about the User mailing list