emil at neotechnology.com
Fri Jan 30 01:27:15 CET 2009
On Wed, Jan 28, 2009 at 8:47 PM, Joel Longtine <joel at socialthing.com> wrote:
> Hey All,
> I noticed the copy on neo4j.org:
> - massive scalability. Neo4j can handle graphs of several billion
> nodes/relationships/properties on a single machine and can be sharded
> to scale out across multiple machines.
> But I can't seem to find information about how to actually set up a
> sharded instance of Neo. Is this simply not implemented yet? Or am I
> missing something?
We haven't yet written up an example of how to shard Neo4j, but at its
core it's equivalent to sharding a relational database: there's no
inherent support for sharding on the server side (don't think that's
doable in RDBMS or graph db?) so the logic for defining and
dispatching to shards resides on the client side.
As you're probably aware, there's plenty to read about sharding on the
internets. Fundamentally you need to find the entities in your domain
that make sense to partition horizontally, and then figure out a good
sharding scheme. Common schemes are key- and hash-based partitioning,
vertical partitioning and range-based partitioning. Then in code on
the client side that works with a sharded domain entity, you add some
(hopefully simple) logic to first figure out the destination shard and
direct your messages there.
For example, let's say that we have a simple e-commerce application
with Customers, Products and Orders. They are all represented by nodes
and every Order node has a relationship to one Customer node and one
Product node. (Extremely simplified, obviously.) In this case, the
scalability pain point will typically be Order throughput (imagine
something like 1M products, 100M customers and 100B orders).
One setup for that could be:
o Implement business logic wrapper classes (Customer, Order, etc)
on top of the Neo4j Nodes/Rels/etc as usual. Expose a domain-oriented
API over the wire somehow, for example via REST or a component that
eats messages off a JMS queue and then invokes the domain APIs (i.e.
invokes addCustomer(...) when it gets an AddCustomer JMS message,
etc). This service (Neo4j db code + business logic) is a backend unit
which will be deployed on several machines.
o Shard orders and customers based on e.g. hashed customer id.
o Duplicate products across all backend instances, use a
master-slave setup with asynchronous write updates. This ensures that
there's no write conflicts and is eventually consistent (so it may
take some time for the read-only slaves to get an updated view of the
o So with 20 machines and a total data set as above, each machine
would handle something like 1M products, 5M customers and 5B orders
(assuming a uniform distribution).
o On the client side, when there's an incoming order, pass the
customer id to your shard distribution function which in this case
will simply hash it and return the id of the shard where that customer
resides. The AddOrder message will then be directed to the appropriate
1] Those machines would in a typical RDBMS architecture host only the
relational database. The database would be exposed over the wire,
making SQL and the raw data model (table layout) the de-facto
integration API. With the architecture outlined above, the integration
API is domain oriented -- it speaks Customers and Orders rather than
SQL strings. This is not exclusive for a graph db like Neo4j, you can
implement a similar setup based on RDBMS technology obviously, but I
do believe that it's a better and more maintainable architecture.
Does that make sense?
Emil Eifrém, CEO [emil at neotechnology.com]
Neo Technology, www.neotechnology.com
Cell: +46 733 462 271 | US: 206 403 8808
More information about the User