On Importing Data into Neo4j (Blog Series)

Posted by Michael Hunger on May 25, 2013 in development, neo4j |

Being able to run interesting queries against a graph database like Neo4j requires the data to be in there in the first place. As many users have questions in this area, I thought a series on importing data into Neo4j would be helpful to get started. This series covers both importing small and moderate data volumes for examples and demonstrations, but also large scale data ingestion.

For operations where massive amounts of data flow in or out of a Neo4j database, the interaction with the available APIs should be more considerate than with your usual, ad-hoc, local graph queries.
This blog series will discuss several ways of importing data into Neo4j and the considerations you should make when choosing one or the other. There is a dedicated page about importing data on neo4j.org which will help you getting started but needs feedback to improve.

Basically Neo4j offers several APIs to import data. Preferably the Cypher query language should be used as it is easiest to use from any programming language. The Neo4j Server’s REST API is not suited for massive data import, only the batch-operation endpoint and the Cypher REST and transactional HTTP (from Neo4j 2.0) endpoints are of interest there. Neo4j’s Java Core APIs provide a way of avoiding network overhead and driving data import directly from a programmatic source of data and also allow to drop down to the lowest API levels for high speed ingestion.

Conceptual Discussion

Neo4j is a fully transactional, ACID database which includes durability guarantees. That also implies that each transaction commit is written to a persistent transaction log (WAL) which is forcibly flushed to disk.
This operation is quite expensive, it depends on the operating system and disk speed. This means, the transaction commit overhead is not negligible.

For updating or inserting data in Neo4j a sensible transaction size should be chosen to accommodate that. Too small transactions (just one or a few updated elements) suffer from the transaction-commit overhead. And too large transactions (100k or millions of elements) reserve a lof of memory for the transient transaction state. So a suitable transaction contains between 10k and 50k elements.

For massive imports of hundreds of millions or billions of nodes, the transactional approach doesn’t yield the a good enough performance. To saturate the full write speed of a modern disk (e.g. SSD-Raids up to 500MB/s), it is necessary to skip transaction semantics and build up your initial datastore in an “raw” way using a batch-insertion mechanism.

If you’re using Neo4j-Server’s REST-API, the maximum insertion speed much lower than when using Cypher or the Core-Java-APIs. It has to cater for the data to JSON transformation, http protocol overhead and network transport. Also by default each http request spawns a single transaction. To accomodate that, Neo4j provides a batch-operation endpoint that streams data to and from the server and can execute many REST-operations at once within a single HTTP-request and transaction while having a low memory overhead.

Especially for the Neo4j Server, but also in the embedded case, the graph query language Cypher offers a neat way of updating/inserting larger amounts of data with its updating operations and parameter support for collections of input-data.

So you see that there is quite a number of choices for inserting larger data volumes with Neo4j. Let’s see how they are managed in detail. As an example data model I want to use the well known social movie database (cineasts) which consists of actors that act in movies. Sample data can be easily pulled programmatically from The MovieDB APIs. I also prepared a small google spreadsheet with sample data and a github repository with some code around this blog series and data model. This dataset is also part of the Neo4j training and documentation, and has been well received so far.

(Actor)-[:ACTED_IN]->(Movie)<-[:RATED]-(User)

Initial Data Import

If you are getting started with Neo4j, you usually want to import existing data into the graph database. For an initial demo or conceptual model it should be enough to create a small graph using the Neo4j Web-UI or the Neo4j-Console. There you can either interactively build your graph using the data-browser or sets of cypher CREATE statements.

To achieve a realistic proof of concept, a sensible data volume should be used to test your use-cases. One way to do that, is to write a data generator in a programming language of your choosing. That data generator can drive any of the mentioned APIs or just generate Cypher statements to be executed. Another option is to import existing data from a relational database or other data source into the graph.

You should have a good understanding of how your domain would look like as a graph, to support your thinking and your use-cases. Modeling the core concepts and relationships on a whiteboard, hopefully together with someone who knows the domain or application well is a very helpful upfront exercise.

Interlude: Indexing

Finding nodes and relationships as starting points for traversals is handled in Neo4j (pre 2.0) by using the integrated indexing framework (and node_auto_index as index-name). Adding elements to an index is either a manual effort which is mostly relevant to the embedded APIs or can be configured to index automatically by property. This auto-indexing should be enabled before you start populating your database. Just edit conf/neo4j.properties and apply the changes listed below. If you enable auto-indexing after you already inserted data, you have to manually re-set the relevant properties, e.g by using Cypher’s SET operation to trigger the index update.

# Enable auto-indexing for nodes, default is false
node_auto_indexing=true

# The node property keys to be auto-indexed, if enabled
node_keys_indexable=name,title

Using Cypher

Cypher, Neo4j’s expressive graph query language is also well suited for updating the graph (much like SQL insert statements are for a relational database). Statements like CREATE, DELETE, SET and CREATE UNIQUE handle the modification of your graph data pretty well.

The easiest way to import data with Cypher is to construct the appropriate statements for your input data. That can be done using something as simple as a Spreadsheet, or any programming language that features text concatenation. Is is then straightforward to paste or pipe the commands to the Neo4j-Shell, or have it read from a file.

The basic updating statements are (Cypher-Cheat-Sheet):

These statements assume you enabled autoindexing for the properties name and title.

CREATE (k {name:"Keanu Reeves"}), (m {title:"Matrix"}),
             (k)-[:ACTED_IN {role:"Neo"}]-&gt;(m)

START k=node:node_auto_index(name="Keanu Reeves")
SET k.age = 48

START m=node:node_auto_index(title="Matrix"),
          a=node:node_auto_index(name="Keanu Reeves")
CREATE UNIQUE (a)-[role:ACTED_IN]-&gt;(m)
SET role.role = "Neo"

To reduce the overhead of parsing Cypher queries, the engine employs a query cache of configurable size. For the cache to work, semantically equal queries also must be the same syntactically. That’s why literal expressions like
strings and numbers must be provided as named parameters, which makes all the different query executions of the same basic statement look the same and only differ in their parameters.

Parameters can be used for node and relationship-properties but also for start-clause lookups, comparison values and more. Parameters are just placeholder names in curly braces that will be used during query execution to look up values from the provided parameter set. This addresses at the same time the issue of code injection (much like SQL’s prepared statements).

The previous example would look like this with parameters:

CREATE (k {actor}),(m {movie}),(k)-[:ACTED_IN {role}]-&gt;(m)

-- parameters here are JSON, use the appropriate HashMap, Dictionary, Hash depending on your language
{"params":{"actor":{"name":"Keanu Reeves"}, "movie":{"title":"Matrix"},"role":{"role":"Neo"}}}

To update data with Cypher it is also necessary to take transaction size into account. For the embedded case, batching transactions is discussed in the next installment of this series. For the remote execution via the Neo4j REST API there are a few important things to remember. Especially with large index lookups and match results, it might happen that the query updates hundreds of thousands of elements. Then a paging mechanism using WITH and SKIP/LIMIT can be put in front of the updating operation.
Note: Neo4j 2.0 syntax

MATCH (m:Movie)&lt;-[:ACTED_IN]-(a:Actor)
WITH a, count(*) AS cnt
SKIP {offset} LIMIT {pagesize}
SET a.movie_count = cnt
RETURN count(*)

Run with pagesize=20000 and increasing offset=0,20000,40000,… until the query returns a count < pagesize

There are some features in Cypher that allow handling larger data volumes at once, for one you can use a collection of maps as a parameter to create multiple nodes at once. You can use FOREACH to iterate over a collection and manipulate nodes or relationship for each of the elements of the collection. In general using the Cypher collection functions and predicates is a very powerful way of mangling larger datasets.

-- create many nodes in one go
CREATE (n {nodes})
{"params":{"nodes":[{"name":"Keanu Reeves"},{"name":"Tom Hanks"},{"name":"Clint Eastwood"}] }

-- use FOREACH, a bit contrived
START a=node:node_auto_index("actor:*")
MATCH (a)-[r:ACTED_IN]-&gt;(m)
WHERE r.top_actor
WITH a, COLLECT(m) as movies
FOREACH (m in movies : set m.star = a.name)

Cypher over REST

Running Cypher over the wire with the Neo4j REST API poses another set of challenges. First you have to take the latency into account, so try to bundle as many operations into as few Cypher statements or HTTP request as possible. Also keep in mind that transferring data over the network (especially response data) ist costly and should be minimized (by returning only attributes, ids or nothing instead of full node, relationship or path representations). Also remember to use parameters everywhere. And last but not least if you want to optimize the transaction size use the REST Batch-Operations-Request (example below)

  1. Use parameters, e.g. in a REST request
    POST /db/data/cypher {"query":"CREATE n={props}","params" : {"props": {"name":"Keanu Reeves"}}}
  2. If you expect large results use streaming (Http-Header: X-Stream:true, most driver libraries already support that).
  3. Remember to use application/json both as accept and content-type headers.
  4. If you want to execute multiple Cypher statements in one transaction and also a single HTTP-Request use the Rest-Batch-Operation call

A batch rest operation call that uses sepatate cypher statements to create two nodes and a relationship, is illustrated below. This is a bit overengineered, usually you can do that in a single statement, but illustrates the point for more complex operations:

POST /db/data/batch [
{"id":0, "to":"/cypher","method":"POST", "body" :
  { "query":"CREATE n={props}", "params":{"props": {"name":"Keanu Reeves"}}}},

{"id":1, "to":"/cypher","method":"POST", "body" :
  { "query":"CREATE n={props}", "params":{"props": {"title":"The Matrix"}}}},

{"id":2, "to":"/cypher","method":"POST", "body" :
  { "query":"START a=node:node_auto_index(name={actor}), m=node:node_auto_index(title={movie}) CREATE (a)-[:ACTED_IN {role}]-&gt;(m)",
    "params":{"movie": "The Matrix", "actor":"Keanu Reeves", "role":"Neo"}}}
]

A word on concurrency and locking

As Neo4j locks each node and relationship for updates on its properties, and both nodes when adding or removign relationships, make sure that concurrently running requests to the server touch different subgraphs, otherwise either lock waiting or in the unfortunate case a deadlock will happen (which requires you to retry the failed operation). So grouping things to update into disjunct subgraphs is really sensible. On the other hand if you want to lock things for single access (e.g. unique node creation) you can use that behavior by e.g. deleting a non-existing property on a applicable lock node as the first thing in the query.

Outlook

Neo4j 2.0 brings a lot of changes in Cypher, enough to justify a separate blog post. Especially MERGE, node labels and automatic indexes are impressive additions. New in 2.0 is also the transactional http Cypher endpoint. The same goes for the Neo4j shell, which is a suprisingly versatile instrument for data import. Following that I’ll cover the embedded APIs, both the transactional Java API, and the batch-inserter. The final posts will look at my CSV-batch-importer implementation and the Geoff graph format. If there is enough interest we can also look at importing RDF data into Neo4j.

Share and Enjoy:
  • Print
  • Digg
  • del.icio.us
  • Facebook
  • LinkedIn
  • Netvibes
  • PDF
  • Ping.fm

8 Comments

  • Mirko says:

    Thanks Michael. Some of this hints I had gathered elsewhere but having a clear, systematic description of batch import as a reference is definitely handy.

  • Leo Zhao says:

    Nice article. Looking forward to the following of the series.
    Is there a way to examine the transaction-commit overhead, so that an optimal size of transaction could be determined?
    Besides, details of concurrency issue would be preferred.

  • Not really unfortunately. Except if you look either at memory for large transactions or at time spent synching to disk for small ones.

  • Niranjan says:

    cat import.txt | /var/lib/neo4j/bin/neo4j-shell -config conf/neo4j.properties -path /var/lib/neo4j/data/graph.db

    When i tried to run this above command which is given in spread shit i got following errors

    Exception in thread “main” java.lang.UnsupportedClassVersionError: org/neo4j/shell/StartClient : Unsupported major.minor version 51.0
    at java.lang.ClassLoader.defineClass1(Native Method)
    at java.lang.ClassLoader.defineClassCond(ClassLoader.java:631)
    at java.lang.ClassLoader.defineClass(ClassLoader.java:615)
    at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:141)
    at java.net.URLClassLoader.defineClass(URLClassLoader.java:283)
    at java.net.URLClassLoader.access$000(URLClassLoader.java:58)
    at java.net.URLClassLoader$1.run(URLClassLoader.java:197)
    at java.security.AccessController.doPrivileged(Native Method)
    at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
    at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
    Could not find the main class: org.neo4j.shell.StartClient. Program will exit.

  • Sounds like version conflict in your installation. Make sure to not mix up versions.

    Best to ask on the neo4j google group: http://neo4j.org/forums

  • mike says:

    used your import tool, its bloody fast, uploaded 10 M nodes and 300k relationships in 95 seconds.

    cypher is just too slow especially when there are many variables in the cypher.

    i wish there was something in between that was fast while being transactional

  • Cypher is not too slow, you can import still 10k nodes / s. And it will get faster in 2.1. Stay tuned.
    But make sure to use parameters in cypher and also batch the transactions in cypher, e.g. every 30-50k elements in 1.9 or 1k elements (a bug) in 2.0

  • Mal says:

    Hi Michael,

    I’m having trouble on the performance of batch insert through rest server. I’ve been using the neo4j rest graph db. I’m only getting like 500 queries / 2 mins . Here’s a link to SO as well
    Thanks!

Leave a Reply

XHTML: You can use these tags:' <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

Copyright © 2007-2014 Better Software Development All rights reserved.
Multi v1.4.5 a child of the Desk Mess Mirrored v1.4.6 theme from BuyNowShop.com.