LOAD CSV into Neo4j quickly and successfully

Posted by Michael Hunger on Jun 25, 2014 in cypher, import |

Since version 2.1 Neo4j provides out-of-the box support for CSV ingestion. The LOAD CSV command that was added to the Cypher Query language is a versatile and powerful ETL tool.
It allows you to ingest CSV data from any URL into a friendly parameter stream for your simple or complex graph update operation, that … conversion.

But hear my words of advice before you jump directly into using it. There are some tweaks and configuration aspects that you should know to be successful on the first run.

Data volume: LOAD CSV was built to support around 1M rows per import, it still works with 10M rows but you have to wait a bit, at 100M it’ll try your patience.
Except for tiny datasets never run it without the saveguard of periodic commits, which prevent you from large transactions overflowing your available database memory (JVM Heap).

The CSV used in this example is pretty basic, but enough to show some issues and make a point, it’s people and companies they work(ed) for.

PersonName,"Company Name",year
"Kenny Bastani","Neo Technology",2013
"Michael Hunger","Neo Technology",2010
"James Ward","Heroku",2011
"Someone",,
"John","Doe.com","ninetynine"

OS, RAM and DISK

First make sure the machine you want to import on is fit to import the data. That means you should have some RAM available, if your 8GB machine only sports 1GB of free RAM and already uses 3GB swap, this is a sure sign you shouldn’t try. The more RAM the better, I’d recommend to reserve at least 4GB to import your data.

If you run on SSD’s you’re lucky, on spinning disks it will definitely take longer.

Note
If you are on Windows take extra care. On Windows the database memory (JVM Heap) is used for BOTH Neo4j’s objects & caches AND the memory-mapping for storage-files. On Unix and OSX the memory mapping resides on the outside.
So make sure that on Windows you have enough Java memory allocated to cater for both aspects.

Memory Config

For your memory configuration, a sane starting point is to give Neo4j 3-4GB RAM as Java heap (including the memory mapping size on Windows), and a sensible amount (e.g. 1GB+) as memory mapping.
The more memory you have the faster it will import your data.

So make sure to edit conf/neo4j-wrapper.conf and set:

# Uncomment these lines to set specific initial and maximum
# heap size in MB.
wrapper.java.initmemory=4096
wrapper.java.maxmemory=4096

In conf/neo4j.properties set:

# Default values for the low-level graph engine
neostore.nodestore.db.mapped_memory=50M
neostore.relationshipstore.db.mapped_memory=500M
neostore.propertystore.db.mapped_memory=100M
neostore.propertystore.db.strings.mapped_memory=100M
neostore.propertystore.db.arrays.mapped_memory=0M

Data Loading

If you load your CSV file over the network make sure it is fast enough to sustain the ingestion rate you’d like to have. Otherwise if possible download it, and use a file:// URL.

Note
File URLs are tricky. On OSX and Unix use file:///path/to/data.csv, on Windows, please use file:c:/path/to/data.csv. Beware spaces in file-names and relative paths. Http-URLs are much easier.
LOAD CSV afaik supports cookies, redirects and https.

Data Quality

There are some challenges when loading CSV data from “organic” files.

  1. You might not load the right file. On a local filesystem Cypher will complain if it can’t find it, but will happily ingest your latest movie-trilogy while trying to find the first comma.
    For remote URLs it will fail if the server returns a 404, but otherwise try to load HTML error pages, which might be fun.

  2. If you have the right file, make sure that it loads correctly. I.e. all the columsn and lines are separated by delimeters to your liking. You can easily try that by sampling the first few lines and see that they are returned correctly.

Initial checking for the first 5 lines

Does everything look ok, anything that looks dubious? Fix it in your input data!

LOAD CSV FROM "https://gist.githubusercontent.com/jexp/d788e117129c3730a042/raw/a147631215456d86a77edc7ec08c128b9ef05e3b/people_leading_empty.csv"
  AS line
WITH line LIMIT 4
RETURN line
+--------------------------------------------+
| line                                       |
+--------------------------------------------+
| [""]                                       |
| ["PersonName","Company Name","year"]       |
| ["Kenny Bastani","Neo Technology","2013"]  |
| ["Michael Hunger","Neo Technology","2010"] |
+--------------------------------------------+

Caveats are: Leading blank line, wrong or double line breaks.

Check for correct columns

LOAD CSV WITH HEADERS FROM "https://gist.githubusercontent.com/jexp/d788e117129c3730a042/raw/a147631215456d86a77edc7ec08c128b9ef05e3b/people_leading_empty.csv"
  AS line
WITH line LIMIT 5
RETURN line.PersonName, line.Company, line.company, line.`Company Name`
+---------------------------------------------------------------------+
| line.PersonName | line.Company | line.company | line.`Company Name` |
+---------------------------------------------------------------------+
| <null>          | <null>       | <null>       | <null>              |
| <null>          | <null>       | <null>       | <null>              |
| <null>          | <null>       | <null>       | <null>              |
| <null>          | <null>       | <null>       | <null>              |
| <null>          | <null>       | <null>       | <null>              |
+---------------------------------------------------------------------+

Leading empty line will break WITH HEADERS.

LOAD CSV WITH HEADERS FROM "https://gist.githubusercontent.com/jexp/d788e117129c3730a042/raw/c5f98f269b4a219ebb88bd3eb83d331f439a8201/people.csv"
  AS line
WITH line LIMIT 5
RETURN line.PersonName, line.Company, line.company, line.`Company Name`
+--------------------------------------------------------------------------+
| line.PersonName      | line.Company | line.company | line.`Company Name` |
+--------------------------------------------------------------------------+
| "Kenny Bastani"      | <null>       | <null>       | "Neo Technology"    |
| "Michael Hunger"     | <null>       | <null>       | "Neo Technology"    |
| "James Ward"         | <null>       | <null>       | "Heroku"            |
| "Matthew McCullough" | <null>       | <null>       | "GitHub"            |
| "Someone"            | <null>       | <null>       | ""                  |
+--------------------------------------------------------------------------+

This is better, still some misspelled columns.

Note
Misspelled column name (it’s case sensitive), empty columns which are treated as null

You can skip null values by adding a

WHERE line.value IS NOT NULL after the LOAD CSV .. WITH line

If you don’t, you might end up with errors in Neo4j’s indexing subsystem trying to index null values.

If you want to you can alternatively provide defaults with

coalesce(value_or_null,value2_or_null,…, default-value).

Filter out NULL values

LOAD CSV WITH HEADERS FROM "https://gist.githubusercontent.com/jexp/d788e117129c3730a042/raw/c5f98f269b4a219ebb88bd3eb83d331f439a8201/people.csv"
  AS line
WITH line

WHERE line.PersonName IS NOT NULL

RETURN line.PersonName, coalesce(line.Company,"None")
+------------------------------------------------------+
| line.PersonName      | coalesce(line.Company,"None") |
+------------------------------------------------------+
| "Kenny Bastani"      | "None"                        |
| "Michael Hunger"     | "None"                        |
| "James Ward"         | "None"                        |
| "Matthew McCullough" | "None"                        |
| "Someone"            | "None"                        |
| "John"               | "None"                        |
+------------------------------------------------------+

Data Conversion

Cypher reads all CSV columns as Strings by default. You can use conversion functions like toInt(str), toFloat(str), or boolean expressions, split(str, " ") and substring(str,start,count) (e.g. for extracting day, month, year from a date-string). Note that the conversion functions return null if they can’t convert. So use the null handling from above to handle unclean data.

Note: Make sure to use the same conversion for the same value in all the places, both when creating nodes and relationships and also when using the CSV data to look them up again.
A helpful tip is to do the conversion upfront with WITH toInt(line.number) as number

LOAD CSV WITH HEADERS FROM "https://gist.githubusercontent.com/jexp/d788e117129c3730a042/raw/c5f98f269b4a219ebb88bd3eb83d331f439a8201/people.csv"
  AS line

WITH line, toInt(line.year) as year
RETURN line.PersonName, year
+-------------------------------+
| line.PersonName      | year   |
+-------------------------------+
| "Kenny Bastani"      | 2013   |
| "Michael Hunger"     | 2010   |
| "James Ward"         | 2011   |
| "Matthew McCullough" | 2010   |
| "Someone"            | <null> |
| "John"               | <null> |
+-------------------------------+

Field Separation

You can specify alternative field separators for your data. For a tab-separated file (.tsv) you can use … AS line FIELDTERMINATOR '\t\' for semicolons +… AS line FIELDTERMINATOR ‘;’

Batch Your Transactions

This is really important

If you import more than 100k elements in a single transactions, it is very likely (depending on your available memory), that you’ll fail. And it might not be a quick death of your operation. It can drag on for quite a while desparately trying to recover memory to store its intermediate transaction data.

So make sure, to ALWAYS prefix your LOAD CSV with USING PERIODIC COMMIT 1000. The number given is the number of import rows after which a commit of the imported data happens. Depending on the complexity of your import operation, you might create from 100 elements per 1000 rows (if you have a lot of duplicates) up to 100,000 when you have complex operations that generate up to 100 nodes and relationships per row of input.
That’s why a commit size of 1000 might be a safe bet (or not).

There is also an issue within Neo4j’s index operations that makes it work better with smaller commit-sizes.
If you use LOAD CSV without any create or update operation, you cannot use PERIODIC COMMIT. If you use it from within an embedded Java-Application, make sure to _not start a manual transaction beforehand.

Batch Transactions after every 1000 Rows

USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM "https://gist.githubusercontent.com/jexp/d788e117129c3730a042/raw/c5f98f269b4a219ebb88bd3eb83d331f439a8201/people.csv"
  AS line
CREATE (:Person {name:line.PersonName})
Nodes created: 6
Properties set: 6

Indexing

Usually you import your nodes and relationships in one step, creating connections as you process each line. As most CSV files are representing denormalized tables, you’ll get a lot of duplication in them, especially for the joined entity tables (which will probably converted to nodes). So you want to use either MERGE or MATCH to lookup nodes in your graph database to connect or update them in a later step.

Make sure you created the neccessary indexes and constraints upfront, so that both operations can utilize them and lookup values really quickly.

Creating Indexes and Constraints Upfront

This example shows a mult-step import where people are created upfront from one source, and then later on only looked up to connect them to merged Companies.

CREATE CONSTRAINT ON (c:Company) ASSERT c.name IS UNIQUE;
CREATE INDEX ON :Person(name);

USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM "https://gist.githubusercontent.com/jexp/d788e117129c3730a042/raw/c5f98f269b4a219ebb88bd3eb83d331f439a8201/people.csv"
  AS line
CREATE (p:Person {name:line.PersonName});

USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM "https://gist.githubusercontent.com/jexp/d788e117129c3730a042/raw/c5f98f269b4a219ebb88bd3eb83d331f439a8201/people.csv"
  AS line
MATCH (p:Person {name:line.PersonName})
MERGE (c:Company {name:line.`Company Name`});
CREATE (p)-[:WORKS_FOR]->(c)

Test First

Use the same sampling approach as before, but now only take the first 10k or 100k lines and import them. Try importing into a throwaway test database with the neo4j-shell (see below).
If that goes well, remove the added data or clean the database (by deleting the db-directory).

Use the Neo4j-Shell for larger Imports

The Neo4j Browser is great for quick query and playing around with your import data, but if you really want to import millions of rows, go with the Neo4j shell.

If you downloaded the zip distribution of Neo4j, the shell can be found in bin/neo4j-shell (bin\Neo4jShell.bat on Windows). By default it connects to a running server but you can also specify a dedicated database directory with the -path people.db parameter. With -config conf/neo4j.properties you’d use the same config as the Neo4j server for that db-directory.

For importing lots of data you should probably edit the shell script and edit this line EXTRA_JVM_ARGUMENTS="-Xmx4G -Xms4G -Xmn1G" to provide sensible Java heap settings.
You can add your import statements (including index creation) to a Cypher script and execute it with -file import.cql. Don’t forget the semicolons at the end of your statements.

If you run into errors, you might try export STACKTRACES=true and re-run the command to have a good error message to share when asking on StackOverflow.

The Import Query in all its Beauty

Clean out the database for my final import.

MATCH (n)
WITH n LIMIT 10000
OPTIONAL MATCH (n)-[r]->()
DELETE n,r
CREATE CONSTRAINT ON (c:Company) ASSERT c.name IS UNIQUE;
CREATE INDEX ON :Person(name);

USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM "https://gist.githubusercontent.com/jexp/d788e117129c3730a042/raw/1bd8c19bf8b49d9eb7149918cc11a34faf996dd8/people.tsv"
  AS line FIELDTERMINATOR '\t'

WITH line, toInt(line.year) as year, coalesce(line.`Company Name`,"None") as company

WHERE year IS NOT NULL

MERGE (p:Person {name:line.PersonName})
MERGE (c:Company {name:company})
CREATE (p)-[r:WORKS_FOR {since:year}]->(c)
RETURN p,r,c

Ready to go?

That’s it. If you hit any issue, I haven’t covered, here please don’t hesitate to reach out to me, either by commenting below or dropping me an emal to michael at neo4j.org

Real World Example

I just went to http://www.mockaroo.com to generate a CSV sample. I did a lastname and country tuple and generated 1M lines of them (18MB data), put them in my public dropbox folder.

The data looks like this.

last_name,country
Fuller,Saint Vincent and the Grenadines
Lynch,Israel
Crawford,Iceland
Fowler,Belgium
... 1M lines ...
Walker,Mali
Wilson,Turkey
Hart,Saint Vincent and the Grenadines
Fowler,Nigeria

Then I checked the CSV as outlined above, and used bin/neo4j-shell -path people.db -config conf/neo4j.properties -file import-names.cql

CREATE CONSTRAINT ON (c:Country) ASSERT c.name IS UNIQUE;
CREATE INDEX ON :Person(name);

USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM "https://dl.dropboxusercontent.com/u/14493611/name_country.csv"
  AS line

WITH line
WHERE line.last_name IS NOT NULL and line.country IS NOT NULL

MERGE (p:Person {name:line.last_name})
MERGE (c:Country {name:line.country})
CREATE (p)-[r:LIVES_IN]->(c);

The output shown here, you and also look at the the full log.

+-------------------+
| No data returned. |
+-------------------+
Nodes created: 499
Relationships created: 1000000
Properties set: 499
Labels added: 499
119200 ms

References

Share and Enjoy:
  • Print
  • Digg
  • del.icio.us
  • Facebook
  • LinkedIn
  • Netvibes
  • PDF
  • Ping.fm

27 Comments

  • Tom Nishim says:

    So does this information supersede the use of your batch-import program on github? Also, you say that 1M rows a good amount to LOAD CSV, but 10M or higher is pushing my luck… why does it matter if we are committing every 1000 rows? And are you suggesting that, if we have a 10M row file, that we split it into 10 1M file and run LOAD CSV ten times?
    Thank you,
    Tom

  • The batch importer is still way faster than load csv but less convenient to use and less powerful in terms of graph structure that it can create.
    Also there are some shortcomings in terms of eagerly fetching data before starting to create anything (match before create) as you don’t want to match on things that are created later.

  • MauroD says:

    Thanks for this!

    Amazing.
    I would like to know better cypher, why don’t you write a cypher cookbook ?

  • Max says:

    Hi Michael — thanks for the post, really informative!

    I’m trying to use LOAD CSV to create nodes with the labels being set to values from the CSV. Is that possible? I’m trying something like:


    LOAD CSV WITH HEADERS FROM 'file:///testfile.csv' AS line
    CREATE (x:line.class)

    …but I get an invalid syntax error. Is there any way to do this?

  • Good question, thanks! There is already one out (“learning cypher” from packt), and I wrote actually a (german) book on Neo4j 2.0 and worked on the cypher ref-card and the dzone refcard :)

  • [...] first blog post on LOAD CSV is still valid in it own right, and contains important aspects that I won’t repeat here. Both [...]

  • [...] LOAD CSV into Neo4j quickly and successfully by Michael Hunger on Jun 25, 2014. Note: You can also read an interactive and live version of this blog post as a Neo4j GraphGist. [...]

  • Felix says:

    Hi! Thanks for this tutorial. It helped me a lot already. I wonder, what values you would recommend for the heap and the mapped memory, if RAM isn’t really a concern (64 GB available on an ubuntu system),

  • I think I’d recommend at most 16G heap and the rest for memory mapping

  • Ehren Headley says:

    Thank you for writing this! it has helped me so much!

    I was running neo4j on my windows machine, but recently we set it up on a linux box. My old LOAD CSV steps no longer work, is there any difference in the cypher when a linux box is trying to reach out to a windows machine?

    Thanks!

  • There should be no difference, best ask with more details on StackOverflow, or our Google Group, you’ll get quick help there.

  • Amit says:

    Running all the imports via a single script, fed into neo4j-console.

    Any way to spit out lines to the console for debugging during the import, so I can check on progress and how far a script gets, before an error?

    Been searching around, and oddly, can’t find anything….

  • Best join neo4j.com/slack

    There is no progress output from LOAD CSV
    if you run it with periodic commit you can check in a separate session the count of nodes and rels already created.

  • Kent says:

    I’m copying Max’s comment because I’m having the same problem. I’m using Windows, have the file on a local drive (can’t upload files because of the company firewalls).

    I’m trying to use LOAD CSV to create nodes with the labels being set to values from the CSV. Is that possible? I’m trying something like:

    LOAD CSV WITH HEADERS FROM ‘file:///testfile.csv’ AS line

    CREATE (x:line.class)

    I wrote the following:
    LOAD CSV WITH HEADERS FROM “C:/Users/gladstone/Neo4j/Docs/Report2.cvs” AS line WITH line
    RETURN line
    LIMIT 5;

    I’m getting a cryptic error: org/parboiled/errors/ErrorUtils

    Suggestions?

  • zarina says:

    Hi,

    I have a csv file that has ~280 columns, i want to create a node with all the columns as its properties, is it possible to do it automatically.

  • Hi Michael,

    Thanks for this wonderful post. I had read in SO that teh batch importer is good with > 1M edges and LOAD CSV can be used with smaller networks. Your blog is really helpful as I am not able to see neo4j-import in my Neo4j directory for some reason.

    Also the newer versions of neo4j insist on using powershell for windows which are giving my some problems.

    Your suggestions to modify the conf/properties files really bailed me out. Thanks!

  • LOAD CSV works with 10M-100M
    Batch Importer tools are good for 10M-xxBN nodes

    You can also join neo4j.com/slack for more support.

  • Try these.

    LOAD CSV WITH HEADERS FROM "url" AS row
    CREATE (n:Foo) SET n = row

    LOAD CSV WITH HEADERS FROM "url" AS row
    MERGE (n:Foo {id:row.id}) SET n += row

  • If the server is running remotely and your files are only locally you can try to serve them from a http server and access that from your neo4j server.

    Otherwise you could try to check out: http://neo4j-csv-import.herokuapp.com/

  • Zarina says:

    Thank you Michael, your reply helped.

    I have another question,

    I have datafile ABC.csv with header, in the following format

    Company,Sample1,Sample2,Sample3….
    abc12,a:23:b,z:34:r,v:45:b …..

    How have nodes Company and Sample with unique ids abc12 , Sample1, Sample2, Sample3 etc

    How can I create a relations between

    1.Company with id abc12 and Sample with id Sample1 with a property AD with value “a:23:b”
    2.Company with id abc12 and Sample with id Sample2 with a property AD with value “z:34:r”
    … etc

    with a generic script, I have thousands of samples and over a million Companies in my db.

  • Kevin P. McGinn says:

    I am new to neo4j; trying to port data from a relational database into neo4j. For now I only want to load and update node data. My initial import is no issue 37M rows/8 min. Where I am going off the rails is with daily updates using ‘Load CSV…”. Each of those daily files with both new and updated data. For new data I simply want to add to an existing node. For updated data, I want to replace the existing node. New inputs seems to work fine. But with update I have an Eager issue no matter what permutation of of load logic I use which cause the load to be very slow or run out of heap space.

    I have a node labeled as BASEaccounts. This node contains 4M data elements. The update file contain 500K data elements that are to be updated. It seemed that a simple statement:

    LOAD CSV WITH HEADERS FROM ‘file:/export/warehouse/DW/accounts_testData.txt’ as row FIELDTERMINATOR ‘\t’ with row limit 0 match (n:BASEaccounts { AccountsID: row.AccountsID }) with n,row set n=row;

    Should suffice for the update. But, the unique index on the AccountsID of PASSPOaccounts is ignored. This runs until heap space expires. The Centos server I am using has 24G of ram.

  • [...] including memory allocation, batch imports, and data filtration, please read Michael Hunger’s blog. I admittedly am no Python or Cypher guru, so if you see mistakes or would like to make a [...]

  • Ahmed Nahar says:

    Hi Michael,
    Thanks a lot for this useful post. My problem is that the more I load data from new CSV files, the more Neo4j is slow in responding… it comes to a point while writing the command statements…. I need to wait a few seconds to see the full statement

  • Bharat says:

    Thank you for your post. I am using Neo4J community edition 3.0.6. I’ve a CSV file on my local and unable to map the file from my local to upload the file. Can you advise how can I load, do I need to setup a root folder? if so, how?

    Thanks,

    Bharat Bhanushali

  • You have to share more detail, best on StackOverflow or on neo4j.com/slack

  • 1. create a constraint on your unique keys
    2. use MERGE which is GET-OR-CREATE
    3. either use SET after merge, or better ON CREATE SET and ON MATCH SET to do conditional updates

  • Not sure I understand, you would just match the two nodes and connect them.
    You can also set the samples as properties on your company nodes, depends on what you want to do.

    If you can share more detail, best on StackOverflow or on neo4j.com/slack

Leave a Reply

XHTML: You can use these tags:' <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

Copyright © 2007-2016 Better Software Development All rights reserved.
Multi v1.4.5 a child of the Desk Mess Mirrored v1.4.6 theme from BuyNowShop.com.