1

LOAD CSV with SUCCESS

Posted by Michael Hunger on Oct 18, 2014 in cypher, import, neo4j

I have to admit that using our LOAD CSV facility is trickier than you and I would expect.
Several people ran into issues that they could not solve on their own.

My first blog post on LOAD CSV is still valid in it own right, and contains important aspects that I won’t repeat here.
Both in terms of data quality checking (broken CSV files, misspelt header names or incorrect data types) as well as the concern of transaction size, where PERIODIC COMMIT comes to the rescue.

To address the most frequent issues and questions, I decided to write this follow up post.

In general you might have better experience using Neo4j-Enterprise as it contains some components which are more memory efficient.

If you want to import much more than 10-15 million lines of data, you might consider using our non-transactional batch-insertion facilities:

Stay tuned for some new announcements from Neo Technology about a super-fast batch-insertion mechanism.

Clean and Check your CSV-Files

We ran into many issues where CSV files were just broken, please make sure that your files are not, otherwise you will spend hours hunting for bugs in the wrong place.

The CSV reader used by Cypher (OpenCSV) will handle quotes and escaping correctly, that means if you have quotes in places where they not belong please escape and remove them.
Otherwise you might end up getting a million lines of CSV concatenated in a single string value, just because you had a quoted string in one place

Other bad things:

  • Having binary zeros in your file \u000, remove them (e.g. tr < file-with-nulls.csv -d '\000' > file-without-nulls.csv)

  • the 2-byte UTF file preamble (byte order mark, BOM) will trip it up, remove it

  • Escaped quotes instead normal quotes in your cells break your file, e.g "A title\", "An Author", unescape them

  • Quotes in the middle of the text will trip up the file structure, e.g. "I love :") this smiley", escape those

  • Make sure to have no unquoted text fields containing newlines

  • Windows Newlines can sometimes trip it up, when imported under non-Windows OS, make sure to clean it up first

  • if you use non-ascii characters (umlauts, accents etc.) make sure to use the appropriate locale or provide the System property -Dfile.encoding=UTF8

Some tools that can help you checking and fixing your CSV:

  • CSV Kit

  • CSV Lint

  • hexdump, and the hex-mode of editors like vi, emacs, UltraEdit and Notepad++

  • the tips on checking your CSV files from my last blog post

Data Conversion

When you convert data from the CSV to be imported via toInt, toFloat, split or otherwise, make sure to do it consistently in all places.
One tip that can help is to use WITH to declare them as identifiers once after conversion:

LOAD CSV ... AS data
WITH data, toInt(data.id) as id, extract(p IN split(data.parts,";") | toInt(p)) as partIds
CREATE (n:Node {id:id})
FOREACH (partId in partIds | CREATE (:Part {id:partId})-[:PART_OF]->(n) )

Partially addressed Issue: Eager Loading for Change Isolation

The biggest issue that people ran into, even when following the advice I gave earlier, was that for large imports of more than one million rows, Cypher ran into an out-of-memory situation.

That was not related to commit sizes, so it happened even with PERIODIC COMMIT of small batches.

The issue is that within a single Cypher statement you have to isolate changes that affect matches further on, e.g. when you CREATE nodes with a label that are suddenly matched by a later MATCH or MERGE operation.
Generating one row of results would affect other, subsequent ones in unexpected ways.

One example query that illustrates the behavior, that would happen if you don’t isolate, is:

MATCH (person:Person)
CREATE (clone:Person {name:"Clone of "+person.name});

If you don’t execute all the read’s before all the updates, you’ll end up creating clone of clone armies.
If you profile that query you see that there is an “Eager” step in the query plan.
That is where the “pull in all data” happens.

+-------------+------+--------+----------------+------------+
|    Operator | Rows | DbHits |    Identifiers |      Other |
+-------------+------+--------+----------------+------------+
| EmptyResult |    ? |      ? |                |            |
| UpdateGraph |    ? |      ? |          clone | CreateNode |
|     *Eager* |    ? |      ? |                |            |
| NodeByLabel |    ? |      ? | person, person |    :Person |
+-------------+------+--------+----------------+------------+

How does this affect LOAD CSV?

Cypher deals with this as follows: As soon as it detects a Update followed by Read (or the other way round) operation, it will execute the first operation for all rows first, before continuing on the second operation.
This happens by inserting an Eager Operator (that you can spot in the query plan), which will fetch all intermediate results from the previous step before continuing.

In normal queries where you create at most a few (hundred-) thousand nodes or relationships in one statements that’s not an issue.
But when you deal with a CSV file with millions of rows of input, it will both – fill your memory with the file contents and created data (and transaction state).
And as PERIODIC COMMIT is tied to the CSV-lines read at the end of the statement, it will also be effectively disabled.

This is not a problem if you have enough heap, I ran very complex LOAD CSV commands that had several Eager operators in their execution plan with a lot of CSV data on machines with enough heap (eg. 8G, 16GB or 32GB) and there it was no problem pulling all intermediate state into memory.
But you might not want to afford such a luxury.

Don’t worry, here are some simple tips on how to avoid it:

Some Tips

  • Upgrade to 2.1.5+ , Cypher has learned a number of constructs where it doesn’t have to put in an Eager operator between reads and writes because they are actually independent

  • Profile your statement upfront (even without pulling lines of input through WITH data LIMIT 0), if Eager shows up, simplify your statement

  • Write only simple LOAD CSV statements if you want to save memory and have multiple passes across the same or multiple csv files

    • only CREATE nodes or MERGE different type of nodes in one statement

    • don’t mix MERGE of nodes and MERGE of relationships

This “Eager” step also shows up in the following LOAD CSV statement in versions before 2.1.5

PROFILE LOAD CSV WITH HEADERS FROM "..." AS data
WITH data LIMIT 0 // limit 0 for profiling only
MATCH (p:Person {name:data.name})
MATCH (c:Company {name:data.company})
CREATE (p)-[:WORKD_AT]->(c)
Neo4j before 2.1.5
+----------------+------+--------+--------------+----------------------------------------+
|       Operator | Rows | DbHits |  Identifiers |                                  Other |
+----------------+------+--------+--------------+----------------------------------------+
|    EmptyResult |    0 |      0 |              |                                        |
|    UpdateGraph |    0 |      0 |   UNNAMED161 |                     CreateRelationship |
|       !! Eager |    0 |      0 |              |                     ! Watch this !     |
| SchemaIndex(0) |    0 |      0 |         c, c | Property(data,company); :Company(name) |
| SchemaIndex(1) |    0 |      0 |         p, p |  Property(data,name(0)); :Person(name) |
|          Slice |    0 |      0 |              |                           {  AUTOINT0} |
|        LoadCSV |    1 |      0 |         data |                                        |
+----------------+------+--------+--------------+----------------------------------------+

Fortunately, Cypher was fixed in 2.1.5 to learn that there are some patterns that are unrelated, so it doesn’t add the Eager step by default.
Here is the profiler output of the same query in 2.1.5, you see that the Eager operation is missing.

Neo4j 2.1.5+
+----------------+------+--------+--------------+----------------------------------------+
|       Operator | Rows | DbHits |  Identifiers |                                  Other |
+----------------+------+--------+--------------+----------------------------------------+
|    EmptyResult |    0 |      0 |              |                                        |
|    UpdateGraph |    0 |      0 |   UNNAMED179 |                     CreateRelationship |
| SchemaIndex(0) |    0 |      0 |         c, c | Property(data,company); :Company(name) |
| SchemaIndex(1) |    0 |      0 |         p, p |  Property(data,name(0)); :Person(name) |
|          Slice |    0 |      0 |              |                           {  AUTOINT0} |
|        LoadCSV |    1 |      0 |         data |                                        |
+----------------+------+--------+--------------+----------------------------------------+

There are some statements that are not yet covered, e.g. property updates like this:

LOAD CSV ... AS data
MATCH (n:Node {id:data.id})
SET n.value = data.value

Fixed Issue: Read your own Changes (Fixed in 2.1.5+)

Another issue that could slow down an import was an read your own writes problem (which has been recently fixed in 2.1.5) in Neo4j when using a statement like this.
That happened especially when you had schema indexes to speed up your node by label and value lookups.

CREATE INDEX ON :Person(name);
CREATE INDEX ON :Company(name);
...
MATCH (p:Person {name:"John"}),(c:Company {name:"ACME"})
CREATE (p)-[:WORKS_AT]->(c);

The reason for that issue was that the overlaying transaction state check for index lookups (i.e. potential node changes that affect index results, like added or removed labels and properties), also checked against nodes where other aspects were changed (e.g. relationships added).
That check also did not take labels into account.
So the more relationships you created the more nodes it had to scan to.
That’s why PERIODIC COMMIT with a small transaction size helped (100 or 1000).

Avoid Windows for Import

Due to a variety of reasons, disk and memory-mapping operations on Windows are much slower than on Linux and Mac.
This might not be so apparent in day-to-day operations with Neo4j but for imports where every millisecond counts, it quickly adds up and becomes a bottleneck.
So even if you just grab a live-boot-cd, an AWS or Digital-Ocean (better w/ SSD) instance or your friends Linux machine, you’ll be happier.

Use the Shell, Luke

The Neo4j-Shell is most helpful when importing data, as you can point it to different test-database directories (-path test.db), kill it with ctrl-c and run multiple of them in parallel (on different databases).
You can also supply a config file where you adapted the memory mapping sizes to fit your projected store sizes (-config conf/neo4j.properties).
And you can load commands from a file (-file import.cyp), no need for tedious copy & paste.

You find the neo4j-shell (or Neo4jShell.bat) script in your path/to/neo4j/bin and you can run it from anywhere.
If you have a server running and, don’t provide the -path parameter it will connect to the running server (if you didn’t disable remote shell).
For Windows users that installed the database via the graphical installer, my colleague Mark explained the steps to access the Neo4j-Shell.

There is only one caveat, if you run neo4j-shell without server, you have to provide it with more RAM for the import.

You can do that by setting an environment variable: EXPORT JAVA_OPTS="-Xmx4G -Xms4G -Xmn1G" for machines with more RAM you can increase that to 8 or 16 but not more that a quarter of your RAM.

For really large imports, you should use the remainder of your RAM for memory mapping, projecting the expected node, relationship and property-counts.
In the file you provide to the shell via -config conf/neo4j.properties:

# eg. for 25M nodes, 250M relationships, total 10.4G, with 4G heap and 2G OS of 16GB total
# 15 bytes per node
neostore.nodestore.db.mapped_memory=400M
# 35 bytes per rel
neostore.relationshipstore.db.mapped_memory=7G
# 42 bytes per property
neostore.propertystore.db.mapped_memory=2G
# long strings, chopped up into 60 char segments
neostore.propertystore.db.strings.mapped_memory=1G
# arrays if needed
#neostore.propertystore.db.arrays.mapped_memory=100M
export JAVA_OPTS="-Xmx4G -Xms4G -Xmn1G"
path/to/neo4j/bin/neo4j-shell -path import-test.db -config path/to/neo4j/conf/neo4j.properties -file import-test.cyp

Need Help? We’re there

If you have any questions regarding importing data into Neo4j, don’t worry, we can help you quickly:

 
1

Flexible Neo4j Batch Import with Groovy

Posted by Michael Hunger on Oct 9, 2014 in import, neo4j

You might have data as CSV files to create nodes and relationships from in your Neo4j Graph Database.
It might be a lot of data, like many tens of million lines.
Too much for LOAD CSV to handle transactionally.

Usually you can just fire up my batch-importer and prepare node and relationship files that adhere to its input format requirements.

Your Requirements

There are some things you probably want to do differently than the batch-importer does by default:

  • not create legacy indexes

  • not index properties at all that you just need for connecting data

  • create schema indexes

  • skip certain columns

  • rename properties from the column names

  • create your own labels based on the data in the row

  • convert column values into Neo4j types (e.g. split strings or parse JSON)

Batch Inserter API

Here is where you, even as non-Java developer can write a few lines of code and make it happen.
The Neo4j Batch-Inserter APIs are fast and simple, you get very far with just the few main methods

  • inserter.createNode(properties, labels) → node-id

  • inserter.createRelationship(fromId, toId, type, properties) → rel-id

  • inserter.createDeferredSchemaIndex(label).on(property).create()

Demo Data

For our demo we want to import articles, authors and citations from the MS citation database. // todo

The file looks like this:

author title date

Max

Matches

2012-01-01

Mark

Clojure

2013-05-21

Michael

Forests

2014-02-03

Setup with Groovy

To keep the “Java”ness of this exercise to a minimum, I choose groovy a dynamic JVM language that feels closer to ruby and javascript than Java.
You can run Groovy programs as script much like in other scripting languages.
And more importantly it comes with a cool feature that allows us to run the whole thing without setting up and build configuration or installing dependencies manually.

If you want to, you can also do the same in jRuby, Jython or JavaScript (Rhino on Java7, Nashorn on Java8) or Lisp (Clojure).
Would love to get a variant of the program in those languages.

Make sure you have a Java Development Kit (JDK) and Groovy installed.

We need two dependencies a CSV reader (GroovyCSV), and Neo4j and can just declare them with a @Grab annoation and import the classes into scope. (Thanks to Stefan for the tip.)

@Grab('com.xlson.groovycsv:groovycsv:1.0')
@Grab('org.neo4j:neo4j:2.1.4')
import static com.xlson.groovycsv.CsvParser.parseCsv
import org.neo4j.graphdb.*

Then we create a batch-inserter instance which we have to make sure to shut down at the end, otherwise our store will not be valid.

The CSV reading is a simple one-liner, here is a quick example, more details on the versatile config in the [API docs].

csv = new File("articles.csv").newReader()
for (line in parseCsv(csv)) {
   println "Author: $line.author, Title: $line.title Date $line.date"
}

One trick we want to employ is keeping our authors unique by name, so even if they appear on multiple lines, we only want to create them once and then keep them around for the next time
they are referenced.

Note
To keep the example simple we just use a Map, if you have to save memory you can look into a more efficient datastructure and a double pass.
My recommendation would be a sorted name array where the array-index equals the node-id, so you can use Arrays.binarySearch(authors,name) on it to find the node-id of the author.

We define two enums, one for labels and one for relationship-types.

enum Labels implements Label { Author, Article }
enum Types implements RelationshipType { WROTE }

So when reading our data, we now check if we already know the author, if not, we create the Author-node and cache the node-id by name.
Then we create the Article-node and connect both with a WROTE-relationship.

  for (line in parseCsv(csv)) {
     name = line.author
     if (!authors[name]) {
        authors[name] = batch.createNode([name:name],Labels.Author)
     }
     date = format.parse(line.date).time
     article = batch.createNode([title:line.title, date:date],Labels.Article)
     batch.createRelationship(authors[name] ,article, Types.WROTE, NO_PROPS)
     trace()
  }

And that’s it.

I ran the data import with the Kaggle Authorship Competition containing 12M Author→Paper relationships.

The import used a slightly modified script that took care of the 3 files of authors, articles and the mapping.
It loaded the 12m rows, 1.8m authors and 1.2m articles in 175 seconds taking about 1-2 s per 100k rows.

groovy import_kaggle.groovy papers.db ~/Downloads/kaggle-author-paper

Total 11.160.348 rows 1.868.412 Authors and 1.172.020 Papers took 174.122 seconds.

You can find the full script in this GitHub gist.

If you’re interested in meeting me, more treats (full-day training, hackathon) or graph database fun in general,
don’t hesitate to join us on Oct 22 2014 in San Francisco for this years GraphConnect conference for all things Neo4j.

 
6

LOAD CSV into Neo4j quickly and successfully

Posted by Michael Hunger on Jun 25, 2014 in cypher, import

Since version 2.1 Neo4j provides out-of-the box support for CSV ingestion. The LOAD CSV command that was added to the Cypher Query language is a versatile and powerful ETL tool.
It allows you to ingest CSV data from any URL into a friendly parameter stream for your simple or complex graph update operation, that … conversion.

But hear my words of advice before you jump directly into using it. There are some tweaks and configuration aspects that you should know to be successful on the first run.

Data volume: LOAD CSV was built to support around 1M rows per import, it still works with 10M rows but you have to wait a bit, at 100M it’ll try your patience.
Except for tiny datasets never run it without the saveguard of periodic commits, which prevent you from large transactions overflowing your available database memory (JVM Heap).

The CSV used in this example is pretty basic, but enough to show some issues and make a point, it’s people and companies they work(ed) for.

PersonName,"Company Name",year
"Kenny Bastani","Neo Technology",2013
"Michael Hunger","Neo Technology",2010
"James Ward","Heroku",2011
"Someone",,
"John","Doe.com","ninetynine"

Read more…

 
2

Rendering a Neo4j Database in UbiGraph

Posted by Michael Hunger on Jun 23, 2014 in cypher, server

I never heard of UbiGraph before, but this tweet by @a61dr41n made me curious.

So I checked it out. UbiGraph is a graph rendering server that is controlled remotely and also interactively with a XML-RPC API (which is a weird choice).
It comes with example clients in Java, Python, Ruby and C.
You can download it from here. After unzipping the file and starting bin/ubigraph_server &, you should see a black window rendering the void, waiting for your commands.

Read more…

 
0

Presentation: “Using AsciiArt to Analyse your SourceCode with Neo4j and OSS Tools” at GeekOut.ee 2014

Posted by Michael Hunger on Jun 15, 2014 in conference, neo4j, programming languages

During the awesome GeekOut conference organized by my friends at ZeroTurnaround I was asked to stand in for Tim Fox who couldn’t come.

So instead of using a existing presentation I decided to finally write one up over night that covers one aspect of graph databases that is close to my heart:

Software Analytics with Graphs

When I first learned about Neo4j in 2008, my first project was pulling in Java class-file information into Neo4j, to find interesting tidbits about the JDK. Fast forward 4 years.

Other things kept me busy until I 2012, when I was speaking at a InnoQ tech-day and thought this would be a good topic to talk about.

I was so amazed by projects that others did in this area and published a blog post on “Graph Databases and Software Metrics to show what I’ve found. These were:

  • Raoul-Gabriel Urma: Expressive and Scalable Source Code Queries with Graph Databases (Paper)
  • Rickard Öberg: NeoMVN is tracing maven dependencies (GitHub)
  • Pavlo Baron: Graphlr, a ANTLR storage in Neo4j (GitHub)

When having a train-ride with my friend Dirk from Buschmais for two hours, he got a full load of my excitement about this topic, and he saw a real good practical use for his daily work with large software projects. Having your projects structure in a graph allows you to:

  1. Query the graph structures for insights on the code level (e.g. code-smells)
  2. Enrich the graph structure with higher level, technical, architectural and business concepts
  3. Define rules and metrics based on those higher level concepts.
  4. Run the parsing, enrichment, metrics computation and rule checking as part of your build process, generating reports and failing it in case of violation of those rules

All those ideas resulted in an impressive open-source project called jQAssistant which does all of the above (and much more).

So back to my GeekOut presentation. I sat down late night until 3am and wrote it up in AsciiDoc (+ deck.js) so you can fork it from the github repo, download the PDF or view the HTML-Slides online.

The session has been recorded, I’ll embed the video as soon as it is online. Then you can even listen to my hoarse voice.

 
0

Styling Neo4j Server Visualisation

Posted by Michael Hunger on Jun 3, 2014 in neo4j, server

Styling Neo4j Server Visualisation

To give you a head start when using Neo4j-Browser I wanted to share these quick tips for styling and querying.

howto_style

Read more…

 
5

Using LOAD CSV to import Git History into Neo4j

Posted by Michael Hunger on Jun 1, 2014 in cypher, neo4j

In this blog post, I want to show the power of LOAD CSV, which is much more than just a simple data ingestion clause for Neo4j’s Cypher.
I want to demonstrate how easy it is to use by importing a project’s git commit history into Neo4j. For demonstration purposes, I use Neo4j’s repository on GitHub, which contains
about 27000 commits.

It all started with this tweet by Paul Horn, a developer from Avantgarde Labs in my lovely Dresden.

I really liked the idea and wanted to take a look. His python script takes the following approach:

Read more…

 
1

Importing Forests into Neo4j

Posted by Michael Hunger on Apr 10, 2014 in cypher, neo4j

Sometimes you don’t see the forest for the trees. But if you do, you probably use a graph database.

Giant Tree

Trees are one of the simple graph datastructures, directed acyclic graphs (DAGs).

For our example we use a time-tree that we want to import into the database.

Data Volume

A quick soulver script (thanks Mark) later, we know how many nodes and rels (nodes-1), we will have to
import to represent a full year down to the second level.

1 year = 12 months = 365 days = 8.760 hours = 525.600 minutes = 31.536.000 seconds

So we have to import about 32M nodes and 32M relationships. Sounds good enough.

Read more…

 
1

Sampling A Neo4j Database

Posted by Michael Hunger on Mar 25, 2014 in cypher, neo4j

After reading the interesting blog post of my colleague Rik van Bruggen on “Media, Politics and Graphs” I thought it would be really cool to render it as a GrapGist. Especially, as he already shared all the queries as a GitHub Gist.

netwerk

Unfortunately the dataset was a bit large for a sensible GraphGist representation, so I thought about means of extracting a smaller sample of his raw data that he made available (see his blog post for the link).

Read more…

 
6

Quickly create a 100k Neo4j graph data model with Cypher only

Posted by Michael Hunger on Mar 21, 2014 in cypher, neo4j

We want to run some test queries on an existing graph model but have no sample data at hand and also no input files (CSV,GraphML) that would provide it.

Why not create quickly it on our own just using cypher. First I thought about using Cypher to generate CSV files and loading them back, but it is much easier.

The domain is simple (:User)-[:OWN]→(:Product) but good enough for collaborative filtering or demographic analysis.

Read more…

Copyright © 2007-2014 Better Software Development All rights reserved.
Multi v1.4.5 a child of the Desk Mess Mirrored v1.4.6 theme from BuyNowShop.com.