Natural Language Analytics made simple and visual with Neo4j

Posted by Michael Hunger on Jan 8, 2015 in cypher, fun

I was really impressed by this blog post on Summarizing Opinions with a Graph from Max and always waited for Part 2 to show up :)

The blog post explains an really interesting approach by Kavita Ganesan which uses a graph representation of sentences of review content to extract the most significant statements about a product.

opiniosis overview

Each word of the sentence is represented by a shared node in the graph with order of words being reflected by relationships pointing to the next word, which carries the sentence-id and a positional information of the leading word.

By just looking at the graph structure, it turns out that the most significant statements (positive or negative) are repeated across many reviews.
Differences in formulation or inserted fill words only affect the graph structure minimally but reinforce it for the parts where they overlap.

You can find all the details of the approach in this presentation or the accompanying research.

I always joked that you could create this graph representation without programming just by writing a simple Cypher statement, but I actually never tried.

Until now, and to be honest I’m impressed how easy it was to write down the essence and then extend and expand the statement until it covered a large number of inputs.

Read more…


Spring Data Neo4j 3.3.0 – Improving Remoting Performance

Posted by Michael Hunger on Dec 9, 2014 in neo4j, spring-data-neo4j

With the first milestone of the Spring Data “Fowler” release train, Spring Data Neo4j 3.3.0.M1 was released. Besides a lot of smaller fixes, it contains one big improvement. I finally found some time to work on the remoting performance of the library, i.e. when used in conjunction with Neo4j Server. This blog post explains the history behind the issue and the first steps taken to address it.

In the past for many of its users, the remote performance of Spring Data Neo4j was not satisfying. The reasons for that were twofold – historical and development bandwidth. Let’s start with the history.


When Spring Data Neo4j started, neither Cypher, nor Neo4j-Server existed, only the embedded Java APIs of Neo4j were available.

So initially, the AspectJ-based, write- and read-through mapping mode and then later the simple mapping mode were built based on these in-process Java-APIs.

Later adding Neo4j server support was made “easy” with the java-rest-binding library which pretended to be an embedded GraphDatabaseService API but actually made remote calls to the Neo4j Server REST endpoints for each of the API operations.

Both were very bad ideas. Not just because network transparency is a very leaky abstraction.

So what we ended up with was an Object-Graph-Mapper assuming to talk to an embedded database. In the embedded case, the frequency of API calls is not problematic, and re-fetching nodes and relationships by id was just a cache-hit away.

But unknowingly making REST calls over the wire for each of the operations has quite an impact.

So a call that took nanoseconds or less in embedded mode may have taken as long as your twice your network latency in the remote case. Combined with the number of calls it pretty quickly summed up to an unsatisfying remote performance of the whole library.

Unfortunately fixing this it was not in the development bandwidth available to work on the library. In retrospect that was a really bad decision too, as many users (esp. in larger organizations) really like the Spring Data (Neo4j) convenience for CRUD and use-case specific operations.

Recommendations for Working with Neo4j Server

Usually I recommended to either move the persistence layer of such an SDN application into a server extension and just expose REST endpoints that use a domain level protocol. Or to rewrite complex remote mapping logic into a few cypher statements that will be executed within the server.

Both suggestions are actually sensible and can improve the performance of your operations up to 20 times.

Performance Example

To show the implications, I created a small test project that runs the same medium-complex SDN-mapping operations for each of the setups, creating 1000 medium size business objects (9 entities and 8 relationships with some properties):

  • on an embedded database,

  • against a SDN based server extension and

  • remotely via Spring Data Neo4j (3.2.1.RELEASE).

For completeness I also added two tests that indicate remote Cypher performance, once via JDBC and once via the SDN Cypher execution.

The speed differences are quite big:



Time (ms)

Time/Op (ms)

SDN remote (3.2.1):




SDN embedded:




SDN server extension:




SDN Cypher:




JDBC Cypher:




Fortunately lately I finally got around to addressing at least a few of the root causes.


I looked into the hotspots of the remote execution, and fixed the ones with the highest impact.

I refrained from rewriting the whole object graph mapping logic within SDN as this is a much larger effort and will be worked on by our partner GraphAware as part of the SDN.next efforts (see below).

The places with the highest impact were:

  • Separate call to fetch node-labels as the REST-representation doesn’t expose labels

  • Continous re-fetching of nodes and relationships from the database as part of the simple mapping process (as only the ID is available in the @GraphId field of the actual entity).

  • setting properties individually via propertyContainer.setProperty() which is the only available API in embedded mode


The existing java-rest-binding library also can’t expose any transaction semantics over the wire, as each REST-operation creates a new independent transaction within Neo4j Server.

The only approach that was supported for “larger transactions” was the REST-Batch-Operations which encapsulated a number of operations in one single large HTTP request. But that API didn’t allow to read your own writes within a transaction and making decisions based on that information. So this didn’t only remove transactional safety but also created a lot of tiny transactions which all had to be forcibly synched to disk by Neo4j.

Changes impacting Performance

I started by comparing the remote performance of single REST-API calls with the appropriate Cypher calls and found that the former are twice as fast for single and simple operations. Cypher has its strengths in more complex operations and in running multiple statements within the same transaction with the new transactional endpoint.

As a first step, I inlined all the code of java-rest-binding that I still intended to use, into Spring Data Neo4j and removed the operations that were no longer relevant. So starting from that next version, the java-rest-binding dependency will no longer be used.

For some quick gains I changed:

  • load node with labels in one step using two batch-REST calls, only load labels from SDN if they are not already loaded

  • used label and id meta-information from Neo4j’s REST format for nodes since Neo4j 2.1.5

  • added a local client-cache for nodes loaded from the server, updates and refreshes will also go through this cache

  • added a separate interface UpdateableState to RestEntity (nodes and relationships) that allows bulk-updates of properties

  • changed node- and relationship-creation to utilize maps of properties for the initial create or merge call

All of those changes already improved the performance of the Spring Data Neo4j remote operations by a factor of 3 as shown by the sample project, but it was still not good enough.

Transactional Cypher #FTW

A you probably know Neo4j supports a streaming HTTP endpoint that is able to run multiple Cypher statements per request and can keep a transaction running across multiple HTTP-requests.

So although I originally wanted to use the Neo4j-JDBC driver, it was not yet available on Maven Central. So I deferred that and instead wrote a quick Jersey based HTTP client for the transactional Cypher endpoint which also supported batching of queries and running multiple HTTP requests within the same transaction.

The internal API looks like this:

query = "MATCH (n) where id(n) = {id} " +
        "RETURN id(n) as id, labels(n) as labels, n as data";

params = singletonMap("id",nodeId);

tx = new CypherTransaction(url, ResultType.row);

Result result = tx.send(query, params);

// or

tx.add(query, params);
List<Result> results = tx.commit();

List<String> cols = result.getColumns();
if (result.hasData()) {
   Iterable<List<Object>> data = result.getData();

for (Map<String,Object> row : result) {
   Long id = row.get("id");
   List<String> = row.get("labels");
   Map props = row.get("data");

I then rewrote all remote operations that were expressible with Cypher from REST-HTTP calls into parameterized Cypher statements, for instance the Create-Node call into:

CREATE (n:`Label1`:`Label 2` {props})
RETURN id(n) as id, labels(n) as labels, n as data

This allowed me to set all the labels and properties of the node with a single CREATE operation and return the property data as well as the metadata like id and labels in a single call. I used the same return format consistently for nodes and relationships to map them easily back into the appropriate graph objects that SDN expects. The variant for relationships actually also returns start- and end-node-ids.

The list of operations that were (re)written is pretty long:

  • createNode, mergeNode

  • createRelationship

  • getDegree, getRelationships

  • findByLabelAndPropertyValue, findAllByLabel

  • findByQuery (lucene)

  • setProperty, setProperties, addLabel, removeLabel

  • deleteNode, deleteRelationship

  • …​

All other methods still forward to the existing REST operations (e.g. adding nodes to legacy indexes).
The new Cypher based REST-Api-Impl also utilizes the node cache that I already mentioned.
Some of these operations also send multiple statements on the same HTTP-request.

All Cypher operations run within a transaction, if there is none running, a single transaction will be opened just for this operation. If there is already a transaction started (stored in a ThreadLocal), the following operations will participate in it. So if the transaction is started on the outside, e.g. on a method boundary (annotated with @Transactional) all operations in the same thread will continue to use that transaction until the transactional scope was closed by issuing a commit or rollback operation.

To integrate this functionality with the outside, the Cypher based Rest-API exposes a method to create new Transactions. Those are held in a thread-local variable so that you can run independent threads with individual, concurrent transactions.

For integration with the Java world, aka. JTA, I also implemented a javax.transaction.TransactionManager on top of that API which can be used on its own.
But of course for integrating with Spring it is injected into a Jta(Platform)TransactionManager in the Spring Data Neo4j configuration.

So whenever you annotate a method or class with @Transactional, the Spring transaction infrastructure will use that bean to tie into the remote transaction mechanism provided by the transactional Cypher endpoint.

It was pretty cool that it worked out of the box after I brought the individual pieces together.

To make this new remote integration usable from Spring Data Neo4j I created a SpringCypherRestGraphDatabase (an implementation of the SDN-Database API that is more comprehensive than Neo4j’s GraphDatabaseService).

This is what you should use now to connect your Spring Data Neo4j application remotely to a Neo4j Server.

@EnableTransactionManagement(mode = AdviceMode.PROXY)
public static class RemoteConfiguration extends Neo4jConfiguration {
    public RemoteConfiguration() {

    public GraphDatabaseService graphDatabaseService() {
        return new SpringCypherRestGraphDatabase(BASE_URI);

The steps taken here improved the performance of the use-case we were looking at by a factor of 8, which is not that bad.



Time (ms)

Time/Op (ms)

SDN remote (3.2.1):




SDN remote (3.3.0):




My changes only addressed the remoting aspect of this challenge, the next step is to think big.

Ad Astra – SDN.next

We started work on completely rewriting the internals of Spring Data Neo4j to embrace a single, fast object graph mapping library for Neo4j.

As part of this effort which is mainly developed by our partner GraphAware in London, we will simplify the architecture that Spring Data Neo4j is built on.

While we will keep the external APIs that you see as SDN users, as stable as possible, the internals will change completely.

The idea is to build a fast, pure Java-Object Graph Mapper that utilizes the transactional Cypher Endpoint.
It will provide APIs for specifying mapping metadata from the outside and focus on simple CRUD operations of your entities and mapping Cypher query results into arbitrary result object structure (DTOs, view objects).

Spring Data Neo4j’s single future mapping mode will then utilize these APIs to provide mapping meta-information from its annotations, run the CRUD operations for updating and reading entities and support Cypher execution and result handling like you already use today.

As all that relies on the execution of compound Cypher statements, you can do much more in a single call, depending on how clever the OGM will become.

And going forward its performance will benefit from all Cypher performance improvements, new schema indexes (spatial and fulltext) and new remoting protocols.

I’m really excited to accompany this work and see it advancing every day. If you want to get a glance of these developments, check out the GraphAware GitHub repositories. But please be patient, this is work in progress in its early stages and although it progresses quickly, the first publicly usable version is still a while out.

I hope you join me all on this journey and are excited as I am of these latest developments.


The Story of GraphGen

Posted by Michael Hunger on Nov 1, 2014 in community, development, neo4j

This is the story behind the really useful and ingenious Neo4j example graph data generator developed by Christophe Willemsen.

I don’t just want to show you the tool but also tell the story how it came to be.

First of all: The Neo4j Community is awesome.
There are so many enthusiastic and creative people, that it is often humbling for me to be part of it.

So October 1st, Christophe tweeted out a short screencast he recorded, about a new tool (NeoGen) he was developing which converted a YAML domain specification into Cypher statements to populate a Neo4j database.

Read more…



Posted by Michael Hunger on Oct 18, 2014 in cypher, import, neo4j

I have to admit that using our LOAD CSV facility is trickier than you and I would expect.
Several people ran into issues that they could not solve on their own.

My first blog post on LOAD CSV is still valid in it own right, and contains important aspects that I won’t repeat here.
Both in terms of data quality checking (broken CSV files, misspelt header names or incorrect data types) as well as the concern of transaction size, where PERIODIC COMMIT comes to the rescue.

To address the most frequent issues and questions, I decided to write this follow up post.

In general you might have better experience using Neo4j-Enterprise as it contains some components which are more memory efficient.

If you want to import much more than 10-15 million lines of data, you might consider using our non-transactional batch-insertion facilities:

Read more…


Flexible Neo4j Batch Import with Groovy

Posted by Michael Hunger on Oct 9, 2014 in import, neo4j

You might have data as CSV files to create nodes and relationships from in your Neo4j Graph Database.
It might be a lot of data, like many tens of million lines.
Too much for LOAD CSV to handle transactionally.

Usually you can just fire up my batch-importer and prepare node and relationship files that adhere to its input format requirements.

Your Requirements

There are some things you probably want to do differently than the batch-importer does by default:

  • not create legacy indexes

  • not index properties at all that you just need for connecting data

  • create schema indexes

  • skip certain columns

  • rename properties from the column names

  • create your own labels based on the data in the row

  • convert column values into Neo4j types (e.g. split strings or parse JSON)

Read more…


LOAD CSV into Neo4j quickly and successfully

Posted by Michael Hunger on Jun 25, 2014 in cypher, import

Since version 2.1 Neo4j provides out-of-the box support for CSV ingestion. The LOAD CSV command that was added to the Cypher Query language is a versatile and powerful ETL tool.
It allows you to ingest CSV data from any URL into a friendly parameter stream for your simple or complex graph update operation, that … conversion.

But hear my words of advice before you jump directly into using it. There are some tweaks and configuration aspects that you should know to be successful on the first run.

Data volume: LOAD CSV was built to support around 1M rows per import, it still works with 10M rows but you have to wait a bit, at 100M it’ll try your patience.
Except for tiny datasets never run it without the saveguard of periodic commits, which prevent you from large transactions overflowing your available database memory (JVM Heap).

The CSV used in this example is pretty basic, but enough to show some issues and make a point, it’s people and companies they work(ed) for.

PersonName,"Company Name",year
"Kenny Bastani","Neo Technology",2013
"Michael Hunger","Neo Technology",2010
"James Ward","Heroku",2011

Read more…


Rendering a Neo4j Database in UbiGraph

Posted by Michael Hunger on Jun 23, 2014 in cypher, server

I never heard of UbiGraph before, but this tweet by @a61dr41n made me curious.

So I checked it out. UbiGraph is a graph rendering server that is controlled remotely and also interactively with a XML-RPC API (which is a weird choice).
It comes with example clients in Java, Python, Ruby and C.
You can download it from here. After unzipping the file and starting bin/ubigraph_server &, you should see a black window rendering the void, waiting for your commands.

Read more…


Presentation: “Using AsciiArt to Analyse your SourceCode with Neo4j and OSS Tools” at GeekOut.ee 2014

Posted by Michael Hunger on Jun 15, 2014 in conference, neo4j, programming languages

During the awesome GeekOut conference organized by my friends at ZeroTurnaround I was asked to stand in for Tim Fox who couldn’t come.

So instead of using a existing presentation I decided to finally write one up over night that covers one aspect of graph databases that is close to my heart:

Software Analytics with Graphs

When I first learned about Neo4j in 2008, my first project was pulling in Java class-file information into Neo4j, to find interesting tidbits about the JDK. Fast forward 4 years.

Other things kept me busy until I 2012, when I was speaking at a InnoQ tech-day and thought this would be a good topic to talk about.

I was so amazed by projects that others did in this area and published a blog post on “Graph Databases and Software Metrics to show what I’ve found. These were:

  • Raoul-Gabriel Urma: Expressive and Scalable Source Code Queries with Graph Databases (Paper)
  • Rickard Öberg: NeoMVN is tracing maven dependencies (GitHub)
  • Pavlo Baron: Graphlr, a ANTLR storage in Neo4j (GitHub)

When having a train-ride with my friend Dirk from Buschmais for two hours, he got a full load of my excitement about this topic, and he saw a real good practical use for his daily work with large software projects. Having your projects structure in a graph allows you to:

  1. Query the graph structures for insights on the code level (e.g. code-smells)
  2. Enrich the graph structure with higher level, technical, architectural and business concepts
  3. Define rules and metrics based on those higher level concepts.
  4. Run the parsing, enrichment, metrics computation and rule checking as part of your build process, generating reports and failing it in case of violation of those rules

All those ideas resulted in an impressive open-source project called jQAssistant which does all of the above (and much more).

So back to my GeekOut presentation. I sat down late night until 3am and wrote it up in AsciiDoc (+ deck.js) so you can fork it from the github repo, download the PDF or view the HTML-Slides online.

The session has been recorded, I’ll embed the video as soon as it is online. Then you can even listen to my hoarse voice.


Styling Neo4j Server Visualisation

Posted by Michael Hunger on Jun 3, 2014 in neo4j, server

Styling Neo4j Server Visualisation

To give you a head start when using Neo4j-Browser I wanted to share these quick tips for styling and querying.


Read more…


Using LOAD CSV to import Git History into Neo4j

Posted by Michael Hunger on Jun 1, 2014 in cypher, neo4j

In this blog post, I want to show the power of LOAD CSV, which is much more than just a simple data ingestion clause for Neo4j’s Cypher.
I want to demonstrate how easy it is to use by importing a project’s git commit history into Neo4j. For demonstration purposes, I use Neo4j’s repository on GitHub, which contains
about 27000 commits.

It all started with this tweet by Paul Horn, a developer from Avantgarde Labs in my lovely Dresden.

I really liked the idea and wanted to take a look. His python script takes the following approach:

Read more…

Copyright © 2007-2015 Better Software Development All rights reserved.
Multi v1.4.5 a child of the Desk Mess Mirrored v1.4.6 theme from BuyNowShop.com.