The Reddit Meme Graph with Neo4j

Posted by Michael Hunger on Feb 25, 2017 in cypher, import

Saturday night after not enough drinks, I came across these tweets by @LeFloatingGhost.

memegraph tweet.jpg

This definitely looks like a meme graph.

We can do that too

memegraph meme.jpg

Recorded Session

If you want to see me struggle get this going live, watch my session here

memegraph gif preview.jpg

If you want to see an interactive version of this post, check it out at the Graph Gist Collection.

memegraph graphgist.jpg

Find us some memes

sticker b222a421fb6cf257985abfab188be7d6746866850efe2a800a3e57052e1a2411.png

There is this really nice CSV from Reddit of the top memes around:

And grab an empty Neo4j Sandbox from http://neo4jsandbox.com.

What’s the data like?

Check CSV

WITH 'https://raw.githubusercontent.com/umbrae/reddit-top-2.5-million/master/data/memes.csv' as url
RETURN count(*);
│"1000"    │
WITH 'https://raw.githubusercontent.com/umbrae/reddit-top-2.5-million/master/data/memes.csv' as url
RETURN row limit 3;
│"row"                                                                                               │
│:"120","edited":"False","title":"Can We Please Start a Crazy Amy Meme For Amy of Amy's Baking Compan│
│y?","created_utc":"1368627364.0","is_self":"False"}                                                 │

Load them memes

WITH 'https://raw.githubusercontent.com/umbrae/reddit-top-2.5-million/master/data/memes.csv' as url
WITH row LIMIT 10000
CREATE (m:Meme) SET m=row // we take it all into Meme nodes

Added 100 labels, created 100 nodes, set 1700 properties, statement completed in 120 ms.

Get some memes

MATCH (m:Meme) return m limit 25;
memegraph memes.jpg
MATCH (m:Meme) return m.id, m.title limit 5;
│"m.id"  │"m.title"                                                                       │
│"1edsw9"│"Can We Please Start a Crazy Amy Meme For Amy of Amy's Baking Company?"         │
│"1ihc34"│"Given the competitive nature of redditors, I assume you all feel the same way."│
│"1gmt99"│"This man left this woman..."                                                   │
│"1ds9y4"│"How to cure bad breath..."                                                     │

But we want the words !

Let’s grab the first meme and get going.

Split the text into words.

MATCH (m:Meme) WITH m limit 1
RETURN split(m.title, " ") as words;


MATCH (m:Meme) WITH m limit 1
RETURN split(toUpper(m.title), " ") as words;

Remove Punctuation

Create an array of punctuation with split on empty string.

return split(",!?'.","") as chars;

And replace each of the characters with nothing ”

with "a?b.c,d" as word
return word,
       reduce(s=word, c IN split(",!?'.","") | replace(s,c,'')) as no_chars;
│"word"   │"no_chars"│
│"a?b.c,d"│"abcd"    │

We got us some nice words

MATCH (m:Meme)  WITH m limit 1
// lets split the text into words
RETURN split(reduce(s=toUpper(m.title), c IN split(",!?'.","") | replace(s,c,'')), " ") as words;
│"words"                                                                                          │

Enough words, where are the nodes?

Let’s create some word nodes

(merge does get-or-create)

MATCH (m:Meme)  WITH m limit 1
WITH split(reduce(s=toUpper(m.title), c IN split(",!?'.","") | replace(s,c,'')), " ") as words, m
MERGE (a:Word {text:words[0]})
MERGE (b:Word {text:words[1]});

Our first two words

MATCH (n:Word) RETURN n;
memegraph two words.jpg

Unwind the ra(n)ge

But we want all in the array, so let’s unwind a range.

MATCH (m:Meme)  WITH m limit 1
WITH split(reduce(s=toUpper(m.title), c IN split(",!?'.","") | replace(s,c,'')), " ") as words, m

UNWIND range(0,size(words)-2) as idx // turn the range into rows of idx

MERGE (a:Word {text:words[idx]})
MERGE (b:Word {text:words[idx+1]});
MATCH (n:Word) RETURN n;

No Limits

MATCH (m:Meme) WITH m // no limits
WITH split(reduce(s=toUpper(m.title), c IN split(",!?'.","") | replace(s,c,'')), " ") as words, m

UNWIND range(0,size(words)-2) as idx // turn the range into rows of idx

MERGE (a:Word {text:words[idx]})
MERGE (b:Word {text:words[idx+1]});
memegraph all words.jpg
MATCH (n:Word) RETURN count(*);

Chain up the memes

Connect the words via :NEXT and store the meme-ids on each rel in an ids property

And for the first word (idx = 0) let’s also connect the Meme node to the first Word

MATCH (m:Meme) WITH m
WITH split(reduce(s=toUpper(m.title), c IN split(",!?'.","") | replace(s,c,'')), " ") as words, m
UNWIND range(0,size(words)-2) as idx // turn the range into rows of idx
MERGE (a:Word {text:words[idx]})
MERGE (b:Word {text:words[idx+1]})

// Connect the words via :NEXT and store the meme-ids on each rel in an `ids` property
MERGE (a)-[rel:NEXT]->(b) SET rel.ids = coalesce(rel.ids,[]) + [m.id]

// to later recreate the meme along the next chain
// connect the first word to the meme itself
WITH * WHERE idx = 0
MERGE (m)-[:FIRST]->(a);

Set 546 properties, created 614 relationships, statement completed in 65 ms.

Yay done!

MATCH (m:Meme)-[:FIRST]->(w:Word)-[:NEXT]->(w2:Word)
memegraph example.jpg

Which words appear most often

MATCH (w:Word)
WHERE length(w.text) > 4
RETURN w.text, size( (w)--() ) as relCount
│"w"               │"relCount"│
│{"text":"AFTER"}  │"56"      │
│{"text":"REDDIT"} │"34"      │
│{"text":"ABOUT"}  │"33"      │
│{"text":"TODAY"}  │"33"      │
│{"text":"SCUMBAG"}│"32"      │
│{"text":"EVERY"}  │"31"      │
│{"text":"FIRST"}  │"30"      │
│{"text":"ALWAYS"} │"28"      │
│{"text":"FRIEND"} │"27"      │
│{"text":"THOUGHT"}│"24"      │

Now let’s find our memes again

// first meme
MATCH (m:Meme) WITH m limit 1
// from the :FIRST :Word follow the :NEXT chain
MATCH path = (m)-[:FIRST]->(w)-[rels:NEXT*..15]->() // let's follow the chain of words starting
// from the meme, where all relationships contain the meme-id
WHERE ALL(r in rels WHERE m.id IN r.ids)

Show meme by id

We can also get meme from the CSV list,
e.g. id ‘1kc9p2′ – ‘As stupid as memes are they can actually make valid points’

MATCH (m:Meme) WHERE m.id = '1kc9p2'

MATCH path = (m)-[:FIRST]->(w)-[rels:NEXT*..15]->()
WHERE ALL(r in rels WHERE m.id IN r.ids)

memegraph 2.jpg

Done. Enjoy !

PS: If you want to connect your own stuff, grab a Neo4j Sandbox or use Neo4j on your machine.
If you have questions, ask me, Michael, on Twitter or on Slack



User Defined Functions in Neo4j 3.1.0-M10

Posted by Michael Hunger on Oct 6, 2016 in apoc, cypher

Neo4j 3.1 brings some really neat improvements in Cypher alongside other cool features

I already demonstrated the – GraphQL inspired – map projections and pattern comprehensions in my last blog post.

User Defined Procedures

In the 3.0 release my personal favorite was user defined procedures which can be implemented using Neo4j’s Java API and called directly from Cypher.
You can tell, because I wrote about half of the 270 procedures for the APOC procedure collection “, with the remainder provided by other contributors.

Remember the syntax: …​ CALL namespace.procedure(arg1, arg2) YIELD col1, col2 AS alias …​

MATCH (from:Place {coords:{from}}), (to:Place {coords:{to}})

CALL apoc.algo.dijkstra(from, to, "ROAD", "cost") YIELD path, weight

RETURN nodes(path)
ORDER BY weight LIMIT 10;

Procedures improvements in 3.1.

As of 3.1, the named procedure parameters can now have default values.
This means you can leave them off during the call (from right to left), e.g.

@Name(value="costProperty", defaultValue="cost") String prop

The annotations for the database mode were folded into the @Procedure annotation.
You can also now set DBMS as mode, which encompasses Schema operations like creating indexes or constraints. @Procedure(value="namespace.procedure",mode=READ|WRITE|DBMS).

Based on the security work in Neo4j Enterprise, procedures can now get an additional set of roles configured in the allowed attribute of the @Procedure annotation, that also permit users with that role (or only that role) to execute the procedure.

The last change is dropping the support for result objects without columns, which we sometimes used for changing the cardinality of a query (between 0 and 1) for boolean operations.
But that’s handled now much better by boolean functions (predicates) as discussed below.

User Defined Functions

If you used or wrote procedures in the past, you most probably came across instances where it felt quite unwieldy to call a procedure just to compute something, convert a value or provide a boolean decision.

For example:

CREATE (v:Value {id:{id}, data:{data}})
CALL apoc.date.formatDefault(timestamp(), "ms") YIELD value as created
SET v.created = created

You’d rather write it as a function:

CREATE (v:Value {id:{id}, data:{data}, created: apoc.date.format(timestamp()) })

Now in 3.1 that’s possible, and you can also leave off the "ms" and use a single function name, because the unit and format parameters have a default value.

Functions are more limited than procedures: they can’t execute writes or schema operations and are expected to return a single value, not a stream of values.
But this makes it also easier to write and use them.

By having information about their types, the Cypher Compiler can also check for applicability.

The signature of the procedure above changed from:

public Stream<StringResult> formatDefault(@Name("time") long time, @Name("unit") String unit) {
   return Stream.of(format(time, unit, DEFAULT_FORMAT));

to the much simpler function signature (ignoring the parameter name and value annotations):

public String format(@Name("time") long time,
                     @Name(value="unit", defaultValue="ms") String unit,
                     @Name(value="format", defaultValue=DEFAULT_FORMAT) String format) {
   return getFormatter().format(time, unit, format);

This can then be called in the manner outlined above.

In our APOC procedure library we already converted about 50 procedures into functions from the following areas:

package # of functions example function

date & time conversion



number conversion



general type conversion



type information and checking



collection and map functions



JSON conversion



string functions



hash functions



You can list user defined functions with call dbms.functions()


We also started to add default values for parameters and deprecate procedures that required alternative names because of a wider parameter set.

We also moved the Description annotation from APOC’s own to the one now provided within Neo4j, so you’ll see the descriptions for all functions and procedures in dbms.procedures().

All of this is available now, if you run Neo4j 3.1.0-M10 and use this latest APOC release.

You can follow the development of APOC for Neo4j 3.1 in the 3.1 branch.


Neo4j 3.0 Stored Procedures

Posted by Michael Hunger on Feb 29, 2016 in cypher, java

One of the many exciting features of Neo4j 3.0 are “Stored Procedures” that, unlike the existing Neo4j-Server extensions are directly callable from Cypher.

At the time of this writing it is only possible to call them in a stand-alone statement with CALL package.procedure(params)
but the plan is to make them a fully integrated part of Cypher statements.
Either by making CALL a clause or by turning procedures into function-expressions (which would be my personal favorite).

Currently procedures can only be written in Java (or other JVM languages).
You might say, “WTF …​ Java”, but it is less tedious than it sounds.

First of all, the effort of setting up a procedure project, writing and building it is minimal.

To get up and running you first need a recent copy of Neo4j 3.0,
either the 3.0.0-M04 milestone or the latest build from the Alpha Site.

To get you started you also need a JDK and a build tool like Gradle or Maven.

You can effectively copy the procedure template example that Jake Hansson provided in neo4j-examples as a starting point.

But let me quickly walk you through an even simpler example (GitHub Repository).

You need to declare the org.neo4j:neo4j:3.0.0[-M04] dependency in the provided scope, to get the necessary annotations and the Neo4j API to talk to the database.

project.ext {
    neo4j_version = ""
dependencies {
	compile group: "org.neo4j", name:"neo4j", version:project.neo4j_version
	testCompile group: "org.neo4j", name:"neo4j-kernel", version:project.neo4j_version, classifier:"tests"
	testCompile group: "org.neo4j", name:"neo4j-io", version:project.neo4j_version, classifier:"tests"
	testCompile group: "junit", name:"junit", version:4.12

If you have a great idea on what kind of procedure you want to write, just open a file with a new class.

Please note that the only package and method names become the procedure name (but not the class name).

In our example we will create a very simple procedure that just computes the minimum and maximum degrees of a certain label.

The reference to Neo4j’s GraphDatabaseService instance is injected into your class into the field annotated with @Context.
As procedures are meant to be stateless, declaring non-injected non-static fields is not allowed.

In our case the procedure will be named stats.degree and called like CALL stats.degree('User').

package stats;

public class GraphStatistics {

    @Context private GraphDatabaseService db;

    // Result class
    public static class Degree {
        public String label;
        // note, that "int" values are not supported
        public long count, max, min = Long.MAX_VALUE;

        // method to consume a degree and compute min, max, count
        private void add(long degree) {
          if (degree < min) min = degree;
          if (degree > max) max = degree;
          count ++;

    public Stream<Degree> degree(String label) {
        // create holder class for results
        Degree degree = new Degree(label);
        // iterate over all nodes with label
        try (ResourceIterator it = db.findNodes(Label.label(label))) {
            while (it.hasNext()) {
               // submit degree to holder for consumption (i.e. max, min, count)
        // we only return a "Stream" of a single element in this case.
        return Stream.of(degree);

If you want to test the procedures quickly without spinning up an in-process server and connecting to it remotely (e.g. via the new binary bolt protocol as shown in the procedure-template), then you can use the test-facilities of Neo4j’s Java API.

Now we can test our new and shiny procedure by writing a small unit-test.

package stats;

class GraphStatisticsTest {
    @Test public void testDegree() {
        // given Alice knowing Bob and Charlie and Dan knowing no-one
        db.execute("CREATE (alice:User)-[:KNOWS]->(bob:User),(alice)-[:KNOWS]->(charlie:User),(dan:User)").close();

        // when retrieving the degree of the User label
        Result res = db.execute("CALL stats.degree('User')");

        // then we expect one result-row with min-degree 0 and max-degree 2
        Map<String,Object> row = res.next();
        assertEquals("User", row.get("label"));
        // Dan has no friends
        assertEquals(0, row.get("min"));
        // Alice knows 2 people
        assertEquals(2, row.get("max"));
        // We have 4 nodes in our graph
        assertEquals(4, row.get("count"));
        // only one result record was produced

Of course you can use procedures to create procedures, e.g. in other languages that are supported natively on the JVM like JavaScript via Nashorn, or Clojure, Groovy, Scala, Frege (Haskell), (J)Ruby or (J/P)ython.
I wrote one for creating and running procedures implemented in JavaScript.

There are many other cool things that you can do with procedures, see the resources below.

If you have ideas for procedures or wrote some of your own, please let us know.

Join our public Slack channel and visit #neo4j-procedures.



Using XRebel 2 with Neo4j

Posted by Michael Hunger on May 5, 2015 in neo4j

At Spring.IO in Barcelona I met my pal Oleg from ZeroTurnaround and we looked at how the new XRebel 2
integrates with Neo4j, especially with the remote access using the transactional Cypher http-endpoint.

As you probably know, Neo4j currently offers a remoting API based on HTTP requests (a new binary protocol is in development).

Our JDBC driver utilizes that http-based protocol to connect to the database and execute parameterized statements while adhering to the JDBC APIs.

XRebel is a lightweight Java Application Profiler which is loaded as java-agent and instruments your application.
It traces runtime for web requests and records your backend-application CPU usage, database- (JDBC) and http-requests to other services.
For web-applications it integrates automatically with the http-processing and injects profiling information into the response.

Movies Webapp

For this quick demo, we use the example Movies application which is available for many programming languages from our developer resources.
The application is just a plain Java webapp that serves three JSON endpoints to a simple Javascript frontend page.
The backend connects to Neo4j via JDBC to retrieve the requested information via our Cypher query language.

To prepare for running our app, just download, unzip and start Neo4j, open it on http://localhost:7474/ and run the :play movies statement in the Neo4j browser.
Then we can get and build the application and run it.
To test that it works, open the app in your browser at http://localhost:8080

git clone http://github.com/neo4j-contrib/developer-resources
cd developer-resources/language-guides/java/jdbc

mvn compile exec:java -DmainClass="org.neo4j.example.movies.Movies"

Setup with XRebel

To use XRebel we just download it, get an eval license and attach the jar as a java-agent to our application.

MAVEN_OPTS="-javaagent:$HOME/Downloads/xrebel/xrebel.jar" mvn compile exec:java -DmainClass="org.neo4j.example.movies.Movies"

If we check our example application page again, we see a small green XRebel icon in the left corner.
It provides access to the XRebel UI which has tabs for application performance, database queries, exceptions and more.

For our initial query for the “Matrix” movie, it shows both the request time for the web-application, as well as the database calls to Neo4j.
Interestingly both the JDBC level as well as the underlying http calls to Neo4j are displayed.

If we uglify our app, that our queries are executed incorrectly, simulating a n+1 select, then that shows clearly up in XRebel as massive database interaction.

Runtime Exceptions due to a programming error are also made immediately accessible from the XRebel UI.

For non-visual REST-services you can access the same profiling information via a special endpoint that is added to your application, in our case: http://localhost:8080/xrebel

As you can see, XRebel can give you quick insights in the performance profile of your Neo4j backed application and highlights which queries / pages / secondary requests
need further optimization.

Ping Oleg or me if you have more questions.

If you’re in London this week and want to have a relaxing election day,
make sure to grab a seat for GraphConnect on May 7, the Neo4j conference.
Ping me via email (michael at neo4j.org) for a steep discount a an avid reader of this blog.


Neo4j Server Extension for Single Page Experiments

Posted by Michael Hunger on Apr 24, 2015 in neo4j, server

Sometimes you have a nice dataset in Neo4j and you’d want to provide a self-contained way of quickly exposing it to the outside world without a multi-tier setup.

So for experiments and proofs of concepts it would be helpful to be able to extend Neo4j Browser to accomodate new types of frames and commands.
Unfortunately we’re not there yet, there is still some work to be done until this will be possible.

Until then …. why not use what we already have.

I was discussing some helpful database and server extensions which would benefit from a tiny built-in UI with different people.
Then I had the idea to just use the JAX-RS mechanisms that Neo4j Server supports to not only serve JSON/text or XML but also deliver
HTML, JS, CSS and image files to the browser.
Those files would not live on the file-system but be packaged directly into the jar of the extension, e.g. residing in a resources/webapp folder.

How it works

This was actually much easier than expected.
This is a normal JAX-RS resource class that can then be mounted on an endpoint using the neo4j-server.properties configuration.

The HTTP-GET endpoint handles certain patterns declared by a regular expression.
There is one function that tries to find that file within the webapp folder within the JAR classpath, returning null if not found or the InputStream otherwise.
And one function for determining the content-type to be returned.

You can easily use this approach for your own Neo4j extension just by copying that StaticWebResource into your project and providing the html,js and css files in the webapp directory.

The Demo: Popoto.js

As my demo I used a setup that exposes popoto.js automatcially on top of the data you have in your graph.

The StaticWebResource provides the web-files of the visualization from the resources/webapp directory.
And PopotoResource adds a second endpoint to provide a config/config.js file which uses label, property and index information
to provide the necessary config for popoto’s visualization.

Note that you have to disable auth for this demo as I haven’t added means for it to configure a username/password.

You can use the demo by cloning and building (mvn clean install) this repository.
Copy the resulting jar in the server’s plugin directory.
Edit conf/neo4j-server.properties to register the package name with an endpoint.

cp target/neo4j-web-extension-2.2-SNAPSHOT.jar /path/to/neo/plugins/
echo 'org.neo4j.server.thirdparty_jaxrs_classes=extension.web=/popoto' >>  /path/to/neo/conf/neo4j-server.properties
/path/to/neo/bin/neo4j restart
open http://localhost:7474/popoto

popoto in neo demo

You can also download the JAR from here.

Enjoy exploring!


How To: Neo4j Data Import – Minimal Example

Posted by Michael Hunger on Apr 18, 2015 in import, neo4j

We want to import data into Neo4j, there are too many resources with a lot of information which makes it confusing.
Here is the minimal thing you need to know.

Imagine the data coming from the export of a relational or legacy system, just plain CSV files without headers (this time).


Graph Model

Our graph Model would be very simple:

import data model.jpg
(p1:Person {userId:10, name:"Anne"})-[:KNOWS]->(p2:Person {userId:123,name:"John"})

Import with Neo4j Server & Cypher

  1. Download, install and start Neo4j Server.

  2. Open http://localhost:7474

  3. Run the following statements one by one:

I used http-urls here to run this as an interactive, live Graph Gist.

LOAD CSV FROM "https://gist.githubusercontent.com/jexp/d8f251a948f5df83473a/raw/people.csv" AS row
CREATE (:Person {userId: toInt(row[0]), name:row[1]});
LOAD CSV FROM "https://gist.githubusercontent.com/jexp/d8f251a948f5df83473a/raw/friendships.csv" AS row
MATCH (p1:Person {userId: toInt(row[0])}), (p2:Person {userId: toInt(row[1])})
CREATE (p1)-[:KNOWS]->(p2);
You can also use file-urls.
Best with absolute paths like file:/path/to/data.csv, on Windows use: file:c:/path/to/data.csv

If you want to find your people not only by id but also by name quickly, also run:

CREATE INDEX ON :Person(name);

For instance all second degree friends of “Anne” and on how many ways they can be reached.

MATCH (:Person {name:"Anne"})-[:KNOWS*2..2]-(p2)
RETURN p2.name, count(*) as freq

Bulk Data Import

For tens of millions up to billions of rows.

Shutdown the server first!!

Create two additional header files:


Execute from the terminal:

path/to/neo/bin/neo4j-import --into path/to/neo/data/graph.db  \
--nodes:Person people_header.csv,people.csv --relationships:KNOWS friendships_header.csv,friendships.csv

After starting your database again, run:



On Neo4j Indexes, Match & Merge

Posted by Michael Hunger on Apr 11, 2015 in cypher, neo4j

We at Neo4j do our fair share to cause confusion of our users. I’m talking about indexes my friends.
My trusted colleagues Nigel Small – Index Confusion and Stefan Armbruster – Indexing an Overview already did a great job explaining the indexing situation in Neo4j,
I want to add a few more aspects here.

Since the release of Neo4j 2.0 and the introduction of schema indexes, I have had to answer an increasing number of questions arising from confusion between the two types of index now available: schema indexes and legacy indexes.
For clarification, these are two completely different concepts and are not interchangable or compatible in any way.
It is important, therefore, to make sure you know which you are using.

— Nigel Small

Why do we indexes in a graph database at all, aren’t we all about graph navigation?
To quickly find starting points for your graph traversals or pattern matches.

Schema Indexes

Neo4j 2.0 introduced the optional schema which was built around the concept of node labels.
Labels can be used for path matching and – along with properties – are used as the basis for schema indexes and constraints.
Schema indexes can automatically speed up queries, unlike legacy indexes which you have to use excplicitely.

Note: Only schema indexes are aware of labels; legacy indexes are completely and utterly unaware of labels.

Further Note: Schema indexes are also only available for nodes whereas legacy indexes allowed relationships to be indexed as well.
The use cases for relationship indexing were few and could generally be worked around by introducing extra nodes.

— Nigel Small

Today Neo4j uses exact, case sensitive automatic schema indexes and constraints based on a single label and single property.
Labels and indexes are part of Neo4j’s “optional” schema concept.

Schema Indexes and MATCH

You create a schema index with CREATE INDEX ON :Label(property) e.g. CREATE INDEX ON :Person(name).

You list available schema indexes (and constraints) and their status (POPULATING,ONLINE,FAILED) with :schema in the browser or schema in the shell.
Always make sure that the indexes and constraints you want to use in your operations are ONLINE otherwise they won’t be used and your queries will be slow.

When you create a new schema index, it will also asynchronously index all existing nodes with that label and property combination.
After the index is available it will show as “ONLINE”. Later changes to nodes (label addition and removal and property updates) are automatically and transactionally reflected in the index as well.

You can await the index creation with schema await.

Fulltext, spatial and composite schema indexes are not available as of Neo4j 2.2.

You need to have an index or constraint to efficiently find starting nodes via MATCH, otherwise Neo4j has to run a full label scan with property comparisons to find your node which can be very expensive.

Schema indexes will be used for both inline syntax MATCH (p:Person {name:"Mark"}) as well as WHERE conditions MATCH (p:Person) WHERE p.name="Mark" and also IN predicates like MATCH (p:Person) WHERE p.name IN ["Mark","Max"].

Neo4j uses schema indexes automatically, yet you can force certain index usage with hints like USING INDEX p:Person(name) after a MATCH clause.

You can check that Neo4j actually uses an index, by prefixing your query with EXPLAIN and looking at the query plan visualization.
It should show a NodeIndexSeek instead of a NodeByLabelScan + Filter

Remember, that only a single property and only the label that you defined an index for is considered for lookups!

Constraints and MERGE

A MERGE operation will also use the index for faster lookups, but an index alone doesn’t guarantee uniqueness of nodes.

That’s where unique constraints come into play, again for a single label and single property.
You create them with this unwieldy syntax CREATE CONSTRAINT ON (n:Label) ASSERT n.property IS UNIQUE, e.g. CREATE CONSTRAINT ON (b:Book) ASSERT b.isbn IS UNIQUE

Constraints will, on creation check all nodes in the database with that label and property combination for uniqueness and will fail if there are duplicates and list them.
Constraint creation is blocking.
It will only return after the constraint has been either successfully created (“ONLINE”) or been aborted (“FAILED”).

Each constraint creates an accompanying index which is noted in the listing, the constraint creation will fail if the same index already existed.

Unique constraints will ensure that no other operation will create a node with a duplicate label+property-value combination by failing the transaction with an exception.
That goes for all operations like CREATE node, SET propert-value, ADD label and also for calls via the Java-API.

MERGE uses constraints for an efficient lookup and uniqueness check as well as acquiring a focused lock which guarantees uniqueness even in concurrent operations and across a cluster.

If you need to set other properties as part of your node-creation, use the ON CREATE SET option:
MERGE (b:Book {isbn:{isbn}}) ON CREATE SET p.title = {title}, p.year = {year}

Other types of constraints, e.g. property (type) or composite constraints are not available as of Neo4j 2.2

Composite Constraints (and Indexes)

If you really need composite-key constraints or index lookups consider concatenating the values into an artificial id-property or use an array with those composite values as “id”. For example:

CREATE CONSTRAINT ON (a:Address) assert a.composite_id IS UNIQUE;

MERGE (a:Address {composite_id: [{zip},{street},{number}]}) ON CREATE SET a.zip = {zip}, a.street={street}, a.number = {number};
// or
MERGE (a:Address {composite_id: {zip}+"_"+{street}+"_"+{number}}) ON CREATE SET a.zip = {zip}, a.street={street}, a.number = {number};

Manual (Legacy,Deprecated) Indexes

Prior to the release of Neo4j 2.0, legacy indexes were just called indexes. These were powered by Lucene outside the graph and allowed nodes and relationships to be indexed under a key:value pair. From the perspective of the REST interface, most things called “index” will still refer to these legacy indexes.

Note: Legacy indexes were generally used as pointers to start nodes for a query; they provided no automatic ability to speed up queries.

— Nigel Small

Historically there were manual indexes that you will come across in the documentation, old blog posts or examples.

If you don’t need a fulltext, spatial or relationship index or you don’t have to deal with a “legacy” Neo4j application, you can ignore them.
Stop reading here.

You had to add nodes and relationships (with key and value) to a named index yourself, that’s why they’re called “manual indexes”.

Those indexes were exact or fulltext Lucene indexes for nodes or relationships or spatial indexes for nodes.

The Lucene indexes also optionally exposed the lucene query syntax and had configurable cases and analyzers.

You could use manual indexes via the Java API or the START clause in Cypher, e.g.
START post=node:posts("title:Graphs") match (post)←[:WROTE]-(author) RETURN post,author

You manage them with the index command in the shell, you list them with index --indexes.

There is also a tab in the old Webadmin UI that lists them.

Legacy indexes also had options for unique node- and relationship-creation, which is now superceded by MERGE with CONSTRAINTs.

Deprecated Auto-Indexes

Because people didn’t like adding nodes and relationship manually but we didn’t have labels back then, there was a way of having “automatic” indexes.

You could configure exactly one automatic index for all nodes (node_auto_index) and one for all relationships (relationship_auto_index) by listing the properties that were to be indexed.

You could use them again with the Java API and the START clause but this time with the fixed _auto_index name (see above).

You still find the configuration options in the `neo4j.properties`config file, and the APIs both in Java as well as the REST endpoints.
All of those are safe to ignore, except if you know what you’re doing and you want to try to use an automatic spatial or fulltext index.
Be aware that that is a tricky business.

So which should I use?

If you are using Neo4j 2.0 or above and do not have to support legacy code from a pre-2.0 era, use only schema indexes and avoid legacy indexes.
Conversely, if you are stuck with an earlier version of Neo4j and are unable to upgrade, you only have one type of index available to you anyway.

If you need full text indexing, regardless of Neo4j version, you will need to use legacy indexes.

The more complicated scenarios are those that involve a period of transition from one type of index to another.
In these cases, make sure you are fully aware of the differences and try, wherever possible, to use either schema or legacy indexes but not both.
Mixing the two will often lead to more confusion.

— Nigel Small

And if in doubt, ask a question on the Neo4j mailing list or on StackOverflow.


Natural Language Analytics made simple and visual with Neo4j

Posted by Michael Hunger on Jan 8, 2015 in cypher, fun

I was really impressed by this blog post on Summarizing Opinions with a Graph from Max and always waited for Part 2 to show up :)

The blog post explains an really interesting approach by Kavita Ganesan which uses a graph representation of sentences of review content to extract the most significant statements about a product.

opiniosis overview

Each word of the sentence is represented by a shared node in the graph with order of words being reflected by relationships pointing to the next word, which carries the sentence-id and a positional information of the leading word.

By just looking at the graph structure, it turns out that the most significant statements (positive or negative) are repeated across many reviews.
Differences in formulation or inserted fill words only affect the graph structure minimally but reinforce it for the parts where they overlap.

You can find all the details of the approach in this presentation or the accompanying research.

I always joked that you could create this graph representation without programming just by writing a simple Cypher statement, but I actually never tried.

Until now, and to be honest I’m impressed how easy it was to write down the essence and then extend and expand the statement until it covered a large number of inputs.

Read more…


Spring Data Neo4j 3.3.0 – Improving Remoting Performance

Posted by Michael Hunger on Dec 9, 2014 in neo4j, spring-data-neo4j

With the first milestone of the Spring Data “Fowler” release train, Spring Data Neo4j 3.3.0.M1 was released. Besides a lot of smaller fixes, it contains one big improvement. I finally found some time to work on the remoting performance of the library, i.e. when used in conjunction with Neo4j Server. This blog post explains the history behind the issue and the first steps taken to address it.

In the past for many of its users, the remote performance of Spring Data Neo4j was not satisfying. The reasons for that were twofold – historical and development bandwidth. Let’s start with the history.


When Spring Data Neo4j started, neither Cypher, nor Neo4j-Server existed, only the embedded Java APIs of Neo4j were available.

So initially, the AspectJ-based, write- and read-through mapping mode and then later the simple mapping mode were built based on these in-process Java-APIs.

Later adding Neo4j server support was made “easy” with the java-rest-binding library which pretended to be an embedded GraphDatabaseService API but actually made remote calls to the Neo4j Server REST endpoints for each of the API operations.

Both were very bad ideas. Not just because network transparency is a very leaky abstraction.

So what we ended up with was an Object-Graph-Mapper assuming to talk to an embedded database. In the embedded case, the frequency of API calls is not problematic, and re-fetching nodes and relationships by id was just a cache-hit away.

But unknowingly making REST calls over the wire for each of the operations has quite an impact.

So a call that took nanoseconds or less in embedded mode may have taken as long as your twice your network latency in the remote case. Combined with the number of calls it pretty quickly summed up to an unsatisfying remote performance of the whole library.

Unfortunately fixing this it was not in the development bandwidth available to work on the library. In retrospect that was a really bad decision too, as many users (esp. in larger organizations) really like the Spring Data (Neo4j) convenience for CRUD and use-case specific operations.

Recommendations for Working with Neo4j Server

Usually I recommended to either move the persistence layer of such an SDN application into a server extension and just expose REST endpoints that use a domain level protocol. Or to rewrite complex remote mapping logic into a few cypher statements that will be executed within the server.

Both suggestions are actually sensible and can improve the performance of your operations up to 20 times.

Performance Example

To show the implications, I created a small test project that runs the same medium-complex SDN-mapping operations for each of the setups, creating 1000 medium size business objects (9 entities and 8 relationships with some properties):

  • on an embedded database,

  • against a SDN based server extension and

  • remotely via Spring Data Neo4j (3.2.1.RELEASE).

For completeness I also added two tests that indicate remote Cypher performance, once via JDBC and once via the SDN Cypher execution.

The speed differences are quite big:



Time (ms)

Time/Op (ms)

SDN remote (3.2.1):




SDN embedded:




SDN server extension:




SDN Cypher:




JDBC Cypher:




Fortunately lately I finally got around to addressing at least a few of the root causes.


I looked into the hotspots of the remote execution, and fixed the ones with the highest impact.

I refrained from rewriting the whole object graph mapping logic within SDN as this is a much larger effort and will be worked on by our partner GraphAware as part of the SDN.next efforts (see below).

The places with the highest impact were:

  • Separate call to fetch node-labels as the REST-representation doesn’t expose labels

  • Continous re-fetching of nodes and relationships from the database as part of the simple mapping process (as only the ID is available in the @GraphId field of the actual entity).

  • setting properties individually via propertyContainer.setProperty() which is the only available API in embedded mode


The existing java-rest-binding library also can’t expose any transaction semantics over the wire, as each REST-operation creates a new independent transaction within Neo4j Server.

The only approach that was supported for “larger transactions” was the REST-Batch-Operations which encapsulated a number of operations in one single large HTTP request. But that API didn’t allow to read your own writes within a transaction and making decisions based on that information. So this didn’t only remove transactional safety but also created a lot of tiny transactions which all had to be forcibly synched to disk by Neo4j.

Changes impacting Performance

I started by comparing the remote performance of single REST-API calls with the appropriate Cypher calls and found that the former are twice as fast for single and simple operations. Cypher has its strengths in more complex operations and in running multiple statements within the same transaction with the new transactional endpoint.

As a first step, I inlined all the code of java-rest-binding that I still intended to use, into Spring Data Neo4j and removed the operations that were no longer relevant. So starting from that next version, the java-rest-binding dependency will no longer be used.

For some quick gains I changed:

  • load node with labels in one step using two batch-REST calls, only load labels from SDN if they are not already loaded

  • used label and id meta-information from Neo4j’s REST format for nodes since Neo4j 2.1.5

  • added a local client-cache for nodes loaded from the server, updates and refreshes will also go through this cache

  • added a separate interface UpdateableState to RestEntity (nodes and relationships) that allows bulk-updates of properties

  • changed node- and relationship-creation to utilize maps of properties for the initial create or merge call

All of those changes already improved the performance of the Spring Data Neo4j remote operations by a factor of 3 as shown by the sample project, but it was still not good enough.

Transactional Cypher #FTW

A you probably know Neo4j supports a streaming HTTP endpoint that is able to run multiple Cypher statements per request and can keep a transaction running across multiple HTTP-requests.

So although I originally wanted to use the Neo4j-JDBC driver, it was not yet available on Maven Central. So I deferred that and instead wrote a quick Jersey based HTTP client for the transactional Cypher endpoint which also supported batching of queries and running multiple HTTP requests within the same transaction.

The internal API looks like this:

query = "MATCH (n) where id(n) = {id} " +
        "RETURN id(n) as id, labels(n) as labels, n as data";

params = singletonMap("id",nodeId);

tx = new CypherTransaction(url, ResultType.row);

Result result = tx.send(query, params);

// or

tx.add(query, params);
List<Result> results = tx.commit();

List<String> cols = result.getColumns();
if (result.hasData()) {
   Iterable<List<Object>> data = result.getData();

for (Map<String,Object> row : result) {
   Long id = row.get("id");
   List<String> = row.get("labels");
   Map props = row.get("data");

I then rewrote all remote operations that were expressible with Cypher from REST-HTTP calls into parameterized Cypher statements, for instance the Create-Node call into:

CREATE (n:`Label1`:`Label 2` {props})
RETURN id(n) as id, labels(n) as labels, n as data

This allowed me to set all the labels and properties of the node with a single CREATE operation and return the property data as well as the metadata like id and labels in a single call. I used the same return format consistently for nodes and relationships to map them easily back into the appropriate graph objects that SDN expects. The variant for relationships actually also returns start- and end-node-ids.

The list of operations that were (re)written is pretty long:

  • createNode, mergeNode

  • createRelationship

  • getDegree, getRelationships

  • findByLabelAndPropertyValue, findAllByLabel

  • findByQuery (lucene)

  • setProperty, setProperties, addLabel, removeLabel

  • deleteNode, deleteRelationship

  • …​

All other methods still forward to the existing REST operations (e.g. adding nodes to legacy indexes).
The new Cypher based REST-Api-Impl also utilizes the node cache that I already mentioned.
Some of these operations also send multiple statements on the same HTTP-request.

All Cypher operations run within a transaction, if there is none running, a single transaction will be opened just for this operation. If there is already a transaction started (stored in a ThreadLocal), the following operations will participate in it. So if the transaction is started on the outside, e.g. on a method boundary (annotated with @Transactional) all operations in the same thread will continue to use that transaction until the transactional scope was closed by issuing a commit or rollback operation.

To integrate this functionality with the outside, the Cypher based Rest-API exposes a method to create new Transactions. Those are held in a thread-local variable so that you can run independent threads with individual, concurrent transactions.

For integration with the Java world, aka. JTA, I also implemented a javax.transaction.TransactionManager on top of that API which can be used on its own.
But of course for integrating with Spring it is injected into a Jta(Platform)TransactionManager in the Spring Data Neo4j configuration.

So whenever you annotate a method or class with @Transactional, the Spring transaction infrastructure will use that bean to tie into the remote transaction mechanism provided by the transactional Cypher endpoint.

It was pretty cool that it worked out of the box after I brought the individual pieces together.

To make this new remote integration usable from Spring Data Neo4j I created a SpringCypherRestGraphDatabase (an implementation of the SDN-Database API that is more comprehensive than Neo4j’s GraphDatabaseService).

This is what you should use now to connect your Spring Data Neo4j application remotely to a Neo4j Server.

@EnableTransactionManagement(mode = AdviceMode.PROXY)
public static class RemoteConfiguration extends Neo4jConfiguration {
    public RemoteConfiguration() {

    public GraphDatabaseService graphDatabaseService() {
        return new SpringCypherRestGraphDatabase(BASE_URI);

The steps taken here improved the performance of the use-case we were looking at by a factor of 8, which is not that bad.



Time (ms)

Time/Op (ms)

SDN remote (3.2.1):




SDN remote (3.3.0):




My changes only addressed the remoting aspect of this challenge, the next step is to think big.

Ad Astra – SDN.next

We started work on completely rewriting the internals of Spring Data Neo4j to embrace a single, fast object graph mapping library for Neo4j.

As part of this effort which is mainly developed by our partner GraphAware in London, we will simplify the architecture that Spring Data Neo4j is built on.

While we will keep the external APIs that you see as SDN users, as stable as possible, the internals will change completely.

The idea is to build a fast, pure Java-Object Graph Mapper that utilizes the transactional Cypher Endpoint.
It will provide APIs for specifying mapping metadata from the outside and focus on simple CRUD operations of your entities and mapping Cypher query results into arbitrary result object structure (DTOs, view objects).

Spring Data Neo4j’s single future mapping mode will then utilize these APIs to provide mapping meta-information from its annotations, run the CRUD operations for updating and reading entities and support Cypher execution and result handling like you already use today.

As all that relies on the execution of compound Cypher statements, you can do much more in a single call, depending on how clever the OGM will become.

And going forward its performance will benefit from all Cypher performance improvements, new schema indexes (spatial and fulltext) and new remoting protocols.

I’m really excited to accompany this work and see it advancing every day. If you want to get a glance of these developments, check out the GraphAware GitHub repositories. But please be patient, this is work in progress in its early stages and although it progresses quickly, the first publicly usable version is still a while out.

I hope you join me all on this journey and are excited as I am of these latest developments.


The Story of GraphGen

Posted by Michael Hunger on Nov 1, 2014 in community, development, neo4j

This is the story behind the really useful and ingenious Neo4j example graph data generator developed by Christophe Willemsen.

I don’t just want to show you the tool but also tell the story how it came to be.

First of all: The Neo4j Community is awesome.
There are so many enthusiastic and creative people, that it is often humbling for me to be part of it.

So October 1st, Christophe tweeted out a short screencast he recorded, about a new tool (NeoGen) he was developing which converted a YAML domain specification into Cypher statements to populate a Neo4j database.

Read more…

Copyright © 2007-2017 Better Software Development All rights reserved.
Multi v1.4.5 a child of the Desk Mess Mirrored v1.4.6 theme from BuyNowShop.com.