On Streaming Cypher

Posted by Michael Hunger on Apr 13, 2012 in development, java, neo4j |

After being annoyed for a long time about the Neo4j REST protocol performance I decided to have a look at streaming JSON last night. It seemed simple enough.

Today Peter pushed me to continue through and use the Lab day for finishing the lab-project.

So I started to create a server-extension project that does 2 things differently. First it uses a more compact format for the cypher results than the current restful representation. Secondly it uses streaming JSON to send a StreamingOutput into a Jersey-Response.


For a query like this it would return:

start n=node(*) match p=n-[r]-m return  n as first,r as rel,m as second,m.name? as name,r.foo? as foo,ID(n) as id, p as path , NODES(p) as all
// columns
{"columns":["first","rel","second","name","foo","id","path","all"],
// rows is an array of array of objects each object is { type : value }
"rows":[
    [{"Node":{"id":0,"data":{"name":42}}},
     {"Relationship":{"id":0,"start":0,"end":1,"type":"knows","data":{"name":"rel1"}}},
     {"Node":{"id":1,"data":{"name":"n2"}}},
     {"String":"n2"},
     {"Null":null},
     {"Long":0},
     {"Path":
         {"length":1,
          "start":{"id":0,"data":{"name":42}},
          "end":{"id":1,"data":{"name":"n2"}},
          "last_rel":{"id":0,"start":0,"end":1,"type":"knows","data":{"name":"rel1"}},
          "nodes":[{"id":0,"data":{"name":42}},{"id":1,"data":{"name":"n2"}}],
          "relationships":[{"id":0,"start":0,"end":1,"type":"knows","data":{"name":"rel1"}}]}},
     {"Array":[{"id":0,"data":{"name":42}},{"id":1,"data":{"name":"n2"}}]}]],
// number of rows
"count":1,
// full runtime including streaming all the results
"time":29}

The Jersey- JAX-RS code is also simple enough to stream data:

StreamingOutput stream = new StreamingOutput() {
    public void write(OutputStream output) throws IOException, WebApplicationException {
        try {
            service.execute(params.get("query"), params.get("params"), output);
        } catch (Exception e) {
            throw new WebApplicationException(e);
        }
    }
};
return Response.ok(stream).build();

Performance wise it allows me to:

  • convert 1M Nodes into JSON (with 50MB data) and send it through a stream in 3 seconds
  • read 1M Nodes from a JSON-Stream (with 50MB data) in 3 seconds

To make it work with Neo4j-Server, just put the streaming-cypher-extension-1.7.M03.jar into neo4j-server/plugins and add this to the conf/neo4j-server.properties file:

org.neo4j.server.thirdparty_jaxrs_classes=org.neo4j.server.extension.streaming.cypher=/streaming

Then you can issue

curl -d'{"query":"start n=node(*) return n"}' -H accept:application/json -H content-type:application/json http://localhost:7474/streaming/cypher

to query the graph and stream the results.

A sample Parser/Client implementation is in org.neo4j.server.extension.streaming.cypher.CypherResultReader which uses a callback to provide notifications about columns, rows and cells and meta-information.

Besides running the performance test we also ran it on a fresh Neo4j Community 1.7.M03 server.  So we used a dirty gremlin script to create 10k connected nodes:

// NSFW it assumes an empty db + single threaded access
(0..10000).inject(0) { count,idx -> v2=g.addVertex(); g.addEdge(g.v(idx),v2,"TYPE"); count+1;}

And then curl to retrieve the data:

curl -o result.txt -H accept:application/json -H content-type:application/json -d '{"query":"start n=node(*) match p=n-[r:TYPE]->m return n,r,m,p"}' http://localhost:7474/streaming/cypher

% Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
Dload  Upload   Total   Spent    Left  Speed
100 3255k    0 3255k  100    64  2134k     41  0:00:01  0:00:01 --:--:-- 2136k

The results where impressive,3.2MB of data was queried, converted and transferred in 1 second (on a previous generation MBA).

Then I tried the same query on the existing cypher endpoint and got an OutOfMemory-Error after 3 minutes. (It builds up the whole result set in memory before transferring it).

curl -o result.txt -H accept:application/json -H content-type:application/json -d '{"query":"start n=node(*) match p=n-[r:TYPE]->m return n,r,m,p"}' http://localhost:7474/db/data/cypher

% Total    % Received % Xferd  Average Speed   Time    Time     Time Current
Dload  Upload   Total   Spent    Left  Speed
100    64    0     0    0    64      0      0 --:--:--  0:01:44 --:--:--     0
100  4414  100  4350    0    64     25      0  0:02:54  0:02:50  0:00:04   879

I’m really interested in your feedback on this, both the format and the approach. Even more in clients in other languages consuming this data in a streaming manner.

After this experiment I also want to look into allowing the current Neo4j-REST API to be streamed across the wire (and also get a consistent, more compact / non-discoverable-links mode).

Update

  • Added mode=compat to emit the json format of the Neo4j REST API (without the discoverable URLs)
  • Added format=pretty for a pretty printed output (to allow streaming consumption by line-by-line parsers)
Share and Enjoy:
  • Print
  • Digg
  • del.icio.us
  • Facebook
  • LinkedIn
  • Netvibes
  • PDF
  • Ping.fm

8 Comments

  • Vinicius says:

    Hi Michael, very interesting, thanks for sharing. Nice to know we may have this support for large graphs over the wire.

    One question (pardon me for the possible ignorance). Would it mean that we may see a long polling (comet like) one day on neo4j?

    I mean, that would be nice to have a persistent connection with the server and receive events over it, I don’t know what’s behind the scenes with neo4j but combining this with netty for instance should be very interesting.

    Thanks once again for the great post.

    Cheers

  • There has been some work on this (but unfortunately not finished which keeps open sessions and also allows transactions on sessions)

    Also the structr team uses a websocket connection to send messages to a thin layer in front of the Neo4j databases.

  • [...] Kollegger writes: Recently, Michael Hunger blogged about his lab work to use streaming in Neo4j’s REST interface. On lab days, everyone on the Neo4j team gets to [...]

  • delbe says:

    I just started researching solutions for a problem I have when I came across your blog entry. I have a long running process that will be building a graph, with nodes added at indeterminate intervals. I would like to have a web page that automatically, with no user interaction, updates whenever a new node is entered. Essentially I have a graph that is being built up slowly over time and that graph needs to be displayed as it is built. Will the streaming work you present here solve this problem? Does your streaming mechanism push updates from multiple transactions?

  • What you want is kind of an audit stream which is not really what the discussed streaming does.
    But it should be pretty simple to add to Neo4j as a Server Extension which uses a transaction-event-handler, streaming JSON and a long connection timeout.

  • [...] If you are planning on passing a large amount of data to and from a Neo4j instance, it is definitely worth considering using the native Embedded Database in Java to achieve this instead of using their REST API. Although language independent, their REST API as of 1.7 is considerably slower by several orders of magnitude than their embedded database. However, their new streaming REST API for 1.8-SNAPSHOT at the moment plans on being considerably faster than the current API (though still slower than the embedded database, naturally) – those interested should check out an article by the developer on the project. [...]

  • Karen Smrha says:

    We have a Neo4j 1.7.1 client and are looking into using Jackson to stream to implement client-side cypher streaming.

    Our problem now is that we sometimes get a result from the Neo4j server that is very large and having in saved in one Java object is a problem.

    The system is in production so the hope is to implement a quick client fix using the non-streamed Neo4j 1.7.1 server and later upgrade to 1.8 and make use of server-side streaming.

    I see that the sample output for the streamed server has “rows” whereas our non-streamed server has “data”. I also see that the sample streamed server output has type:value pairs whereas our non-streamed server results has only values.

    My question is:

    Can a Jackson Cypher streaming client work correctly with a non-streamed Neo4j server?

    Thank you for your time,
    Karen

  • Hi Karen,

    if you have streaming in the client you can easily consume a result from a non-streaming server so that the client has less memory consumption (at least if you don’t keep the results around but just use them to calculate something).

    In Neo4j Server 1.8 we implemented streaming for the format that was there before (aka colums:[], data: [[row1],[row2]] ). So you should use that format.

Leave a Reply

XHTML: You can use these tags:' <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

Copyright © 2007-2014 Better Software Development All rights reserved.
Multi v1.4.5 a child of the Desk Mess Mirrored v1.4.6 theme from BuyNowShop.com.