On Streaming Cypher

Posted by Michael Hunger on Apr 13, 2012 in Uncategorized |

After being annoyed for a long time about the Neo4j REST protocol performance I decided to have a look at streaming JSON last night. It seemed simple enough.

Today Peter pushed me to continue through and use the Lab day for finishing the lab-project.

So I started to create a server-extension project that does 2 things differently. First it uses a more compact format for the cypher results than the current restful representation. Secondly it uses streaming JSON to send a StreamingOutput into a Jersey-Response.


For a query like this it would return:

start n=node(*) match p=n-[r]-m return  n as first,r as rel,m as second,m.name? as name,r.foo? as foo,ID(n) as id, p as path , NODES(p) as all
// columns
{"columns":["first","rel","second","name","foo","id","path","all"],
// rows is an array of array of objects each object is { type : value }
"rows":[
    [{"Node":{"id":0,"data":{"name":42}}},
     {"Relationship":{"id":0,"start":0,"end":1,"type":"knows","data":{"name":"rel1"}}},
     {"Node":{"id":1,"data":{"name":"n2"}}},
     {"String":"n2"},
     {"Null":null},
     {"Long":0},
     {"Path":
         {"length":1,
          "start":{"id":0,"data":{"name":42}},
          "end":{"id":1,"data":{"name":"n2"}},
          "last_rel":{"id":0,"start":0,"end":1,"type":"knows","data":{"name":"rel1"}},
          "nodes":[{"id":0,"data":{"name":42}},{"id":1,"data":{"name":"n2"}}],
          "relationships":[{"id":0,"start":0,"end":1,"type":"knows","data":{"name":"rel1"}}]}},
     {"Array":[{"id":0,"data":{"name":42}},{"id":1,"data":{"name":"n2"}}]}]],
// number of rows
"count":1,
// full runtime including streaming all the results
"time":29}

The Jersey- JAX-RS code is also simple enough to stream data:

StreamingOutput stream = new StreamingOutput() {
    public void write(OutputStream output) throws IOException, WebApplicationException {
        try {
            service.execute(params.get("query"), params.get("params"), output);
        } catch (Exception e) {
            throw new WebApplicationException(e);
        }
    }
};
return Response.ok(stream).build();

Performance wise it allows me to:

  • convert 1M Nodes into JSON (with 50MB data) and send it through a stream in 3 seconds
  • read 1M Nodes from a JSON-Stream (with 50MB data) in 3 seconds

To make it work with Neo4j-Server, just put the streaming-cypher-extension-1.7.M03.jar into neo4j-server/plugins and add this to the conf/neo4j-server.properties file:

org.neo4j.server.thirdparty_jaxrs_classes=org.neo4j.server.extension.streaming.cypher=/streaming

Then you can issue

curl -d'{"query":"start n=node(*) return n"}' -H accept:application/json -H content-type:application/json http://localhost:7474/streaming/cypher

to query the graph and stream the results.

A sample Parser/Client implementation is in org.neo4j.server.extension.streaming.cypher.CypherResultReader which uses a callback to provide notifications about columns, rows and cells and meta-information.

Besides running the performance test we also ran it on a fresh Neo4j Community 1.7.M03 server.  So we used a dirty gremlin script to create 10k connected nodes:

// NSFW it assumes an empty db + single threaded access
(0..10000).inject(0) { count,idx -> v2=g.addVertex(); g.addEdge(g.v(idx),v2,"TYPE"); count+1;}

And then curl to retrieve the data:

curl -o result.txt -H accept:application/json -H content-type:application/json -d '{"query":"start n=node(*) match p=n-[r:TYPE]->m return n,r,m,p"}' http://localhost:7474/streaming/cypher

% Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
Dload  Upload   Total   Spent    Left  Speed
100 3255k    0 3255k  100    64  2134k     41  0:00:01  0:00:01 --:--:-- 2136k

The results where impressive,3.2MB of data was queried, converted and transferred in 1 second (on a previous generation MBA).

Then I tried the same query on the existing cypher endpoint and got an OutOfMemory-Error after 3 minutes. (It builds up the whole result set in memory before transferring it).

curl -o result.txt -H accept:application/json -H content-type:application/json -d '{"query":"start n=node(*) match p=n-[r:TYPE]->m return n,r,m,p"}' http://localhost:7474/db/data/cypher

% Total    % Received % Xferd  Average Speed   Time    Time     Time Current
Dload  Upload   Total   Spent    Left  Speed
100    64    0     0    0    64      0      0 --:--:--  0:01:44 --:--:--     0
100  4414  100  4350    0    64     25      0  0:02:54  0:02:50  0:00:04   879

I’m really interested in your feedback on this, both the format and the approach. Even more in clients in other languages consuming this data in a streaming manner.

After this experiment I also want to look into allowing the current Neo4j-REST API to be streamed across the wire (and also get a consistent, more compact / non-discoverable-links mode).

Update

  • Added mode=compat to emit the json format of the Neo4j REST API (without the discoverable URLs)
  • Added format=pretty for a pretty printed output (to allow streaming consumption by line-by-line parsers)
Share and Enjoy:
  • Print
  • Digg
  • del.icio.us
  • Facebook
  • LinkedIn
  • Netvibes
  • PDF
  • Ping.fm

Leave a Reply

XHTML: You can use these tags:' <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

Copyright © 2007-2015 Better Software Development All rights reserved.
Multi v1.4.5 a child of the Desk Mess Mirrored v1.4.6 theme from BuyNowShop.com.