0

On Creating a MapDB Schema Index Provider for Neo4j 2.0

Posted by Michael Hunger on May 11, 2013 in Uncategorized, code, java, neo4j

Writing an Neo4j 2.0 Schema Index Provider for MapDB

Neo4j 2.0 introduced the concept of real automatic indexes with a new underlying indexing subsystem SPI. So I thought it would be really helpful to try it out and provide a faster indexing implementation than the default lucene one. I chose MapDB for it and the results are here on github.

Using the index is quite easy from Cypher and the other APIs:

// Cypher
CREATE INDEX ON :Label(property)
e.g.
CREATE INDEX ON :Person(name)

// the index is used automatically, but can be enforced with
MATCH (n:Person)
USING INDEX :Person(name)
where n.name = "Andres"
// Java
Label LABEL = DynamicLabel.label("foo");
String PROPERTY = "bar";

// Creation
Transaction tx = db.beginTx();
IndexCreator indexCreator = db.schema().indexCreator(LABEL).on(PROPERTY);
IndexDefinition indexDefinition = indexCreator.create();
tx.success(); tx.finish();
db.schema().awaitIndexOnline(indexDefinition, 5, TimeUnit.SECONDS);

// Usage, get Index Information
IndexDefinition index = IteratorUtil.single(db.schema().getIndexes(LABEL));
assertEquals(LABEL.name(), index.getLabel().name());

// Create matching Node
Transaction tx = db.beginTx();
Node node = db.createNode(LABEL);
node.setProperty(PROPERTY, 42);
tx.success(); tx.finish();

// Find nodes
ResourceIterable<Node> nodes = db.findNodesByLabelAndProperty(LABEL, PROPERTY, 42);

MapDB is a very potent implementation of an effective in-memory and persistent map structure, either as b-tree or hashmap. It supports optimized serialization of arbitrary Java objects including collections, compresses data on the fly even with id-compression and much more. A very important feature for adding MapDB as an index provider is the support for snapshots.

Support for transaction like semantics allow for batch-updates which is also really cool for the index provider which uses batch-updates too.

A code-example from the MapDB website:

    import org.mapdb.*;

    //Configure and open database using builder pattern.
    DB db = DBMaker.newFileDB(new File("testdb")).closeOnJvmShutdown().make();

    //create new collection (or open existing)
    ConcurrentNavigableMap map = db.getTreeMap("collectionName");
    map.put(1,"one");
    map.put(2,"two");

    //persist changes into disk, there is also rollback() method
    db.commit();

    db.close();

So choosing MapDB as an index provider was really straightforwad. Now the small task is only to implement the SPI.

The requirements for implementing the SPI are quite simple. We have to tie into Neo4j’s lifecycle management with an IndexProviderFactory to
register the index provider and which implements the SchemaIndexProvider which supplies an IndexPopulator and IndexAccessor that handle index updates and an IndexReader that
has to provide a repeatable read snapshot view of the data in the index. Actually I just copied the code from the org.neo4j.kernel.impl.api.index.InMemoryIndexProvider and adapted it for MapDb.

The MapDbIndexProviderFactory is tiny, it just returns the single instance of MapDbSchemaIndexProvider in the newKernelExtension lifecycle method.

@Service.Implementation(KernelExtensionFactory.class)
public class MapDbIndexProviderFactory extends KernelExtensionFactory&lt;MapDbIndexProviderFactory.Dependencies> {
    public interface Dependencies {}

    private final MapDbSchemaIndexProvider singleProvider;

    public MapDbIndexProviderFactory() {
        // name and version
        super(new SchemaIndexProvider.Descriptor("mapdb-index", "1.0"));
        this.singleProvider = new MapDbSchemaIndexProvider();
    }

    @Override
    public Lifecycle newKernelExtension(Dependencies dependencies) throws Throwable {
        return singleProvider;
    }
}

To register the MapDbIndexProviderFactory we have to provide a file named org.neo4j.kernel.extension.KernelExtensionFactory in META-INF/services that contains the fully qualified name of our Factory, in its role as KernelExtensionFactory, which is: org.neo4j.index.mapdb.MapDbIndexProviderFactory.

The MapDbSchemaIndexProvider extends SchemaIndexProvider, it is also an instance of Lifecycle, so it implements init(), start(), stop(), shutdown(). In the constructor it registers itself with a descriptor and priority (2 is higher than the default 1 for lucene) and creates a MapDB Database instance which is used later on.

public MapDbSchemaIndexProvider() {
    super(new SchemaIndexProvider.Descriptor("mapdb-index", "1.0"), 2);
    db = DBMaker.newFileDB(new File("mapdb-index"))
      .compressionEnable().closeOnJvmShutdown().make();
}

It keeps an internal CopyOnWriteHashMap for a list of index-instances by name which are representated by the appropriate MapDB tree-map. The 3 methods from SchemaIndexProvider provide access to each concrete index (indexId is unique per declared :Label(property) combination).

@Override
public MapDbIndex getOnlineAccessor(long indexId) {
    MapDbIndex index = indexes.get(indexId);
    if (index == null || index.state != InternalIndexState.ONLINE)
        throw new IllegalStateException("Index " + indexId + " not online yet");
    return index;
}

@Override
public InternalIndexState getInitialState(long indexId) {
    MapDbIndex index = indexes.get(indexId);
    return index != null ? index.state : InternalIndexState.POPULATING;
}

@Override
public MapDbIndex getPopulator(long indexId) {
    BTreeMap&lt;Object,Set<Long>> map = db.getTreeMap(String.valueOf(indexId));
    MapDbIndex index = new MapDbIndex(map,db);
    indexes.put(indexId, index);
    return index;
}

The getPopulator returns the IndexPopulator which is repsonibile for updating the index. That happens within a separate class called MapDbIndex which handles addition of removal of batches of value->nodeId pairs to the MapdDB tree-map instance, all of this happens in the implementation of updateAndCommit and recover which both call update(Iterable updates). That method then decides on the Mode of NodePropertyUpdate to either add,remove or update data. In this demo I base the implementation on storing Sets of Long values for the node-id’s. The real implementation is a bit more evolved to save space and skip (un-)boxing.

private void add(Object value, Long id) {
    Set&lt;Long> ids=indexData.get(value);
    if (ids==null) ids = new HashSet&lt;Long>();
    ids.add(id);
    indexData.put(value,ids);
}
private void remove(Object value, Long id) {
    Set&lt;Long> ids=indexData.get(value);
    if (ids==null) return;
    ids.remove(id);
    indexData.put(value,ids);
}

public void update(Iterable&lt;NodePropertyUpdate> updates) {
    for (NodePropertyUpdate update : updates) {
        switch (update.getUpdateMode()) {
            case ADDED:
                add(update.getValueAfter(),update.getNodeId());
                break;
            case CHANGED:
                remove(update.getValueBefore(), update.getNodeId());
                add(update.getValueAfter(),update.getNodeId());
                break;
            case REMOVED:
                remove(update.getValueBefore(), update.getNodeId());
                break;
            default:
                throw new UnsupportedOperationException();
        }
    }
    db.commit();
}

The IndexReader must make sure to supply a reapeatable read view of the data, using MapDB’s treeMap.snapshot() facility. So implementing the MapDbIndexReader is not complicated.

@Override
public IndexReader MapDbSchemaIndexProvider.newReader() {
    return new MapDbMemoryReader((BTreeMap&lt;Object, Set<Long>>) indexData.snapshot());
}

private static class MapDbIndexReader implements IndexReader {
    private BTreeMap&lt;Object, Set<Long>> snapshot;

    @Override
    public Iterator&lt;Long> lookup(Object value) {
        final Set&lt;Long> result = snapshot.get(value);
        return result == null ? IteratorUtil.&lt;Long>emptyIterator() : result.iterator();
    }

}

That’s about it.

You just have to clone the repository, build it with mvn package and put the jar file target/mapdb-index-1.0.jar as well as org.mapdb:mapdb:jar:0.9.1 in your classpath or server/plugins directory to use the index.

. For both you can also just use the contents of the generated target/target/mapdb-index-1.0-provider.zip.

So far in tests it was twice as fast as Lucene but there is certainly optimization potential.

In general it was really simple to implement the index provider, so I suggest you go ahead and try it for other NOSQL stores. Would really love to see some other implementations out there.

 
4

Cool first Neo4j 2.0 milestone – Now with Labels and “real” Indexes

Posted by Michael Hunger on Apr 10, 2013 in neo4j

With the addition of node labels, the property graph model that is the foundation of Neo4j was changed for the first. It has been already thirteen years, since the founders (Emil, Johan and Peter) sketched the original property graph model over some beers.

With the new node-label feature you can assign any number of types from your domain to a node. Imagine labels like Person, Location, Product, Project, User etc. Adding, querying and removing labels is supported in all Neo4j-APIs: Cypher, Java-API and REST-API (Batch-Inserter is in the works).

Starting with Neo4j 2.0 a last missing piece to Cypher functionality was added too. The new labels allow to provide label-based indexes which are handled automatically by the database. That means after an index is created all existing nodes with the label and properties are added to it behind the scenes and after the completion of that task the index will be updated transactionally.

These indexes are used by Cypher to perform index based lookups based on the label and properties that are part of the index. That either happens automatically for simple expressions or with an explicit index hint.

In this quick presentation the team outlines the basic ideas behind the node labels:

Introducing labels caused a change the store format of Neo4j. As always, if you want to upgrade an older installation to Neo4j 2.0 the store will be automatically upgraded if you provide the configuration option “allow_store_upgrade=true” to either conf/neo4j.properties or the configuration passed in to the embedded API.

This quick screencast shows some of these new features in action:

If you want to know how labels work, look no further, here are some examples:

CREATE a node with a label

CREATE (movie:Movie {title: "Matrix"})
RETURN movie,labels(movie);

Find the node via a label expression and property check

MATCH (movie:Movie)
WHERE movie.title="Matrix"
RETURN movie;

CREATE INDEX to speed up the previous query

CREATE INDEX ON :Movie(title);

Create a second labeled node and a relationship

CREATE (actor:Actor {name: "Keanu Reeves"})
WITH actor
MATCH (movie:Movie)
WHERE movie.title="Matrix"
CREATE (actor)-[role:ACTED_IN {role:"Neo"}]->(movie)
RETURN actor, role, movie;

Find the labeled nodes using only the MATCH expression

MATCH (actor:Actor)-[role:ACTED_IN]->(movie:Movie)
WHERE movie.title = "Matrix"
RETURN actor, role, movie;

Force USING the created index

MATCH (actor:Actor)-[role:ACTED_IN]->(movie:Movie)
USING INDEX movie:Movie(title)
WHERE movie.title = "Matrix"
RETURN actor, role, movie;

There is much more to indexes. Make sure to check out the documentation of Cypher, the Java-API, REST-API.

Besides that major new feature set there are some more niceties in the Neo4j 2.0 Milestone 1 release:

Cypher

You can now merge multiple query results with UNION [ALL]:

MATCH actor:Actor WHERE actor.age > 50 RETURN actor
UNION
MATCH director:Director RETURN director

For computing different values depending on an expression or value two versions of the well known CASE ... WHEN ... ELSE statement were added to Cypher:

MATCH (m:Movie)<-[r:RATED]-(u:User)
RETURN CASE r.stars
WHEN 5 THEN "awesome"
WHEN 3 THEN "good"
WHEN 1 THEN "bad"
ELSE "unknown"
END
MATCH MATCH (m:Movie)<-[r:RATED]-(u:User)
with movie, count(r) a ratings
return movie, CASE
WHEN ratings > 100000 THEN "blockbuster"
WHEN ratings < 10 THEN "flop"
ELSE "average"
END

Neo4j-Shell

Import, Export, Parameters

This is something I worked on and had much fun with, didn't dive into the shell internals before.

The Neo4j shell can now DUMP the contents of the database or the result of a Cypher statement as a single big Cypher CREATE statement (not yet with labels). This can then be piped or pasted into another shell to create that (sub)graph. The shell can now also read commands directly from a file using the -f file option.

bin/neo4j-shell -path data/graph.db -c "dump" | bin/neo4j-shell -path
data/new.db
bin/neo4j-shell -path data/graph.db -c "dump bin/neo4j-shell -path data/new.db -f love.cql
START n=node(*) match n-[r:LOVES]->m return n,r,m;" > love.cql

I also added support for executing parameterized statements by automatically passing shell variables (set with EXPORT) to a cypher statement. So you can run your parameterized queries without rewriting them for the shell.

EXPORT name=”Keanu Reeves”
MATCH actor:Actor WHERE actor.name = {name};

Neo4j Console

I also updated the Neo4j Test Console to the new 2.0 milestone and changed the default graph to use labels (:Crew and :Matrix for the characters).
To be new and shiny it also got the light theme as default that we already used on Neo4j.org

 
2

Parallel Batch Inserter with Neo4j imported 20 billion relationships on EC2

Posted by Michael Hunger on Oct 27, 2012 in code, java, neo4j

As massive data insertion performance has bothered me for a while, I made it the subject of my last lab days (20% time) at Neo4j. The results of my work are available on GitHub and I explain the approach below.

Data Insertion issues

When getting started with a new database like the graph database Neo4j it is important to quickly get an initial data-set to work with. Either you write a data-generator to generate it or you have existing data in relational- or NOSQL-databases that you want to import.

In both cases the import is unusual as oftentimes hundreds of millions or billions of nodes and relationships have to be imported in a short time. The normal write load of a graph database doesn’t cater for those insertion speeds. That’s why Neo4j has a BatchInserter that is able to import data quickly by loosening the transactional constraints but in doing so only working in a single thread.

If for instance only nodes w/o properties are imported, the inserter reaches a speed of 1 million nodes per second which is nice. But as soon as relationships and properties come into the picture the insertion speed drops noticeably. The reason for that degradation is that the single-threaded approach doesn’t utilize all available resources in a modern system. Neither the plethora of CPU’s nor the high concurrent throughput (up to 200MB/s) of modern (SSD) even in multiple streams is used.

Read more…

 
8

On Streaming Cypher

Posted by Michael Hunger on Apr 13, 2012 in development, java, neo4j

After being annoyed for a long time about the Neo4j REST protocol performance I decided to have a look at streaming JSON last night. It seemed simple enough.

Today Peter pushed me to continue through and use the Lab day for finishing the lab-project.

So I started to create a server-extension project that does 2 things differently. First it uses a more compact format for the cypher results than the current restful representation. Secondly it uses streaming JSON to send a StreamingOutput into a Jersey-Response.

Read more…

 
0

MovieHackDay Berlin Recap

Posted by Michael Hunger on Jun 6, 2011 in fun, neo4j

It was really a great event. Perfectly organized by MoviePilot.de (@Jannis) where Pere works as well.

You can find all information about the event on their site moviehackday.com and of course @moviehackday and #moviehackday (wiki)

There was free pizza, drinks (also beer), food coupons and cake from Pere, ping pong, lots of space, great conversations.

Some people have heard of neo4j and graphs some didn’t so Achim, Pere and I did a quick intro into graph-databases.

For hacking 3 teams decided to go for Neo4j for recommendations / connections etc. which is great (two of which never used it before).

MoviePilot are using/going to use Neo4j for their international site (moviepilot.com) they have 60 people working, running everything on ruby on rails. Their german site is running on mysql, but the german team got interested
in neo4j too so they will perhaps add it for recommendations and such and (imho: should replace mysql in the long run, we should work on that :)

They have a really great, golden, light-flooded office (previously a dance school, with mirrors and such, my wife even knew that one).
It is also a good location (Mehringdamm 33), easy to reach via subway and very cheap and great ho(s)tels around the corner (complete room for 3 w/ bathroom & breakfast for 42 EUR).

I sponsored a Kymera Magic Wand as 2nd prize, courtesy Neo4j.

I also announced the Graph Database Meetup Berlin there (during the demo show-off), even got a good location suggestion (c-base).

By the way, our team of four won the first prize (http://moviehackday.com/alien-egg.jpg), an Alien Collectors Set with “MovieTrail” a different visualization of a movie (like a twitter stream with geo-loc of the movie and appropriate images), all just extracted from the movie-script doc-file.
We didn’t use any storage so far, as we wanted to concentrate on the parsing, geo-lookup and visualization.

Public MovieTrail URL (dam’n it doesn’t show the images, although the urls are correct)
Repository is on github

I’m looking forward to the next movie-hackday, there were also suggestions on twitter to hold another one this year in the US which would be great.

Thanks again to the organizers.

Tags:

 
8

Industrial Grade Barcode Scanning w/ an Apple iPad USB barcode scanner and camera connection kit

Posted by Michael Hunger on Oct 26, 2010 in fun, iOS

I was pondering the means of how to scan barcodes with an iPad for a while.

Today I found a really cheap solution. Just connect a stock USB keyboard barcode scanner via the camera connection kit USB connector to the iPad, and you’re done.

Now you can scan your book EANs for librarything or other things like price searches. But
as I’m currently working on some iPad business applications – those would benefit even more from this setup.

Normally you would go with an quite expensive (300-700 USD) bluetooth barcode scanner (one of those that is accepted by the iPad as a keyboard).

Now you can scan whatever barcodes your scanner is configured for (EAN, UPC, CODE128, CODABAR,I3OF9). Those USB scanners are also able to send prefix and postfix key strokes in addition to the barcode itself and also other optional information like the barcode type. All that can be used by your app to process the barcodes.

Another alternative that came up today is to use the iPod touch 4G as the barcode scanning device (creating an App that scans the barcodes with the typical barcode SDKs like redlaser or Big In Japan/Shop Savvy).

Tags: , , , , ,

 
3

Keynote at 4developers: The Game Of Life – Java‘s Siblings and Heirs are populating the Ecosystem

Posted by Michael Hunger on Mar 29, 2010 in code, development, java, programming languages


I was invited to give a keynote talk at the 4developers conference in Poznan, Poland.

Topic

I’d liked to talk about the Java.next programming languages on the JVM and polyglot programming. When pondering how to address this issue, two things came into my mind.
Read more…

 
0

Switching to Wordpress

Posted by Michael Hunger on Mar 6, 2010 in blogging

After running this blog for some years on Serendipity, I finally switched to wordpress. All my other blogs are running on wordpress, so this was the only black sheep.

I use the Aspire theme for many of my other blogs but for this one a work/desk like theme seemed more appropriate. So I took a quick look and chose Desk Mess Mirrored.

Importing the content worked like a charm using the S9 Importer plugin.

So here we are.

For the interested here is the list of the other blogs I’m running:

Tags: , ,

 
1

97TESPK: Scoping Methods

Posted by Michael Hunger on Mar 5, 2010 in 97TESPK, code, patterns, writing
97 things every programmer should know on compass table

Now that 97 things every programmer should know lies on my compass table, I’ll post my contributions here that didn’t make it into the book.

The first is “scoping methods” which I thought about while reading Uncle Bob Martin’s Clean Code. He discussed scoping variables but only about putting methods near to each other. Obviously there was a missing piece. I tried to write it down.
Read more…

Tags: , , ,

 
0

12 patterns of development

Posted by Michael Hunger on Oct 5, 2009 in patterns

Martin Fowlers Pattern talk at #jaoo made me make this 12 patterns of development song.

Have fun

Michael

On the twelfth day of Development,
my true dev sent to me
Twelve bridges bridging,
Eleven factories making,
Ten observers observing,
Nine builders building,
Eight visitors a-visiting,
Seven composites composing,
Six iterarators iterating,
Five golden states,
Four calling proxies,
Three nice adaptors,
Two commands commanding,
And a singleton in a pair tree!

Tags: ,

Copyright © 2007-2013 Better Software Development All rights reserved.
Multi v1.4.5 a child of the Desk Mess Mirrored v1.4.6 theme from BuyNowShop.com.