Toolbox
  • Printable version
 
Toolbox
LANGUAGES
Language
Personal tools
Wikipedia Affiliate Button
 

Neo4j

From BrightByte

Jump to: navigation, search
Free Content
Freecontent.png

neo4j is a graph database written in Java (neo4j.org). I recently poked at it a little to see if it could be used to make fast queries over Wikipedia's category structure.

Contents

The Problem

Using the category structure when searching content on Wikipedia, or when looking for maintenance task in a specific topic area, has long been a pending item on the wishlist of a lot of people. Some years back, I wrote catscan to address the issue, but it's slow, truncates results, prone to failure, and generally ugly. So I'm looking for better ways to do this, and neo4j looks like an option.

But first off, a closer look at the problem: Categories on Wikipedia are not tags: they can't easily be combined (intersected), but they can put into relation to each other (making subcategories). A category can be a subcategoriy of several other categories: American Writers may be a subcategory of American people and Writers. By convention, there should be a single root category, and there should be no circles in the category structure, so the resulting graph is a directed graph that has no circles and is (weakly) connected. This is alsy called a poly-hierarchy. However, there is nothing that actually prevents circles, and nothing that forces the structure to be connected. So, both loops and islands may occur.

The most wanted feature now is commonsly called deep category intersection: we want all pages that are contained in two categories, while also considering all of their subcategories. Formally, this is the intersection of the transitive closure of the two categories alon the subcategory-relation. Calculating the transitive closure is typically done by recursively evaluating all subcategories. However, this is something traditional relational database systems are particularly bad at - it's only possible with lots of individual queries, which makes the proces quite slow.

The Idea

So, the idea is to use a database system that is better suited for managing a graph structure and analysing it. This is where neo4j comes in. Neo4j is an embedded, disk-based graph database library for Java. It can handle graphis with millions of nodes easily. To model Wikipedias category structure in neo4j, I wanted to load all category links between categories (that is, not including actual pages, the "leaf" nodes) connected by a single type of relation ("CONTAINS") into neo4j, and then tried how fast I could traverse all subcategories of a given category recursively.

The Program

The program I wrote for this is called CatGraph [1]. There are several things to note about it:

Indexing

Neo4j is purely a graph database: it lets you navigate nodes and relations in your graph, associate properties, etc. It does not however provide any way to find nodes by some property. I however needed a way to find the node in the graph that corresponds to a particular category by its MediaWiki page-id (the title would be more convenient, but would require a lot more memory).

The solution comes as an optional module for neo4j: while there is no index mechanism built into neo4j, it provides an abstract interface for indexing nodes and querying the index. It also provides an implementation for the indexing service, based on Lucene. So, whenever I create a node for a given page-id, i put it into the index. And when I need a node for a given page-id, I look it up in the index, and only create it if it's not already there. Here's the code:

	GraphDatabaseService graphDb = new EmbeddedGraphDatabase( dbDirectory );
	IndexService indexer = new LuceneIndexService(graphDb);
 
	....
 
	public Node getNodeByPageId(int pageId) {
		return indexer.getSingleNode("page_id", pageId);
	}
 
	public Node aquireNodeByPageId(int pageId) {
		Node n = getNodeByPageId(pageId);
 
		if (n==null) {
			n = graphDb.createNode(); 
			n.setProperty("page_id", pageId);
			indexer.index(n, "page_id", pageId);
		}
 
		return n;
	}

Transactions

One pitfall I found are transcations. More precisely, I did not recon on the fact that all operations done within a transcation are buffered in memory until the transaction is committed. So, if you do too many operations in the same transaction, you will get an OutOfMemoryError at some point. The solution is to devide the insert operations into smaller chunks that can each be wraped in a transaction - in my cased, I added 100000 relations per transaction. Here's some code (somewhat simplified):

	public void loadArcs(DataCursor<Pair<Integer, Integer>> args) throws PersistenceException {
		int count = 0;
		Transaction tx = null;
		try {
			Pair<Integer, Integer> row ;
			while ((row = args.next()) != null) {
				int from = row.getA();
				int to = row.getB();
 
				if (tx==null) tx = graphDb.beginTx();
 
				Node f = aquireNodeByPageId(from);
				Node t = aquireNodeByPageId(cat);
 
				t.createRelationshipTo( f, CategoryRelationships.CONTAINS );
 
				if ( count++ % 100000 ) {
					if (tx!=null) {
						tx.success();
						tx.finish();
						tx = null;
					}
				}
			}
 
			tx.success();
		} finally {
			if ( tx != null) {
				tx.finish();
			}
		}
 
	}

Neo4j's transactions can be nested, and a generally rather painless to use [2].

Performance Tuning

I did some basic tuning of neo4j's configuration and also of the Java-VM I used for running my tests, based on the configuration hints given in the manual [3]. First of all, neo4j uses memory-mapping to allow fast access to the graph structure. I chose a setup that is optimized for traversal:

neostore.nodestore.db.mapped_memory=15M
neostore.relationshipstore.db.mapped_memory=500M
neostore.propertystore.db.mapped_memory=15M
neostore.propertystore.db.strings.mapped_memory=0M
neostore.propertystore.db.arrays.mapped_memory=0M

I didn't play with this much, so it's probably not optimal, but it worked ok.

As to the Java-VM, I used Suns Java HotSpot 64 bit server engine, with 312 MB of heap space and the concurrent mark sweep garbage collector, as suggested by the neo4j manual. The comand line looked like this:

java -d64 -server -Xmx312m  -XX:+UseConcMarkSweepGC ...

Findings

For the test run, I created a dump of all subcategory relations in the German language Wikipedia, that is, a dump of the category structure without the leaf nodes (regular pages):

SELECT cl_from, page_id FROM categorylinks JOIN page ON cl_to = page_title AND page_namespace = 14

This resulted in a CSV file with about four and a half million entries (4'623'236 exactly). Loading this graph into neo4j, using my CatGraph program, took about half an hour at a rate of 2500 arcs per second on one core of amaranth (a mostly idle Sun Fire X4150 with 8 cores and 32GB RAM). The test system has fast disk I/O (12 disks in an external array, striped).

Finding all descendants (subcategories) of a smallish category like Berlin or Physiker took about 200ms. Very large categories like Geschichte or Wissenschaft were done after 4 minutes, an attempt to traverse the entire structure starting from !Hauptkategorie failed with an OutOfMemoryException.

Cutting off the traversal at a given depth (8 perhaps) would probably lead to reasonably quick results in allmost all cases, but this needs further investigation. Note that cutting off the traversal at a sertain depth may even improve the quality of the result: due to the rather unkept nature of the category structure, the topic relations become rather blurry after a few steps. Why is Irland (Ireland) filed under Wissenschaft (science)? Because geography is a science, and Irland is a geographic category...

Conclusion

With some limitations and a bit of tweaking, Neo4j seems to be a viable option for working with the category structure. What makes it particularly attractive is the fact that it can be closely integrated with Lucene: Wikimedia's search infrastructure is based on Lucene, Neo4j could easily be incorporated into this infrastructure for doing category-based queries.

However, a purpose-build graph management server, which keeps all data in memory at all times, and does not suffer the overhead of a high level object oriented API, is likely to deliver much better results, in terms of speed and also memory efficiency. This is an option that should be explored further.

[talk page]Talk:Neo4j

Contents

[edit] Real benefit of a graph database for Wikipedia

The real benefit of a graph database for Wikipedia would be not to trace the full structure from one given node but to find similar articles and categories. With Gremlin you should be able to query all categories that are most often used with a given other category, categories that are most often used in pages that link to a given page etc. Such queries are very hard to answer with a relational database but easy in a graph database. By the way you should ask the Neo4j community if there is a bulk insert. 2500 arc per seconds sounds little. Isn't the whole database hold in main memory anyway? -- Jakob (not logged on and only read the Gremlin specs.)

Hi Jakob! Finding similar pages and categories is indeed an interresting issue, I have been playing with that in the context of WIkiWord quite a bit. I have been using feature vectors to calculate similaries, and have been using links (in/out/category) as the features. Works pretty well, too :)
Thanks for the pointer to Gremlin, that looks awesome! I'm not quite sure though any of the graph implementations quite fit our needs. Neo4j is disk-base and a bit too slow. Having to use Lucene just to associate the native IDs is somewhat annoying. Maybe I'll play with TinkerGraph a bit and see how that works out. However, Java isn't very memory-efficient. I'm looking for something that would allow me to have the structure of all wikis in memory all the time... -- Daniel 18:44, 29 July 2010 (UTC)

[edit] comparision

Interesting.

You give some numbers for how long it takes with neo4J, it'd be interesting to have the numbers for sql to compare against. --bawolff

true - it's hard to compare directly though. The trivial, recusive approach would take extremly long, the algoithem i'm using with catscan is much faster, but limited by the max size of sql queries. Also, I'm afraid I can't invest much more time into benchmarking right now. -- Daniel 18:44, 29 July 2010 (UTC)

[edit] CatScan is for readers

Hi BrightByte. Nice work!

I'd love to see a CatScan-like feature more prominent, perhaps even include it in MediaWiki itself, rather than an external tool (as discussed here]). The fact that Neo4j is written in Java is no problem imho – for large sites we can expect them to install Java for tools like this (and Lucene). And for smaller sites, categories don't get that big, so a simple non-scaling solution in MediaWiki's core should work (in case smaller sites need that functionality at all).

It has always been my understanding that CatScan is not only for tech-folks and Wikipedia contributors, but mostly for readers. This more efficient and scaling approach gives the opportunity to provide that service to more users.

Bests, --Church of emacs 09:09, 29 July 2010 (UTC)

In my experience, CatSCan is mostly for editors and admins: people doing maintenance stuff in specific topic areas (look at tnew pages about physics, check disputed pages in the category religion, etc). But once it gets more efficient and better integrated, it could be used for each - in particular, it could be used to improve image search. Often enough, you find no images in a given category - because they are all in subcategories. Being able to expand the result automatically to include subcategories would help wikipics a lot. -- Daniel 18:44, 29 July 2010 (UTC)

[edit] Request: Can you use this to create an index/classification system for Wikipedia articles?

Thanks a lot for your work. I work on offline releases of the English Wikipedia, and we have one remaining technical problem - we cannot easily produce an index of our articles. We produced a collection of 31,000 articles (called Version 0.7) earlier this year, but the index had significant problems in it. We produced it by analysing category keywords, and based on these I created a lookup table manually which assigned words like "Warsaw" to a category of "Poland-related", etc. We used this to create a classification scheme with things like Polish artists, etc. Not ideal, but it gave us a basic start.

We tried to look at the category organisation problem you've been addressing, but found it fraught with challenges caused by the hierarchical system. For example, this building comes under the high-level category of "Chemistry" because it is a public house, i.e., it serves alcohol, and alcohol is a chemical substance. However it also comes under "England" which is a good mapping for indexing purposes. Clearly if we create an index or classification scheme, it would be nice to list this under "Buildings in England" but not under "Chemistry". Can your program solve this?

Please forgive me, I don't read code, and my understanding of the technical side of this is very poor - I teach chemistry! But if you can see how to solve this problem for us, we would be extremely grateful! Thanks, Walkerma 19:39, 30 July 2010 (UTC)

This is indeed a tricky problem. Neo4j itself doesn't do much to solve it, but perhaps WikiWord could help. If you like, send me an email, so we can discuss this a bit more. It's especially unclear to me what kind of index you need, and for what purpose... -- Daniel 13:54, 31 July 2010 (UTC)


The above comments may have been left by visitors.

This site's operators can not take responsibility for the content of such comments.