Contents |
[edit] Real benefit of a graph database for Wikipedia
The real benefit of a graph database for Wikipedia would be not to trace the full structure from one given node but to find similar articles and categories. With Gremlin you should be able to query all categories that are most often used with a given other category, categories that are most often used in pages that link to a given page etc. Such queries are very hard to answer with a relational database but easy in a graph database. By the way you should ask the Neo4j community if there is a bulk insert. 2500 arc per seconds sounds little. Isn't the whole database hold in main memory anyway? -- Jakob (not logged on and only read the Gremlin specs.)
- Hi Jakob! Finding similar pages and categories is indeed an interresting issue, I have been playing with that in the context of WIkiWord quite a bit. I have been using feature vectors to calculate similaries, and have been using links (in/out/category) as the features. Works pretty well, too :)
- Thanks for the pointer to Gremlin, that looks awesome! I'm not quite sure though any of the graph implementations quite fit our needs. Neo4j is disk-base and a bit too slow. Having to use Lucene just to associate the native IDs is somewhat annoying. Maybe I'll play with TinkerGraph a bit and see how that works out. However, Java isn't very memory-efficient. I'm looking for something that would allow me to have the structure of all wikis in memory all the time... -- Daniel 18:44, 29 July 2010 (UTC)
[edit] comparision
Interesting.
You give some numbers for how long it takes with neo4J, it'd be interesting to have the numbers for sql to compare against. --bawolff
- true - it's hard to compare directly though. The trivial, recusive approach would take extremly long, the algoithem i'm using with catscan is much faster, but limited by the max size of sql queries. Also, I'm afraid I can't invest much more time into benchmarking right now. -- Daniel 18:44, 29 July 2010 (UTC)
[edit] CatScan is for readers
Hi BrightByte. Nice work!
I'd love to see a CatScan-like feature more prominent, perhaps even include it in MediaWiki itself, rather than an external tool (as discussed here]). The fact that Neo4j is written in Java is no problem imho – for large sites we can expect them to install Java for tools like this (and Lucene). And for smaller sites, categories don't get that big, so a simple non-scaling solution in MediaWiki's core should work (in case smaller sites need that functionality at all).
It has always been my understanding that CatScan is not only for tech-folks and Wikipedia contributors, but mostly for readers. This more efficient and scaling approach gives the opportunity to provide that service to more users.
Bests, --Church of emacs 09:09, 29 July 2010 (UTC)
- In my experience, CatSCan is mostly for editors and admins: people doing maintenance stuff in specific topic areas (look at tnew pages about physics, check disputed pages in the category religion, etc). But once it gets more efficient and better integrated, it could be used for each - in particular, it could be used to improve image search. Often enough, you find no images in a given category - because they are all in subcategories. Being able to expand the result automatically to include subcategories would help wikipics a lot. -- Daniel 18:44, 29 July 2010 (UTC)
[edit] Request: Can you use this to create an index/classification system for Wikipedia articles?
Thanks a lot for your work. I work on offline releases of the English Wikipedia, and we have one remaining technical problem - we cannot easily produce an index of our articles. We produced a collection of 31,000 articles (called Version 0.7) earlier this year, but the index had significant problems in it. We produced it by analysing category keywords, and based on these I created a lookup table manually which assigned words like "Warsaw" to a category of "Poland-related", etc. We used this to create a classification scheme with things like Polish artists, etc. Not ideal, but it gave us a basic start.
We tried to look at the category organisation problem you've been addressing, but found it fraught with challenges caused by the hierarchical system. For example, this building comes under the high-level category of "Chemistry" because it is a public house, i.e., it serves alcohol, and alcohol is a chemical substance. However it also comes under "England" which is a good mapping for indexing purposes. Clearly if we create an index or classification scheme, it would be nice to list this under "Buildings in England" but not under "Chemistry". Can your program solve this?
Please forgive me, I don't read code, and my understanding of the technical side of this is very poor - I teach chemistry! But if you can see how to solve this problem for us, we would be extremely grateful! Thanks, Walkerma 19:39, 30 July 2010 (UTC)



