| Free Content |
There has long been talk about a "data wiki", that is, a way to collect and maintain structured, factual data in a collaborative, wiki-like fassion. The most obvioius application for this would be to manage the information we now see in Wikipedia's infoboxes on the right side of many articles. The basic requirements for such a system are:
- centralized. Data used on several web pages (wikis) is maiontained on one place. There may, however, be multiple data wikis for different kinds of data.
- multilingual. If values are language-specific, it should be possible to enter a value for each language, and there should be a mechanism for selecting a language (or a preference list of languages) when querying results.
- versioned. The system must provide a mechanism to store all old revisions of a record, make them available upüon request, and present differences between arbitrary revisions of records.
- scalable. The system should be able to handle dozents or hundreds of millions of records, with up to a hundred properties each, and with hundres of revisions for each record.
- flexible. It should be easy to introduce new types of records and modify the scecification of existing records, without disturbing the system.
Requirements 1, 4 and 5 are met more or less by existing document based database systems like MongopDB, CouchDB or even Lucene. Multi-lingual values can be added without much trouble if the DB supports complex data values. Versioning however is a bit more tricky, none of the existing systems seem to support it.
With a bit of though, however, versioning can be implemented on top of a regular document-based system (thank you, Dirk). In order to achive this, we introduce meta-properties that are not part of the actual record's data, but used for management. As a convention, we start the names of these properties wuth an underswcore "_". We would need at least the following:
- _record_id: an id that is the same for all revisions of the same record, but unique among different records.
- _revision_is_current: boolean, true if this revision is the most current one. Could also become a bitfield in order to support flagged revisions.
- _revision_id: a globally unique id for that specific object instance.
- _revision_timestamp: the time at which the revision was created.
- _revision_user: the user who created that revision.
Optionally, there could also be:
- _parent_revisions: a list of revisions this revision was based upon.
- _child_revisions: a list of revisions that have been derived from this revision
That way, a complex revision graph including forks and merges could be modeled.
In any case, the interesting bit is how references work in such a system. A property value may be a primitive value (such as a number or a text), or a reference to another record (or a list of such references). Such a reference would be modeled as a pair of ids: the _record_id and the _revision_id of the target record, the _revision_id being the one that was "current" when the reference was created.
When a record is edited, a new revision is created by cloning the previous revision, then modifying it. The following modifications are necessary:
- set _revision_is_current to fale on the old revision.
- set the _revision_id to a new, unique id on the new revision.
- set the _revision_timestamp and _revision_user to appropriate values.
- change all reference to other objects so their _revision_id corresponds to the latest revision of the target object.
- set _parent_revisions to include the old revision.
- set the old revision's _child_revisions to include this new revision.
Because of the revision id in the references, it becomes possible to look at the web of records (the object network) two ways:
- by looking at the _record_id in the references on a "current" revision, and resolving them to the current revisions of the target records (the ones with _revision_is_current set), we get a "current" or "trunk" view of the structure.
- by looking only at the _revision_id in the references on any revision, we get a "time warp" snapshot of the network that is consistent with the time at which that revision was created.
This can all be done on top of a conventional document based database system. However, this means a lot of overhead in terms of "old" revisions of objects. There are several ways the underlying plattform could help with that:
- remove indexs from old revisions. When running queries against the full data-set, we are generally only interested in the latest version (or, with flagged revisions, in the last "stable" version). Old revisions don't need to be indexed (except by record-id and revision-id, of course).
- allow "current" objects to be kept in a fast access storage (in addition to a MRU cache), while moving old revisions to slower but larger long term storage
- allow smart cloning, where only the values that get modified are stored in the new version of the object, and unmodified properties are taken from previous revisions. This however does not work well if the previous suggestion is omplemented (moving old revisions to slower storage).
- in the store for old revisions, allow consecutive revisions to be bunched together and compressed
How these things could best be implemented with the current database systems will need some more thought.
[talk page]Talk:Versioning Structured Data
The above comments may have been left by visitors.
This site's operators can not take responsibility for the content of such comments.




Semantic wikipedia stores data as attributes ((Berlin) (has population) of (number)) and relations ((Berlin) (is capitol of) (Germany)) but real life is a bit more complicated. Each fact will need to be qualified.
(Berlin) (has population) (number) (date this was valid) (citation for the information)
(Berlin) (was capitol of) (Bismark's Germany) from (date) to (date) (citation)
(Berlin) (was capitol of) (Hitler's Germany) from (date) to (date) (citation)
(Berlin) (was capitol of) (East Germany) from (date) to (date) (citation)
(Berlin) (was capitol of) (Reunited Germany) from (date) to (date) (citation)
How would you cope with this?
Contents
[edit] Validated data
I think something like this would be excellent - and perhaps it could be done in collaboration with the dBpedia people?
Some data never, or very rarely change. The birth date of Beethoven, or the melting point of sodium chloride. Such things do not improve over many years of editing; they are either right or wrong. These data should (IMHO) be (a) checked rigorously against the most reliable sources possible, then (b) established as a reliable value.
On the English Wikipedia, the Chemicals WikiProject started a validation project, to try to do this for a handful of key data fields. To obtain the correct CAS number (a very popular identifier for chemical substances), we collaborated with CAS themselves to ensure that all of our numbers are correct (see this press release) and now all of our validated CAS numbers are (a) linked to a relevant record on a CAS website (set up through the collaboration) and (b) patrolled by a bot which flags all changes within a few seconds by replacing a green tick (OK!) with a red X (Danger!). (This bot is CheMoBot, written by User:Beetstra.)
This job of data validation should only be done once, then the data shared across different languages. Could your proposal flag data as validated, unvalidated or validated-then-changed? Could it help us share validated data across the different languages? Walkerma 01:51, 5 August 2010 (UTC)
[edit] Hmmm
If you already have records with IDs, then the versioning doesn't seem very different from pages in a normal wiki. But how do you assign those IDs and what do the records contain? What does a "reference to another record" look like? I think you need many small experiments as proof of concept before we have to worry about scalability and version history. --LA2
[edit] inaccurate and changing data
Take for example the birthday of Jesus of Nazareth. Some say it is between 6 and 4 BC, others say it is between 7—2 BC. Another problem is for example the size of a country. It is changing over the years because the shore is changing. And there are disputed parts of a country which one party say it belong to them, the other party say it is theres. Therefore, a lot of data exist that are full of problems.
[edit] DataTransclusion Extension
I am not sure if this the only to ask questions about the DataTransclusion extension. Is it possible to get a couple of examples for what to put in the localSettings.php file to create sources, say one for a fictional database and one for web access? Thanks.