Warning: Parameter 3 to newsxRenderPageBy() expected to be a reference, value given in /data/daniel/brightbyte.de/www/html/mw/includes/parser/Parser.php on line 3333 WikiData light - BrightByte
Toolbox
  • Printable version
 
Toolbox
LANGUAGES
Language
Personal tools
Categories
Wikipedia Affiliate Button
 

WikiData light

From BrightByte

Jump to: navigation, search

Support for structured data records is virtually non-existent in MediaWiki. It's sorely missed on Wikipedia though, and there have been several attempts to implement it: full blown systems like WikiData or MediaWiki Semantic MediaWiki, and also some more limited extensions like the Data Extension, PageAttributes, DataTable, or Infobox_Data_Capture.

To add my 2¢ to all that, here's what I came up with when thinking about the issue:

We would want a simple, flexible, minimalist approach that can be tested non-intrusively on wikipedia, and which can be expanded in use over time. Conceptually, I would limit this to plain properties assigned to entities, where a page would represent exactly one entity. The set of possible properties would be divided into groups, which correspond to types or aspects of entities: examples would be person.firstName, book.title, place.longitude, etc.

Obvious applications of this would be to capture existing structured data from infoboxes, taxoboxes, geotags, etc. One use I would like to suggest (and which I will write about in detail later, maybe), would be bibliographic records: make a Book namespace, and put a data records and a short description of the book there. Split it into an <includeonly> and a <noinclude> section, so it can be used as a template, especially for <cite> entries.

Properties would be assigned to the entity described by a page using a parser function called #property, which would produce no visible output. For example, basic data about Albert Einstein could be recorded like this:

{{#property:person.firstName|Albert}}
{{#property:person.lastName|Einstein}}
{{#property:person.dateOfBirth|1879-03-14}}
{{#property:person.placeOfBirth|Ulm, Germany}}

This could of course be done in a template that is used on the actual page about Einstein, for example wp:de:Template:Personendaten - that way, only a single template needs to be edited, and structured data about thousands of entities would start to flow into the database automatically.

It might be useful to have a meta-page describing the meaning and use the properties for each property group - in this example, maybe Wikipedia:Person properties. Also, it may be good to allow only a preconfigured set of properties to be used. In that case, a pattern could be defined that would allow MediaWiki to check the values supplied for each property: for example, for person.dateOfBirth, the regular expression -?\d{4}-\d{2}-\d{2} could be used to check for valid dates.

In order to utilize this data, some special pages would have to be implemented. A generic query mask would be a start, allowing users to query pages (and properties of pages) by existence of a property or group, and value of a property, possibly restricted further by namespace and category. But it would also be possible, and probably useful, to have application-specific query pages that can be used to retrieve pages about books, people, places, etc with a specialized query mask and represented in a specialized form suitable for that kind of data.

The properties would be stored in a special database table (maybe called page_data, because page_props is already used for something else), with the following columns:

  • page: the id of the page the properties are assigned to; that is, the subject
  • group: the property group
  • property: the name of the property
  • value: the property's value

It may be a good idea to allow comments or qualifications to values, especially if they are restricted by a pattern: for example, the population of a city could be given as {{#property:place.population|123456|in 2001}} indicating the date the census data was collected. This could be stored in an extra comment column in the database. This could also be used for citing sources, giving units of measurements, indicating disputed figures, etc. Also, if we ever get flagged revisions, property records would have to be subjects to those flags too - if we have a "current" and a "reviewed" version of a page, we should also have a "current" and a "reviewed" version of the corresponding data record.

Property data may also be saved to different tables depending on the property group: this may improve performance for much-used property groups, and it might also be used to allow sharing this data across wikis (though that entails several complex problems, like keeping the wikitext-version of the data synchronized).

To conclude, this way of associating structured data with articles would be simple to implement, and would fit well with the current way data is managed on wikipedia. It could be tested in a restricted domain of application, like bibliographic or biographic records, and the be expanded over time to cover more and more things like geodata and other information from infoboxes.

Free Content
Freecontent.png

[talk page]Talk:WikiData light

[edit] better?

imho the name and group should go into a separate table.

well, normalization is good, until you have too much of it :) Not quite sure in this case. Worth thinking about. -- Daniel 23:31, 2 June 2008 (CEST)

[edit] Specific features

You mix several ideas, I try to seperate them and mix my thoughts together with it:

  • Having a tree-structure of properties (groups are just a limited case of trees of depth 1) like place.population, person.name.given, etc. - I am not sure whether the benefit is little compared to the rise in complexity
  • Putting data in parser-functions - the obvious benefit is probably easy of implementation
  • Comments/sources/addition information about statements - really useful! Data without this will mostly be POV
  • Data types, for instance with patterns - quickly wished but this will cause additional pain. Data types should be more then just formal patterns, you also need to define its semantics. Keep it for a second step.

Frankly the benefit compared to templates is little in the beginning. Now we can already extract data from templates and analyze them offline, for instance at the toolserver. There will always be the need to first extract data from Wikipedia and then analyze and use it, so we should not try to implement too much intelligence in the MediaWiki core or a WikiData extension. My 2¢. -- JakobVoss 16:08, 3 June 2008 (CEST)


[edit] using it in multiple projects

How do you use it in srn.wikipedia or for the xh.wikipedia... an the over 120 wikipedias that may not know Albert Einstein yet ?? Thanks, GerardM



The above comments may have been left by visitors.

This site's operators can not take responsibility for the content of such comments.