Warning: Parameter 3 to newsxRenderPageBy() expected to be a reference, value given in /data/daniel/brightbyte.de/www/html/mw/includes/parser/Parser.php on line 3333 Extending MediaWiki's dump format - BrightByte
Toolbox
  • Printable version
 
Toolbox
LANGUAGES
Language
Personal tools
Categories
Wikipedia Affiliate Button
 

Extending MediaWiki's dump format

(Redirected from Better dumps)

From BrightByte

Jump to: navigation, search

There are quite a few things missing from MediaWiki's dump format as specified. Since Brion talked about adding some things to support incremental dumps (yay!), this might be a good time to make a few more suggestions. I would like to propose additional information in two categories: export problems and compatibility issues.

Export problems: currently, if Special:Export encounters problems, pages or revisions are silently ignored. This can be quite confusing, and even lead to data loss, if you are trying to use this to transfer pages to another wiki. Some of the conditions that may cause pages or revisions to be left out are:

  • The requested page does not exist
  • The requested page has restricted read access
  • The maximum number of pages was exceeded, further pages where omitted
  • The maximum number of revisions was exceeded, the revision history was truncated
  • Only the last revision was requested

It would be good to have a general purpose <notice> element. It should contain a canonical message identifier, that can be resolved to a localized message on the target system, but would still be understandable directly (this is important, so new versions of MediaWiki can issue new messages, which old systems would still be able to handle in some meaningful way). For example:

 <mediawiki>
   <siteinfo>
    ....
   </siteinfo>
   <notice>using-last-revision-only</notice>
   <page>
     <title>Foo</title>
     <id>1234</id>
     <revision>
        ...
     </revision>      
   </page>    
   <page>
     <title>Bar</title>
     <notice>not-found</notice>
   </page>    
   <page>
     <title>Secret:Bar</title>
     <notice>access-denied</notice>
   </page>    
   <notice>page-limit-exceeded</notice>
 </mediawiki>

Showing the corresponding warnings on import would avoid silent data loss. If no localized message is found for a given notice key, the key itself could be shown, which should still convey the basic issue, at least to english speaking users.

Compatibility issues: the dump should contain information that allows MediaWiki to check on import if all prerequisites for handling the imported text. This includes installed extensions (especially tag hooks and parser functions / magic words). This information should be included in the <siteinfo> element, perhaps like this:

 <siteinfo>
   <sitename>MediaWiki</sitename>
   <base>http://www.mediawiki.org/wiki/MediaWiki</base>
   <generator>MediaWiki 1.12alpha</generator>
   <case>first-letter</case>
   <extension uri="http://www.mediawiki.org/wiki/Extension:CategoryTree" type="tag">
       CategoryTree
   </extension>
   <extension uri="http://meta.wikimedia.org/wiki/Help:ParserFunctions" type="parserfunction">
       ParserFunctions
   </extension>
   ...
 </siteinfo>

This information could be taken from the $wgExtensionCredits array (though we don't seem to have a category for parser function extensions yet - that would be needed).

On import, mediawiki should issue a warning if one of the extensions is not installed, or if the local version of mediawiki is older than the one that create the dump. Both cases may lead to the imported wikitext not being rendered as expected. Perhaps it would even be good to show a warning if the local $wgCapitalLinks setting is different than the one indicated by the <case> element.

License issues: MediaWiki knows about the license that applies to the site, it shows it in the page footer. This info should be included in the dump, for legal reasons, and also in order to compare it with the license used locally on import. If the two licenses are incompatible, a warning should be shown.

Independently of this, it would be good if the Special:Export and Special:Import pages would show the site's license info prominently, not only in the small print in the footer.

Namespaces: First of all, the <namespaces> section should define all names usable for a namespace, including canonical names and aliases. Secondly, it would be nice to split the title like this:

 <title><namespace id="120">Foo</namespace>:<name>Bar</name></title>

or, to sty backward compatible:

 <title>Foo:Bar</title>
 <namespace id="120">Foo</namespace>
 <name>Bar</name>
Free Content
Freecontent.png

[talk page]Talk:Extending MediaWiki's dump format

[edit] Some notes

The <notice> chunks could be kind of nice.

I like the <extension> bit in principle. General warnings of missing extensions or version mismatches may be inappropriate, however, as many extensions won't actually be in use on a given page, making the warnings unnecessary and confusing.

License markers aren't necessarily going to be exact or useful; frequently you'll see individual pages with exceptions, making the bulk metadata incorrect. Currently we don't have per-page copyright metadata, so can't really pull from that either.

Including namespace aliases would be nice.

Note that the wiki's case-sensitivity setting could change to have per-namespace overrides in the near future...

Namespace splitting in <title> fields; I'd probably recommend doing it this way:

<title namespace="Foo" text="Bar">Foo:Bar</title>

if it's necessary at all. Including the internal namespace index key shouldn't be necessary here (it's redundant to the <siteinfo> list), unless we want to make fragments easier to work with independently. Ultimately however, all that's redundant, and we want to support the target wiki being able to import pages with different namespaces (eg 'Portal:Foo' should import just fine on a wiki with no portal namespace, since 'Portal:Foo' is a perfectly legitimate title.)

-- brion, 2008-11-10

Thanks for the input, brion -- Daniel 09:22, 11 November 2008 (UTC)

[edit] <notice>

Rather than using <notice>using-last-revision-only</notice>, where "using-last-revision-only" is an identifier, it might be better to use <notice type="last-revision-only">Only the most recent revision has been exported.</notice>, where 'last-revision-only' is an identifier and the text inside the tag is the value of that identifier as set on the export wiki, in the language of the user doing the export. Other tools would then map the identifier to a local string in a local language, if it exists, or report the literal text inside the tag if not. --HappyDog 16:36, 10 January 2009 (UTC)

Yes, indeed, that sounds better -- 217.234.224.156 22:26, 10 January 2009 (UTC)


The above comments may have been left by visitors.

This site's operators can not take responsibility for the content of such comments.