Extension:ConTeXtXML

From Wiki
Revision as of 20:40, 12 August 2020 by Taco (talk | contribs) (→‎XML parser)
Jump to navigation Jump to search

TODO: This page documents the situation that will become active in a few days/a week when the newly developed wiki extension will be ported over from the test wiki. This page is written in preparation. Soon, this todo block will be deleted.

We have a heatwave right now, and even though there are some things in the extension that I still want to improve on, it is too hot in the actual office to do any programming. Instead, I am on the couch in front of a fan.

--Taco (talk) 19:48, 10 August 2020 (CEST) (See: To-Do List)


An extension for editing /Command subpages

The ConTeXtXML extension is a new wiki feature specifically designed to edit the ConTeXt command referencs pages (the ones that live under the /Command/ URL.

It does this by intercepting the creation of new wiki pages below /Command/, and using a ContentHandler extension to maintain those pages. The text model of those pages is contextxml, which is a special XML format developed for documenting ConTeXt commands that is based in the interface XML files by Wolfgang Schuster.

For details of the XML format and the subtree structure below /Command/, see the pages Command and Help:Command. This page documents some features of the wiki extension itself.

Building on wikitext

The core of the extension is made up of two connected php classes:

  • ConTeXtXMLContentHandler, which extends WikitextContentHandler
  • ConTeXtXMLContent, which extends WikitextContent

Together they ensure that even though the declared page format is CONTENT_FORMAT_XML (which expands to the mime-type text/xml) and the page model is contextxml instead of wikitext, it remains possible to use wiki code. The extension achieves this by using a preprocessor on the XML data that converts it to wiki code that can then be used by the normal mediawiki page viewing and parser code. At the same time, it keeps the XML format available for edits to take place on, so that any documentation text that is added by the user(s) can be easily extracted and exported for other uses outside of the wiki.

It turns out that for this to work, some tweaks have to be made. Either I have not understood the mediawiki documentation well, or there are issues with extending WikitextContent, or perhaps even both. Anyway:

  • The content_models mediawiki SQL database table needed an extra row with the values 4,contextxml. Added manually as I could not figure out how to do this automatically at extension registration time.
  • WikitextContent does not like being subclassed, so some functions from the parent:: needed to be copied wholesale instead of just wrapping a bit of code around the parent implementation.

The ConTeXtXMLContent also runs various checks on the XML before allowing the user to save the page. It uses a hand-written XML parser because it not only verifies the XML well-formedness, it also performs various checks on the textual content, as well as making sure that only documentation is added, and that nothing is removed from the main XML database information.

Building on Article

When a user loads an existing page, that page is normally of class Article (except for some special cases). The Article class sets the page content model to wikitext, which makes it use the standard WikitextContent and WikitextContenthandler. Because that would not work for the ConTeXtXML pages, there is a third class:

  • CmdPage, which extends Article

This class is really small. It only exists are a coatrack for using ConTeXtXMLContent. It may not even be really needed in the current implementation, but it could prove useful for future further extensions.

And then the Hooks

The extension uses a set of hooks to link into the mediawiki processing:

ArticleFromTitle

Creates a CmdPage if the wiki page title starts with /Command.

ContentHandlerDefaultModelFor

Sets the content model to contextxml if the wiki page title starts with /Command.

PageContentSave

On save, this saves contextxml pages to a designated harddisk location as well as in the wiki database.

ArticleAfterFetchContentObject

On load, this checks whether the page content on the designated harddisk location has changed. If yes, it will replace the text of the mediawiki revision with the content of the file on the harddisk. This is so that there is an easy interface for integrating updated versions of the interface xml files from Wolfgang.

EditFormPreloadText

This fills the edit area for newly created /Command pages from the file on the harddisk

EditPageNoSuchSection

Error hook that is triggered if the user tried to edit a section that is generated from wiki code instead of from the XML data. This is an error because it is quite hard to extract the right block of text in that case and still keep track of where it is in relation to the XML data.

EditPage::showEditForm:fields

Prints a simple help message at the top of the edit field for /Command pages.


Generating the wikitext code for page views and previews

All the above is written in php. But since I am a complete noob in php and rather at home in Lua, I decided to write the conversion from XML to wiki format as an mtxrun script. The script is called mtx-wikipage.lua, and it converts our special XML format into wiki code.

I call it an mtxrun script, but really it only takes an option for the input file (in the case of section editing, this is a temporary file generated by php) and an optional output file (useful for debugging). Besides that, it is almost pure lua. It uses an integrated tiny expat-style XML parser and a few handwritten xml tree processing functions to create wiki code.

The penalty for calling an external program from php is relatively small, because 1) the processed page content is fairly small. 2) the internal caching of mediawiki makes it so that the code is only called when a page is actively being edited and 3) starting luametatex as a Lua interpreter only is surprisingly fast.

The only unsolved problem with this subsystem is that it needs an intermediate two-line shell script that does nothing except adjust the PATH environment variable, just so that mtxrun can run without complaints about unfound paths and configuration files. (mtxrun's warning messages would have to stripped off the mtx-wikipage.lua output otherwise)

Implementation notes

XML parser

The extension uses a hardwritten simple XML parser in pure Lua. The parser is expat-style and the implementation is based on string.find() and string.sub(). The advantage of this approach is that it can handle bad XML input by throwing an appropriate (and understandable) error. Neither the Lpeg-based Lua parser from the 13th ConTeXt meeting nor the ConTeXt built-in parser allow for that. Both those parsers assume well-formed XML as input.

A tailored parser also allowed for easy extension to deal with the CDATA issue mentioned below.

But the main motivation for a private dedicated parser written in Lua is that we want to be able to not only check the well-formedness of the XML, but also its adherence to a set of extra rules:

  1. The documentation should not modify the argument structure of the command’s formal specification, only add explanations to it. Theoretically, each of the 3900+ formal specifications has its own private XML Schema.
  2. The documentation should be easily parseable by an external system, meaning that use of wiki code and HTML tags need to be governed.

These additional rules made using the DOM-based parser in php unwieldy, for me. I am sure a good php programmer could implement these extra checks, but not me. At least not in a reasonable amout of time. But I knew how to tackle both requirements using Lua, and could write an implementation quite quickly and effortlessly.

The first point is handled like this:

  • When a fresh set of ‘virgin’ XML files is created from context-en.xml, each separate file is parsed using a set of functions that create a lua table representing the ‘virginal’ parse tree of the XML file. This Lua table is dumped to disk and distributed along with the XML file.
  • When a wiki user presses the ‘Save’ button in the page editor, their edited XML is parsed using a slightly different set of functions from the ones for viewing. These functions in this set skip all documentation content while building the parse tree. The two lua tables representing the parse trees are then compared. They should be identical. If not, an error is raised and the save action is aborted with a user-visible error message.

The second point is taken care of during that same XML parse step of the user page revision. It uses a combination of a tag lookup table and string text matching to make sure the user followed the rules (as explained in Help:Command).

About those extension tags

The special tags <texcode>, <xmlcode>, and <context> on our wiki are handled by an extension (context) written a long time ago by Patrick Gundlach. That extension converts the parsed XML output from mediawiki into HTML code that looks 'right'. In normal wiki pages this works, because the mediawiki parser is quite forgiving (more like a HTML browser than a XML parser) and does some recovery attempts itself when a user types in something that is not quite well-formed HTML/XML.

For example, in a normal wiki page you do not need to properly quote the attributes of <context>. And the structure within <xmlcode> does not have to be properly nested.

But it also sometimes backfires. If you use a XML tag name inside a <context source="yes"> call or within <texcode>, it will not be displayed in the verbatim display section of the page (but it will be seen by ConTeXt while processing the <context>).

To solve this question between 'is it data?' and 'is it markup?' in a standalone XML file, you would wrap a CDATA section around things like the content of <xmlcode>. But unfortunately that is something that either the mediawiki parser or the context or the HTML browser does not understand (I don't know which is the exact problem).

For now, within the ConTeXtXML XML parser,I decided to treat the content of <texcode>, <xmlcode>, and <context> 'as if' they are SGML elements with data model CDATA. That means that the generated XML files on disk that make use of this feature are not actually well-formed, for example this content of <xmlcode>:

<xmlcode>
<document>
This <highlight detail="important">you</highlight> need to know.
</document>
</xmlcode>

should actually be this:

<xmlcode><![CDATA[
<document>
This <highlight detail="important">you</highlight> need to know.
</document>
]]></xmlcode>

but then it could not be displayed on the wiki properly, or (with some internal patching by ConTeXtXML) there would be a constant difference between the XML version on disk and the wiki database version of a page (resulting in endless 'This revision is outdated' messages). So, I think this is the best solution for now.