Difference between revisions of "Getting Started with XML and ConTeXt using TEXML"

From Wiki
Jump to navigation Jump to search
(added links)
m (Text replacement - "</cmd>" to "}}")
 
(12 intermediate revisions by 8 users not shown)
Line 2: Line 2:
  
 
This document is for XML authors who want to use open source
 
This document is for XML authors who want to use open source
software to produce high quality PDF documents--right now. The
+
software to produce high quality PDF documents--right now.
most official way to convert XML to PDF has been to use the FO
 
language, but the only open source project to convert FO is FOP,
 
and it doesn't come close to implementing all the standards. It
 
cannot center tables, for example, and it has no way to control
 
orphan text. The FOP developers have not made any changes in the
 
last 1 1/2 years, making believe it is a dead end project.
 
  
ConTeXt, a variation of tex, has almost none of the limitations
+
One and one half years ago I gave up using LaTeX to format my XML
one runs into when using FOP. If we know how to use ConTeXt in
+
documents. I had found--or so I thought--a much superior solution
place of FO, we can produce the documents we want right now.
+
in the Formatting Object language, or FO. FO would allow me to
Don't worry if you have never seen FOP. Just ignore those parts
+
create high quality PDF documents in XML and unicode instead of
and you should still get a good idea on how to use ConTeXt.
+
the cumbersome and unfamiliar syntax of TeX. It would
 +
allowed me to convert from an XML tree to an XML tree, exactly
 +
what an XML author wants. The FO language was established in the
 +
same manner as HTML and therefore represented the power and
 +
acceptance of open source software. The open source tool called
 +
FOP, which did the actual conversion from the abstract language
 +
to PDF, could already do much of what I wanted for my small
 +
needs, such as create simple tables and format paragraphs.
  
Since an XML author will use XSLT for conversion, we can dispense
+
One and one half years passed, however, and nothing changed with
with many of the macros written in ConTeXt that produce such
+
the development of FOP. While I could produce basic documents, I
things as titles and table of contents. We'll let XSLT do a lot
+
still couldn't perform other basic formatting needs, such as
of the work and use the most stipped down version of ConTeXt we
+
controlling widow paragraphs or centering a table. Since the
can.
+
developers of FOP have made no changes to their software in all
 +
this time, I came to the conclusion that I would be stuck with
 +
limitations if I continued to use FO and FOP to convert my
 +
documents.
  
This document assumes that the user already has ConTeXt installed
+
Thus I turned back to TeX, knowing I would face almost none of
and knows how to use it. It also assumes that he has a passing
+
these limitations. I could produce beautiful documents right now,
knowledge of XM and XSLT.
+
without having to wait for an open source FO converter that
 +
actually implemented all the standards. ConTeXt seemed the most
 +
advanced form of TeX, allowing me to format in the most direct
 +
manner without having to rely on many different macros (or
 +
outside libraries), so I choose it.
 +
 
 +
If you are an XML author who wants to convert your documents to
 +
PDF via XSLT, you will find this document useful. I try to first
 +
describe how to do something in FO before explaining how I would
 +
do it in ConTeXt, but even if you do not know any FO you should
 +
find the hints about formatting useful.
 +
 
 +
==What You Should Know==
 +
 
 +
This document assumes that you already have ConTeXt installed
 +
and know how to use it. If neither is true, take a little bit
 +
of time and visit the ConTeXt website to get familiar with how to
 +
run ConTeXt on your system. At the minimum, you should know the
 +
commands to issue to convert the ConTeXt examples here to PDF. You
 +
don't need to know more than that to get started, though of
 +
course the more you learn, the clearer this document will be.
  
 
=Converting From XML=
 
=Converting From XML=
  
Being text-based, ConTeXt does not lend itself well to a
+
Being text-based, ConTeXt does not always lend itself well to a direct
conversion from an XML tree. If an innoncent blank line from an
+
conversion from an XML tree. If an innocent blank line from an
 
XML document finds its way into a ConTeXt document, we end up
 
XML document finds its way into a ConTeXt document, we end up
 
with an extra paragraph division. One way around this problem is
 
with an extra paragraph division. One way around this problem is
Line 38: Line 62:
 
python utility that converts its own special form of XML into
 
python utility that converts its own special form of XML into
 
ConTeXt. That means you can use XSLT to convert from one XML tree
 
ConTeXt. That means you can use XSLT to convert from one XML tree
to another and then let the python utlity to the dirty work of
+
to another and then let the python utility do the dirty work of
handling white space.
+
handling whitespace.
  
 
TeXML uses a very simple XML language. Basically, it represents
 
TeXML uses a very simple XML language. Basically, it represents
Line 45: Line 69:
 
TeXML document and immediately know what the author meant to
 
TeXML document and immediately know what the author meant to
 
express in ConTeXt. In converting an XML document such as TEI to
 
express in ConTeXt. In converting an XML document such as TEI to
TeXML, one is coming as close as possible to actually conerting
+
TeXML, one is coming as close as possible to actually converting
 
to ConTeXt itself, without having to worry about white space, and
 
to ConTeXt itself, without having to worry about white space, and
 
while having the comfort of still working with an XML tree. If
 
while having the comfort of still working with an XML tree. If
 
you use TeXML to convert, you really won't have to learn a new
 
you use TeXML to convert, you really won't have to learn a new
XML languge, since TeXML consists of very few elements. Instead,
+
XML language, since TeXML consists of very few elements. Instead,
 
you will still think in terms of ConTeXt.
 
you will still think in terms of ConTeXt.
  
==Simple Document in ConTeXt and in TeXML==
+
== Simple Document in ConTeXt and in TeXML ==
  
 
Here is the simplest ConTeXt document:
 
Here is the simplest ConTeXt document:
  
 
<texcode>
 
<texcode>
 
 
\starttext
 
\starttext
 
hello world
 
hello world
 
\stoptext
 
\stoptext
 +
</texcode>
  
 +
On my system, I issue the command:
 +
 +
<texcode>
 +
texexec [document_name]
 
</texcode>
 
</texcode>
  
In TeXML, this looks like:
+
to produce a formatted document. Along with many other documents,
 +
this command prodices a document with the extension ".dvi", which
 +
I can view with the xdvi software. Follow the instructions to
 +
produce other types of output.
 +
 
 +
In TeXML, this simple document looks like:
  
 
<pre>
 
<pre>
 
 
<?xml version="1.0"?>
 
<?xml version="1.0"?>
 
<TeXML>
 
<TeXML>
 
   <env name="text">
 
   <env name="text">
     <TeXML>Hello World</TeXML>
+
     Hello World
 
   </env>
 
   </env>
 
</TeXML>
 
</TeXML>
 
 
</pre>
 
</pre>
  
Follow the instruction to convert the TeXML to ConTeXt. As of
+
I need to first convert this document to ConTeXt and then issue
writing this document, I had to set an environmental variable to
+
the same exact commands I used above. It is a two step process.
tell TeXML it was converting to ConTeXt (as opposed to latex),
+
In order to convert the XML to ConTeXt, I issue the command:
and then type <texcode>
 
  
texml.py -e utf <indoc> <outdoc></texcode>
+
<pre>
 +
texml -e utf8 -c [infile.xml] [outfile.tex]
 +
</pre>
  
.
+
The "-e" option along with its argument of "utf8" tells TeXML to
 +
produce a document that is encoded in utf8. The "-c" option tells
 +
TeXML to produce ConTeXt output rather than LaTeX. Make sure you
 +
include both options.
  
Our simple document consists of one envrionment, the text
+
Although ConTeXt and XML documents use different
environonment. Like all environments in ConTeXt, this one starts
+
structures, they do share the main text environment.
 +
Like all environments in ConTeXt, this one starts
 
with a backslash followed by the word "start", and then followed
 
with a backslash followed by the word "start", and then followed
by the name of the environment wihout a space. We end this
+
by the name of the environment without a space. We end this
 
environment in the same way, replacing "start" with "stop."
 
environment in the same way, replacing "start" with "stop."
  
In TeXML, we enclose environments with the <texcode>
+
In TeXML, we enclose environments with the
  
env</texcode>
+
<texcode>
 +
env
 +
</texcode>
  
 
element. The mandatory "name" attribute defines the environment's
 
element. The mandatory "name" attribute defines the environment's
Line 107: Line 145:
 
command," which are placed in curly brackets.
 
command," which are placed in curly brackets.
  
For example, if we wanted to create a simple document with just
+
For example, to create a simple document with just one box,
one box, inside of which were the lines "that's it," we write:
+
inside of which were the lines "that's it," we write:
  
 
<texcode>
 
<texcode>
 
 
\starttext
 
\starttext
 
\framed[width=2cm,height=1cm]{that's it}  
 
\framed[width=2cm,height=1cm]{that's it}  
 
\stoptext
 
\stoptext
 
 
</texcode>
 
</texcode>
  
Line 121: Line 157:
  
 
<pre>
 
<pre>
 
 
<?xml version="1.0"?>
 
<?xml version="1.0"?>
 
<TeXML>
 
<TeXML>
Line 129: Line 164:
 
           <opt>width=2cm, height=1cm</opt>
 
           <opt>width=2cm, height=1cm</opt>
 
           <parm>that's it</parm>
 
           <parm>that's it</parm>
 +
        }}
 
     </TeXML>
 
     </TeXML>
 
   </env>
 
   </env>
 
</TeXML>
 
</TeXML>
 
 
</pre>
 
</pre>
  
Line 142: Line 177:
  
 
<texcode>
 
<texcode>
 
 
\enableregime[utf]
 
\enableregime[utf]
 
 
</texcode>
 
</texcode>
  
 
Apparently, this allows ConTeXt to handle both utf8 and utf16.
 
Apparently, this allows ConTeXt to handle both utf8 and utf16.
  
In addtion, we want to disable as many of ConTeXt's automatic
+
In addition, we want to disable as many of ConTeXt's automatic
 
modes as possible, since we will generate things like titles and
 
modes as possible, since we will generate things like titles and
sections our self. ConTeXt automatically places a number on each
+
sections ourselves. ConTeXt automatically places a number on each
page. To turn this feature off, place this line somewhere at the
+
page, and starts a new number with each part. To turn this
top of your document:
+
feature off, place this line somewhere at the top of your
 +
document:
  
 
<texcode>
 
<texcode>
 
+
\setuppagenumbering[state=stop, way=bytext]
 
 
\setuppagenumbering[state=stop]
 
 
 
 
</texcode>
 
</texcode>
  
Line 170: Line 201:
  
  
[[simple_page.tex]]
+
<code>Simple_page.tex</code>
 +
 
 +
<texcode>
 +
\enableregime[utf]
 +
\setuppagenumbering[state=stop]
 +
 
 +
\starttext
 +
Wie schön!
 +
\stoptext
 +
</texcode>
 +
 
 +
and <code>Simple_page.texml</code>
 +
 
 +
<pre>
 +
<?xml version="1.0"?>
 +
<TeXML>
 +
<!--
 +
Attributes nl1 and nl2 can be used to force a new line before (nl1) or after (nl2) TeX command.
 +
-->
 +
<cmd name="setuppagenumbering">
 +
  <opt>state=stop, way=bytext</opt>
 +
}}
 +
<cmd name="enableregime" nl1="1">
 +
  <opt>utf</opt>
 +
}}
 +
<env name="text">
 +
  Wie schön!
 +
</env>
 +
</TeXML>
 +
</pre>
 +
 
 +
=To Do=
 +
 
 +
* Include more documentation about TeXML.
  
[[simple_page.texml]]
+
[[Category:XML]]

Latest revision as of 13:21, 9 August 2020

Goal

This document is for XML authors who want to use open source software to produce high quality PDF documents--right now.

One and one half years ago I gave up using LaTeX to format my XML documents. I had found--or so I thought--a much superior solution in the Formatting Object language, or FO. FO would allow me to create high quality PDF documents in XML and unicode instead of the cumbersome and unfamiliar syntax of TeX. It would allowed me to convert from an XML tree to an XML tree, exactly what an XML author wants. The FO language was established in the same manner as HTML and therefore represented the power and acceptance of open source software. The open source tool called FOP, which did the actual conversion from the abstract language to PDF, could already do much of what I wanted for my small needs, such as create simple tables and format paragraphs.

One and one half years passed, however, and nothing changed with the development of FOP. While I could produce basic documents, I still couldn't perform other basic formatting needs, such as controlling widow paragraphs or centering a table. Since the developers of FOP have made no changes to their software in all this time, I came to the conclusion that I would be stuck with limitations if I continued to use FO and FOP to convert my documents.

Thus I turned back to TeX, knowing I would face almost none of these limitations. I could produce beautiful documents right now, without having to wait for an open source FO converter that actually implemented all the standards. ConTeXt seemed the most advanced form of TeX, allowing me to format in the most direct manner without having to rely on many different macros (or outside libraries), so I choose it.

If you are an XML author who wants to convert your documents to PDF via XSLT, you will find this document useful. I try to first describe how to do something in FO before explaining how I would do it in ConTeXt, but even if you do not know any FO you should find the hints about formatting useful.

What You Should Know

This document assumes that you already have ConTeXt installed and know how to use it. If neither is true, take a little bit of time and visit the ConTeXt website to get familiar with how to run ConTeXt on your system. At the minimum, you should know the commands to issue to convert the ConTeXt examples here to PDF. You don't need to know more than that to get started, though of course the more you learn, the clearer this document will be.

Converting From XML

Being text-based, ConTeXt does not always lend itself well to a direct conversion from an XML tree. If an innocent blank line from an XML document finds its way into a ConTeXt document, we end up with an extra paragraph division. One way around this problem is to use ConTeXt's native XML mapping, which you can read about on the ConTeXt home page. I find this mapping scheme too complicated, which is why I advocate using http://getfo.sourceforge.net/texml/index.html TeXML. TeXML is a python utility that converts its own special form of XML into ConTeXt. That means you can use XSLT to convert from one XML tree to another and then let the python utility do the dirty work of handling whitespace.

TeXML uses a very simple XML language. Basically, it represents ConTeXt commands in XML and does little more. One could look at a TeXML document and immediately know what the author meant to express in ConTeXt. In converting an XML document such as TEI to TeXML, one is coming as close as possible to actually converting to ConTeXt itself, without having to worry about white space, and while having the comfort of still working with an XML tree. If you use TeXML to convert, you really won't have to learn a new XML language, since TeXML consists of very few elements. Instead, you will still think in terms of ConTeXt.

Simple Document in ConTeXt and in TeXML

Here is the simplest ConTeXt document:

\starttext
hello world
\stoptext

On my system, I issue the command:

texexec [document_name] 

to produce a formatted document. Along with many other documents, this command prodices a document with the extension ".dvi", which I can view with the xdvi software. Follow the instructions to produce other types of output.

In TeXML, this simple document looks like:

<?xml version="1.0"?>
<TeXML>
  <env name="text">
    Hello World
  </env>
</TeXML>

I need to first convert this document to ConTeXt and then issue the same exact commands I used above. It is a two step process. In order to convert the XML to ConTeXt, I issue the command:

texml -e utf8 -c [infile.xml] [outfile.tex]

The "-e" option along with its argument of "utf8" tells TeXML to produce a document that is encoded in utf8. The "-c" option tells TeXML to produce ConTeXt output rather than LaTeX. Make sure you include both options.

Although ConTeXt and XML documents use different structures, they do share the main text environment. Like all environments in ConTeXt, this one starts with a backslash followed by the word "start", and then followed by the name of the environment without a space. We end this environment in the same way, replacing "start" with "stop."

In TeXML, we enclose environments with the

env

element. The mandatory "name" attribute defines the environment's name.

Commands

Aside from environments, we also have commands in ConTeXt. Through commands we control the text formatting in ConTeXt. Commands start with a backslash and can be followed by setups, which are placed in brackets, and by the "scope or range of the command," which are placed in curly brackets.

For example, to create a simple document with just one box, inside of which were the lines "that's it," we write:

\starttext
\framed[width=2cm,height=1cm]{that's it} 
\stoptext

In TeXML, this looks like:

<?xml version="1.0"?>
<TeXML>
  <env name="text">
    <TeXML>
        <cmd name="framed">
           <opt>width=2cm, height=1cm</opt>
           <parm>that's it</parm>
        }}
    </TeXML>
  </env>
</TeXML>

Other Preliminaries

In order to makes sure that that our unicoded XML documents get converted properly, we want to put the following line at the top of all our documents:

\enableregime[utf]

Apparently, this allows ConTeXt to handle both utf8 and utf16.

In addition, we want to disable as many of ConTeXt's automatic modes as possible, since we will generate things like titles and sections ourselves. ConTeXt automatically places a number on each page, and starts a new number with each part. To turn this feature off, place this line somewhere at the top of your document:

\setuppagenumbering[state=stop, way=bytext]

We might alter this command in some ways later.

Example Documents

Here are two very simple documents, one in plain old ConTeXt, and one in TeXML.


Simple_page.tex

\enableregime[utf]
\setuppagenumbering[state=stop]

\starttext
Wie schön!
\stoptext

and Simple_page.texml

<?xml version="1.0"?>
<TeXML>
<!--
Attributes nl1 and nl2 can be used to force a new line before (nl1) or after (nl2) TeX command.
-->
 <cmd name="setuppagenumbering">
  <opt>state=stop, way=bytext</opt>
 }}
 <cmd name="enableregime" nl1="1">
  <opt>utf</opt>
 }}
 <env name="text">
  Wie schön!
 </env>
</TeXML>

To Do

  • Include more documentation about TeXML.