TEI xml

From Wiki
Jump to navigation Jump to search

General

TEI (Text Encoding Initiative) is "a consortium which collectively develops and maintains a standard for the representation of texts in digital form," to quote their own website. They have developed a series of guidelines for editing texts in a digital form. In their latest form (which is called P 5), these guidelines weigh in at a hefty 1350 pages (OK, that's counting the bibliography and the index too; there are only 1290 pages of real text). These describe an xml format which is suitable for editing texts. The TEI guidelines have the advantage of being very well documented. There are a number of free resources available that should help everyone who is interested in getting started (one extremely helpful website with lots of tutorials, examples, and tests is TEI by example). They are not (and do not aspire to be) an absolute standard that everyone has to follow, but many academic projects use these guidelines, and they should be a pretty good way to make sure that your electronic edition of a text will be useful in the future.

Since editing texts is something which quite a few users of ConTeXt are involved in, it makes sense to think about ways in which xml documents which follow the TEI guidelines can be typeset with ConTeXt. We would invite users to keep a few caveats in mind:

  1. The TEI guidelines are very detailed because they try to cater to a large number of needs. Most users will only need a small subset of the tags and attributes which the guidelines offer (in fact, TEI is aware of this and has a slimmed down version of their guidelines which is called TEI Lite. This is a very good starting place to familiarize yourself with TEI). It would not make sense to try and provide a monolithic solution that defines all TEI tags; instead, localized ConTeXt style sheets are necessary which will define a subset which is relevant for a number of texts with similar features.
  2. Even with this huge number of tags, TEI does not expect to be sufficient for every text. Users are encouraged to develop their own styles; again, this necessitates special ConTeXt style sheets to process such adaptations.
  3. Encoding and typesetting texts in xml is an ongoing process. As you go forward in your edition, you realize that you need more tags, that you need to distinguish more special cases, that you want to add more information to your edition. This means that you will have to go back and forth between your xml file and the ConTeXt style and adapt both to your needs.

All of which means that the following paragraphs are just the first step in an ongoing attempt. I (Thomas) have written down a setup for a text that I am editing (for those who are interested: the Lives of the Sophists by Philostratus). I fully expect this to be a community effort: as others use TEI xml, they will discover new ways of handling things, will want to add features or add examples for other sorts of texts. My example is meant to start the discussion. Since those who edit texts usually have a background in the humanities, not in programming, I have added lengthy comments which will explain every step.

Our xml file

Philostratus's text is in ancient Greek, but since the text itself doesn't matter much when we talk about structure and typesetting xml, I have replaced it here with a simple lorem ipsum text that is easier to display. So here's what the first paragraphs of the xml file philostratus.xml look like:

<?xml version="1.0"?>

<TEI xmlns="http://www.tei-c.org/ns/1.0" xml:lang="en">
  <teiHeader>
    <fileDesc>
      <titleStmt>
	<title>Lives of the Sophists</title>
	<author>Philostratus</author>
	<respStmt>
	  <resp>editor</resp>
	  <name xml:id="TAS">Thomas</name>
	</respStmt>
      </titleStmt>
      <publicationStmt>
	<p>Work in progress</p>
      </publicationStmt>
      <sourceDesc>
	<p>See indication of manuscripts</p>
      </sourceDesc>
    </fileDesc>
  </teiHeader>
  <text>
    <front>
      <div type="sigla">
	<listWit>
	  <witness id="c2">codd. 2</witness>
	  <witness id="Richards">Richards</witness>
	</listWit>
      </div>
      <div type="work">
	<head type="main">Philostrati</head>
	<head type="sub">Vitae Sophistarum.</head>
	<opener>
	  <salute>Lorem <pb ed="Olearius" n="479"/>Ipsum</salute>
	</opener>
      </div>
    </front>
    <body>
      <div xml:id="VS1" n="I" type="book">
	<div xml:id="VS1.1" n="1" type="chapter">
	  <div xml:id="VS1.1.1" n="1" type="section">
	    <p>Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed
	    diam nonumy eirmod tempor invidunt
	    <app>
	      <rdg wit="#Richards">induunt</rdg>
	    </app>
	    ut labore et dolore magna aliquyam erat, sed diam voluptua. At
	    vero eos et accusam et justo duo dolores et ea rebum. Stet
	    clita kasd gubergren
	    <app>
	      <rdg wit="#c2">arrgl</rdg>
	    </app>
	    <pb ed="Olearius" n="480"/>
	    no sea takimata sanctus est Lorem ipsum dolor sit amet. Lorem
	    ipsum dolor sit amet, consetetur sadipscing elitr, sed diam
	    nonumy eirmod tempor invidunt.</p>
	  </div>
	  <div xml:id="VS1.1.2" n="2" rend="inline" type="section">
	    <p>ut labore et dolore magna aliquyam erat, sed diam
	    voluptua. <pb ed="Olearius" n="481"/> At vero eos et accusam et
	    justo duo dolores et ea rebum. Stet clita kasd gubergren, no
	    sea takimata sanctus est Lorem ipsum dolor sit amet.
	    <lg>
	      <l>At vero eos et accusam et justo duo dolores</l>
	    </lg>
	    et
	    <lg>
	      <l>ea rebum. Stet clita kasd gubergren, no sea takimata</l>
	    </lg>
	    sanctus est Lorem ipsum dolor sit amet. Lorem ipsum dolor sit
	    amet</p>
	  </div>
	  <div xml:id="VS1.1.3" n="3" rend="paragraph" type="section">
	    <p>Duis autem vel eum iriure dolor in hendrerit in vulputate
	    velit esse molestie consequat, vel illum dolore eu feugiat
	    nulla facilisis at vero eros et accumsan et iusto odio
	    dignissim qui blandit praesent luptatum zzril delenit augue
	    duis dolore te feugait nulla facilisi. Lorem ipsum dolor sit
	    amet, consectetuer adipiscing elit, sed diam nonummy nibh
	    euismod tincidunt ut laoreet dolore magna aliquam erat
	    volutpat.</p>
	  </div>
	</div>
      </div>
    </body>
  </text>
</TEI>

So let's have a look at this file. This can be brief since most of the tags are described in the TEI guidelines and tutorials.

Every TEI file has as its root level (i.e. the "outer" level of the xml file) the element

<TEI>

</TEI>

which defines it as a TEI xml file. Everything else is a "child" of this root level. At the next level, you see two of these children: on the one hand, the <teiHeader> element. This contains meta-information about your electronic edition: title, author, editor, publication status, source of your edition. There can be much more information here. This is meta-information which will usually not be typeset in your edition.

The other child is the <text> element. This is what will really be in a typeset, printed edition. As you see, the <text> element has again two children. The <front> contains the title of the work you edit in the form in which it will appear in your typeset document, prefatory material, etc. The <body> element contains the text itself. This text has a logical structure: It consists of books, chapters, and sections. All of these logical parts are expressed via different <div> elements; to distinguish them from each other, these <div> elements have so-called attributes, so we have:

      <div xml:id="VS1" n="I" type="book">
	<div xml:id="VS1.1" n="1" type="chapter">
	  <div xml:id="VS1.1.1" n="1" type="section">
          </div>
        </div>
     </div>

As you can see, most of these "div" elements have other attributes as well: the "xml:id" attribute gives every section in your document a unique identifier. This makes it easier to refer to these sections later. You are free to choose these attributes; as an example, I have opted for a short numeric tag that refers to the paragraph. The "n" attribute is the name of the section as it will appear in your typeset edition. For classical prose texts, it is customary to have the chapter and section numbers appear in the margin of the edition, with no prefix and no additional information about the structure. E.g., at the beginning of chapter 8, there will be a bold 8 in the margin (the mark for "section 1" is understood and usually not expressed). For subsequent sections of chapter 8, there will be smaller section numbers in the margin, like "2," "3," etc. Finally, such sections of chapters do not necessarily begin a new paragraph. In order to make this clear, I have used the "rend" attribute (not exactly in the way TEI defines it, but close enough). For sections, I have two types of "rend" attributes: "inline" means that this section should just continue the typographical paragraph; "paragraph" means that it should begin in a new paragraph. This is an important distinction which I want to emphasize: in your typeset edition, these two will appear very different. For the logical structure of your digital text, however, they are both on the same level. That's why they are both "div" of the same type, but with different "rend" attributes.

Further, we have <pb> elements. These are used to denote pagebreaks in standard editions, which are often used for reference purposes and displayed in the margin; in the case of the Lives of the Sophists, this is the 18th-century edition of Olearius. These elements are inserted at the places where these pagebreaks occur.

Finally, we have the critical apparatus. Its notes are included in <app> elements. Every single entry into the apparatus is within a <rdg> (= reading) element.

This should be enough to get us started. We will now look at the way in which we will typeset such a xml document with ConTeXt.

The ConTeXt style file

NB: Some of the functionality described here has been introduced quite recently. You will need a ConTeXt version not earlier than December 2010 in order to try this example!

In order to typeset such a file with ConTeXt, we need a style file which will map xml elements and attributes to specific ConTeXt commands. We have to save this file (let's call it tei-style.tex) somewhere where ConTeXt can find it (e.g., somewhere in your personal texmf tree or in the same directory as the xml file) and then typeset with the command context --environment=tei-style philostratus.xml. We will look at this file in detail:

\startxmlsetups xml:teisetups
        \xmlsetsetup{#1}{*}{-}
\stopxmlsetups

We define a set of \xmlsetups in a \start \stop environment, and we give it a name in the namespace xml:. The first line of these setups does only one thing: the \xmlsetsetup operates on the current xml tree (that's what the first argument {#1} refers to), takes all its elements ({*}) and discards them ({-}). That means only elements which we address explicitly will be typeset. This is necessary in our case because we do not want the information in the TEI header to be typeset.

For those elements we do want typeset, we have to add instructions. This involves a three-step process:

  1. We have to add their names to a line which defines a \xmlsetsetup
  2. We define a specific setup for them
  3. (optional) we define TeX commands for typesetting

Let us begin with some easy steps. The xml tree we are operating on is empty now. So we first have to tell ConTeXt to pass the content of the topmost elements to its typesetting engine. The topmost element is TEI, so we write:

\startxmlsetups xml:teisetups
        \xmlsetsetup{#1}{*}{-}
        \xmlsetsetup{#1}{TEI}{xml:*}
\stopxmlsetups

\xmlregistersetup{xml:teisetups}

\startxmlsetups xml:TEI
	\xmlflush{#1}
\stopxmlsetups

So: we add the TEI element to a new \xmlsetsetup. We "register" the setups we have defined. And then we declare that the content of the element TEI should be passed to ConTeXt; this is what the line \xmlflush{#1} does.

Of course, we will do the same for the text element, but not for the TEIheader element, which we do not want to be typeset. So we now have:

\startxmlsetups xml:teisetups
        \xmlsetsetup{#1}{*}{-}
        \xmlsetsetup{#1}{TEI|text}{xml:*}
\stopxmlsetups

\xmlregistersetup{xml:teisetups}

\startxmlsetups xml:TEI
	\xmlflush{#1}
\stopxmlsetups

\startxmlsetups xml:text
	\xmlflush{#1}
\stopxmlsetups

Things become a bit more interesting when we look at the next level. We will start with the text proper, which is contained in the body element. For the text, we want line numbers in the margin, and we want these linenumbers in steps of five, in a small font. Here you can see the three steps we have to take:

\startxmlsetups xml:teisetups
        \xmlsetsetup{#1}{*}{-}
        \xmlsetsetup{#1}{TEI|text|body}{xml:*}
\stopxmlsetups

\xmlregistersetup{xml:teisetups}

\startxmlsetups xml:TEI
	\xmlflush{#1}
\stopxmlsetups

\startxmlsetups xml:text
	\xmlflush{#1}
\stopxmlsetups

\startxmlsetups xml:body
    \startlinenumbering
    \xmlflush{#1}
    \stoplinenumbering
\stopxmlsetups

\setuplinenumbering[location=inner,
                    step=5,
                    method=page,
                    style=\tfxx,
                    align=left,
                    distance=0.3em,
                    width=0.3cm]

So we have:

  1. added the element body to our \xmlsetsetup
  2. added a specific setup for the element which puts its content within a \startlinenumbering environment
  3. added ConTeXt setup commands for the \startlinenumbering environment.

Things become even more interesting at the next level. When you look at our xml document, you will see that the entire body consists of different divisions in div elements; the different levels are distinguished by different type attributes. This means we cannot simply add the div element to our general \xmlsetsetup, but have to add a specific \xmlsetsetup for every type. Fortunately, ConTeXt makes it easy to address these different elements. We begin with the book level: (for clarity, I will now only show the new steps, not the entire style document):

\startxmlsetups xml:teisetups
        \xmlsetsetup{#1}{*}{-}
        \xmlsetsetup{#1}{TEI|text|body}{xml:*}
	\xmlsetsetup{#1}{div[@type='book']}{xml:div:book}
\stopxmlsetups

\startxmlsetups xml:div:book
	\blank[line]\midaligned{\xmlatt{#1}{n}}\blank[medium]
	\xmlflush{#1}
\stopxmlsetups

What happens here? The expression div[@type='book'] means "every element div which has an attribute 'type' with the value 'book.'" We want a blank line before the title of the book. Then, we take the value of the n attribute (that's what the construct \xmlatt{#1}{n} expands to: the value of the attribute n of the current tag) and typeset it midaligned. We add another, smaller blank. And don't forget to "flush" the content of the div element!

For the next level, the chapter, we need again three steps: add it to the \xmlsetsetup, define a setup command and a ConTeXt macro for it:

\startxmlsetups xml:teisetups
        \xmlsetsetup{#1}{*}{-}
        \xmlsetsetup{#1}{TEI|text|body}{xml:*}
	\xmlsetsetup{#1}{div[@type='chapter']}{xml:div:chapter}
\stopxmlsetups

\startxmlsetups xml:div:chapter
	\PhilSection{\xmlatt{#1}{n}}
	\xmlflush{#1}
	\par
\stopxmlsetups

\defineinmargin [PhilSection] [outer] [normal] [distance=0.3em,style=\tfa\bf]

So: here, the argument of the n attribute is passed to a ConTeXt macro \PhilSection. This macro is defined as an \inmargin which will be typeset in the outer margin, in a bigger, bold font. This will be the "chapter" numbering in the outer margin.

For the section numbering, we take a similar approach, but as you will see, we need to define even more different setups:

\startxmlsetups xml:teisetups
        \xmlsetsetup{#1}{*}{-}
        \xmlsetsetup{#1}{TEI|text|body}{xml:*}
	\xmlsetsetup{#1}{div[@type='section']}{xml:div:section}
\stopxmlsetups

\startxmlsetups xml:div:section
	\doifelse
	 {\xmlatt{#1}{n}}
	 {1}
	 {\xmlflush{#1}}
	 {\doifelse
	  {\xmlatt{#1}{rend}}
	  {paragraph}
	  {\par\PhilSubsection{\xmlatt{#1}{n}}\xmlflush{#1}}
	  {\PhilSubsection{\xmlatt{#1}{n}}\xmlflush{#1}}}
\stopxmlsetups

\defineinmargin [PhilSubsection] [outer] [normal] [distance=0.3em,style=normal]

Here, we define a setup for the section level which contains two further tests, for which we use ConTeXt's \doifelse macro. The first \doifelse tests if the value of the n attribute is "1," i.e., if this is the first section in a chapter. If it is, it does nothing more than "flush" the content of this section -- remember, the number for the first section should not appear in the margin since it is implied in the chapter number. It's still good to have this number -- if you ever decide that your typeset output should look different, the information is there and can be shown. But for the time being, we do not want it to appear, and that's what the first condition does. If the n attribute's value isn't 1, another test is performed; this time, we look at the value of the rend attribute. If this attribute has the value "paragraph," we insert a \par, pass the value of the n attribute to the macro \PhilSubsection, and "flush" the content of our section. If the value is anything else (i.e., "inline"), we flush the content without inserting a \par. Then, we define \PhilSubsection as another \inmargin, which will appear in the outer margin, at the same place as the chapter numbering, but in a normal font. Finally, when you look at the main text, you will see that we now have defined setups for books, chapters, sections, but not yet for the smallest element, p. Remember: we don't want paragraph breaks for these elements, so all we need to do is "flush" them. Which means: we add the p element to the list:

\xmlsetsetup{#1}{TEI|text|body|p}{xml:*}

and the appropriate setup is:

\startxmlsetups xml:p
	\xmlflush{#1}
\stopxmlsetups

And that's it! This is our structure for the main text! If you typeset the xml file with this setup, you get text with marginal numbering for your chapters and sections.

We now add the bells and whistles. We begin with the Olearius pagebreaks, the <pb> elements. If you've followed so far, this should be easy. As you see, these elements contain a reference to the relevant edition (the ed= attribute) and the pagenumber. If we had more elements of this type, it would make sense to define a setsetup for every one of them. In the case of Philostratus, we will probably only have Olearius, so we just add them to our list:

\xmlsetsetup{#1}{TEI|text|body|p|pb}{xml:*}

and add both the setup for the xml element and a new definition for a marginal text (since we're a bit paranoid, we still test whether the xmlattribute ed is set to Olearius). Since I want the Olearius numbers in square brackets, I needed to take a two-step approach (the square brackets would be confusing to the ConTeXt parser). So I first define an inmargin \Zolearius and then a macro \Olearius which takes this value and typesets it within square brackets, in the outer margin, at a distance of 2em from the main text:

\startxmlsetups xml:pb
	\doifelse
	{\xmlatt{#1}{ed}}
	{Olearius}
	{\Olearius{\xmlatt{#1}{n}}}
	{}
\stopxmlsetups

\defineinmargin [ZOlearius] [outer] [normal] [distance=2em,style=small]

\define[1]\Olearius%
  {\ZOlearius{[#1]}}

Thomas 21:38, 7 November 2010 (UTC)

Removing unwanted strings from xml source

In some cases you might want to remove strings or characters from the xml source. For example ConTeXt cannot process a hashmark. The following example shows how to remove the hashmark from a xml identifier before processing with the command \cldcontext

The xml source:

<a href="#myspecialid">the previous section</a>

The setup code:

\startxmlsetups xml:initialize
      \xmlsetsetup{#1}{a}{xml:*}
\stopxmlsetups

\xmlregistersetup{xml:initialize}

\startxmlsetups xml:a
     \cldcontext{string.sub([[\xmlatt{#1}{href}]],2)}
\stopxmlsetups