Campus Publishing in Standardized Electronic Formats -- HTML and TEI.

David Seaman
Electronic Text Center
University of Virginia Library
November 1994

Introduction

In the past year, HyperText Markup Language (HTML) has done more to popularize the notion of Standard Generalized Markup Language than any single preceding use of SGML. Used on the World Wide Web through a graphical client such as Netscape or NCSA Mosaic, HTML documents and their associated image, sound, and digital video files result in sophisticated network publications and services. And even when viewed through the plain text (VT100) client Lynx, HTML files can still be exciting clusters of interlinked documents.

In common with Internet users all over the world, the University of Virginia Library now uses and produces HTML documents; unlike most other academic institutions, however, we came to HTML with practical experience in another, more sophisticated, form of SGML -- that of the Text Encoding Initiative Guidelines. For two years the Electronic Text Center has been using the TEI Guidelines, through several drafts, to tag and distribute hundreds of electronic texts. The purpose of this paper is both to explain how we are using these various forms of SGML mark-up to publish a variety of documents, and to sound a cautionary note about the wholesale use of HTML as a primary authoring language.

HyperText Markup Language: HTML

HyperText Markup Language is exciting as an implementation of SGML not least because it is easy to use and to learn. It has a real pedagogical value as a form of SGML that makes clear to newcomers the concept of standardized markup. To the novice, the mass of information that constitutes the Text Encoding Initiative Guidelines -- the premier tagging scheme for most humanities documents -- is not easily grasped. In contrast, the concise guidelines to HTML that are available on-line (and usually as a "help" option from the menu of a Web client) are a good introduction to some of the basic SGML concepts. The combination of a small number of tags with a viewer such as Netscape means that in the space of a single training session a user new to the concept of standardized markup can create a document from scratch and see it operate, with images and hypertext links in place. In HTML we have finally a way to build functional hypertexts easily and cheaply and to publish the results in a networked manner that will never be an option with the proprietary hypertext software packages (ToolBook, HyperCard, MediaView, AuthorWare, and so on).

This ease of use is a mixed blessing. The current Web clients are very forgiving, requiring no real conformity to the specific rules governing the usage of HTML tags (the Document Type Definition, or DTD), which leads to bad habits in all of us. HTML documents become in a practical sense defined as "texts that work when viewed through Mosaic", rather than files that conform to a specific set of rules governing tag names and their usage. Leaving aside the potential long-term problems that non-conforming HTML texts may cause, there is a more pressing problem -- we need to be careful that the texts we mark up primarily in HTML are ones that do not suffer from the use of such a simple form of SGML.

In the other SGML tagsets with which I am familiar, the aim of the markup is to describe structurally and conceptually what an item is -- for example, in a TEI text a chapter might appear thus:

<div type="chapter" n=1>
<head rend="italics">
Chapter Name </head>
[TEXT OF CHAPTER]
</div>


In the example above, the structure as well as the typography of the text is explicitly named. About all the HTML tagset allows one to do is to mark for line breaks and typographic features:
<br>
<i>
Chapter Name </i>
<br>
[TEXT OF CHAPTER]
<br>
It does not necessarily take appreciably longer to mark up a text in a structurally specific manner than in a non-specific one. To indicate that a chapter starts and ends here, and that a phrase is an italicized chapter heading (the TEI way) is not more work than to indicate that this is a line break, and this is an italicized phrase of unspecified type (the possibilities given to us in HTML currently). However, the former is inherently more useful over a wider range of uses. In a collection of a thousand novels one may well want to search for items that appear only in opening chapters, or for words only when they appear in chapter headings, and the former example provides the level of specificity to do exactly that.

There are practical and appropriate uses for HTML as a primary form of mark-up, rather than as a form to which another variety of SGML is converted. At the Electronic Text Center, and elsewhere in our Library and our University, HTML is being used to publish on-line guides and brochures for various projects, including the University Press of Virginia's Fall 1994 Catalog. In my on-line description of the Etext Center I have color images and a web of hypertext links that are too expensive or impossible to duplicate in print -- its URL is
http://www.lib.virginia.edu/etext/ETC.html.

And HTML works very well for the various training handouts that accompany the humanities computing courses I teach. These documents are written and printed out for class in WordPerfect, and they change with great frequency. I use a WordPerfect-to-HTML converter to perform the conversions, and it is a quick and easy way of keeping the on-line documentation up to date. For training manuals such as the Electronic Text Center's Guide to Optical Character Recognition, there is much to be gained from using HTML to show digital images of a range of different typefaces linked hypertextually to the results generated by our OCR technology. HTML functions perfectly well for me when I am building such a helpsheet, and the ability to make it available to others on the Internet is an added and welcome bonus. Little by little, our faculty too are seeing HTML as a useable pedagogical tool, to construct supplementary courseware items -- "Kinkos Packets on-line", as they have been called.

For the moment, I have no real qualms about using HTML as the primary mark-up language for such items. What is clearly inappropriate and unnecessary is to undertake any large amount of text production and markup in HTML for the following items: finding aids, full texts, sets of journal titles, encyclopedia, and dictionaries. Such items are only really useful and navigable if their structure is clearly and explicitly delimited. And this is where tagging schemes such as the TEI come in.

The Text Encoding Initiative: TEI

I have become fond of saying that the TEI Guidelines are 1,600 pages long and the HTML guidelines are 16 tags long -- this is an exaggeration, but currently not an outrageously distorting one. The TEI is a splendid if somewhat daunting undertaking; it provides us with a full set of tags, a methodology, and a set of Document Type Descriptions (DTDs) that allow the detailed (or not so detailed) description of the spacial, intellectual, structural, and typographic form of a work. For a library such as ours at the University of Virginia, which is busy buying and creating texts for use on-line through a single SGML-aware software package, TEI is just what we need. It means that each new work tagged by us can be added to an existing database of related work and can be searched and analyzed either as an individual entity or as a part of a larger context (to do this we use Pat, an SGML textual analysis and display tool from OpenText Corp.) Moreover, it is comforting to work within a large tagset so that a user in the future can add to the complexity of the mark-up and still conform to the same general tagging "universe".

TEI-to-HTML Conversion

As we began at Virginia to create a set of publicly accessible SGML texts (in addition to the commercial databases that we have on-line for Uva users only) we ran into something of a problem, for which the World Wide Web is a partial answer. We wanted to be able to share access to the public domain SGML materials, and could do this by putting copies on an ftp site, or by depositing them at another central repository such as the Oxford Text Archive. In both of these cases, a user retrieving a file would get a "raw" SGML text, and in all likelihood would have no SGML-aware software thorough which to use the item. The advent of HTML and the World Wide Web meant that we could provide on-line access to HTML copies of these public domain materials: as both the TEI tags and the HTML tags are predictable, it was no great undertaking to write a search-and-replace routine to turn the TEI tags into HTML tags automatically. Converting from a specific and precise form of SGML to a simpler form -- TEI to HTML -- presents little problem.


Click here for examples of this process, drawn from our English language holdings.

Crucially, the access to the texts through the World Wide Web is achieved without the need for us to maintain two copies of every file: a TEI-encoded and an HTML-encoded version. Instead, the TEI-to-HTML conversion is performed by a Perl script on the Unix machine that houses all the electronic texts in the Uva on-line archive; for most of the full texts available through our Web server, the conversion is set in motion by the action of choosing a particular text. The conversion is done "on the fly", and the freshly- generated HTML version is sent to whatever Web client one uses.

This, then, is a convenient and low maintenance means by which to mark up the data in a manner appropriate to the subject matter and also to get Web- browsable HTML copies with no real additional effort. As the HTML tagset grows, the filter can be enlarged to let more of the TEI tags through, without the need to do anything to the texts.

The Problem of Unidentified Images

Almost as soon as we had this system of TEI-to-HTML conversions running, a problem dawned on us. A growing number of our electronic texts have book illustrations and other book-related images along with the tagged ASCII text. Our TEI-to-HTML filter takes the TEI manner of indicating the location of an image (example below from Booth Tarkington's The Flirt):

<figure entity="TarFlir1"> <head> "You Darling!" </head> </figure>

and turns it into an in-line image with a hypertext link to an associated full-size version:

<A href="TarFlir1.jpg"> <img src="TarFlir1.gif"> </A> <br> <h1> "You Darling!" </h1>

It is at this point that the problem arises -- if a user saves a book illustration from one of our works there is often nothing on that image to identify what work it comes from, where the image was created, or by whom. A month after saving a book illustration such as the narcissistic frontispiece to Booth Tarkington's The Flirt [see figure 1], the user has in all likelihood forgotten where it came from, or to what work it belongs. By strewing unlabelled images across the Net, we would be contributing to the problem of unattributed texts that we have been complaining about for two years. For many of the public domain texts we have taken off other network sites, we have had to spend a lot of time trying to identify the source text. Without the print source in hand the checking and tagging is largely impossible; without knowledge of the print source, a user cannot safely make scholarly or pedagogical use of the material.

The solution to the problem of unlabelled book illustrations wandering free from their texts presented itself quite readily: the user who downloads an image file of a book illustration or manuscript page needs to have delivered along with it a copy of the bibliographical header that is at the top of every TEI text [see Appendix A for an example]. The TEI header is a catalog record, a finding aid, and a description of the production of the electronic text. A user saving a copy of the text gets the bibliographical information along with it, because the header is part of the TEI text file and is converted along with the rest of the text into HTML.

By the surprisingly simple expedient of taking a version of the TEI header out of the text and burying this "image header" into the binary code of the image itself, the user who saves an image from our server now gets -- in Trojan Horse fashion and whether they know it or not -- a tagged full-text record of the creation of that image as part of the single image file they save. [See Appendix B for an example of an image header from the Rita Dove work whose TEI header appears in Appendix A]. If a user has an image tool that permits the viewing of text comments in the image file (I use XV, the X Windows viewer) then both image and header can be seen simultaneously [see Figures 1 and 2 to see what this looks like through XMosaic], but any program that lets you see the contents of a file is sufficient to read the text.

I'm hoping that the practice of burying tagged ASCII data in the code of an image file will become much more widespread in the electronic data communities, and I am writing up the procedure we use to aid others who wish to do the same. The practice may even be extendable to sound files or digital video files in the future [I have done no experiments with these file formats yet]. The text that goes into the image file does not have to be SGML- tagged data, but it does seem to be a logical extension of the purpose of the TEI header, and there are long-term advantages to making this "text in the image file" contain clearly delimited fields: when we have software that can search (rather than simply view) the text contained in image files then suddenly we have the possibility of a database of images that is searchable by keyword. And when that happens, the text in those images is much more useful over much larger collections if various fields are marked off with SGML tags, so that the text is searchable within specific categories.

Anticipating such "text in image files" search capabilities, we have started very recently to add into the image file not only a bibliographical header but also any written text that occurs as part of an image. In the case of the example in Figure 2, for example, the TEI header in the image file is followed by a tagged text transcription of the words on the page:

<text>
<lg type="stanza">
<l n=4> don't mutter <hi>oh no</hi> </l>
<l n=5> <hi>not another one</hi> </l>
<l n=6> <hi>get a job fly a kite</hi> </l>
<l n=7> <hi>go bury a bone</hi> </l>
</lg>
</text>
At the point at which we have the ability to search this ASCII text then we suddenly have images of typescript (or manuscript) pages that are fully searchable, long before tools for matching the shapes of letters within an image give us searchable images in any form that requires pattern recognition. The addition of a series of key descriptive terms to the text in the image file would allow for images to be searched according to content.

Conclusion

New and significant electronic documents are being produced at a rapid pace, many of them driven by the excitement generated by the arrival of the World Wide Web. At the University of Virginia all sorts of guides, manuals, teaching documents, finding aids, and texts are appearing on Web servers, and the Electronic Text Center's short courses in HTML fill up repeatedly. In the Center itself, we are making heavy use of both HTML and TEI tagging as we create and/or convert full-text collections, and are facing some of the issues involved in describing the texts and images we send out onto the Internet.

Despite the allure of the Web as a distribution medium it is increasingly important to choose the SGML tagging system that best allows one to describe with precision the material being created -- for different texts this may be the TEI Guidelines; Dan Pitti's Finding Aids Project (showcased elsewhere at this conference); the AAP tagset; HTML; or something else entirely. The decision to use a form of SGML other than HTML does not deny one the use of the World Wide Web as a delivery mechanism for that document: it is not difficult to convert a specific set of SGML tags to a simpler, less descriptive form by employing a "search and replace" conversion routine that gives you HTML output with little extra effort and no extra tagging. Crucially -- and this is the central message of this paper -- the electronic document that one creates needs to receive the form of standardized description suitable to the nature of the document, and not one simply dictated by a desire to publish in HTML on the Web.


Figures 1 and 2
Appendix A
Appendix B
etext@virginia.edu