© Martin Bryan 1997 from SGML and HTML Explained published by Addison Wesley Longman
To set the scene for the following description of the facilities provided in the Standard Generalized Markup Language, this chapter provides an overview of how an SGML system can be built from a set of interrelated software components. The classification of components used in this chapter will follow that use in Steve Pepper's definitive Whirlwind Guide to SGML Tools and Vendors, the latest version of which can be obtained over the WWW from http://www.falch.no/people/pepper/sgmltool/.
This chapter is split into the following sections:

Figure 3.1 Parts of an SGML-based system
Figure 3.1 shows the main component of a comprehensive SGML system. The central part of any SGML system is the data repository. This can take many forms, from a simple file store to a complex database of reusable SGML elements. Files to be placed into the repository must be validated by an SGML parser. Validation may also be required before documents are despatched from the repository to ensure that no referenced data has been omitted from the transmitted data set.
Input to the data repository can come from:
Data stored in the repository can be:
SGML document editors fall into three main categories:
Editors that try to present the printed page as it is input normally hide SGML markup from users. If they do not provide a separate view of the marked up file that allows users to edit the markup they will have mechanisms for displaying details of the currently open elements, and their attributes, in well-defined areas of the screen window, or in pop-up or pull-down windows. Alternatively the editor may provide a pictorial representation of the document structure that helps users to understand where they are in the document structure.
Editors that show an unpaginated, quasi-WYSIWYG, view of the SGML file typically have mechanisms that allow the markup tags to be displayed as part of the text. Typically markup will be shown in the form of an icon. Often these icons will not contain details of the element's attributes, which can only be displayed in the form of a pop-up window. Figure 3.2 shows a typical view of the screen of such an editor when attribute editing is in progress

Figure 3.2 Example of quasi-WYSIWYG editor screen
As SGML concerns itself principally with the logical structure of data, rather than the way the data will be presented to end-users, there is no ideal way to display an SGML document during data capture. In many situations there are advantages in capturing the data on an editor that does not seek to represent a `final view' of the data but instead concentrates on making the SGML structure as clear as possible to the document creator. In such circumstances an editor that displays the SGML markup as part of the document has advantages.
Of the many SGML-based text editors mentioned in The Whirlwind Guide to SGML Tools and Vendors the following are among the most popular:
There are a number of tools that are designed to make it easier for document analysts to create and validate their DTDs. Some of these are designed to work with specific editors or document parsers; others are designed as free-standing tools.
One class of document analysis tool takes existing DTDs and displays them in the form of a directed graph. The more advanced of such tools allow users to modify the displayed graph and then output a revised DTD. Where this is possible it is normally also possible to create a DTD from scratch by drawing a graph of the required structure. A well known example of such a tool is Microstar's NEAR & FAR Designer.
One of the main strengths of SGML DTDs is their ability to define recursive structures, and to have rules for including and excluding certain parts of the document model in certain circumstances. Trying to display such rules graphically is extremely difficult. Where a DTD has been defined using a carefully controlled set of nested structures it will often be impossible to graphically display the whole document structure in a viewable form. In such circumstances other forms of DTD analysis tools can help to identify, for example, all the places where a particular element can occur. An example of such a tool is SoftQuad's DTD Documentor.
Another category of useful data analysis tools allows users to scan a set of existing SGML and non-SGML documents to identify the elements they contain and the relationships between these elements. A typical example of such a tool is Avalanche's Document Analyzer.
While sophisticated text editors often include facilities for importing data from the more popular forms of word processors, they often do not have sufficient knowledge of data structures to be able to convert non-SGML documents into the format required for a specific SGML DTD. Stand-alone data conversion tools can provide fully programmable solutions to the problems of converting word processor documents into SGML.
Data conversion tools can be categorized into two main classes:
Tools that can validate the SGML structure during conversion typically use the output DTD as the starting point for describing the conversions to be made. As each element is identified it is checked to ensure that it is valid at the current point in the document structure. If it is not, either an alternative conversion is attempted or an error is reported to the user.
Most existing data conversion tools fail to validate the SGML structure during conversion. They simply apply a set of programmed rules to the input document and then pass the unvalidated output file to an SGML parser for post-conversion validation.
Among the many data conversion tools listed in The Whirlwind Guide to SGML Tools and Vendors the following are possibly the most popular:
Before a converted, or new, SGML document is placed in a document repository it should be parsed to ensure that the structure of the completed document conforms to the rules specified in the associated document type definition. An SGML parser is defined in the SGML standard (ISO 8879:1986) as:
A program (or portion of a program or a combination of programs) that recognizes markup in SGML documents.
NOTE - If an analogy were drawn to programming language processors, an SGML parser would be said to perform the functions of both a lexical analyzer and a parser with respect to SGML documents.
In practice an SGML parser must do more than just recognize markup. It must also:
A validating SGML parser should be able to report errors and identify where the errors occurred. It should also be able to identify points at which the markup rules defined in the SGML declaration associated with each DTD have not been followed. For example, a validating SGML parser should report the use of any invalid characters within SGML names, and whenever the document's name or string length limits have been exceeded. In addition, a validating parser should identify when the restrictions on memory storage capacity or the structure of formal public identifiers have been broken.
The most commonly used free-standing SGML parsers are those that form part of James Clark's set of public domain parsers (sgmls, nsgmls and SP). Commercially supported parsers include Exoterica's SGML Kernel and Sema's Mark-It.
SGML documents can be stored in any type of data repository, including compressed file stores, encrypted file stores, relational, hierarchical or object oriented databases. The techniques used to identify the object to be stored, and the level at which stored objects can be reused, are heavily dependent on the storage manager that controls access to the storage facilities.
When documents are stored in a file store, rather than a database, an SGML entity manager can be used to control the way in which documents, and the entities they reference, are loaded into, and recalled from, the file store. An SGML entity manager should be able to convert documents from the coding format used for storage to that required for parsing. This can involve the application of algorithms for data compression/decompression, encryption/decryption and conversion from single-byte (e.g. 8-bit) to multi-byte (e.g. 16-bit or 32-bit) formats. Where support is provided for bit combination transformations the entity manager can also provide facilities for converting from the encoding used for data storage to that used for document parsing/editing (e.g. from EBCIDIC to ASCII). Sophisticated SGML parsers, such as James Clarke's SP, have built-in SGML entity managers.
Where the databases are being used as the basic storage mechanism, documents can be split into separate elements for storage. The techniques used for identifying relevant storage units depend on the type of database being used and the sophistication of the database loading/unloading software.
When relational database management systems (RDBMSs) are being used for storage it is not common practice to treat each element in the DTD as a storable object. Instead the DTD is normally analyzed to identify reusable units, such as numbered sections or tables, that can be stored as a binary large object (BLOB) within the database. An example of an RDBMS-based SGML document repository is IDI's Basis SGMLserver.
When object orientated databases (OODBs) are being used the SGML document structure provides a natural hierarchy of related objects that can be directly loaded into the database. A good SGML-based OODB will be able to determine the relationships between elements, and between elements and their attributes, automatically by reading the DTD. In ideal situations the OODB should be able to map the changes required to the document structure when a new variant of a DTD is introduced for a particular set of existing documents. An example of a OODB-based SGML repository is Electronic Book Technology's DynaWeb database.
There are some hybrid systems that combine object oriented front ends with relational storage. As most RDBMS vendors are now developing object oriented front ends such systems are likely to become more common. An early example of the use of this approach can be seen in Texcel's ERIC data repository.
A good data repository will offer facilities over and above simple data loading and retrieval. A data repository should offer facilities for version control, data archiving and data sharing. An example of an RDBMS-based system that offers such facilities is CRI's Life*CDM. An example of a managed file-based data repository is Xerox's Astoria.
For many systems workflow management will form an integral part of the data repository. Such systems are able to assign tasks to specific users, set and monitor target dates for completion of specific processes, report on progress-to-date, and route files and messages from one process to another. The degree of integration of the workflow management system with the data repository is often a key factor in the selection of an SGML data repository.
Because of its background as a controlled method of documentation capture most SGML systems are designed specifically for the production of printed documents. Where WYSIWYG text editors are being used for data capture there may be no need for additional facilities for document formatting: printing the image shown on the screen may be sufficient.
Because SGML systems are specifically designed to divorce the document storage/interchange format from the document presentation format, SGML documents will typically have to be formatted when retrieved from storage. In many cases this will be done using a high speed, high quality, formatting package. There are four main ways in which format control information can be assigned to the elements of a DTD:
SGML's LINK facility provides a number of techniques for enhancing the existing markup of an SGML document to provide the information required by a text formatter to format the text into columns and pages. Details of these techniques are given in Chapter 10 of this book.
Conversion of SGML markup into formatter specific forms can often be undertaken using the same tools that are used to convert word processor documents into SGML. Alternatively conversion facilities may be offered as part of the text formatting system. Examples of pagination systems offering such facilities include Datalogic's DL Composer and 3B2's Advent system.
SGML systems that are specifically designed for use as part of the US Department of Defense's Continuous Acquisition and Life-cycle Support (CALS) program often support the interchange of formatting specifications in the form of FOSIs. While FOSIs only support a limited range of formatting options they do provide a limited degree of specification transportability. A document composition system that supports the use of FOSIs is Arbortext's ADEPT Publisher.
April 1996 saw the publication of the long-awaited Document Style Semantics and Specification Language (DSSSL) as ISO/IEC 10179:1996. Despite its recent release many vendors have already committed themselves to upgrading their products to support at least a subset of the comprehensive set of data transformation and formatting options detailed in this exciting new standard. Those interested in developing long-term systems development strategies should look carefully into the option of using this standardized mechanism for interchanging formatting specifications.
Whilst SGML was originally designed for the production of printed documents one of its key roles today is for the electronic delivery of documents. Four main techniques for the electronic delivery of SGML documents can be identified:
Early examples of electronic delivery involved the processing of a set of SGML files to produce a binary format that could be displayed using a customized document browser. Typically such systems will fully index the text so that fast text searches can be performed on the stored data. One of the most comprehensive browsers to use this technique was Electronic Book Technology's DynaText system.
An alternative that is widely used by many publishers is to pass the Postscript files produced as part of the text formatting process to an Acrobat Distiller program to produce PDF encoded files that can be displayed using Adobe's Acrobat Reader.
Another approach is to store the data in its SGML form and then use an SGML document browser, such as SoftQuad's Panorama, to browse through the native SGML. In such cases the formatting rules to be associated with the SGML files form normally form part of a browser-specific information set that is associated with the SGML DTDs or document instances.
With the rapid spread of the Internet many users are switching to HTML document browsers for presenting information over the Internet. In this case conversion from the SGML storage format to the HTML transmission will be undertaken at the WWW server site as a transparent process. In this scenario formatting is controlled by the set-up of the HTML browser. Where the browser supports the new Cascading Style Sheets (CSS) specification it may be possible for the server to control document formatting to a limited degree, though user-set parameters should still be able to override formatting specified by the document's creator.
When documents are interchanged between, or within, SGML systems three components must be interchanged in a predefined order:
Not all SGML systems are able to process SGML declarations. Some systems have a built-in SGML declaration. Others rely on the default SGML declaration defined in ISO 8879:1986. In some cases facilities normally controlled via the SGML declaration have to be specified using menus within the program.
A good SGML system will be able to accept documents coded using different SGML declarations, and will be able to interpret files containing SGML declarations. Because the SGML declaration controls many of the functions of an SGML system we will briefly cover its facilities in the next chapter of this book. You should not, however, worry too much about understanding the role of the SGML declaration when you first come to grips with SGML. All you need to know initially is that the SGML declaration places certain restrictions on what can and cannot be done by an SGML system. A quick scan through the next chapter will start to give you some idea of what these restrictions are.
For most SGML systems, document prologs will be stored in the form of a set
of files that can be referenced through DOCTYPE declarations at
the start of a document instance. For some systems, however, the DTD will need
to be stored in the form of a detached DOCTYPE declaration so that
it can be precompiled before being associated with one or more document
instances. Most document prologs consist of a single DTD. The reasons why
multiple DTDs may be required, and the role of LPDs, will be explained in
Chapter 10.
Where document instances are to be associated with precompiled prologs they
can simply start with a markup tag identifying the base document element. In
most cases, however, the document instance will start with a DOCTYPE
declaration that identifies the DTD to be used to process the document. When
moving documents from system to system it is important to determine whether or
not the DOCTYPE declaration needs to be attached to, or detached
from, the document instance.
When interchanging SGML document sets it is important to ensure that all relevant files have been interchanged. For products that conform to the specifications laid down by the SGML Open vendor consortium this will normally be achieved through the interchange of an SGML catalog that identifies the public and system names of all the files required to process a specific set of document instances.
Pepper, S (1996) The Whirlwind Guide to SGML Tools and Vendors Oslo, Norway: Falch Infotek A/S (for the latest version contact http://www.falch.no/people/pepper/sgmltool/)
`Exchange of Formatting Information using the Output Specification', Department of Defense Application of MIL-M-28001 Using Standard Generalized Markup Language (SGML) (MIL-HDBK-SGML) Washington:US Department of Defense (1993)
W3C Working Draft (05-May-96) Cascading Style Sheets, level 1 (For latest version of specification contact http://www.w3.org/pub/WWW/TR/WD-css1.html)