Previous chapter Table of Contents

© Martin Bryan 1997 from SGML and HTML Explained published by Addison Wesley Longman


Chapter 15
The Future for SGML and HTML

This chapter looks briefly at extensions to SGML and HTML that are likely to be defined in 1998. It is split into the following sections:

15.1 What DSSSL brought to SGML

April 1996 saw the long-awaited publication of ISO/IEC standard 10179, which defined the Document Style Semantics and Specification Language (DSSSL) used to formally describe the formatting rules to be associated with an SGML document type definition. The new standard introduced two concepts that are central to the future development of SGML:

15.1.1 SGML groves

To make it possible to describe the sort of transformations that are needed to formally describe the relationships between the logical structure of an SGML document and the physical structure of a formatted document, DSSSL introduced a generalized model for the description of SGML document structures that went far beyond that defined in the Element Structure Information Set (ESIS) previously used for SGML conformance testing. To allow for the creation of books from more than one SGML document instance, and for the creation of more than one formatted document from an SGML document instance, the new concept is based on defining one or more trees of SGML properties. The generic name for this set of trees is an SGML grove.

SGML groves are defined in terms of a grove plan, which identifies the elements, attributes and related properties that are to form part of the trees in the grove. DSSSL extended the set of SGML properties defined in ISO/IEC standard 10744, the Hypermedia/Time-based Structuring Language (HyTime), but continued to use HyTime's formal mechanism for describing property sets to allow application developers to define any additional properties they might need to assign to groves over and above the standard set of properties defined in HyTime and DSSSL.

Once a grove plan has been established it becomes possible to identify nodes in an SGML tree. A node is an addressable object in an SGML tree. It should be noted that the nodes in the tree of a document instance differs somewhat from the trees of the DTDs shown in Chapter 2 and elsewhere in this book. For document instances each occurrence of an element forms a node of the tree, rather than having one node for each type of element. Each element can have a set of relationships with other elements. For example, most embedded elements have a parent element, which will normally be linked to earlier ancestors of the element. If the element forms part of a model group it is likely to have younger siblings that precede it in the document instance and older siblings that follow it in the document instance. Unless the element is an empty element it will have children, either in the form of parsed data characters or in the form of embedded trees of descendant elements. Each attribute of an element will also form a tree attached to the element node. Attributes have properties such as name, value, tokens and characters, that form their own hierarchy of addressable nodes.

15.1.2 The SGML Document Query Language

The SGML Document Query Language (SDQL) is specifically designed to identify nodes within an SGML tree using the properties identified in the grove plan defined for parsing a document instance. Expressed as extensions to IEEE's LISP-based Scheme programming language, SDQL expressions are designed to provide a transportable form of query that can be interpreted at run-time, without having to be predefined or preprocessed.

SDQL queries use keywords to identify nodes that have specific relationships to previously identified nodes, or to nodes that have identifiable properties, such as a unique identifier. Like all LISP list processing expressions, SDQL expressions can be infinitely nested. For example, to select the children of an element with a particular unique identifier you would enter an expression of the form:

(children (select-match (attributes (current-node)) '(name: id value: object-x)))

SDQL expressions can reference a wide range of standard SGML properties, and can also map to user-defined properties of trees or documents. For example, they can be set up to look for elements that contain strings that require special processing, or to identify elements containing part-numbers or other data that needs to be processed by querying a relational database using SQL.

15.2 The HyTime SGML General Facilities annex

1997 saw the publication of a new SGML General Facilities annex as part of the Hypermedia/Time-based Structuring Language (HyTime) published as ISO/IEC standard 10744. The new annex will define a set of facilities that will be applicable to all SGML systems, not just those conforming to HyTime. The new annex will cover three main areas:

15.2.1 Architectural Form Definition Requirements

HyTime introduced the concept of architectural forms to SGML to allow a DTD to be defined in terms of known classes of elements. The Architectural Form Definition Requirements (AFDR) section of the SGML General Facilities annex will take those features of HyTime that are applicable to any architectural form and define them in a way that makes them easily referenceable by all SGML applications. It will define a set of attributes, and formal processing instructions, that can be used to identify which architectural forms are associated with a particular document prolog and instance.

The AFDR specifications allow architectural forms to be derived from other architectural forms. This will allow superclasses, such as HyTime, to form the basis of other architectural forms, such as those being defined for topic map navigation by the Committee for the Application of HyTime (CApH).

15.2.2 Formal System Identifiers

The SGML General Facilities annex will also contain the definition for the formal system identifiers mentioned in Section 6.3.1. Formal system identifiers have been introduced to make SGML system identifiers more transportable, and to add security control features to entity managers. As well as identifying the path and file names of the relevant files in a system independent way, a formal system identifier can also identify features such as its methods of compression, checksum sealing and encryption, any bit combination transformation process (bctp) carried out on data before transmission, and the type of record boundaries used within the file.

The HyTime SBENTO facility for creating a container that will allow a number of related files to be referenced using a single file identifier has also been moved to this part of the SGML General Facilities annex. As well as ensuring that related files can be referenced as a single unit, SBENTO containers also allow streams of data to be multiplexed in such a way that suitable amounts of data are transmitted in an appropriate sequence. This is particularly important when sending multimedia presentations over slow networks such as the Internet, where you need to ensure that both sound and visual components are delivered, in a sychronizable way, at rates that are suitable for presentation on the screen before all of the file has been received.

15.2.3 Property Set Definition Requirements

The SGML General Facilities annex will also contain those parts of the HyTime standard that are used for the definition of the property sets used to identify the component parts of an SGML grove. This will include a set of definitions that can be used to identify virtually all aspects of SGML. In addition the annex will define mechanisms for defining classes of properties.

As well as formally defining SGML groves this annex will also provide facilities for controlling what SGML parsers will generate when parsing particular document instances. It will also contain the definitions for the HyTime <LEXTYPE> element, which can be used to create user-defined lexical types. This will mean that the concepts introduced in HyTime to provide a customizable method for checking the values of attributes or the contents of element will now be usable in non-HyTime applications. To see the effect of this, consider the following attribute list definition, which allows more than one attribute to have a value of YES or NO:

<!ATTLIST switch1 CDATA --lextype: ("YES"|"NO")-- NO
          switch2 CDATA --lextype: ("YES"|"NO")-- NO >

15.3 Possible extensions to SGML

As part of ISO's standard 5-year review cycle SGML is now undergoing its 10-year review. During 1998 JTC1/WG4, the ISO committee responsible for SGML, will be doing a clause by clause examination of ISO 8879:1986. A number of areas of improvement have already been identified, including:

Many more extensions to SGML are planned, but the guiding principle behind all extensions will be that any currently valid SGML document will remain valid under any future revision to the standard.

15.4 Possible extensions to HTML

Whilst the December 1997 version of the HTML DTD introduces a number of useful new concepts it still only provides a limited set of elements, which do not include all the elements currently supported by the manufacturers of HTML document browsers. In December 1997 Microsoft announced that they plan to use the Extensible Markup Language (XML) developed by what was the SGML on the Web Working Group for future extensions to HTML, which they plan to introduce across their product range as part of their 1998 product upgrades.

The Extensible Markup Language became a formal W3C recommendation in February 1998. It will be followed by an XML Link Language (XLL) and an XML Style Language (XSL). At present, timeframes for these important languages are unclear, though XLL should be formalized in 1998, with XSL perhaps taking slightly longer due to problems with getting buy-in from some manufacturers.

Warning: Expect a book called XML Explained during 1999!

November 1997 saw the publication of a discussion draft for a second version of the Cascading Style Sheet language (CSS2). By March 1998 W3C member companies were being asked to formally approved the completed specification. The new specification allows much greater control of on-screen presentation, control of audio presentation and the ability to "paginate" displayed and printed output.

In December 1997 ISO/IEC JTC1/WG4 approved the preparation an publication of a subset of the strict variant of the HTML Version 4.0 DTD that is acceptable to nearly every existing WWW browser as an ISO standardized form of HTML. This will become ISO standard 15445.

The latest version of the HTML DTD presumes that the full ISO/IEC 10646 character set will be supported on future released of WWW browsers. Mechanisms for identifying language preferences of users at the HTTP level will help in the selection of suitable versions of documents, but until a better mechanism is provided for switching from one language to another while maintaining the current point in the file it will not be possible to develop truly multilingual HTML applications. As the existing mechanism provided through the <LINK> element is poorly understood, and rarely implemented suitably, it is likely to be some time before facilities for supporting the sort of multi-cultural applications now being postulated for the WWW are available.

As this book has shown, SGML provides a very flexible method for marking up documents for interchange between computer systems. New facilities are constantly being added to both SGML and HTML as more complex applications are coded using these markup languages. Today's new ideas will be taken for granted in a few years time. By the beginning of the 21st century the concepts of structurally based generic markup introduced by the ISO as far back as 1986 will have become a standard part of virtually every word processing system, and of many millions of computerized data servers. The last 10 years have been very interesting for those of us involved in the development of SGML. The next 10 years are likely to be even more exciting.