© Martin Bryan 1997 from SGML and HTML Explained published by Addison Wesley Longman
The phenomenal growth of the Internet during the first half of the 1990s has led to an ever increasing awareness of the benefits of generic markup in the interchange of documents. In particular, the success of the HyperText Markup Language (HTML) has improved awareness of the relevance of the Standard Generalized Markup Language (SGML) developed by the International Organization for Standardization (ISO) as a method for describing documents in a way that makes it easy to move them from one platform to another.
At present there is a large gap between the levels of knowledge of those who work only with HTML and those who have been using the full power of SGML for some time to manage their documents. One of the purposes of this book is, therefore, to try to increase the awareness of those using HTML of the role SGML plays in ensuring that their documents can be easily exchanged between systems.
This chapter provides some background material on the development history of both HTML and SGML. It is split into the following headed sections:
The Internet is a set of interconnected computer networks that has been developed over the last three decades of the 20th century. The US Defense Advanced Research Project Agency Network (DARPANET) was one of the first networks to interconnect government, academic and private research organizations. One of the early European connections was established with the Centre Européen de Recherches Nucléaire (CERN) in Geneva. Today the Internet network of interconnected computer networks developed from this early beginning has been extended to cover many millions of computers located in all the continents of this planet, including Antarctica!
Initially the Internet was used primarily for sending electronic messages and transferring files. It was not long before groups of people interested in the same subjects set up news groups so that messages could be shared among the relevant user community. To allow people to request files from another system without having to send a message requesting a copy of it, the Internet community developed a File Transfer Protocol (FTP).
While these early techniques were very helpful in speeding up the interchange of information between researchers they had limitations when it came to document handling as you could not be sure what format the files you requested would be in, and whether you would be able to process them when you received them. Tim Berners-Lee, a researcher at CERN, developed a document browser that could request files over the Internet and display them in a predefined format. To do this he introduced two new protocols to the Internet, the HyperText Transfer Protocol (HTTP) and the HyperText Markup Language (HTML). The CERN browser soon became the standard tool through which researchers requested documents ('surfed') over the Internet.
As HTML document browsers became readily available more and more people started referencing existing documents over the Internet. A standard method of identifying files using Uniform Reference Locators (URLs) was developed so that browsers could share files more easily. Files that were interconnected in this manner were seen to form a World Wide Web (WWW) of data. This name is, nowadays, synonymous with the use of the HTTP and HTML protocols to interchange electronic documents.
The spread of the Internet has led to an increased awareness of the advantages of transferring documents from one word processor to another in electronic form. Many of the currently available techniques do not, however, allow information about how a document is to be presented to the user to be transferred with the text. Sometimes presentation details cannot be transferred because they are coded in a machine-specific manner that may not be understood at the receiving end; sometimes they are coded using characters that are outside the range allowed by the electronic delivery subsystem. The absence of formatting clues to the "structure" of a received document can lead to a degeneration of the transmitted message. SGML allows information about a document's structure to be preserved as it is transferred from one computer system to another.
Documents that are to be freely interchanged over the WWW are typically coded using the HyperText Markup Language (HTML), which is a simple application of SGML. For more complex documents users can define the structure of their information set using an SGML document type definition. This structure can be used directly by a plug-in to a standard HTML document browser that understands SGML to present the stored information in a suitable manner, or can be converted to HTML for display using a basic WWW browser.
The structure of a document can be used to help users identify and move to relevant parts of an electronic document. Electronic documents do not need to rely on the tables of contents, indexes and cross-references used in printed documents to help users find information. By making tables of contents active users can move directly to headed sections of text without having to scroll through intervening pages.
Electronic books are typically displayed as a continuous set of paragraphs rather than being split up into arbitrary pages. Cross-references to a particular page number are not, therefore, appropriate in an electronic environment. Instead electronic cross-references should be presented as a field that the user can click on to move directly to the relevant point in the text. Indexing is often replaced in electronic documents by features that allow you to search for particular words or phrases within the document.
Markup is the term used to describe codes added to electronically prepared text to define the structure of the text. (Markup is spelt as one word when applied to electronic files to distinguish it from the form of mark up traditionally used by graphic designers, which was normally hand-written onto printed copy.)
Any device that stores formatted text for later recall uses some form of markup, though this may not be apparent. Sometimes the points at which markup has been added to a document can only be identified on a display screen by a change of typeface, or by the addition of a special marker. Wherever a change of features is found in an electronic document it can be presumed that some form of electronic markup has been recorded. In many cases some form of delimiter will be used to identify the start and, optionally, end of this markup.
Each word processing program has its own set of markup instructions, though it will often be able to import files coded by other programs. Moving from one word processor to another involves authors in the costly and time consuming exercise of learning a new set of markup instructions. The changeover may also require the conversion of existing text before it can be used with the new software.
There are three types of electronic markup in common use today:
Specific markup describes the format of a document by use of instructions that are specific to the program used to generate or output the text. Such instructions normally have immediate effect on the appearance of the text. They can either affect the appearance of the characters (e.g. by selecting bold or italic text) or the position of characters or lines (e.g. by adjusting indent, margin or spacing values).
Generalized markup normally identifies a style sheet to be associated with the following text. The name of the style sheet can indicate the basic structure of the document, but in general a different style sheet name is required for each variant of element concerned. For example, if paragraphs in an appendix are to be set in a smaller size than those in the body of a document they must be marked up using a different name.
Generic markup concentrates on the role of the associated data, leaving differences in the way the data is to be presented in different contexts to a later stage in the process. Typically generic markup instructions identify elements such as headings, paragraphs, highlighted words, quoted text, etc. The presentation format of the elements of a generically marked up documents is context specific. For example, a paragraph in an appendix can have the same name as one in the body of the text, but will be presented in a different format.
Specific markup instructions can take many forms. One of the most commonly used forms is the Rich Text Format (RTF) used to interchange documents between different environments using the Microsoft Word program. (This format is often used as a common denominator when interchanging documents between different word processors as most programs can import and export files in RTF format.) RTF commands begin with a backslash (\) followed by one or more letters identifying the function to be used (e.g. \b for Bold).
There are three basic types of RTF commands:
Although these three basic types of markup instruction occur in most markup languages, the way in which they are used can differ from program to program. For example:
Some markup instructions need to be qualified by the addition of one or more numbers or other parameters. As such instructions can be of variable length, two basic techniques are used to identify them:
Wordstar Dot Commands are an example of text formatting instructions that are placed on separate lines. A Wordstar Dot Command consists of a dot (full stop) followed by two letters identifying the command required and any relevant parameter(s). (The initial dot always appears in the first column of a line.) Parameters may be variable length numbers or one of a predefined set of control words (e.g. ON or OFF).
A typical Wordstar document might start:
.PL 66 .MT 6 .MB 9 .UJ ON ^A^BChapter 1 INTRODUCTION^B The spread of word processors ...
These instructions tell the program that the page length is 11 inches (.PL 66 = 66 lines of the standard 1/6th of an inch spacing), with a top margin of 1 inch (.MT 6 = 6 lines), a bottom margin of 1.5 inches (.MB 9= 9 lines). After switching on justification (.UJ ON) the first Print Control Commands request that the character pitch be changed to 12 characters per inch (^A = Elite) while the second (^B) requests the bold version of the typeface. This bolder face remains in force until the end of the heading, where it is switched off by a second ^B sequence.
RTF uses delimiters to identify the scope of its formatting instructions. For example, to set the above heading without associating a style sheet with it, the following RTF command sequence could be used:
\pard\plain\qc
{\b\f4\fs36 Chapter 1
\par INTRODUCTION
\par }
The first line defines the style of following paragraphs (\pard), identifying it as being based on the default style (\plain) but quad centered (\qc) rather than left aligned. The text between the curly braces is to be set using the bold (\b) variant of the font identified by the number 4 (\f4) using an 18pt font size (\fs36). The end of paragraph (\par) command shows where the quadding rules defined in the paragraph definition are to be applied.
It will be seen from the above examples that inter-relating the instructions used by different word processing programs is not straightforward. To convert a file created using one package to a form that can be understood by another may require the entries to be redefined using different mnemonics in a different sequence.
To avoid having to re-specify markup instructions whenever a change of format, or output device, is required, the concept of a generalized markup language was postulated by Goldfarb et al (1970) as part of an IBM research project. The idea was based on two premises:
These techniques formed the basis of IBM's Document Composition Facility Generalized Markup Language (DCF GML). A typical chapter in a GML coded document might start:
:book. :body. :h1.INTRODUCTION :p.The spread of word processors ...
It can be seen that this generalized markup differs from the earlier, specifically coded, examples in a number of important respects. It starts with two instructions identifying the type of document being prepared (a book) and the section of the document in which the text is to be placed (the main body of the book). These instructions clearly show the structure of the document and define the role of the following text.
The next point to notice is the absence of the line containing Chapter
1. With generic coding, the format and numbering of chapters (identified
by the :h1. instruction) can be taken care of automatically by the
text formatting program. This has the advantage that the sequence in which
chapters are output can be changed without the author having to renumber each
one. In addition, decisions as to whether or not the word Chapter is to appear,
and whether numbers are to be printed using arabic or roman numerals, or simply
as words, can be controlled by the printer of the document rather than its
author.
The final point to notice about the GML coded version of the chapter opening is that it has not been necessary to indicate that the heading is to be set in bold, or in a larger point size. The text formatting program knows that, when it reaches the instruction to start a paragraph of text (:p.), it should return to the standard form of text, after applying any necessary paragraph indents
Dr. Goldfarb used the concepts of generalized markup developed for the DCF GML project as the basis for developing a generic markup scheme for the American National Standards Institute (ANSI). This scheme was then internationalized to become ISO's Standard Generalized Markup Language (SGML).
Generic markup schemes differ from other generalized ones in that it is no longer necessary to have a one-to-one correspondence between style sheet names and the names of markup tags. With generic coding the markup tag only identifies the role of the data element. The way this element will be presented to users depends on the context in which it is used. For example, a paragraph may be formatted differently if it appears immediately after a heading, or in an foreword or appendix, without having to be coded differently from other paragraphs.
SGML operates at a number of levels, depending on the features required by the application. The standard provides facilities for defining:
The structure of an SGML-coded document, and details of optional SGML features used in its preparation, are formally defined in a set of markup declarations that form a document type definition (DTD). These markup declarations describe a set of markup instructions, known as tags, that can be used to identify the start or end of the logical elements of the text. The start of each element is normally marked by a start-tag, an end-tag being used to indicate where the element ends.
While separate sets of markup tags can be used for different applications, the same basic elements occur in most documents. For instance, a paragraph of text forms a logical element of a letter, report, paper or book, and can be allocated the same markup tag in each type of document (e.g. <p>). Shared markup declarations can be stored in files that can be referenced from a number of document type definitions.
Where necessary SGML markup tags can be qualified by attributes. Attributes are used within SGML to:
Attributes can also allow users to control the way in which text is presented to readers.
The amount of keying required to capture a document can be reduced by assigning names to data storage entities that contain text that will be used in more than one place. For example, a general entity reference called &SGML; could be used to enter the phrase Standard Generalized Markup Language at relevant points. The replacement text for each entity can either be declared within the document type definition currently being used (local entities) or can be stored in an external file that is recalled as the document is processed (external entities).
Where complex structures are required for specific parts of a document, SGML allows externally stored subdocuments, based on an alternative document type definition, to be merged with the main document.
Numeric or named character references can be used within SGML documents to request characters that are not included in the word processor's character set. This technique allows characters outside the basic character set of a word processor to be incorporated into a transmitted file in a way that makes them interpretable when received by a different program or operating system.
SGML also provides ways of declaring short cuts to document markup. The amount of markup that needs to be entered or transmitted can be reduced by:
These techniques make it possible to reduce the amount of markup required in an SGML-coded file to a minimum.
Users of SGML systems may hardly notice the difference between their existing word processors and SGML-based programs. Both may use the same sequences of key or button depressions to enter and format the text. The main difference will be that the set of tags/buttons that are permitted/active at a particular point in an SGML document will be restricted to the set that is defined for the containing element in the document type definition. This will mean, for example, that it will no be longer possible to place a third level heading directly under a first level heading if the document type definition requires there to be an intervening second level heading.
Word processors will only be able to import and format SGML-coded documents if they have a program that is capable of converting SGML markup into a form that can be understood by the formatting program used by the word processor. For generalized SGML documents this can be a difficult process. There is, however, one particular application of SGML that is specifically designed to make it as easy as possible to convert word processed text into and out of SGML - the HyperText Markup Language (HTML) used on the World Wide Web.
When Tim Berners-Lee created the first HTML browser at CERN he was not
designing an SGML system. Initially he just wanted a mechanism that would
describe the processes going on within his browser in a form that could be
safely transmitted over the Internet. As the SGML developers had found, the
safest code set for the transmission of information between computer systems is
that defined in ISO 646, which formally defines an International Reference
Version (IRV) of the code set originally created by the American National
Standards Institute (ANSI) as the American Code Set for Information Interchange
(ASCII). To delimit his markup instructions from the text Tim Berners-Lee chose
the same delimiters as SGML, the < and >
characters. His initial markup instruction set included things like end of
paragraph,
<P>, italic, <I> and bold <B>.
To end italic and bold text strings Tim Berners-Lee chose to use end delimiters
of the same form as SGML, </I> and </B>.
While at first glance this initial coding scheme looked like SGML it was
apparent to those that knew SGML that there were fundamental differences between
the concepts behind HTML and those behind SGML. In particular the role of the
<P> tag was fundamentally different. In HTML this tag
initially only indicated the point at which a paragraph end was required. In
SGML this tag indicates the start of a paragraph, with a matching </P>
tag identifying the end of the paragraph, where the actual paragraph break
occurs.
Another fundamental difference was that HTML originally had no control on when you should switch bold and italic on and off. As with many word processors, there was nothing to stop users from switching on bold in the middle of one paragraph, and then switching on italic at the start of the the next. In such cases the first part of the second paragraph would be set in bold italic. If bold was then switched off the text would continue to be presented in italic until a command to switch off italic was received.
The problem with adopting this approach is that you could not be sure that all HTML document browsers would work in exactly the same way. Some might chose to automatically switch bold and italic off when they started a new paragraph. This led to the same document providing different results in different browsers.
The philosophical differences between HTML and SGML were resolved with
Version 2.0 of HTML, when it was decided to use an SGML document type
declaration to formalize HTML so that restrictions could be placed on where each
of the HTML elements could be started and ended. The formal definition made it
clear that it was no longer permissible to extend formatting instructions over
paragraph boundaries. It also introduced logical equivalents for formatting
related instructions. For example, emphasis (<EM>) was
introduced to replace italic (<I>) and <STRONG>
was introduced to replace bold (<B>).
HTML is not as well controlled structurally as most SGML document type definitions. It is still more presentation oriented that structurally ordered. Version 3.2 of the standard, which was announced in May 1996, introduced some additional structural elements, including one that can be used to arbitrarily group sets of elements. It is likely that the trend of introducing a greater range of logical elements will increase over time.
HTML is very easy to map to and from word processing software. At its simplest level it can be looked at as an SGML representation of RTF. HTML is an ideal way of getting users of existing word processors to start to use generic markup tags as it can be introduced into existing word processors with very few changes to existing document creation processes. This book will use HTML to explain many of the features of SGML that it utilizes, but will also explain why some of the techniques used in HTML still represent poor practice for SGML document creation.
Goldfarb, C.F., Mosher, E.J. and Peterson, T.I. (1970). `An Online System for Integrated Text Processing', Proceedings of the American Society for Information Science, 7, 147-150.
International Organization for Standardization (1991), Information Processing - 7-bit coded character set for information interchange (ISO 646:1991) Geneva: ISO.
International Organization for Standardization (1986), Information Processing - Text and Office Systems - Standard Generalized Markup Language (SGML) (ISO 8879:1986) Geneva: ISO.
Internet Engineering Task Force RFC 1866 The HyperText Markup Language (HTML), Version 2.0 (1995) Reston, Virginia: Internet Society.
Microsoft Word for Windows Technical Reference (1989) Microsoft Corporation.