Previous chapter Next chapter Table of Contents

© Martin Bryan 1997 from SGML and HTML Explained published by Addison Wesley Longman


Chapter 9
Tag Minimization

This chapter explains the rules provided in SGML to reduce the number/size of markup tags that need to be entered by users. It is split into the following sections:

9.1 Types of minimization

SGML provides four main techniques for minimizing the number and length of a document's markup tags:

The intention to use tag minimization features must be indicated by activating the appropriate MINIMIZE options in the FEATURES clause of the SGML declaration. By default only the OMITTAG and SHORTTAG options may be used.

9.2 Tag omission

The most commonly used form of minimization is tag omission. This optional SGML feature allows tags to be omitted when their presence can be unambiguously implied by the program from the structure of the document declared in the document type definition.

Whenever the FEATURES clause of the SGML declaration contains the entry OMITTAG YES all element type declarations in the document type definition must contain two characters defining the type of omitted tag minimization permitted for the declared element(s). The first of these two characters is set to O (the letter O rather than the number 0) if the element's start-tag can be omitted: otherwise a hyphen is entered. If the element's end-tag can be omitted the second character is set to O: otherwise a hyphen is entered. The two characters must be separated by a space (or any other valid separator character) and must be separated from the preceding element type name, and the following model for the element's contents, by further spaces or separator characters.

Note: If tag omission has not been permitted in the SGML declaration the two tag omission characters can still be present in the element type declaration. It is, therefore, standard practice to put tag omission rules into DTDs even when the MINIMIZE section of the FEATURES clause contains an entry of OMITTAG NO in the SGML declaration.

9.2.1 Start-tag omission

Start-tags can only be omitted from a document when:

  1. The element concerned is contextually required (i.e. must occur at that position) and
  2. Any other element that could occur at the same point, such as an element specified as an inclusion, is contextually optional.

Start-tag minimization should, therefore, only be used where the elements in the currently active model group are connected with sequence (SEQ) connectors. (If the AND or OR connectors are used the parsing program will not be able to uniquely determine which element's tag has been omitted.) Similarly, elements whose start-tags may be omitted should not be optional, i.e. have an OPT (?) or REP (*) occurrence indicator next to the element type name, as such indicators make it impossible to identify which element should occur next.

Start-tags cannot be omitted for elements whose content type has been declared using the RCDATA, CDATA or EMPTY reserved names. Start-tags should also not be omitted where the first character in the element with the omitted tag is one of the short reference characters associated with the element in a short reference (SHORTREF) declaration that has been associated with the element through a short reference use (USEMAP) declaration, especially where the short reference would be associated with a different entity if the start-tag was not present.

When the presence of an omitted start-tag is implied by the parser, the currently defined default values will automatically be used for any attributes associated with the element. It is important, therefore, to ensure that the default values of any attributes associated with elements which may have their start-tags omitted are checked carefully before start-tag omission is permitted. If any of the attributes associated with the element has a required attribute, either because its default value has been declared using the #REQUIRED keyword or because the default value is #CURRENT and the element has not yet been used, the start-tag cannot be omitted. In such cases the full tag, including all compulsory attributes, must be added to the text.

9.2.2 End-tag omission

The rules governing the omission of end-tags are much less restrictive than those for start-tags. End-tags can be omitted wherever the tag is followed by the end-tag of another currently open element (i.e. one started at a higher level than the current element) or when the tag is followed by an element, or data character, that is not a permitted part of the element's content model. End-tags can also be omitted if their presence can be implied by the end of an SGML document or subdocument.

9.2.3 Omitting tags

The following elements will be used to illustrate the effect of tag omission:

   <!ELEMENT section      - O  (title, p+, subsection*) >
   <!ELEMENT title        O O  (#PCDATA)  >
   <!ELEMENT p            - O  (#PCDATA|%phrases;|q)+  >

If tag omission is not permitted by the current concrete syntax (i.e. OMITTAG NO has been specified in the SGML declaration), a section defined using this model would need to be coded as:

   <SECTION><TITLE>Section Headings</TITLE>
   <P>Section headings should indicate ...
   ... end of the paragraph.</P>
   <P>An alternative use for section headings ...    
   ... at the end of the section.</P></SECTION>

Here each start-tag is matched by an equivalent end-tag, the start of each compulsory element in the model group always being required.

When tag omission is allowed, however, the coding can be simplified to:

   <SECTION>Section Headings
   <P>Section headings should indicate ... 
   ... end of the paragraph.
   <P>An alternative use for section headings ...   
   ... at the end of the section.</SECTION>

In this example the tag minimization option has led to a halving of the number of tags that need to be added to the text. Let us look at how this was achieved.

The most important saving occurred in the section title, where both the start- and end-tags have been omitted. The start-tag can be omitted because the absence of this compulsory first embedded subelement could be implied by the parser from the content model of the section element (<SECTION>). The parser knows from the content model that, before it can accept any data for the section, it must receive a start-tag for the <TITLE> element . As soon as it sees a character other than a start-tag delimiter (<) it will recognize that the character should be preceded by <TITLE>.

The end-tag for the title can be omitted because the <P> used to identify the start of the first paragraph in the section is not valid within the content model of the section title. As the section title can only consist of text the parsing program will automatically recognize that a </TITLE> tag should precede the first of the <P> tags.

The two paragraph end-tags (</P>) have been omitted from the minimized version of the coded text for different reasons. At the end of the first paragraph the tag can be omitted because the content model for the paragraph element does not allow other paragraphs to be directly embedded within a paragraph. As soon as the parsing program sees the second <P> it knows it can infer the presence of the end-tag of the preceding paragraph.

The second of the paragraph end-tags has been omitted because it is immediately followed by the end-tag of an element at a higher level in the document's structure. Providing the OMITTAG option has been activated, the parsing program will automatically close any currently open embedded element which has been declared as having omissible end-tags when it encounters an end-tag for a higher level element. (If an embedded element whose end-tag cannot be omitted is still open, however, the program will report an error in the coding.)

It should be noted that the presence of the </SECTION> tag is not compulsory. If the section shown in the above example was immediately followed by another section its end-tag could be omitted to give an entry of the form:

   <SECTION>Section Headings
   <P>Section headings should indicate ... 
   ... end of the paragraph.
   <P>An alternative use for section headings ... 
   ... at the end of the section.
   <SECTION>Omitting Tags
   <P>When specified in an element's declaration ...

When an SGML parser analyses this part of the document the presence of the start-tag (<SECTION>) for the second section will cause it to infer the presence of an </SECTION> tag at the end of the first section. From the presence of this implied end-tag the program will also be able to identify the need for a </P> tag to close the last paragraph of the preceding section.

9.3 Short tags

Shortened versions of tags can be used whenever the FEATURES clause of the SGML declaration contains the statement SHORTTAG YES. There are four ways in which element tags can be shortened:

Web SGML Adaptations Extension
When the Web SGML adaptations provided by Annex K to ISO 8879 are available the use of each of these features can be controlled individually by use of the short tag form control extensions to the SHORTTAG specification detailed in Chapter 4.

9.3.1 Empty tags

Empty tags are tags from which the element's name (and any attributes) has been omitted. An empty start-tag consists of the currently defined start-tag open and tag close symbols (< and > respectively in the reference concrete syntax), without an intervening space. Similarly an empty end-tag consists of the currently defined end-tag open and tag close delimiters (e.g. </>).

Empty tags can only be associated with the base document type, i.e. they apply to the first document type defined in the prolog. For empty end-tags the generic identifier added by the program is always the name of the last element to be opened in the base document type.

The way in which the program interprets an empty start-tag depends on whether or not tags can also be omitted from the document. If OMITTAG YES has been specified in the SGML declaration, the parser will give an empty start-tag the generic identifier of the most recently started element in the base document type. Otherwise the generic identifier used will be that of the most recently ended element in the base document.

Where tag omission is permitted, the first of the above rules allows the program to infer which generic identifier it should use before determining whether or not a tag has been omitted from the markup. By presuming that the last opened element is to be repeated the parser has a value which it can use to check for the omission of end-tags. It can then determine whether the last element used should be closed by the addition of an implied end-tag, or whether the new tag represents a further level of nesting within the document's structure.

When tags cannot be omitted, the last element to be closed is presumed to be the one to be repeated, even if the element is not a repeatable element. This can, unfortunately, lead to errors where the last element used is not repeatable as the program can report an empty start-tag as an error even when the content model unambiguously defines which element must occur next. For this reason it is better not to allow short tag minimization while tag omission is forbidden.

Typically a piece of text coded using empty tags will take the form:

   <P>This paragraph contains two lists. The first has four
   entries:<OL>
   <LI>item 1
   <>item 2
   <>item 3
   <>item 4</></>
   while the second only has two:<UL>
   <LI>first item
   <>second item.</></>
   <>Multiple lists ...

In this example the first three empty start-tags, and the first empty end-tag, will be given a generic identifiers of LI as this was the last element to be started. Once the last of the items in the first list has been formally closed by the first empty end-tag, the ordered list (<OL>) becomes the currently active element. This list then is closed by the next empty end-tag (the second one of the first pair) before the second part of the paragraph element can be processed. The next empty start-tag and the first of the two empty end-tags are assigned LI as their element type name. The last end-tag will close down the unordered list (UL) .

Provided OMITTAG YES has been specified in the SGML declaration, and tag omission has been allowed for all end-tags, the last empty start-tag will cause the currently open element, the paragraph element, to be closed before the element type name is reused as the name of the new element. The final result of parsing will be a file of the form:

   <P>This paragraph contains two lists. The first has four
   entries:<OL>
   <LI>item 1</LI>
   <LI>item 2</LI>
   <LI>item 3</LI>
   <LI>item 4</LI></OL>
   while the second only has two:<UL>
   <LI>first item</LI>
   <LI>second item.</LI></UL></P>
   <P>Multiple lists ... 

9.3.2 Unclosed tags

Where two or more consecutive tags are required in a document the end delimiters of all tags but the last one in the sequence can be omitted if SHORTTAG YES (or UNCLOSED YES when the Web SGML Adaptation extensions are in use) has been specified in the SGML declaration. No restriction is placed on whether the next tag in the sequence is a start-tag or an end-tag: end-tags can be followed by start-tags, and vice versa. The four permissible combinations are illustrated by the following examples:

   <P<EM>
   </EM</P>
   </TITLE<P>
   <ARTWORK sizey=120mm</P>

In the first case a new paragraph is to start with an emphasized phrase. An unclosed start-tag has been used for the first of the tags. The second example shows how the tags could be minimized by using an unclosed end-tag if a paragraph ended with an emphasized phrase. The third example shows a paragraph starting immediately after a title. The final example shows how the tags might be combined when the first tag has attributes associated with it.

Note: The use of unclosed tags is deprecated by the SGML community, but it can be useful in overcoming keying errors in environments that do not use SGML-sensitive editors for data capture.

9.3.3 Null end-tags

Null end-tags provide a means of specifying the end of an element in the base document type with a single character. In the reference concrete syntax the character defined as the null end-tag (NET) delimiter is the solidus (slash), but any code not defined as a name character can be assigned to this role within the SGML declaration.

Two stages are involved in activating the null end-tag option. The first step involves creating a net-enabling start-tag by replacing the tag close (TAGC) delimiter at the end of an element's start-tag with a NET delimiter. The second step involves replacing the whole of the element's end-tag with a matching NET delimiter.

Web SGML Adaptations Extension
When the Web SGML adaptations provided by Annex K of SGML are available an optional net-enabling start-tag close (NESTC) delimiter can be defined in additon to the null-end tag delimiter. When this delimiter has been defined, the start-tag of an element to be ended with a null end-tag must end with the NESTC delimiter.

Note: If no NESTC delimiter is defined it is assumed to have the same definition as the NET delimiter

In addition the adaptations provide a NETENABL extension in the SGML declarations rules for short tag use that controls the ways in which null end-tags can be used. This extension allows one of three options to be selected

  1. NETENABL ALL allows null end-tags to be used with all elements
  2. NETENABL IMMEDNET restricts null end-tags to elements that have no contents
  3. NETENABL NO forbids the use of null end-tags

Note: NETENABL IMMEDNET cannot be used if the empty element ending rule is set to EMPTYNRM NO.

The effect of applying these new rules can be seen in the definition assigned to the NESTC delimiter in the SGML declaration for XML shown in Chapter 4. Here the delimiter clause contains the entry NESTC "/" (which happens to be the default entry for the NET delimiter) while the NET delimiter has been redefined using NET ">". As a result an element with no contents would be presented as <HR/>.

Note: XML uses this facility to provide facilities similar to those provided by the #CONREF attribute default value, which it does not support.

To see how null end-tags work in practice consider its use in conjunction with an <ISBN> element which can be defined as:

   <!ELEMENT ISBN  - -  CDATA --ISBN number-- >

Instead of entering an ISBN number as:

   <ISBN>0 201 17535 5</ISBN>

we can use the null end-tag option to enter the element in the shortened form:

   <ISBN/0 201 17535 5/

Notice that, by replacing the end delimiter of the start-tag with the special NET code, we have been able to reduce the end-tag to a single, matching, character. This feature is particularly useful when the content of the model has been declared, as in the above case, using the reserved name CDATA, or RCDATA, where the presence of an end-tag is compulsory.

Null end-tags can be used for any element declared in the base document that does not require the character assigned to the null end-tag role in its contents, or those of any embedded elements. However, because care is needed to ensure that the relevant element is not prematurely ended by entry of the character assigned as the NET code, null end-tags are normally only used for elements that do not contain embedded subelements.

Null end-tags can be nested to any level permitted for elements (typically the default value of 24 levels). Each null end-tag identified by the program closes down the last element defined using a net-enabling start-tag (i.e. one whose tag ends with a NET code). The following example shows how null end-tags can, with care, be embedded within each other:

   <P/Nested <EM/net-enabling start-tags/ are permitted, as this
      example shows./

The main point to notice about this example is that the number of net-enabling start-tags exactly matches the number of individual null end-tags in the paragraph.

The following is an example of an illegal use of a null end-tag:

   <P/Paragraphs cannot contain either/or choices if
   started with a net-enabling start-tag./

A program receiving this would terminate the paragraph after the word "either". It would then place the rest of the paragraph in the next higher open element, if permitted to, treating the slash intended to end the paragraph as a normal text character (unless the element concerned had been opened using a null end-tag!).

One way of avoiding this problem is to use a character reference, or an entity reference, to generate the embedded slash. In this case the paragraph could be amended to read:

   <P/Paragraphs cannot contain either&#47;or choices if
   started with a net-enabling start-tag./

or:

   <P/Paragraphs cannot contain either&sol;or choices if
   started with a net-enabling start-tag./

where &sol; has been defined as:

   <!ENTITY sol CDATA "/">

9.3.4 Omitting attribute names

As mentioned in Section 5.4, there are two ways in which attribute specifications can be shortened when the default SHORTTAG YES entry in the SGML declaration has been left unchanged:

For example, a start-tag of the form <ARTWORK width="40mm" align="center"> could be shortened to read <ARTWORK width=40mm center>.

Web SGML Adaptations Extension
When the Web SGML adaptations provided by Annex K to ISO 8879 are available the use of each of these omission features can be controlled individually by use of the ATTRIB options in the short tag form control extensions to the SHORTTAG specification detailed in Chapter 4. The options can also be used to define whether or not attribute value defaulting is applicable.

Note: As attribute value name tokens do not need to be unique when the Web SGML adaptations are being used, ommission of attribute names is only permitted when the name token is only valid for one of the attributes of the element.

9.4 Tag grouping (rank)

When the current SGML declaration contains the statement RANK YES in the FEATURES clause, elements can be declared as ranked elements. When an element type declaration contains a rank stem and a rank suffix, the element's start-tag can be shortened by omission of the rank suffix, provided that the element concerned has been entered in full (i.e. with a numeric rank suffix) at some preceding point in the document. A typical declaration might take the form:

   <!ELEMENT HEADING 1 - - (#PCDATA) >

Where a single element type declaration has been used for a ranked group of elements, the rank suffix that is added to the end of a minimized ranked element is the last rank number entered for any member of that ranked group. For example, headings and related paragraphs could share the following model:

   <!ELEMENT (HEADING|P) 1 - - (#PCDATA) +(%phrases;) >

Note: As the rank option has been found to be confusing to users use of this option will be deprecated in the next edition of ISO 8879. For this reason this optional SGML feature is not covered in any depth in this book. A more in-depth of this feature can be found in Chapter 5 of SGML - An author's guide to the Standard Generalized Markup Language.

9.5 Automatic tag recognition (data tags)

Data tags are sequences of characters which, as well as forming part of the document, also mark the end of an element. Whenever the data tag option has been activated by entry of DATATAG YES in the FEATURES clause of the SGML declaration, the program will check the content of those elements whose declarations contain data tag definitions to see if it can identify the end of the element from the presence of a specified string of characters.

Data tags are declared within the content model of an element in the base document's DTD by replacing the name of one or more of the embedded elements with a data tag group. Each data tag group is enclosed within a special pair of delimiters known as the data tag group open (DTGO) and data tag group close (DTGC) delimiters. In the reference concrete syntax these delimiters are the open and close square brackets. The declaration within these delimiters has two main parts: the generic identifier of the element concerned and the data tag pattern to be checked for whenever the specified element has been opened. The data tag pattern can also consist of two parts. Each pattern must start with a data tag template, or a data tag template group, which defines the character sequence(s) the program is looking for. This can optionally be followed by a data tag padding template that identifies one or more characters which should be skipped if they occur immediately after a data tag. Each part of the data tag group is separated from the others by the currently defined sequence indicator (SEQ, a comma in the reference concrete syntax).

To see how data tags work consider the following set of declarations:

   <!ELEMENT  mission   - - (delegate+)    >
   <!ELEMENT  delegate  O O ([name, ", ", " "], for) >
   <!ELEMENT  name      O O (#PCDATA)   >
   <!ELEMENT  for       O O (#PCDATA)   >
   <!ENTITY   entry     STARTTAG  "delegate" >
   <!SHORTREF map1      "&#RS;"  entry >
   <!USEMAP   map1      mission >

A data tag template, consisting of a comma followed by a space, has been associated with the element called <NAME> that forms the first element within the <DELEGATE> element's content model. Optionally this template can be followed by more spaces, which the program should treat as part of the data tag. The short reference map associated with the <MISSION> element shows that a record start code (RS) will be recognized as the start of a new <DELEGATE> entry.

To see how the these declarations affect the coding of a document consider the following list of members on a mission:

   <MISSION>
   James D. Mason,      ANSI
   Charles F. Goldfarb, ANSI
   James Clark,         BSI
   Martin Bryan,        BSI
   Yushi Komachi,       JIS</MISSION>

As soon as the program encounters the start-tag for the <MISSION> element it will invoke the short reference map (map1) that will identify the start of each line as the start of a <DELEGATE> entry.

Each <DELEGATE> element starts with the <NAME> of a delegate, the end of the name being indicated by a comma and at least one space. When the program sees this data tag sequence it will automatically infer the presence of a </NAME> end-tag. As the comma, and any immediately following spaces, are not part of the delegate's name, or part of the <FOR> element, the program automatically treats the data tag as an implied #PCDATA element, i.e. the <DELEGATE> element is considered to be defined as:

   <!ELEMENT delegate (name, #PCDATA, for) >

where #PCDATA can only consist of a comma followed by one or more spaces.

Because data tags act as real end-tags, rather than omitted end-tags, once the program has identified the end-tag for the <NAME> element it will also be able to infer the presence of the start-tag immediately after the data tag template because the declaration for the <DELEGATE> element tells it that a <FOR> element must follow.

The combination of the data tag and the short reference for the <MISSION> element means that the program would treat the uncoded text as if it had been coded:

   <MISSION>
   <DELEGATE><NAME>James D. Mason</NAME>,      <FOR>ANSI</FOR>
   <DELEGATE><NAME>Charles F. Goldfarb</NAME>, <FOR>ANSI</FOR>
   <DELEGATE><NAME>James Clark</NAME>,         <FOR>BSI</FOR>
   <DELEGATE><NAME>Martin Bryan</NAME>,        <FOR>BSI</FOR>
   <DELEGATE><NAME>Yuchi Komachi</NAME>,       <FOR>JIS</FOR></MISSION>

It is important to realize the difference between the role of the data tag and the short reference in the above example. While the short reference replaces the record start character it is linked to, the characters defined in the data tag template are retained as a special piece of (implied) parsed character data. It should also be realized that the data tag is looked for only while the <NAME> element remains the currently open element, while the short reference applies to any embedded elements as well (unless they invoke their own short reference map).

When declaring data tag templates it is important to ensure that the length of the data tag template, or data tag padding template, does not exceed that declared as the DTEMPLEN quantity in the SGML declaration. Similarly the data tags entered in the text, including any padding characters, must not exceed the DTAGLEN quantity. In the reference concrete syntax both of these quantities are set to 16 characters.

Data tag templates cannot include numeric character references to non-SGML characters or SGML function characters, though these are permitted in entity strings. This prohibits the use of the &#9; sequence to identify a Tab code in a data tag, though it would be a valid part of the replacement text of an entity. If, for example, the Tab code had been used within the above example to position the second column, the declaration for the <DELEGATE> element would need to be altered to:

   <!ELEMENT delegate - O ([name, ",&#TAB;", "&#TAB;"], for) >

It should, however, be noted that if this declaration is to be used all entries must be keyed without any spaces between the comma and the Tab code. If there is a likelihood that the typist may key a space after the comma the range of permitted data tags should be extended by defining all the valid templates in a data tag template group. As with other groups within the content model, the data tag template group is enclosed by group open and group close symbols (left and right brackets in the reference concrete syntax). In the case of data tag template groups the entries within the group must be separated by OR connectors. This gives the entry the form:

   <!ELEMENT delegate - O ([name, (",	"|",&#TAB;"|", &#TAB;"), 
                                        "&#TAB;"], for) >

Unfortunately, groups are not permitted for the data tag padding template, so if Tab codes are used to move to the start of the next column any following spaces will not be recognized as a part of the template, and so will not be removed from the data stream.

References

Bryan, M.T. (1988) SGML - An author's guide to the Standard Generalized Markup Language Wokingham: Addison-Wesley (ISBN 0 201 17535 5).