© Martin Bryan 1997 from SGML and HTML Explained published by Addison Wesley Longman
This chapter explains the rules provided in SGML to reduce the number/size of markup tags that need to be entered by users. It is split into the following sections:
SGML provides four main techniques for minimizing the number and length of a document's markup tags:
OMITTAG)
SHORTTAG)
RANK)DATATAG).The intention to use tag minimization features must be indicated by
activating the appropriate MINIMIZE options in the FEATURES
clause of the SGML declaration. By default only the OMITTAG and
SHORTTAG options may be used.
The most commonly used form of minimization is tag omission. This optional SGML feature allows tags to be omitted when their presence can be unambiguously implied by the program from the structure of the document declared in the document type definition.
Whenever the FEATURES clause of the SGML declaration contains
the entry
OMITTAG YES all element type declarations in the document type
definition must contain two characters defining the type of omitted
tag minimization permitted for the declared element(s). The first of
these two characters is set to O (the letter O rather than the
number 0) if the element's start-tag can be omitted: otherwise a hyphen is
entered. If the element's end-tag can be omitted the second character is set to
O: otherwise a hyphen is entered. The two characters must be
separated by a space (or any other valid separator character) and must be
separated from the preceding element type name, and the following model for the
element's contents, by further spaces or separator characters.
Note: If tag omission has not been permitted in the SGML declaration
the two tag omission characters can still be present in the element type
declaration. It is, therefore, standard practice to put tag omission rules into
DTDs even when the MINIMIZE section of the FEATURES
clause contains an entry of OMITTAG NO in the SGML declaration.
Start-tags can only be omitted from a document when:
Start-tag minimization should, therefore, only be used where the elements in
the currently active model group are connected with sequence (SEQ)
connectors. (If the AND or OR connectors are used
the parsing program will not be able to uniquely determine which element's tag
has been omitted.) Similarly, elements whose start-tags may be omitted should
not be optional, i.e. have an OPT (?) or REP
(*) occurrence indicator next to the element type name, as such
indicators make it impossible to identify which element should occur next.
Start-tags cannot be omitted for elements whose content type has been
declared using the RCDATA, CDATA or EMPTY
reserved names. Start-tags should also not be omitted where the first character
in the element with the omitted tag is one of the short reference characters
associated with the element in a short reference (SHORTREF)
declaration that has been associated with the element through a short reference
use (USEMAP) declaration, especially where the short reference
would be associated with a different entity if the start-tag was not present.
When the presence of an omitted start-tag is implied by the parser, the
currently defined default values will automatically be used for any attributes
associated with the element. It is important, therefore, to ensure that the
default values of any attributes associated with elements which may have their
start-tags omitted are checked carefully before start-tag omission is permitted.
If any of the attributes associated with the element has a required attribute,
either because its default value has been declared using the #REQUIRED
keyword or because the default value is #CURRENT and the element
has not yet been used, the start-tag cannot be omitted. In such cases the full
tag, including all compulsory attributes, must be added to the text.
The rules governing the omission of end-tags are much less restrictive than those for start-tags. End-tags can be omitted wherever the tag is followed by the end-tag of another currently open element (i.e. one started at a higher level than the current element) or when the tag is followed by an element, or data character, that is not a permitted part of the element's content model. End-tags can also be omitted if their presence can be implied by the end of an SGML document or subdocument.
The following elements will be used to illustrate the effect of tag omission:
<!ELEMENT section - O (title, p+, subsection*) > <!ELEMENT title O O (#PCDATA) > <!ELEMENT p - O (#PCDATA|%phrases;|q)+ >
If tag omission is not permitted by the current concrete syntax (i.e.
OMITTAG NO has been specified in the SGML declaration), a section
defined using this model would need to be coded as:
<SECTION><TITLE>Section Headings</TITLE> <P>Section headings should indicate ... ... end of the paragraph.</P> <P>An alternative use for section headings ... ... at the end of the section.</P></SECTION>
Here each start-tag is matched by an equivalent end-tag, the start of each compulsory element in the model group always being required.
When tag omission is allowed, however, the coding can be simplified to:
<SECTION>Section Headings <P>Section headings should indicate ... ... end of the paragraph. <P>An alternative use for section headings ... ... at the end of the section.</SECTION>
In this example the tag minimization option has led to a halving of the number of tags that need to be added to the text. Let us look at how this was achieved.
The most important saving occurred in the section title, where both the
start- and end-tags have been omitted. The start-tag can be omitted because the
absence of this compulsory first embedded subelement could be implied by the
parser from the content model of the section element (<SECTION>).
The parser knows from the content model that, before it can accept any data for
the section, it must receive a start-tag for the <TITLE>
element . As soon as it sees a character other than a start-tag delimiter (<)
it will recognize that the character should be preceded by <TITLE>.
The end-tag for the title can be omitted because the <P>
used to identify the start of the first paragraph in the section is not valid
within the content model of the section title. As the section title can only
consist of text the parsing program will automatically recognize that a </TITLE>
tag should precede the first of the <P> tags.
The two paragraph end-tags (</P>) have been omitted from
the minimized version of the coded text for different reasons. At the end of the
first paragraph the tag can be omitted because the content model for the
paragraph element does not allow other paragraphs to be directly embedded within
a paragraph. As soon as the parsing program sees the second <P>
it knows it can infer the presence of the end-tag of the preceding paragraph.
The second of the paragraph end-tags has been omitted because it is
immediately followed by the end-tag of an element at a higher level in the
document's structure. Providing the OMITTAG option has been
activated, the parsing program will automatically close any currently open
embedded element which has been declared as having omissible end-tags when it
encounters an end-tag for a higher level element. (If an embedded element whose
end-tag cannot be omitted is still open, however, the program will report an
error in the coding.)
It should be noted that the presence of the </SECTION>
tag is not compulsory. If the section shown in the above example was immediately
followed by another section its end-tag could be omitted to give an entry of the
form:
<SECTION>Section Headings <P>Section headings should indicate ... ... end of the paragraph. <P>An alternative use for section headings ... ... at the end of the section. <SECTION>Omitting Tags <P>When specified in an element's declaration ...
When an SGML parser analyses this part of the document the presence of the
start-tag (<SECTION>) for the second section will cause it
to infer the presence of an </SECTION> tag at the end of the
first section. From the presence of this implied end-tag the program will also
be able to identify the need for a </P> tag to close the
last paragraph of the preceding section.
Shortened versions of tags can be used whenever the FEATURES
clause of the SGML declaration contains the statement SHORTTAG YES.
There are four ways in which element tags can be shortened:
| Web SGML Adaptations Extension When the Web SGML adaptations provided by Annex K to ISO 8879 are available the use of each of these features can be controlled individually by use of the short tag form control extensions to the SHORTTAG specification detailed in Chapter 4. |
Empty tags are tags from which the element's name (and any attributes) has
been omitted. An empty start-tag consists of the currently
defined start-tag open and tag close symbols (< and >
respectively in the reference concrete syntax), without an intervening space.
Similarly an empty end-tag consists of the currently defined
end-tag open and tag close delimiters (e.g. </>).
Empty tags can only be associated with the base document type, i.e. they apply to the first document type defined in the prolog. For empty end-tags the generic identifier added by the program is always the name of the last element to be opened in the base document type.
The way in which the program interprets an empty start-tag depends on
whether or not tags can also be omitted from the document. If OMITTAG YES
has been specified in the SGML declaration, the parser will give an empty
start-tag the generic identifier of the most recently started element in the
base document type. Otherwise the generic identifier used will be that of the
most recently ended element in the base document.
Where tag omission is permitted, the first of the above rules allows the program to infer which generic identifier it should use before determining whether or not a tag has been omitted from the markup. By presuming that the last opened element is to be repeated the parser has a value which it can use to check for the omission of end-tags. It can then determine whether the last element used should be closed by the addition of an implied end-tag, or whether the new tag represents a further level of nesting within the document's structure.
When tags cannot be omitted, the last element to be closed is presumed to be the one to be repeated, even if the element is not a repeatable element. This can, unfortunately, lead to errors where the last element used is not repeatable as the program can report an empty start-tag as an error even when the content model unambiguously defines which element must occur next. For this reason it is better not to allow short tag minimization while tag omission is forbidden.
Typically a piece of text coded using empty tags will take the form:
<P>This paragraph contains two lists. The first has four entries:<OL> <LI>item 1 <>item 2 <>item 3 <>item 4</></> while the second only has two:<UL> <LI>first item <>second item.</></> <>Multiple lists ...
In this example the first three empty start-tags, and the first empty
end-tag, will be given a generic identifiers of LI as this was the
last element to be started. Once the last of the items in the first list has
been formally closed by the first empty end-tag, the ordered list (<OL>)
becomes the currently active element. This list then is closed by the next empty
end-tag (the second one of the first pair) before the second part of the
paragraph element can be processed. The next empty start-tag and the first of
the two empty end-tags are assigned LI as their element type name.
The last end-tag will close down the unordered list (UL) .
Provided OMITTAG YES has been specified in the SGML
declaration, and tag omission has been allowed for all end-tags, the last empty
start-tag will cause the currently open element, the paragraph element, to be
closed before the element type name is reused as the name of the new element.
The final result of parsing will be a file of the form:
<P>This paragraph contains two lists. The first has four entries:<OL> <LI>item 1</LI> <LI>item 2</LI> <LI>item 3</LI> <LI>item 4</LI></OL> while the second only has two:<UL> <LI>first item</LI> <LI>second item.</LI></UL></P> <P>Multiple lists ...
Where two or more consecutive tags are required in a document the end
delimiters of all tags but the last one in the sequence can be omitted if
SHORTTAG YES (or UNCLOSED YES when the Web SGML
Adaptation extensions are in use) has been specified in the SGML declaration. No
restriction is placed on whether the next tag in the sequence is a start-tag or
an end-tag: end-tags can be followed by start-tags, and vice versa. The four
permissible combinations are illustrated by the following examples:
<P<EM> </EM</P> </TITLE<P> <ARTWORK sizey=120mm</P>
In the first case a new paragraph is to start with an emphasized phrase. An unclosed start-tag has been used for the first of the tags. The second example shows how the tags could be minimized by using an unclosed end-tag if a paragraph ended with an emphasized phrase. The third example shows a paragraph starting immediately after a title. The final example shows how the tags might be combined when the first tag has attributes associated with it.
Note: The use of unclosed tags is deprecated by the SGML community, but it can be useful in overcoming keying errors in environments that do not use SGML-sensitive editors for data capture.
Null end-tags provide a means of specifying the end of an element in the
base document type with a single character. In the reference concrete syntax the
character defined as the
null end-tag (NET) delimiter is the solidus
(slash), but any code not defined as a name character can be assigned to this
role within the SGML declaration.
Two stages are involved in activating the null end-tag option. The first
step involves creating a net-enabling start-tag by replacing
the tag close (TAGC) delimiter at the end of an element's
start-tag with a NET delimiter. The second step involves replacing
the whole of the element's end-tag with a matching NET delimiter.
| Web SGML Adaptations
Extension When the Web SGML adaptations provided by Annex K of SGML are available an optional net-enabling start-tag close ( NESTC) delimiter
can be defined in additon to the null-end tag delimiter. When this delimiter has
been defined, the start-tag of an element to be ended with a null end-tag must
end with the NESTC delimiter.
Note: If no In addition the adaptations provide a
Note: The effect of applying these new rules can be seen in the definition
assigned to the Note: XML uses this facility to provide facilities similar to those
provided by the |
To see how null end-tags work in practice consider its use in conjunction
with an <ISBN> element which can be defined as:
<!ELEMENT ISBN - - CDATA --ISBN number-- >
Instead of entering an ISBN number as:
<ISBN>0 201 17535 5</ISBN>
we can use the null end-tag option to enter the element in the shortened form:
<ISBN/0 201 17535 5/
Notice that, by replacing the end delimiter of the start-tag with the
special NET code, we have been able to reduce the end-tag to a
single, matching, character. This feature is particularly useful when the
content of the model has been declared, as in the above case, using the reserved
name CDATA, or RCDATA, where the presence of an
end-tag is compulsory.
Null end-tags can be used for any element declared in the base document that
does not require the character assigned to the null end-tag role in its
contents, or those of any embedded elements. However, because care is needed to
ensure that the relevant element is not prematurely ended by entry of the
character assigned as the NET code, null end-tags are normally
only used for elements that do not contain embedded subelements.
Null end-tags can be nested to any level permitted for elements (typically
the default value of 24 levels). Each null end-tag identified by the program
closes down the last element defined using a net-enabling start-tag (i.e. one
whose tag ends with a NET code). The following example shows how
null end-tags can, with care, be embedded within each other:
<P/Nested <EM/net-enabling start-tags/ are permitted, as this
example shows./
The main point to notice about this example is that the number of net-enabling start-tags exactly matches the number of individual null end-tags in the paragraph.
The following is an example of an illegal use of a null end-tag:
<P/Paragraphs cannot contain either/or choices if started with a net-enabling start-tag./
A program receiving this would terminate the paragraph after the word "either". It would then place the rest of the paragraph in the next higher open element, if permitted to, treating the slash intended to end the paragraph as a normal text character (unless the element concerned had been opened using a null end-tag!).
One way of avoiding this problem is to use a character reference, or an entity reference, to generate the embedded slash. In this case the paragraph could be amended to read:
<P/Paragraphs cannot contain either/or choices if started with a net-enabling start-tag./
or:
<P/Paragraphs cannot contain either/or choices if started with a net-enabling start-tag./
where / has been defined as:
<!ENTITY sol CDATA "/">
As mentioned in Section 5.4, there are two
ways in which attribute specifications can be shortened when the default SHORTTAG
YES entry in the SGML declaration has been left unchanged:
=), can both be omitted.For example, a start-tag of the form <ARTWORK width="40mm"
align="center"> could be shortened to read <ARTWORK
width=40mm center>.
| Web SGML Adaptations Extension When the Web SGML adaptations provided by Annex K to ISO 8879 are available the use of each of these omission features can be controlled individually by use of the ATTRIB options in the short tag
form control extensions to the SHORTTAG specification detailed
in Chapter 4. The options can also be used to define whether or not attribute
value defaulting is applicable.
Note: As attribute value name tokens do not need to be unique when the Web SGML adaptations are being used, ommission of attribute names is only permitted when the name token is only valid for one of the attributes of the element. |
When the current SGML declaration contains the statement RANK YES
in the
FEATURES clause, elements can be declared as ranked
elements. When an element type declaration contains a rank
stem and a rank suffix, the element's start-tag can
be shortened by omission of the rank suffix, provided that the element concerned
has been entered in full (i.e. with a numeric rank suffix) at some preceding
point in the document. A typical declaration might take the form:
<!ELEMENT HEADING 1 - - (#PCDATA) >
Where a single element type declaration has been used for a ranked group of elements, the rank suffix that is added to the end of a minimized ranked element is the last rank number entered for any member of that ranked group. For example, headings and related paragraphs could share the following model:
<!ELEMENT (HEADING|P) 1 - - (#PCDATA) +(%phrases;) >
Note: As the rank option has been found to be confusing to users use of this option will be deprecated in the next edition of ISO 8879. For this reason this optional SGML feature is not covered in any depth in this book. A more in-depth of this feature can be found in Chapter 5 of SGML - An author's guide to the Standard Generalized Markup Language.
Data tags are sequences of characters which, as well as
forming part of the document, also mark the end of an element.
Whenever the data tag option has been activated by entry of DATATAG YES
in the
FEATURES clause of the SGML declaration, the program will check
the content of those elements whose declarations contain data tag definitions to
see if it can identify the end of the element from the presence of a specified
string of characters.
Data tags are declared within the content model of an element in
the base document's DTD by replacing the name of one or more of the embedded
elements with a data tag group. Each data tag group is
enclosed within a special pair of delimiters known as the data tag
group open (DTGO) and
data tag group close (DTGC) delimiters. In the
reference concrete syntax these delimiters are the open and close square
brackets. The declaration within these delimiters has two main parts: the
generic identifier of the element concerned and the data
tag pattern to be checked for whenever the specified element has been
opened. The data tag pattern can also consist of two parts. Each pattern must
start with a data tag template, or a data tag
template group, which defines the character sequence(s) the program is
looking for. This can optionally be followed by a data tag padding
template that identifies one or more characters which should be skipped
if they occur immediately after a data tag. Each part of the data tag group is
separated from the others by the currently defined sequence indicator (SEQ,
a comma in the reference concrete syntax).
To see how data tags work consider the following set of declarations:
<!ELEMENT mission - - (delegate+) > <!ELEMENT delegate O O ([name, ", ", " "], for) > <!ELEMENT name O O (#PCDATA) > <!ELEMENT for O O (#PCDATA) > <!ENTITY entry STARTTAG "delegate" > <!SHORTREF map1 "&#RS;" entry > <!USEMAP map1 mission >
A data tag template, consisting of a comma followed by a space, has been
associated with the element called <NAME> that forms the
first element within the <DELEGATE> element's content model.
Optionally this template can be followed by more spaces, which the program
should treat as part of the data tag. The short reference map associated with
the <MISSION> element shows that a record start code (RS)
will be recognized as the start of a new <DELEGATE> entry.
To see how the these declarations affect the coding of a document consider the following list of members on a mission:
<MISSION> James D. Mason, ANSI Charles F. Goldfarb, ANSI James Clark, BSI Martin Bryan, BSI Yushi Komachi, JIS</MISSION>
As soon as the program encounters the start-tag for the <MISSION>
element it will invoke the short reference map (map1) that will
identify the start of each line as the start of a <DELEGATE>
entry.
Each <DELEGATE> element starts with the <NAME>
of a delegate, the end of the name being indicated by a comma and at least one
space. When the program sees this data tag sequence it will automatically infer
the presence of a </NAME> end-tag. As the comma, and any
immediately following spaces, are not part of the delegate's name, or part of
the
<FOR> element, the program automatically treats the data tag
as an implied #PCDATA element, i.e. the <DELEGATE>
element is considered to be defined as:
<!ELEMENT delegate (name, #PCDATA, for) >
where #PCDATA can only consist of a comma followed by one or
more spaces.
Because data tags act as real end-tags, rather than omitted end-tags, once
the program has identified the end-tag for the <NAME>
element it will also be able to infer the presence of the start-tag immediately
after the data tag template because the declaration for the <DELEGATE>
element tells it that a <FOR> element must follow.
The combination of the data tag and the short reference for the <MISSION>
element means that the program would treat the uncoded text as if it had been
coded:
<MISSION> <DELEGATE><NAME>James D. Mason</NAME>, <FOR>ANSI</FOR> <DELEGATE><NAME>Charles F. Goldfarb</NAME>, <FOR>ANSI</FOR> <DELEGATE><NAME>James Clark</NAME>, <FOR>BSI</FOR> <DELEGATE><NAME>Martin Bryan</NAME>, <FOR>BSI</FOR> <DELEGATE><NAME>Yuchi Komachi</NAME>, <FOR>JIS</FOR></MISSION>
It is important to realize the difference between the role of the data tag
and the short reference in the above example. While the short reference replaces
the record start character it is linked to, the characters defined in the data
tag template are retained as a special piece of (implied) parsed character
data. It should also be realized that the data tag is looked for only while
the <NAME> element remains the currently open element, while
the short reference applies to any embedded elements as well (unless they invoke
their own short reference map).
When declaring data tag templates it is important to ensure that the length
of the data tag template, or data tag padding template, does not exceed that
declared as the DTEMPLEN quantity in the SGML declaration.
Similarly the data tags entered in the text, including any padding characters,
must not exceed the DTAGLEN quantity. In the reference concrete
syntax both of these quantities are set to 16 characters.
Data tag templates cannot include numeric character references to
non-SGML characters or SGML function characters, though these are permitted in
entity strings. This prohibits the use of the 	 sequence to
identify a Tab code in a data tag, though it would be a valid part of the
replacement text of an entity. If, for example, the Tab code had been used
within the above example to position the second column, the declaration for the
<DELEGATE> element would need to be altered to:
<!ELEMENT delegate - O ([name, ",&#TAB;", "&#TAB;"], for) >
It should, however, be noted that if this declaration is to be used all
entries must be keyed without any spaces between the comma and the Tab code. If
there is a likelihood that the typist may key a space after the comma the range
of permitted data tags should be extended by defining all the valid templates in
a data tag template group. As with other groups within the content model, the
data tag template group is enclosed by group open and group close symbols (left
and right brackets in the reference concrete syntax). In the case of data tag
template groups the entries within the group must be separated by
OR connectors. This gives the entry the form:
<!ELEMENT delegate - O ([name, (", "|",&#TAB;"|", &#TAB;"),
"&#TAB;"], for) >
Unfortunately, groups are not permitted for the data tag padding template, so if Tab codes are used to move to the start of the next column any following spaces will not be recognized as a part of the template, and so will not be removed from the data stream.
Bryan, M.T. (1988) SGML - An author's guide to the Standard Generalized Markup Language Wokingham: Addison-Wesley (ISBN 0 201 17535 5).