© Martin Bryan 1997 from SGML and HTML Explained published by Addison Wesley Longman
This chapter explains how SGML document type declarations can be used to define which elements can occur in an SGML document instance, and the attributes that can be used to control the processing of each type of element. It contains sections on:
The role of an SGML element depends on the context it is found in and the form of markup used to enter it. The following roles can be identified:
The first element specified in any SGML document is the base document element. This element must be
formally
declared, within the document type declaration, by entry of an
element type declaration whose name is the same as that
of the document type declaration. The name of the
base document element identifies the class of document to be created (e.g.
<ACT>).
Normally the last tag entered within an SGML document will be an end-tag
whose name matches that used for the first tag (e.g. </ACT>).
This tag ensures that the end of the document is correctly identified by the
SGML document parser.
Further elements can be embedded between
the two tags identifying the limits of the base document element, up to the
level specified by the current tag level quantity
(TAGLVL). (In the reference concrete syntax this value is 24, i.e.
up to 24 levels of embedded elements can be used within each base document
element.)
Where necessary, elements can be qualified by attributes. An attribute is a named value that can be used during processing to control such things as the presentation or content type of the element it is associated with. In the reference concrete syntax up to 40 different attributes can be associated with any element, provided that the total length of the attribute names and values does not exceed 960 characters.
Elements can optionally be ranked so that
specific groups of elements are used at the same ranked element level. Ranked
element level is specified by adding a number to the end of the element's
generic identifier. When the RANK
tag minimization option is being used, the current rank level can be implied
from that of preceding elements.
Minimized elements are elements
whose presence, name or attributes names, can be implied by an SGML parser when
the appropriate minimization options have been enabled in the FEATURES
clause of the SGML declaration. Minimization techniques provided within SGML
include:
In addition, data tags can be used to identify references to end-tags that have been entered using pre-defined character strings.
Element type declarations must be entered as part of a
document type declaration subset. The
reserved name
ELEMENT (or its previously declared replacement) appears
immediately after the markup declaration open (MDO) code that
identifies the start of the markup declaration.
In its shortest form an element type declaration takes the form:
<!ELEMENT name model>
where name is the element type name (generic
identifier) that uniquely identifies the type of element and model
is either a formal declaration of the type of data that may be entered within
the element, or a content model showing
which subelements can be embedded within the element.
Where elements share a content model a bracketed name group can replace the element type name. A
name group consists of a set of connected element type names bracketed by
group open (GRPO) and group close
(GRPO) delimiters. The element type names are normally connected
by an OR connector (| in the
reference concrete syntax) to give an entry of the form:
<!ELEMENT (name-1|name-2|...|name-n) model>
The maximum length of an element type name must not exceed the current value
of the NAMELEN quantity. The
first character must be alphabetic, or one of the additional
name start characters defined in the syntax
clause. Subsequent characters may be alphanumeric characters or one of the
name characters declared in the syntax clause.
When the OMITTAG entry in the FEATURES
clause of the current SGML declaration reads OMITTAG YES two extra
characters must be entered between the name and content of the element
declaration to define the type of omitted tag minimization to
be applied to the element. These extra characters define whether or not the
start-tag and/or end-tag can be omitted if its presence can be unambiguously
implied from the model of the element it is embedded within. If the first
character is O (the letter O, not the number zero) the start-tag
for the element can be omitted at appropriate points. If it is -
(hyphen) it can never be omitted. If the second character is O the
end-tag can be omitted: otherwise it is
- to show that the end-tag must never be omitted. The two
characters must be separated from each other, and the adjacent element type name
and content model, by at least one space or another valid separator code (e.g.
TAB). For example, an element whose end-tag may be omitted might be declared as:
<!ELEMENT artwork - O EMPTY >
This element type declaration defines an empty
element,
<ARTWORK>, which has no embedded content. The element is
simply a tag that marks the point at which an illustration is to be added to the
document. The <artwork> start-tag cannot be omitted but, as
the element contains no text, the end-tag must be omitted as it serves no
purpose.
When an element can contain embedded subelements the declaration's content
model must be defined as a model group. Like name
groups, model groups consist of one or more connected element type names (called
element tokens in this context)
bracketed by group open (GRPO) and group close (GRPC)
delimiter sequences, e.g.:
<!ELEMENT book - O (prelims, body, annexes) >
In this case a book is said to be made up of three nested subelements,
<PRELIMS>, <BODY> and <ANNEXES>.
For model groups, unlike other name groups, the type of connector used is significant. Three types of connector are used in model groups to define the logical sequence in which elements are to appear are shown in Figure5.1.
| Default character | Delimiter name | Meaning |
|---|---|---|
, |
SEQ |
All must occur, in the order specified |
& |
AND |
All must occur, in any order |
| |
OR |
One (and only one) must occur |
Figure 5.1 SGML connectors
The sequence connector (a comma in the reference concrete syntax) connects element types which must occur in a predefined sequence. In the above example, therefore, the prelims must precede the body of the text, which must precede any annexes.
Where the sequence in which the elements are used is not fixed, subelement
names should be connected with an AND connector
(&). For example, the fields at the head of a memo could be
defined using a model of the form:
<!ELEMENT heading O O (from & to & date) >
If only one element could be applicable at a given point, the
relevant element type names can be connected by an OR
connector (|). For example, the following element
could occur in the prelims of a book:
<!ELEMENT by O O (author|editor) >
Note: The OR connector used in SGML is an exclusive OR rather than an inclusive OR.
The use of each embedded subelement can be further qualified by the addition of an occurrence indicator immediately after the element type name, or immediately after a group close delimiter linking a number of element type names. The three types of occurrence indicator defined in SGML are shown in Figure 5.2.
| Default character | Delimiter name | Meaning |
|---|---|---|
+ |
PLUS |
Repeatable element(s) that must occur at least once |
* |
REP |
Optional element(s) that may be repeated |
? |
OPT |
Optional element(s) that can occur at most once |
Figure 5.2 SGML occurrence indicators
For example, to make the use of annexes in a book optional you would extend the definition given above to read:
<!ELEMENT book - O (prelims, body, annexes?) >
An alternative, and somewhat better approach, would be to use an optional
and repeatable <annex> element:
<!ELEMENT book - O (prelims, body, annex*, index?) >
To allow more than one author or editor to be defined in the prelims you
could extend the definition of the <BY> element shown above
to read:
<!ELEMENT by O O (author+|editor+) >
Occurrence indicators have a higher precedence than connectors. For example,
a model group such as (author|editor)+ differs from one defined as
(author+|editor+) because the first model permits any sequence of
author and editor details to be entered, whereas the second model only permits a
set of author details or a set of editor details to be entered within
a <BY> element.
Model groups can be nested within each other up to the level indicated by
SGML's GRPLVL quantity value.
(The reference concrete syntax allows up to 16 levels of nested model groups.)
Each nested model group can, if necessary, have its elements linked by a
different connector. Each name in the group, each nested group, and the whole
content model, can be qualified by an occurrence indicator. An example of a
nested set of model groups is the element used for a Text
Encoding Initiative (TEI) title statement, which is defined as:
<!ELEMENT titleStmt - O (title+, (author|editor|sponsor|
funder|principal|respStmt)*)
When parameter entities (see Chapter 6) are used to define the contents of model groups, a word of warning is required: you cannot associate an occurrence indicator with a parameter entity. If the elements whose names are listed in the replacement string of the parameter entity are to be qualified by an occurrence indicator either:
Note: When using parameter entities to define part, or all, of a model group it is important to remember that the associated entity declaration must precede the entity reference. The safest way to ensure this is to place all parameter entity declarations at the start of the document type definition.
A special form of primitive
content token is used in model groups to indicate points at which
the element can contain text. This token consists of a
reserved name indicator (RNI,
# in the reference concrete syntax) followed by the reserved name
PCDATA, which stands for parsed character data.
#PCDATA indicates that, at that point in the model, the element
can contain text which has been checked by the SGML parser to ensure that any
embedded tags or entity references have been identified.
When the #PCDATA tag is present in a model group the element's
content is referred to as mixed content.
If text is not permitted the model group is defined as having only element content. Different rules for
processing record boundaries apply to mixed
content. A typical example of an element defined using mixed content is:
<!ELEMENT para - O (#PCDATA|emphasis)+ >
A special feature of the #PCDATA keyword is that it is
automatically presumed to have a repeatable (REP) occurrence
indicator. All characters occurring between successive markup tags are
considered to satisfy a single
#PCDATA token (including any entered as character data in a
marked section).
It is recommended that #PCDATA is only used when data
characters are permitted anywhere in the content of an element, i.e. where
#PCDATA is the only token in the model group or where it is a
member of a repeatable model group whose members are connected using an OR
connector.
Note: This recommendation is made to avoid potential problems relating to the processing of record boundaries within mixed content.
Where nested subelements cannot occur within an element, its contents can be declared to consist of one of the following types of declared content:
RCDATA),
which can contain text, character references and/or general entity references
that resolve to character data
CDATA),
which contains only valid SGML charactersEMPTY),
i.e. having no contents, or contents that can be generated by the program.A variant of the basic content model allows the reserved name ANY
to replace an element type declaration's model group. This tells the program
that text or any element defined within the same document type declaration can
be used as an embedded element.
For elements with declared content, or using the ANY reserved
name, the keyword replaces the model group, including its brackets.
Because, unlike #PCDATA, these reserved names cannot occur within
a model group, they do not need to be preceded by the reserved name indicator.
A typical example of the use of declared content is shown in the following element type declaration:
<!ELEMENT ISBN - - CDATA >
It should be noted that, in the case of elements defined using the
replaceable character data and character data options, the program will ignore
any requests to start a new element until such time as it encounters a valid
end-tag open (ETAGO) delimiter, i.e. </ followed
by any valid name start character. For this reason all elements declared using
the CDATA or RCDATA declared content keywords should
have compulsory end-tags and should not contain the ETAGO
character sequence within their content.
| Web SGML Adaptations Extension When the Web SGML adaptations provided by Annex K to ISO 8879 are available the first character of any potentially clashing ETAGO sequence can be
replaced by the predefined character data
entity used to escape markup delimiter opening sequences, e.g. <
|
Model groups can be qualified by the addition of lists of exceptions. There are two types of exceptions:
Exceptions are specified by entry of name groups immediately after the model
group defining the permitted contents of the element. The group open (GRPO)
delimiter at the start of each such name group must be preceded by a plus sign
if the names identify inclusions, or a hyphen (minus sign) if they represent
exclusions. Both sets may be present at the same time, provided that:
Exclusions that prohibit the use of embedded elements required as part of a model group are not permitted.
Inclusions are typically associated with elements whose content model is
#PCDATA, or are used at the start of a document to allow commonly
occurring floating elements, such as footnotes, figures and tables, to occur
anywhere in the text. For example the definition given for a book above could be
extended to read:
<!ELEMENT book - O (prelims, body, annexes?) +(footnote|figure|table) >
This model would allow footnotes, figures and tables to occur anywhere within a book.
It should be noted that inclusions are inherited by all the elements declared in the model, and any of their children. Where inclusions have been declared at a high level in the data structure, it will often become necessary to define exclusions at lower levels in the data model. For example, to prevent footnotes from containing embedded footnotes, or figures, the following declaration could be used for footnotes:
<!ELEMENT footnote - O (#PCDATA) -(footnote|figure) >
Note that this model would not prevent tables, or any other element that had been declared as an inclusion in a parent element of the footnote, from being entered within footnotes.
Comments may be entered at most points spaces are permitted within an element type declaration, except within bracketed name or model groups. They must be preceded and followed by comment delimiters (a pair of hyphens in the reference concrete syntax). Comments can run over more than one line if necessary. e.g.:
<!ELEMENT position - O (#PCDATA|line+) -- one or more lines of text
describing position held -- >
It is recommended that comments within element type declarations are placed after the model group.
An SGML content model cannot be ambiguous. Every element or character found in the document instance must be able to satisfy only one content token without looking ahead in the document instance. For example, an element whose content model is:
<!ELEMENT contact - O ((name, address?), company?, address)>
is ambiguous because it cannot be determined whether an <ADDRESS>
element entered after a name satisifies the optional address in
the nested subset or the compulsory one in the outermost group until you know
what markup tag follows the address. In most cases such content models can be
easily avoided by introducing another level of container, e.g.
<!ELEMENT contact - O (person, company?, address)> <!ELEMENT person O - (name, address?) >
By making the end-tag of the container element (e.g. </PERSON>)
compulsory you can ensure that the position of the two <ADDRESS>
elements can always be distinguished
The base document element, whose name matches that of the document type declaration, provides the starting point for the analysis of any set of element type declarations. The rules that should be applied when analysing a DTD are:
#PCDATA,
CDATA, RCDATA, EMPTY or ANY).To see the effect of these rules we will use them to create a tree diagram for the following simplified DTD for a memorandum:
<!DOCTYPE memo [
<!ELEMENT memo O O (heading, body, signature?) >
<!ELEMENT heading 0 0 (from & to & copied-to? & date) >
<!ELEMENT (from|to|copied-to) - O (name, position?)+ >
<!ELEMENT name O O (#PCDATA) >
<!ELEMENT (position|date) - O (#PCDATA) >
<!ELEMENT body O O (para+) +(artwork) >
<!ELEMENT para - O (#PCDATA|emphasis)+ >
<!ELEMENT emphasis - - (#PCDATA) -(artwork) >
<!ELEMENT artwork - O EMPTY >
<!ELEMENT signature O O (salutation?, (name, position?)+)>
<!ELEMENT salutation - O (#PCDATA) >
]>
The first thing that needs to be done is to identify the element type
declaration for the element whose name matches that of the document type
declaration, i.e.
memo in this case. The tag omission rules associated with this
root element type declaration, tells us that both the start-tag and the end-tag
can be omitted as their presence can be determined by the SGML parser from the
presence of embedded elements. The model group for the memo
element shows us that it contains only element
content, and that three elements, heading, body
and, optionally, signature must occur in a fixed sequence.
Following the third of the rules listed above we find that the model for the
first element in the initial model group, heading, shows that the
start-tag and end-tag for this element can also be omitted. Again the model
group consists solely of element content, but this time the four elements are
connected by an AND connector to indicate
that the order in which the elements are entered is not important (they have a
fixed position on preprinted paper). Also one of the elements, the copied-to
element, is optional.
When we look for the model for the first of these elements we find that it
shares a declaration with two of the other components of the heading. The tag
omission rules for this declaration tell us that the start-tag for these
elements must be present, but that the end-tag is omissible. Each of these
elements must contain an embedded name element, optionally
followed by details of the position held by the named person. The
PLUS occurrence indicator associated with
the whole model group shows that multiple names, with or without positions, may
be entered for each component of the heading if required.
The model for the name element shows that the both the
start-tag and the end-tag can be omitted where their presence can be determined
from the preceding and following elements. The content model consists of the
special #PCDATA keyword showing that this is a terminal node that
may contain parsed character data. At this point the fifth of our rules is
invoked, so we need to return to the model group for the parent element, in this
case the model shared by from, to and copied-to.
The second element in the model group shared by this set of elements is the
optional position element.
The only difference between the declaration for the position
element, which is shared with the date element, and that for the
name element is that the start-tag is not omissible. This is
because the position element is always optional while the position
of the date element cannot be determined by the parser as it is
part of an AND group. The presence of an optional component of a
model, or an element within an AND group, must always be indicated
by a start-tag.
As the model group for the position and date
elements consists solely of the #PCDATA content token we must
return to the model of position's parent element(s), from,
to and copied-to. As we have already seen them model
for all the elements listed in this model group we must immediately return to
their parent element, heading. As all the elements in the model
group for a heading share the same element type declaration, the
fifth rule requires us to return to its parent element,
memo, and look at the second element listed in its model group,
body.
The start-tag and end-tag for the body container can be
omitted as the first paragraph in the memo will indicate the start of the
body, and the presence of a signature will indicate the end of the
body. The model group for the element shows that the body
must contain one or more paragraphs (para). In this case the
occurrence indicator has been placed adjacent to the element type name, rather
than being applied to the whole of the model group. In addition the
inclusion added after the model group shows that
artwork can be interspersed between paragraphs, or placed within
any embedded text or subelements.
The model for the para element is an example of the use of
mixed content. In this case the parsed character data (#PCDATA)
can be repeatedly mixed with emphasis elements. But it must be
remembered, however, that this element inherits the inclusion specified for its
parent element, para, so artwork can also be embedded within
paragraphs. The tag omission rules show that start-tag of each paragraph must
be present in the document instance.
The model for the emphasis element also indicates that it
should contain parsed character data (#PCDATA), but in this case
the model is qualified by the presence of an exclusion
that prohibits the inheritence of the artwork inclusion from the
model of the body element. Both the start-tag and the end-tag must
be present to indicate the full scope of the emphasized text.
The model for the artwork element shows that this is an
empty element that consists simply of a start-tag, with no
end-tag. (The role of this element will be examined further shortly.)
Now that all the elements in the model group for body have
been identified we must return to the model group of its parent, memo
and look at the next component of its model group, signature.
Again both the start-tag and end-tag can be omitted from these elements as their
presence can be determined from the presence of their subelements, or parent's
end-tags. The model group here is slightly more complex, consisting of the name
of an optional element, salutation, followed by a repeatable
model group which uses the same subelements as the header elements. There must
be at least one name element within each signature.
The model for the salutation element shows that this consists
simply of parsed character data. As this element is optional its start-tag is
always required, though its end-tag could be omitted.
Figure 5.3 shows how a graph could be drawn to represent the model of a memo.

Figure 5-3 Graph showing structure of memo DTD
Within a document instance the contents of elements are indicated by the use
of
start-tags and end-tags. A start-tag
consists of the element's name between the currently declared start-tag
open (STAGO) and tag close (TAGC)
delimiters (< and > respectively in the
reference concrete syntax). Optionally the tag close code can be replaced by a
null end-tag (NET) delimiter (e.g.
/) so that a matching null end-tag can be used in place of the
normal end-tag. Where appropriate, the element type name can be qualified by the
entry of one or more of the attributes declared for the element.
An end-tag consists of the element's name between the currently declared
end-tag open (ETAGO) and tag close
(TAGC) delimiters (e.g. </ and >).
Where a null end-tag has been used to close the element's start-tag, the whole
of the end-tag must be replaced by a single null end-tag code (/).
| Web SGML Adaptations Extension When the Web SGML adaptations provided by Annex K to ISO 8879 are available the delimiter used at the end of the start-tag to identify that the element is to end with a null end-tag can be specified using the optional null end-tag start code ( NETSC) delimiter. When a NETSC delimiter has
been specified in the
DELIMS section of the SYNTAX clause of the SGML
declaration it should be different from that declared for the null end-tag by
the specification for the NET delimiter. If no definition for the
NETSC delimiter is provided the value for the NET
delimiter is used, which by default is /.
Note: These new options allow code sequences such as |
Not all tags need to be present in a document. Provided that the OMITTAG
feature has been enabled in the SGML declaration, tags can be omitted when their
presence can be implied without ambiguity.
Element type names can normally be entered in either uppercase or lowercase.
(The
NAMECASE section of the SGML declaration defaults to GENERAL
YES. Where this entry is altered, DTD developers should take special
care to warn users of the need to enter tags in the appropriate case.)
The main problem that can occur when using elements is that, unlike the style sheets used in uncontrolled word processors, SGML markup tags cannot be entered at a level for which they have not been declared. SGML-based text editors will be able to prevent users from entering invalid tags, but if documents are prepared without the guidance of an SGML parser errors can occur.
A more detailed example of the use of SGML elements can be found in the description of the HTML DTD in Chapter 13.
An attribute is a named parameter (value) used to qualify an element's start-tag. Attributes are typically used to:
<BOOK status=draft>
<FIGURE
id="piechart1"><REFER to="piechart1"><INPUT value="100"><TEXTAREA
rows=6 cols=70> and <FIGURE source="Reuters"><IMG src="new.gif"
align=bottom>.There are two parts to an
attribute specification: an attribute
name and an attribute value. These two parts are
joined by a
value indicator (VI, = in the
reference concrete syntax) to give an attribute specification of the form:
<element-name attribute-name=attribute-value ... >
Attribute values can be entered as attribute value literals.
A literal is a string of characters recognized as a single unit by the system
because the characters have been entered between a matched pair of literal
delimiters. The two alternative sets of literal delimiters are provided
in SGML. They are referred to within the standard as LIT (literal)
and
LITA (alternative form of literal). In the reference concrete
syntax these are represented by the quotation mark (") and
apostrophe (') respectively.
Note: The choice of which set of literal delimiters should be used is a matter of user convenience. The only restriction is that the character chosen cannot appear in the entered attribute value.
Only one type of literal delimiter can be used to delimit a particular attribute value, but the two types can be used interchangeably within the same tag, e.g.:
<A href='http://www.u-net.com/~sgml/piechart1.gif'
title="Martin's Work Breakdown">
As well as showing how changing the type of literal delimiter can allow you to use a literal delimiter within a string, the above example also exhibits the fact that line breaks can occur between attribute specifications in place of the normal space. This is particularly useful when the start-tag would otherwise be too long to fit on a line, as is the case in the example.
Where the FEATURE clause of the SGML
declaration contains the statement SHORTTAG YES (as it does by
default), the literal delimiters can be omitted if the only characters used in
the value are ones currently declared as name
characters. For example, a declaration of the form <INPUT name="field1"
size="60"> could also be entered as <INPUT
name=field1 size=60>.
When the entered attribute value has been declared as a member of a set of
valid attribute values for the element, the attribute name, with the associated
value indicator, can also be omitted when SHORTTAG YES has been
specified. For example, an entry such as <H1 align="center">
can be shortened to give an attribute specification of the form <H1
center>. In this case the attribute value must not be entered
within literal delimiters because, if it is, the program will be unable to
identify the attribute referred to.
Attribute values can also consist of delimited lists of values,
each part of which is separated from the others by a space, or another valid
separator character (e.g. RE,
RS or TAB).
Each attribute can be given a default value when it is declared. If either
OMITTAG YES or SHORTTAG YES has been specified in
the FEATURES clause, this default value will be used if a specific
attribute value is not entered in the start-tag.
| Web SGML Adaptations Extension When the Web SGML adaptations provided by Annex K to ISO 8879 are available the use of default values can be controlled via the attribute options in the short tag omission rules section of the FEATURES clause. If this entry
starts ATTRIB DEFAULT NO default values should not be specified in
the DTD. If default values are to be applied to attributes which have not been
assigned a value in the document instance then ATTRIB DEFAULT YES
must be specified in the SGML declaration.
Note: While an error need not be reported if a default attribute value
is specified in the DTD when |
Where a default value cannot be specified a special reserved name must be
entered. For example, to tell the program that is should use internal rules for
determining what value to assign to the attribute value the reserved name
#IMPLIED is used. The reserved name #CURRENT can be
used to tell the program to repeat the last value entered for that attribute on
any element that shares the attribute list declaration in which it was declared.
Where an attribute value must be entered whenever the element is requested
the reserved name
#REQUIRED can be used as the default value.
| Web SGML Adaptations Extension When the Web SGML adaptations provided by Annex K to ISO 8879 are available selection of the permitted attribute omission rules is controlled through the entries following the ATTRIB keyword in the
SGML declaration. |
Attributes are declared in attribute definition list declarations.
Each attribute definition list is a separate markup declaration, delimited by
the currently defined markup declaration delimiters (e.g. <!
and
>). Attribute definition list declarations start with the
reserved name ATTLIST (or its previously declared replacement)
which is followed, after one or more separators (space, etc.), by details of the
element(s) the list is to be associated with. Once this associated
element type specification has been entered one or more attribute
definitions can be entered before the closing delimiter, to give the
attribute definition list the general form:
<!ATTLIST elements attribute-definition-1
...
attribute-definition-n >
Where more than one element needs to be associated with a given list of attributes, the names of the associated element types are entered as a bracketed name group, individual names being separated by one of the SGML connectors to give an entry of the form:
<!ATTLIST (element1|...|elementn) attribute-definition-1
...
attribute-definition-n >
Each attribute definition consists of an attribute name, a
declared value and a default
value. They are separated from each other by a parameter
separator which is a
separator character (e.g. space,
RE, RS, or TAB), a system specific
entity end code, a comment delimited
by pairs of hyphens, or a
parameter entity reference for an entity
whose replacement text starts with a parameter separator.
Attribute names must start with a valid name start character and must
contain only valid name characters. Their length
must not exceed the current value of the NAMELEN
quantity. This means that, when the default reference concrete syntax is being
used because no SGML declaration has been transmitted for use with the document,
attribute names must consist of not more than eight alphanumeric characters,
full stops or hyphens, starting with a letter. (This is why many DTDs,
unnecessarily, use cryptic short forms of attribute names: such short forms of
names will be recognizable by any SGML parser.) An attribute name can only be
used once in any attribute definition list declaration, but the same attribute
name can be used in other declarations.
The declared value of an attribute is either a bracketed list of valid attribute values, or a reserved name identifying the type of value(s) that can be entered. Where specific attribute values are defined each listed attribute value must be unique to the attribute definition list, but where reserved names are used the same attribute value can be used for a number of different attributes.
Note: This last rule can lead to problems where users need to assign
Y/N values to more than one attribute in a list. Typically this is overcome by
using a %boolean parameter entity whose replacement text is
NUMBER in place of the token list, and then defining booleans in
such a way that any number other than 0 is considered to be true (yes).
Figure 5.4 lists the reserved names that can be used for attribute types.
| Reserved name | Purpose |
|---|---|
CDATA |
Attribute value consists of character data (valid SGML characters, including markup delimiters) |
ENTITY |
Attribute value can be any currently declared subdocument or data entity name |
ENTITIES |
Attribute value is a list of subdocument or data entity names |
ID |
Attribute value is a unique identifier (ID)
for the element
|
IDREF |
Attribute value is an ID reference value (i.e. a reference to a name entered as the unique identifier of an element elsewhere in the same document) |
IDREFS |
Attribute value is a list of ID reference values |
NAME |
Attribute value is a valid SGML name |
NAMES |
Attribute value is a list of valid SGML names |
NMTOKEN |
Attribute value is a name token (i.e. contains only name characters, but in this case with digits and other valid name characters accepted as the first character) |
NMTOKENS |
Attribute value is a list of name tokens |
NOTATION |
Attribute value is a member of the bracketed list of notation names that qualifies this reserved name |
NUMBER |
Attribute value is a number |
NUMBERS |
Attribute value is a list of numbers |
NUTOKEN |
Attribute value is a number token (i.e. a name that starts with a number) |
NUTOKENS |
Attribute value is a list of number tokens |
Figure 5-4 Reserved names for attribute declared values
| Web SGML Adaptations Extension When the Web SGML adaptations provided by Annex K to ISO 8879 are available attributes can be assigned an explicit datatype by use of the keyword DATA
followed by the name of one of the notations
declared in the document type definition. Optionally a
data attribute specification can be used to
qualify the way in which the identified notation processor is to operate. |
As tokens (NMTOKEN, NMTOKENS,
NUTOKEN and NUTOKENS) provide a more flexible
approach to checking attribute values they are sometimes used in preference to
their more specific equivalents (NAME, NAMES,
NUMBER and NUMBERS), which place more restrictions
on the characters that can be used in attribute values.
The default value entry of the attribute definition consists of either a
specific value or one of the reserved names listed in Figure
5.5. Notice that these reserved names are preceded by the reserved name
indicator (RNI) to ensure that they are not mistaken for attribute
values of the same name which have not been enclosed in literal delimiters.
| Reserved name | Purpose |
|---|---|
#FIXED |
The following value is a fixed default value (i.e. cannot be changed by entry of another value in the start-tag) |
#REQUIRED |
The attribute value must be entered within the start-tag of the element |
#CURRENT |
If no attribute value is specified in the start-tag the value entered for this attribute on the start-tag for the nearest preceding element to share the attribute definition list declaration is to be used |
#IMPLIED |
If no attribute value is specified the program may imply a value |
#CONREF |
The element may contain either specific cross-reference text or an attribute whose value is a recognized ID reference value (i.e. a name that has been entered as the unique identifier to another element) |
Figure 5.5 Reserved names for attribute default values
The restrictions that apply to the use of the reserved names listed in Figures 5.4 and 5.5 are:
ID and NOTATION reserved names may be used
only once in any attribute definition list
NOTATION and #CONREF reserved names cannot
be used for attributes associated with EMPTY elements
ID
reserved name must be either #REQUIRED or #IMPLIED
CDATA.Further restrictions also apply to names associated with attributes declared
using the ID and NOTATION reserved names, as will be
explained when examples of the use of these reserved names are given.
| Web SGML Adaptations Extension When the Web SGML adaptations provided by Annex K to ISO 8879 are available the restrictions on a name token appearing in the token lists associated with more than one attribute in a list is removed, with the proviso that names assigned to more than one list cannot be used to identify the occurrence of a short form of an attribute specification. The Web SGML adaptations also allow multiple attribute list declarations for the same element. All attribute lists defined for a particular element are concatenated in the order encountered. If the same attribute is defined twice the first attribute definition applies. To make extensibility easier, empty attribute list declarations are no longer an error. The Web SGML adaptations also allow attributes to be defined that apply to
all elements, or to elements whose presence has been detected implicitly in
response to use of the new |
The simplest type of attribute is one with just two declared values, one of which is the default. For example:
<!ATTLIST book status (draft|final) draft >
This declaration shows that the two valid values (tokens) for the
status attribute are draft and final.
If a value of status="final" is not specified in the
start-tag the default value of status="draft" will be
applied. As both tokens only contain valid name characters, providing the
concrete syntax contains the default SHORTTAG YES entry:
<BOOK final> is all that is required to identify that a book
that has moved from draft to final status.A list of declared values can contain as many names as required. For example, acts presented to the European Pariliament use the following set of attributes:
<!ATTLIST act leg.val (agr|dec|rec|rec.ecsc|dir|reg|dec.ecsc|
dec.eea|proc|opin|prot|com.pos|other) #REQUIRED
ld (da|de|el|en|es|ga|fi|fr|it|nl|pt|sv|ml) #REQUIRED >
Where the list of possible values is liable to change regularly, or cannot be fully defined, the list of declared values can be replaced by a declared value reserved name that identifies names or name tokens. For example, elements conforming to the rules specified by the Text Encoding Initiative (TEI) can have an attribute associated with them that identifies which TEI element type form they are. The default definition for this attribute is:
TEIform NAME #IMPLIED
The value to be implied by the program if no value is specified is the name of the element with which the attribute is associated.
When a element that is not part of one of the TEI standard DTDs is required,
it should be associated with one of the predefined TEI forms. To do this you
simply add a TEIform attribute to its definition, e.g.:
<!ELEMENT special-para - O (heading, text) > <!ATTLIST special-para TEIform NAME #FIXED "p" >
By using NAME as the declared value this declaration ensures
that the name must conform to the rules used for naming elements in the DTD
without having to list all the TEI element type names as permitted values in a
very long list.
Where more than one value may be required the declared value can be changed
to NAMES.
Where names required for an attribute may need to begin with a digit, or
another name character not defined as a valid name start character (e.g. a
hyphen), the NMTOKEN or NMTOKENS declared values can
be used in place of NAME and NAMES. The values
entered for such name token attributes will be parsed to
ensure that they only contain name characters and that their length does not
exceed the limit specified by the NAMELEN quantity.
The only difference between the NUTOKEN and NUTOKENS
number token declared value keywords and the name token
keywords (NMTOKEN and NMTOKENS) is that the first
character of any number token must be numeric. As the following
declaration for an <ARTWORK> element shows, the two declared
value types can be used together:
<!ELEMENT artwork - O EMPTY >
<!ATTLIST artwork width NMTOKEN colwidth
depth NUTOKEN #REQUIRED >
In this case the horizontal size (width) of the artwork
defaults to a special parameter value (colwidth) known to the
formatting program, unless a specific value is entered for the attribute. As
this special name starts with a letter, rather than a number, the NMTOKEN
keyword has been used for the declared value. For the depth
attribute the declared value has been defined as NUTOKEN to ensure
that the first part of the compulsory vertical size value is always a number.
To see the differences between these two definitions compare these valid tags:
<ARTWORK width=150mm depth=100mm> <ARTWORK depth = 8in> <ARTWORK width = "30-picas" depth = '4-in'>
with these invalid ones:
<ARTWORK> <ARTWORK width=mm150 depth=mm100> <ARTWORK depth = 8 in> <ARTWORK width = "30 picas" depth = "4 in"> <ARTWORK colwidth depth=24pi> <ARTWORK depth = 6">
Many of the entries in the invalid list may seem at first sight to be valid. You need to understand why they are invalid if you are to make full use of the checks provided when number tokens are being used in place of name tokens.
In the first of the invalid examples an error occurs because the compulsory
depth attribute has not been entered. (Remember that attributes
whose default value is #REQUIRED must have their value entered as
part of the start-tag.)
The second example is invalid because the value entered for the depth
attribute does not start with a number. When the declared value is NUTOKEN
the attribute value must start with a digit, rather than a letter. It should be
noted, however, that the value for the width attribute, which has
been a declared value of NMTOKEN, is valid as it begins with a
letter and only contains name characters.
The faults in the third and fourth invalid examples relate to the use of
spaces in the value. Spaces may only occur within delimited literals. With the
third invalid example the SGML parser would accept depth = 8 as a
valid entry, but would then be required to treat in as the value
of the width attribute. As only tags that have lists of permitted
values can be entered without an attribute name and value indicator, however,
this format would be invalid for the width attribute
The fourth invalid entry will be treated as incorrect because the presence
of a space identifies the entry within the literal as a list of tokens rather
than a single token. This is invalid because the attributes have been declared
using NMTOKEN and
NUTOKEN rather than NMTOKENS and NUTOKENS.
For the fifth of the invalid entries an attempt has been made to minimize
the entry by omitting the attribute name and value indicator (width=).
As mentioned above, this technique of shortening tag is only valid where a
specific set of valid name tokens has been entered as the declared value.
The final example illustrates another subtle fault. Here an attempt has been
made to use the quotation mark (") to represent inches: but,
unless otherwise instructed by a change in the document's set of delimiter
characters, the parser should treat the symbol as an unmatched literal
delimiter, and so flag the entry as invalid.
One way of simplifying the attribute definition list for the artwork
element would be to treat both values as part of a single number token list by
use of a declaration of the form:
<!ATTLIST artwork size NUTOKENS #REQUIRED >
In this case a valid start-tag for the artwork element might
take the form:
<ARTWORK size = "100mm 5in">
Note how the only space seperates the two entries.
Where only one unit of measurement is being used on the output system the
NUTOKEN and NUTOKENS keywords can be replaced by
NUMBER or NUMBERS to restrict attribute values to
numeric values only.
There is, however, one danger with the NUMBER keyword. Only
integers can be entered for attributes declared in this way: decimal values
cannot be defined. If decimal values are likely to be needed the NUTOKEN
or NUTOKENS option must be used as this will allow periods to be
used as decimal points at any point other than the first character. (Values less
that one must be entered with a zero in front of the decimal point.) If negative
values are required, however, NMTOKEN or NMTOKENS
must be used as while a hypen is a valid name character it is not a number.
The following special types of attribute values are catered for by SGML:
The declaration given for the artwork element above said
nothing about the source of the artwork. (It only defined the size of the space
to be left for the image.) If the illustration was one that could be processed
by the pagination system, the file containing the coded picture could be
declared as an
external entity in the DTD. It might be
declared in a declaration of the form:
<!ENTITY fig1 SYSTEM "fig1.gif" NDATA GIF>
To allow this picture to be processed at the appropriate point the
declarations for the
artwork element could be extended to read:
<!ELEMENT artwork - O EMPTY >
<!ATTLIST artwork width NMTOKEN colwidth
depth NUTOKEN #IMPLIED
file ENTITY #REQUIRED >
The attribute definition list shows that the <ARTWORK>
element must have its start-tag qualified by an attribute, called file,
whose value is the name of an entity declared in the DTD referenced by the
document instance. Using this definition the illustration can be referenced
using a start-tag of the form:
<ARTWORK file="fig1">
When using ENTITY as the declared value it is important to
remember that the associated element must be declared as EMPTY
and, therefore, requires no end-tag. It should also be noted that if the program
encounters an <ARTWORK> start-tag without a file name the
program will report an error.
Entities do not have to contain non-SGML data. They could equally well
contain an SGML subdocument, or text which
does not contain SGML markup instructions (e.g. CDATA or SDATA
entities). Entities referenced using attributes may not, however, contain markup
or other text that requires parsing.
The ID declared value allows a unique identifier to be
associated with specific start-tags. Once a start-tag has been given a unique
identifier it can be cross-referred to by other attributes declared using the
IDREF or IDREFS declared value.
The default value associated with an attribute that has the ID
keyword as its declared value must be either:
#REQUIRED, indicating that a unique identifier must
be entered, or #IMPLIED, indicating that the identifier can be implied by
the system if not present.Because each identifier must be unique to the document, the standard
recommends that the same attribute name (e.g. id) is used for all
identifiers. This recommendation is not, however, compulsory.
The <ARTWORK> element could be assigned a unique
identifier that could be referenced from the text by extending its definition to
read:
<!ELEMENT artwork - O EMPTY >
<!ATTLIST artwork width NMTOKEN colwidth
depth NUTOKEN #IMPLIED
file ENTITY #REQUIRED
id ID #REQUIRED >
Each SGML identifier (known as an id value) must be a
valid SGML name, starting with a letter. This means that identifiers such as
<ARTWORK file=fig1 id=1> are invalid. If you do want to use
numbers as identifiers you must place at least one letter in front of the first
digit, e.g.
<ARTWORK file=fig1 id=f1>
Unique identifiers can be entered in either case, any lowercase characters
being converted to uppercase before the uniqueness of the identifier is
determined (unless the SGML declaration has been altered to contain the
statement NAMECASE GENERAL NO). This means, for example, that a
start-tag of the form <ARTWORK FILE=fig1 ID=F1> would be
treated as identical to the tag shown above. Note, however, that a start-tag
reading <ARTWORK FILE=FIG1 ID=F1> would not be identical to
its predecessors because entity names are normally case sensitive. This means
that fig1 and FIG1 refer to different entities.
An attribute with a declared value of IDREF or IDREFS
can be used to refer to a unique identifier within the same document instance.
Normally only one unique identifier will be involved, so the attribute can be
declared using the singular keyword (IDREF). Typically the
declaration will take the form:
<!ELEMENT figref - O EMPTY >
<!ATTLIST figref to IDREF #REQUIRED
page (yes|no) no >
In this case the figure reference element (<FIGREF>) has
been declared as an empty element because its contents are automatically
generated by the program. It has a compulsory attribute (to) which
must be a reference to a unique identifier used in the same document instance.
At the point where the artwork is to be referred to within the text a figure
reference should be entered in the form <p>As shown in <FIGREF
to=f1>, ... . This might generate a cross-reference of the form
As shown in Figure 3.1 ....
If the start-tag was changed to read <FIGREF to=f1 page=yes>
the generated text might be extended to read As shown in Figure 3.1 on
page 94 ...
While attributes using the IDREF or IDREFS
keywords will normally have a default value of #REQUIRED, there
are circumstances in which entries whose default value is
#CONREF may apply.
The content reference (#CONREF)
default value reserved name is particularly useful where documents are being
prepared as a number of individual files, which will be linked together as
subdocuments to a master document prior to
output. Because cross-references can only be made to identifiers entered in the
same document, cross-references to identifiers used in other subdocuments will
need to be entered specifically by the author. To allow for this, the #CONREF
default value option permits references to be made in two ways:
To see how this works, consider the following declaration for a figure reference:
<!ELEMENT figref - O (#PCDATA) >
<!ATTLIST figref to IDREF #CONREF
page (yes|no) no >
Because the to attribute has, in this case, been given a
default value keyword of #CONREF the contents of the associated
element cannot be declared to be EMPTY. Instead the element
declaration has been given a content model that allows parsed character data to
be entered.
Cross references to a unique identifier can still be made in the format used for the last example. When, however, the reference is to a figure in another subdocument the relevant entry should be entered as text within a start-tag and end-tag, e.g.:
<p>As shown in <FIGREF>Figure A.1 in Appendix A</FIGREF> ...
When the content reference attribute is present in the start-tag, the
element is treated as an EMPTY element (without content) and,
therefore, no end-tag is present. When the attribute value is not specified,
however, the element's end-tag must be entered to identify the end of the
reference. Because the end-tag is present in some cases and not in others, the
second of the tag omission indicators for any element associated with an
attribute whose default value is #CONREF should be O.
Only one attribute should be defined using the #CONREF default
value in any attribute definition list declaration. If the attribute list were,
for some unusual reason, to contain two #CONREF default value
keywords the parser must be able to imply values for both attributes
because, if either attribute is present, the element will automatically become
an empty one.
Two other keywords can be used to control entered values:
#FIXED when a fixed attribute value is
required#CURRENT when the current attribute value
is to be used as the default value.If an entered default value is preceded by the reserved name #FIXED
its value can never be changed. An example of an element with a fixed attribute
value is the version attribute associated with the <HTML>
element (see Chapter 12).
When the SGML declaration contains both SHORTTAG YES and
OMITTAG YES the #CURRENT default value keyword can
be used. This keyword tells users that, for the first occurrence of the
associated element, a value must be entered (as if #REQUIRED had
been used) but if no value is entered for subsequent occurrences of the element
the last entered value will be used as the current default value.
It should be noted, however, that only one current value is associated with each attribute. If an attribute declaration is shared by a number of elements, the value used as the current value will be the last value entered for the named attribute in any of the associated elements. For example, if the following attribute definition was added to the document type declaration subset:
<!ATTLIST (p|note) indent NUTOKEN #CURRENT>
and a section of text was coded as:
<P indent=0>This is an example of a normal, unindented paragraph of text. Notice that, because the paragraph tag was the first one that used the indent attribute a value had to be entered, even though no indent was required. <NOTE indent=36pt>This note has been set with a 36pt indent.</NOTE> <P>Because no specific indent value has been stated this paragraph has also been indented by 36pt as this is the value currently associated with the indent attribute. <P indent=0>To cancel the indent applied to the note it is necessary to enter a new value for the indent attribute as part of the paragraph's start-tag.
the set text might appear in the form:
This is an example of a normal, unindented paragraph of text. Notice that,
because the paragraph tag was the first one that used the indent attribute,
a value had to be entered even though no indent was required.
NOTE: This note has been set with a 36pt indent.
Because no specific indent value has been stated this paragraph has also
been indented by 36pt as this is the value currently associated with the
indent attribute.
To cancel the indent applied to the note it is necessary enter a new value for
the indent attribute as part of the paragraph's start-tag.
Notice that, until the indent is specifically restated, the value entered at
the start of the <NOTE> element remains in force for the
<P> element as well.
| Web SGML Adaptations Extension When the Web SGML adaptations provided by Annex K to ISO 8879 are available the following forms of attributes can be defined:
|
Guidelines for Electronic Text Encoding and Interchange (TEI P3) Edited by C, M. Sperberg-McQueen and Lou Burnard for The Association for Computers and the Humanities (ACH), The Association of Computational Linguistics (ACL) and The Association for Literary and Linguistic Computing (ALLC), Chicago/Oxford, 1994, 1289pp
Web SGML Adaptations, Annex K to ISO 8879:1986, ISO/IEC JTC1/WG4, December 1997