Previous chapter Next chapter Table of Contents

© Martin Bryan 1997 from SGML and HTML Explained published by Addison Wesley Longman


Chapter 5
Elements and Attributes

This chapter explains how SGML document type declarations can be used to define which elements can occur in an SGML document instance, and the attributes that can be used to control the processing of each type of element. It contains sections on:

5.1 Element roles

The role of an SGML element depends on the context it is found in and the form of markup used to enter it. The following roles can be identified:

The first element specified in any SGML document is the base document element. This element must be formally declared, within the document type declaration, by entry of an element type declaration whose name is the same as that of the document type declaration. The name of the base document element identifies the class of document to be created (e.g. <ACT>).

Normally the last tag entered within an SGML document will be an end-tag whose name matches that used for the first tag (e.g. </ACT>). This tag ensures that the end of the document is correctly identified by the SGML document parser.

Further elements can be embedded between the two tags identifying the limits of the base document element, up to the level specified by the current tag level quantity (TAGLVL). (In the reference concrete syntax this value is 24, i.e. up to 24 levels of embedded elements can be used within each base document element.)

Where necessary, elements can be qualified by attributes. An attribute is a named value that can be used during processing to control such things as the presentation or content type of the element it is associated with. In the reference concrete syntax up to 40 different attributes can be associated with any element, provided that the total length of the attribute names and values does not exceed 960 characters.

Elements can optionally be ranked so that specific groups of elements are used at the same ranked element level. Ranked element level is specified by adding a number to the end of the element's generic identifier. When the RANK tag minimization option is being used, the current rank level can be implied from that of preceding elements.

Minimized elements are elements whose presence, name or attributes names, can be implied by an SGML parser when the appropriate minimization options have been enabled in the FEATURES clause of the SGML declaration. Minimization techniques provided within SGML include:

In addition, data tags can be used to identify references to end-tags that have been entered using pre-defined character strings.

5.2 Element type declarations

Element type declarations must be entered as part of a document type declaration subset. The reserved name ELEMENT (or its previously declared replacement) appears immediately after the markup declaration open (MDO) code that identifies the start of the markup declaration.

In its shortest form an element type declaration takes the form:

   <!ELEMENT name model>

where name is the element type name (generic identifier) that uniquely identifies the type of element and model is either a formal declaration of the type of data that may be entered within the element, or a content model showing which subelements can be embedded within the element.

Where elements share a content model a bracketed name group can replace the element type name. A name group consists of a set of connected element type names bracketed by group open (GRPO) and group close (GRPO) delimiters. The element type names are normally connected by an OR connector (| in the reference concrete syntax) to give an entry of the form:

   <!ELEMENT (name-1|name-2|...|name-n) model>

The maximum length of an element type name must not exceed the current value of the NAMELEN quantity. The first character must be alphabetic, or one of the additional name start characters defined in the syntax clause. Subsequent characters may be alphanumeric characters or one of the name characters declared in the syntax clause.

When the OMITTAG entry in the FEATURES clause of the current SGML declaration reads OMITTAG YES two extra characters must be entered between the name and content of the element declaration to define the type of omitted tag minimization to be applied to the element. These extra characters define whether or not the start-tag and/or end-tag can be omitted if its presence can be unambiguously implied from the model of the element it is embedded within. If the first character is O (the letter O, not the number zero) the start-tag for the element can be omitted at appropriate points. If it is - (hyphen) it can never be omitted. If the second character is O the end-tag can be omitted: otherwise it is - to show that the end-tag must never be omitted. The two characters must be separated from each other, and the adjacent element type name and content model, by at least one space or another valid separator code (e.g. TAB). For example, an element whose end-tag may be omitted might be declared as:

   <!ELEMENT artwork   - O   EMPTY >

This element type declaration defines an empty element, <ARTWORK>, which has no embedded content. The element is simply a tag that marks the point at which an illustration is to be added to the document. The <artwork> start-tag cannot be omitted but, as the element contains no text, the end-tag must be omitted as it serves no purpose.

5.2.1 Model groups

When an element can contain embedded subelements the declaration's content model must be defined as a model group. Like name groups, model groups consist of one or more connected element type names (called element tokens in this context) bracketed by group open (GRPO) and group close (GRPC) delimiter sequences, e.g.:

   <!ELEMENT book   - O   (prelims, body, annexes) >

In this case a book is said to be made up of three nested subelements, <PRELIMS>, <BODY> and <ANNEXES>.

For model groups, unlike other name groups, the type of connector used is significant. Three types of connector are used in model groups to define the logical sequence in which elements are to appear are shown in Figure5.1.

Default character Delimiter name Meaning
, SEQ All must occur, in the order specified
& AND All must occur, in any order
| OR One (and only one) must occur

Figure 5.1 SGML connectors

The sequence connector (a comma in the reference concrete syntax) connects element types which must occur in a predefined sequence. In the above example, therefore, the prelims must precede the body of the text, which must precede any annexes.

Where the sequence in which the elements are used is not fixed, subelement names should be connected with an AND connector (&). For example, the fields at the head of a memo could be defined using a model of the form:

   <!ELEMENT heading O O (from & to & date) >

If only one element could be applicable at a given point, the relevant element type names can be connected by an OR connector (|). For example, the following element could occur in the prelims of a book:

   <!ELEMENT by O O (author|editor) >

Note: The OR connector used in SGML is an exclusive OR rather than an inclusive OR.

The use of each embedded subelement can be further qualified by the addition of an occurrence indicator immediately after the element type name, or immediately after a group close delimiter linking a number of element type names. The three types of occurrence indicator defined in SGML are shown in Figure 5.2.

Default character Delimiter name Meaning
+ PLUS Repeatable element(s) that must occur at least once
* REP Optional element(s) that may be repeated
? OPT Optional element(s) that can occur at most once

Figure 5.2 SGML occurrence indicators

For example, to make the use of annexes in a book optional you would extend the definition given above to read:

   <!ELEMENT book - O (prelims, body, annexes?) >

An alternative, and somewhat better approach, would be to use an optional and repeatable <annex> element:

   <!ELEMENT book - O (prelims, body, annex*, index?) >

To allow more than one author or editor to be defined in the prelims you could extend the definition of the <BY> element shown above to read:

   <!ELEMENT by O O (author+|editor+) >

Occurrence indicators have a higher precedence than connectors. For example, a model group such as (author|editor)+ differs from one defined as (author+|editor+) because the first model permits any sequence of author and editor details to be entered, whereas the second model only permits a set of author details or a set of editor details to be entered within a <BY> element.

Model groups can be nested within each other up to the level indicated by SGML's GRPLVL quantity value. (The reference concrete syntax allows up to 16 levels of nested model groups.) Each nested model group can, if necessary, have its elements linked by a different connector. Each name in the group, each nested group, and the whole content model, can be qualified by an occurrence indicator. An example of a nested set of model groups is the element used for a Text Encoding Initiative (TEI) title statement, which is defined as:

   <!ELEMENT titleStmt  - O (title+, (author|editor|sponsor|
                                       funder|principal|respStmt)*)

When parameter entities (see Chapter 6) are used to define the contents of model groups, a word of warning is required: you cannot associate an occurrence indicator with a parameter entity. If the elements whose names are listed in the replacement string of the parameter entity are to be qualified by an occurrence indicator either:

Note: When using parameter entities to define part, or all, of a model group it is important to remember that the associated entity declaration must precede the entity reference. The safest way to ensure this is to place all parameter entity declarations at the start of the document type definition.

5.2.2 Text elements

A special form of primitive content token is used in model groups to indicate points at which the element can contain text. This token consists of a reserved name indicator (RNI, # in the reference concrete syntax) followed by the reserved name PCDATA, which stands for parsed character data. #PCDATA indicates that, at that point in the model, the element can contain text which has been checked by the SGML parser to ensure that any embedded tags or entity references have been identified.

When the #PCDATA tag is present in a model group the element's content is referred to as mixed content. If text is not permitted the model group is defined as having only element content. Different rules for processing record boundaries apply to mixed content. A typical example of an element defined using mixed content is:

   <!ELEMENT para  - O (#PCDATA|emphasis)+ >

A special feature of the #PCDATA keyword is that it is automatically presumed to have a repeatable (REP) occurrence indicator. All characters occurring between successive markup tags are considered to satisfy a single #PCDATA token (including any entered as character data in a marked section).

It is recommended that #PCDATA is only used when data characters are permitted anywhere in the content of an element, i.e. where #PCDATA is the only token in the model group or where it is a member of a repeatable model group whose members are connected using an OR connector.

Note: This recommendation is made to avoid potential problems relating to the processing of record boundaries within mixed content.

Where nested subelements cannot occur within an element, its contents can be declared to consist of one of the following types of declared content:

A variant of the basic content model allows the reserved name ANY to replace an element type declaration's model group. This tells the program that text or any element defined within the same document type declaration can be used as an embedded element.

For elements with declared content, or using the ANY reserved name, the keyword replaces the model group, including its brackets. Because, unlike #PCDATA, these reserved names cannot occur within a model group, they do not need to be preceded by the reserved name indicator.

A typical example of the use of declared content is shown in the following element type declaration:

<!ELEMENT ISBN - - CDATA >

It should be noted that, in the case of elements defined using the replaceable character data and character data options, the program will ignore any requests to start a new element until such time as it encounters a valid end-tag open (ETAGO) delimiter, i.e. </ followed by any valid name start character. For this reason all elements declared using the CDATA or RCDATA declared content keywords should have compulsory end-tags and should not contain the ETAGO character sequence within their content.

Web SGML Adaptations Extension
When the Web SGML adaptations provided by Annex K to ISO 8879 are available the first character of any potentially clashing ETAGO sequence can be replaced by the predefined character data entity used to escape markup delimiter opening sequences, e.g. &lt;

5.2.3 Exceptions

Model groups can be qualified by the addition of lists of exceptions. There are two types of exceptions:

Exceptions are specified by entry of name groups immediately after the model group defining the permitted contents of the element. The group open (GRPO) delimiter at the start of each such name group must be preceded by a plus sign if the names identify inclusions, or a hyphen (minus sign) if they represent exclusions. Both sets may be present at the same time, provided that:

Exclusions that prohibit the use of embedded elements required as part of a model group are not permitted.

Inclusions are typically associated with elements whose content model is #PCDATA, or are used at the start of a document to allow commonly occurring floating elements, such as footnotes, figures and tables, to occur anywhere in the text. For example the definition given for a book above could be extended to read:

   <!ELEMENT book - O (prelims, body, annexes?) +(footnote|figure|table) >

This model would allow footnotes, figures and tables to occur anywhere within a book.

It should be noted that inclusions are inherited by all the elements declared in the model, and any of their children. Where inclusions have been declared at a high level in the data structure, it will often become necessary to define exclusions at lower levels in the data model. For example, to prevent footnotes from containing embedded footnotes, or figures, the following declaration could be used for footnotes:

   <!ELEMENT footnote  - O  (#PCDATA)  -(footnote|figure) >

Note that this model would not prevent tables, or any other element that had been declared as an inclusion in a parent element of the footnote, from being entered within footnotes.

5.2.4 Comments within element type declarations

Comments may be entered at most points spaces are permitted within an element type declaration, except within bracketed name or model groups. They must be preceded and followed by comment delimiters (a pair of hyphens in the reference concrete syntax). Comments can run over more than one line if necessary. e.g.:

   <!ELEMENT position - O (#PCDATA|line+)  -- one or more lines of text
                                              describing position held --  >

It is recommended that comments within element type declarations are placed after the model group.

5.2.5 Ambiguous content models

An SGML content model cannot be ambiguous. Every element or character found in the document instance must be able to satisfy only one content token without looking ahead in the document instance. For example, an element whose content model is:

   <!ELEMENT contact - O ((name, address?), company?, address)>

is ambiguous because it cannot be determined whether an <ADDRESS> element entered after a name satisifies the optional address in the nested subset or the compulsory one in the outermost group until you know what markup tag follows the address. In most cases such content models can be easily avoided by introducing another level of container, e.g.

   <!ELEMENT contact - O (person, company?, address)>
   <!ELEMENT person O - (name, address?) >

By making the end-tag of the container element (e.g. </PERSON>) compulsory you can ensure that the position of the two <ADDRESS> elements can always be distinguished

5.2.6 Analysing content models

The base document element, whose name matches that of the document type declaration, provides the starting point for the analysis of any set of element type declarations. The rules that should be applied when analysing a DTD are:

  1. Identify the element type declaration that has the relevant document type name (which may be part of a name group or the replacement text of a parameter entity).
  2. Study the element's tag omission rules and its model group, including any exceptions.
  3. Find the element type declaration for the first element listed in the model group.
  4. Repeat stages 2) and 3) until you come to one of the terminal keywords (#PCDATA, CDATA, RCDATA, EMPTY or ANY).
  5. Go back to the previous declaration and look for the declaration for the next element listed in its model group (or that of one of its parents).

To see the effect of these rules we will use them to create a tree diagram for the following simplified DTD for a memorandum:

   <!DOCTYPE memo [
    <!ELEMENT memo                  O O  (heading, body, signature?) >
    <!ELEMENT heading               0 0  (from & to & copied-to? & date) >
    <!ELEMENT (from|to|copied-to)   - O  (name, position?)+ >
    <!ELEMENT name                  O O  (#PCDATA) >
    <!ELEMENT (position|date)       - O  (#PCDATA) >
    <!ELEMENT body                  O O  (para+)   +(artwork) >
    <!ELEMENT para                  - O  (#PCDATA|emphasis)+ >
    <!ELEMENT emphasis              - -  (#PCDATA) -(artwork) >
    <!ELEMENT artwork               - O  EMPTY >
    <!ELEMENT signature             O O  (salutation?, (name, position?)+)>
    <!ELEMENT salutation            - O  (#PCDATA) >
   ]>

The first thing that needs to be done is to identify the element type declaration for the element whose name matches that of the document type declaration, i.e. memo in this case. The tag omission rules associated with this root element type declaration, tells us that both the start-tag and the end-tag can be omitted as their presence can be determined by the SGML parser from the presence of embedded elements. The model group for the memo element shows us that it contains only element content, and that three elements, heading, body and, optionally, signature must occur in a fixed sequence.

Following the third of the rules listed above we find that the model for the first element in the initial model group, heading, shows that the start-tag and end-tag for this element can also be omitted. Again the model group consists solely of element content, but this time the four elements are connected by an AND connector to indicate that the order in which the elements are entered is not important (they have a fixed position on preprinted paper). Also one of the elements, the copied-to element, is optional.

When we look for the model for the first of these elements we find that it shares a declaration with two of the other components of the heading. The tag omission rules for this declaration tell us that the start-tag for these elements must be present, but that the end-tag is omissible. Each of these elements must contain an embedded name element, optionally followed by details of the position held by the named person. The PLUS occurrence indicator associated with the whole model group shows that multiple names, with or without positions, may be entered for each component of the heading if required.

The model for the name element shows that the both the start-tag and the end-tag can be omitted where their presence can be determined from the preceding and following elements. The content model consists of the special #PCDATA keyword showing that this is a terminal node that may contain parsed character data. At this point the fifth of our rules is invoked, so we need to return to the model group for the parent element, in this case the model shared by from, to and copied-to. The second element in the model group shared by this set of elements is the optional position element.

The only difference between the declaration for the position element, which is shared with the date element, and that for the name element is that the start-tag is not omissible. This is because the position element is always optional while the position of the date element cannot be determined by the parser as it is part of an AND group. The presence of an optional component of a model, or an element within an AND group, must always be indicated by a start-tag.

As the model group for the position and date elements consists solely of the #PCDATA content token we must return to the model of position's parent element(s), from, to and copied-to. As we have already seen them model for all the elements listed in this model group we must immediately return to their parent element, heading. As all the elements in the model group for a heading share the same element type declaration, the fifth rule requires us to return to its parent element, memo, and look at the second element listed in its model group, body.

The start-tag and end-tag for the body container can be omitted as the first paragraph in the memo will indicate the start of the body, and the presence of a signature will indicate the end of the body. The model group for the element shows that the body must contain one or more paragraphs (para). In this case the occurrence indicator has been placed adjacent to the element type name, rather than being applied to the whole of the model group. In addition the inclusion added after the model group shows that artwork can be interspersed between paragraphs, or placed within any embedded text or subelements.

The model for the para element is an example of the use of mixed content. In this case the parsed character data (#PCDATA) can be repeatedly mixed with emphasis elements. But it must be remembered, however, that this element inherits the inclusion specified for its parent element, para, so artwork can also be embedded within paragraphs. The tag omission rules show that start-tag of each paragraph must be present in the document instance.

The model for the emphasis element also indicates that it should contain parsed character data (#PCDATA), but in this case the model is qualified by the presence of an exclusion that prohibits the inheritence of the artwork inclusion from the model of the body element. Both the start-tag and the end-tag must be present to indicate the full scope of the emphasized text.

The model for the artwork element shows that this is an empty element that consists simply of a start-tag, with no end-tag. (The role of this element will be examined further shortly.)

Now that all the elements in the model group for body have been identified we must return to the model group of its parent, memo and look at the next component of its model group, signature. Again both the start-tag and end-tag can be omitted from these elements as their presence can be determined from the presence of their subelements, or parent's end-tags. The model group here is slightly more complex, consisting of the name of an optional element, salutation, followed by a repeatable model group which uses the same subelements as the header elements. There must be at least one name element within each signature.

The model for the salutation element shows that this consists simply of parsed character data. As this element is optional its start-tag is always required, though its end-tag could be omitted.

Figure 5.3 shows how a graph could be drawn to represent the model of a memo.

Structure of memo DTD

Figure 5-3 Graph showing structure of memo DTD

5.3 Using elements

Within a document instance the contents of elements are indicated by the use of start-tags and end-tags. A start-tag consists of the element's name between the currently declared start-tag open (STAGO) and tag close (TAGC) delimiters (< and > respectively in the reference concrete syntax). Optionally the tag close code can be replaced by a null end-tag (NET) delimiter (e.g. /) so that a matching null end-tag can be used in place of the normal end-tag. Where appropriate, the element type name can be qualified by the entry of one or more of the attributes declared for the element.

An end-tag consists of the element's name between the currently declared end-tag open (ETAGO) and tag close (TAGC) delimiters (e.g. </ and >). Where a null end-tag has been used to close the element's start-tag, the whole of the end-tag must be replaced by a single null end-tag code (/).

Web SGML Adaptations Extension
When the Web SGML adaptations provided by Annex K to ISO 8879 are available the delimiter used at the end of the start-tag to identify that the element is to end with a null end-tag can be specified using the optional null end-tag start code (NETSC) delimiter. When a NETSC delimiter has been specified in the DELIMS section of the SYNTAX clause of the SGML declaration it should be different from that declared for the null end-tag by the specification for the NET delimiter. If no definition for the NETSC delimiter is provided the value for the NET delimiter is used, which by default is /.

Note: These new options allow code sequences such as <ISBN[0 201 40394 3]> to be used to delimit elements where this markup format is considered to be appropriate.

Not all tags need to be present in a document. Provided that the OMITTAG feature has been enabled in the SGML declaration, tags can be omitted when their presence can be implied without ambiguity.

Element type names can normally be entered in either uppercase or lowercase. (The NAMECASE section of the SGML declaration defaults to GENERAL YES. Where this entry is altered, DTD developers should take special care to warn users of the need to enter tags in the appropriate case.)

The main problem that can occur when using elements is that, unlike the style sheets used in uncontrolled word processors, SGML markup tags cannot be entered at a level for which they have not been declared. SGML-based text editors will be able to prevent users from entering invalid tags, but if documents are prepared without the guidance of an SGML parser errors can occur.

A more detailed example of the use of SGML elements can be found in the description of the HTML DTD in Chapter 13.

5.4 Attributes

An attribute is a named parameter (value) used to qualify an element's start-tag. Attributes are typically used to:

There are two parts to an attribute specification: an attribute name and an attribute value. These two parts are joined by a value indicator (VI, = in the reference concrete syntax) to give an attribute specification of the form:

   <element-name attribute-name=attribute-value ... >

Attribute values can be entered as attribute value literals. A literal is a string of characters recognized as a single unit by the system because the characters have been entered between a matched pair of literal delimiters. The two alternative sets of literal delimiters are provided in SGML. They are referred to within the standard as LIT (literal) and LITA (alternative form of literal). In the reference concrete syntax these are represented by the quotation mark (") and apostrophe (') respectively.

Note: The choice of which set of literal delimiters should be used is a matter of user convenience. The only restriction is that the character chosen cannot appear in the entered attribute value.

Only one type of literal delimiter can be used to delimit a particular attribute value, but the two types can be used interchangeably within the same tag, e.g.:

   <A href='http://www.u-net.com/~sgml/piechart1.gif'
      title="Martin's Work Breakdown">

As well as showing how changing the type of literal delimiter can allow you to use a literal delimiter within a string, the above example also exhibits the fact that line breaks can occur between attribute specifications in place of the normal space. This is particularly useful when the start-tag would otherwise be too long to fit on a line, as is the case in the example.

Where the FEATURE clause of the SGML declaration contains the statement SHORTTAG YES (as it does by default), the literal delimiters can be omitted if the only characters used in the value are ones currently declared as name characters. For example, a declaration of the form <INPUT name="field1" size="60"> could also be entered as <INPUT name=field1 size=60>.

When the entered attribute value has been declared as a member of a set of valid attribute values for the element, the attribute name, with the associated value indicator, can also be omitted when SHORTTAG YES has been specified. For example, an entry such as <H1 align="center"> can be shortened to give an attribute specification of the form <H1 center>. In this case the attribute value must not be entered within literal delimiters because, if it is, the program will be unable to identify the attribute referred to.

Attribute values can also consist of delimited lists of values, each part of which is separated from the others by a space, or another valid separator character (e.g. RE, RS or TAB).

Each attribute can be given a default value when it is declared. If either OMITTAG YES or SHORTTAG YES has been specified in the FEATURES clause, this default value will be used if a specific attribute value is not entered in the start-tag.

Web SGML Adaptations Extension
When the Web SGML adaptations provided by Annex K to ISO 8879 are available the use of default values can be controlled via the attribute options in the short tag omission rules section of the FEATURES clause. If this entry starts ATTRIB DEFAULT NO default values should not be specified in the DTD. If default values are to be applied to attributes which have not been assigned a value in the document instance then ATTRIB DEFAULT YES must be specified in the SGML declaration.

Note: While an error need not be reported if a default attribute value is specified in the DTD when ATTRIB DEFAULT NO has beens specified, an error must be reported if any element in the document instance fails to provide a value for this attribute.

Where a default value cannot be specified a special reserved name must be entered. For example, to tell the program that is should use internal rules for determining what value to assign to the attribute value the reserved name #IMPLIED is used. The reserved name #CURRENT can be used to tell the program to repeat the last value entered for that attribute on any element that shares the attribute list declaration in which it was declared.

Where an attribute value must be entered whenever the element is requested the reserved name #REQUIRED can be used as the default value.

Web SGML Adaptations Extension
When the Web SGML adaptations provided by Annex K to ISO 8879 are available selection of the permitted attribute omission rules is controlled through the entries following the ATTRIB keyword in the SGML declaration.

5.5 Declaring attributes

Attributes are declared in attribute definition list declarations. Each attribute definition list is a separate markup declaration, delimited by the currently defined markup declaration delimiters (e.g. <! and >). Attribute definition list declarations start with the reserved name ATTLIST (or its previously declared replacement) which is followed, after one or more separators (space, etc.), by details of the element(s) the list is to be associated with. Once this associated element type specification has been entered one or more attribute definitions can be entered before the closing delimiter, to give the attribute definition list the general form:

   <!ATTLIST elements attribute-definition-1
                      ...
                      attribute-definition-n >

Where more than one element needs to be associated with a given list of attributes, the names of the associated element types are entered as a bracketed name group, individual names being separated by one of the SGML connectors to give an entry of the form:

   <!ATTLIST (element1|...|elementn) attribute-definition-1
                                     ...
                                     attribute-definition-n >

Each attribute definition consists of an attribute name, a declared value and a default value. They are separated from each other by a parameter separator which is a separator character (e.g. space, RE, RS, or TAB), a system specific entity end code, a comment delimited by pairs of hyphens, or a parameter entity reference for an entity whose replacement text starts with a parameter separator.

Attribute names must start with a valid name start character and must contain only valid name characters. Their length must not exceed the current value of the NAMELEN quantity. This means that, when the default reference concrete syntax is being used because no SGML declaration has been transmitted for use with the document, attribute names must consist of not more than eight alphanumeric characters, full stops or hyphens, starting with a letter. (This is why many DTDs, unnecessarily, use cryptic short forms of attribute names: such short forms of names will be recognizable by any SGML parser.) An attribute name can only be used once in any attribute definition list declaration, but the same attribute name can be used in other declarations.

The declared value of an attribute is either a bracketed list of valid attribute values, or a reserved name identifying the type of value(s) that can be entered. Where specific attribute values are defined each listed attribute value must be unique to the attribute definition list, but where reserved names are used the same attribute value can be used for a number of different attributes.

Note: This last rule can lead to problems where users need to assign Y/N values to more than one attribute in a list. Typically this is overcome by using a %boolean parameter entity whose replacement text is NUMBER in place of the token list, and then defining booleans in such a way that any number other than 0 is considered to be true (yes).

Figure 5.4 lists the reserved names that can be used for attribute types.

Reserved name Purpose
CDATA Attribute value consists of character data (valid SGML characters, including markup delimiters)
ENTITY Attribute value can be any currently declared subdocument or data entity name
ENTITIES Attribute value is a list of subdocument or data entity names
ID Attribute value is a unique identifier (ID) for the element
IDREF Attribute value is an ID reference value (i.e. a reference to a name entered as the unique identifier of an element elsewhere in the same document)
IDREFS Attribute value is a list of ID reference values
NAME Attribute value is a valid SGML name
NAMES Attribute value is a list of valid SGML names
NMTOKEN Attribute value is a name token (i.e. contains only name characters, but in this case with digits and other valid name characters accepted as the first character)
NMTOKENS Attribute value is a list of name tokens
NOTATION Attribute value is a member of the bracketed list of notation names that qualifies this reserved name
NUMBER Attribute value is a number
NUMBERS Attribute value is a list of numbers
NUTOKEN Attribute value is a number token (i.e. a name that starts with a number)
NUTOKENS Attribute value is a list of number tokens

Figure 5-4 Reserved names for attribute declared values

Web SGML Adaptations Extension
When the Web SGML adaptations provided by Annex K to ISO 8879 are available attributes can be assigned an explicit datatype by use of the keyword DATA followed by the name of one of the notations declared in the document type definition. Optionally a data attribute specification can be used to qualify the way in which the identified notation processor is to operate.

As tokens (NMTOKEN, NMTOKENS, NUTOKEN and NUTOKENS) provide a more flexible approach to checking attribute values they are sometimes used in preference to their more specific equivalents (NAME, NAMES, NUMBER and NUMBERS), which place more restrictions on the characters that can be used in attribute values.

The default value entry of the attribute definition consists of either a specific value or one of the reserved names listed in Figure 5.5. Notice that these reserved names are preceded by the reserved name indicator (RNI) to ensure that they are not mistaken for attribute values of the same name which have not been enclosed in literal delimiters.

Reserved name Purpose
#FIXED The following value is a fixed default value (i.e. cannot be changed by entry of another value in the start-tag)
#REQUIRED The attribute value must be entered within the start-tag of the element
#CURRENT If no attribute value is specified in the start-tag the value entered for this attribute on the start-tag for the nearest preceding element to share the attribute definition list declaration is to be used
#IMPLIED If no attribute value is specified the program may imply a value
#CONREF The element may contain either specific cross-reference text or an attribute whose value is a recognized ID reference value (i.e. a name that has been entered as the unique identifier to another element)

Figure 5.5 Reserved names for attribute default values

The restrictions that apply to the use of the reserved names listed in Figures 5.4 and 5.5 are:

Further restrictions also apply to names associated with attributes declared using the ID and NOTATION reserved names, as will be explained when examples of the use of these reserved names are given.

Web SGML Adaptations Extension
When the Web SGML adaptations provided by Annex K to ISO 8879 are available the restrictions on a name token appearing in the token lists associated with more than one attribute in a list is removed, with the proviso that names assigned to more than one list cannot be used to identify the occurrence of a short form of an attribute specification.

The Web SGML adaptations also allow multiple attribute list declarations for the same element. All attribute lists defined for a particular element are concatenated in the order encountered. If the same attribute is defined twice the first attribute definition applies. To make extensibility easier, empty attribute list declarations are no longer an error.

The Web SGML adaptations also allow attributes to be defined that apply to all elements, or to elements whose presence has been detected implicitly in response to use of the new IMPLYDEF option. In these cases the associated element name(s) are replaced by the keyword #ALL or #IMPLICIT.

5.6 Using attributes

5.6.1 Simple attributes

The simplest type of attribute is one with just two declared values, one of which is the default. For example:

   <!ATTLIST book     status (draft|final)   draft  >

This declaration shows that the two valid values (tokens) for the status attribute are draft and final. If a value of status="final" is not specified in the start-tag the default value of status="draft" will be applied. As both tokens only contain valid name characters, providing the concrete syntax contains the default SHORTTAG YES entry:

A list of declared values can contain as many names as required. For example, acts presented to the European Pariliament use the following set of attributes:

   <!ATTLIST act leg.val  (agr|dec|rec|rec.ecsc|dir|reg|dec.ecsc|
                           dec.eea|proc|opin|prot|com.pos|other)    #REQUIRED
                   ld     (da|de|el|en|es|ga|fi|fr|it|nl|pt|sv|ml)  #REQUIRED >

5.6.2 Using tokens

Where the list of possible values is liable to change regularly, or cannot be fully defined, the list of declared values can be replaced by a declared value reserved name that identifies names or name tokens. For example, elements conforming to the rules specified by the Text Encoding Initiative (TEI) can have an attribute associated with them that identifies which TEI element type form they are. The default definition for this attribute is:

   TEIform     NAME   #IMPLIED

The value to be implied by the program if no value is specified is the name of the element with which the attribute is associated.

When a element that is not part of one of the TEI standard DTDs is required, it should be associated with one of the predefined TEI forms. To do this you simply add a TEIform attribute to its definition, e.g.:

   <!ELEMENT special-para - O (heading, text) >
   <!ATTLIST special-para TEIform      NAME    #FIXED "p" > 

By using NAME as the declared value this declaration ensures that the name must conform to the rules used for naming elements in the DTD without having to list all the TEI element type names as permitted values in a very long list.

Where more than one value may be required the declared value can be changed to NAMES.

Where names required for an attribute may need to begin with a digit, or another name character not defined as a valid name start character (e.g. a hyphen), the NMTOKEN or NMTOKENS declared values can be used in place of NAME and NAMES. The values entered for such name token attributes will be parsed to ensure that they only contain name characters and that their length does not exceed the limit specified by the NAMELEN quantity.

Numeric attributes

The only difference between the NUTOKEN and NUTOKENS number token declared value keywords and the name token keywords (NMTOKEN and NMTOKENS) is that the first character of any number token must be numeric. As the following declaration for an <ARTWORK> element shows, the two declared value types can be used together:

   <!ELEMENT artwork     - O EMPTY                >
   <!ATTLIST artwork  width  NMTOKEN  colwidth
                      depth  NUTOKEN  #REQUIRED   >

In this case the horizontal size (width) of the artwork defaults to a special parameter value (colwidth) known to the formatting program, unless a specific value is entered for the attribute. As this special name starts with a letter, rather than a number, the NMTOKEN keyword has been used for the declared value. For the depth attribute the declared value has been defined as NUTOKEN to ensure that the first part of the compulsory vertical size value is always a number.

To see the differences between these two definitions compare these valid tags:

   <ARTWORK width=150mm depth=100mm>
   <ARTWORK depth = 8in>
   <ARTWORK width = "30-picas" depth = '4-in'>

with these invalid ones:

   <ARTWORK>
   <ARTWORK width=mm150 depth=mm100>
   <ARTWORK depth = 8 in>
   <ARTWORK width = "30 picas" depth = "4 in">
   <ARTWORK colwidth depth=24pi>
   <ARTWORK depth = 6">

Many of the entries in the invalid list may seem at first sight to be valid. You need to understand why they are invalid if you are to make full use of the checks provided when number tokens are being used in place of name tokens.

In the first of the invalid examples an error occurs because the compulsory depth attribute has not been entered. (Remember that attributes whose default value is #REQUIRED must have their value entered as part of the start-tag.)

The second example is invalid because the value entered for the depth attribute does not start with a number. When the declared value is NUTOKEN the attribute value must start with a digit, rather than a letter. It should be noted, however, that the value for the width attribute, which has been a declared value of NMTOKEN, is valid as it begins with a letter and only contains name characters.

The faults in the third and fourth invalid examples relate to the use of spaces in the value. Spaces may only occur within delimited literals. With the third invalid example the SGML parser would accept depth = 8 as a valid entry, but would then be required to treat in as the value of the width attribute. As only tags that have lists of permitted values can be entered without an attribute name and value indicator, however, this format would be invalid for the width attribute

The fourth invalid entry will be treated as incorrect because the presence of a space identifies the entry within the literal as a list of tokens rather than a single token. This is invalid because the attributes have been declared using NMTOKEN and NUTOKEN rather than NMTOKENS and NUTOKENS.

For the fifth of the invalid entries an attempt has been made to minimize the entry by omitting the attribute name and value indicator (width=). As mentioned above, this technique of shortening tag is only valid where a specific set of valid name tokens has been entered as the declared value.

The final example illustrates another subtle fault. Here an attempt has been made to use the quotation mark (") to represent inches: but, unless otherwise instructed by a change in the document's set of delimiter characters, the parser should treat the symbol as an unmatched literal delimiter, and so flag the entry as invalid.

One way of simplifying the attribute definition list for the artwork element would be to treat both values as part of a single number token list by use of a declaration of the form:

   <!ATTLIST artwork size NUTOKENS #REQUIRED >

In this case a valid start-tag for the artwork element might take the form:

   <ARTWORK size = "100mm 5in">

Note how the only space seperates the two entries.

Where only one unit of measurement is being used on the output system the NUTOKEN and NUTOKENS keywords can be replaced by NUMBER or NUMBERS to restrict attribute values to numeric values only.

There is, however, one danger with the NUMBER keyword. Only integers can be entered for attributes declared in this way: decimal values cannot be defined. If decimal values are likely to be needed the NUTOKEN or NUTOKENS option must be used as this will allow periods to be used as decimal points at any point other than the first character. (Values less that one must be entered with a zero in front of the decimal point.) If negative values are required, however, NMTOKEN or NMTOKENS must be used as while a hypen is a valid name character it is not a number.

5.7 Specialized attributes

The following special types of attribute values are catered for by SGML:

5.7.1 Entity attributes

The declaration given for the artwork element above said nothing about the source of the artwork. (It only defined the size of the space to be left for the image.) If the illustration was one that could be processed by the pagination system, the file containing the coded picture could be declared as an external entity in the DTD. It might be declared in a declaration of the form:

   <!ENTITY fig1 SYSTEM "fig1.gif" NDATA GIF>

To allow this picture to be processed at the appropriate point the declarations for the artwork element could be extended to read:

   <!ELEMENT artwork - O    EMPTY              >
   <!ATTLIST artwork width  NMTOKEN  colwidth
                     depth  NUTOKEN  #IMPLIED
                     file   ENTITY   #REQUIRED >

The attribute definition list shows that the <ARTWORK> element must have its start-tag qualified by an attribute, called file, whose value is the name of an entity declared in the DTD referenced by the document instance. Using this definition the illustration can be referenced using a start-tag of the form:

   <ARTWORK file="fig1">

When using ENTITY as the declared value it is important to remember that the associated element must be declared as EMPTY and, therefore, requires no end-tag. It should also be noted that if the program encounters an <ARTWORK> start-tag without a file name the program will report an error.

Entities do not have to contain non-SGML data. They could equally well contain an SGML subdocument, or text which does not contain SGML markup instructions (e.g. CDATA or SDATA entities). Entities referenced using attributes may not, however, contain markup or other text that requires parsing.

5.7.2 Unique identifiers

The ID declared value allows a unique identifier to be associated with specific start-tags. Once a start-tag has been given a unique identifier it can be cross-referred to by other attributes declared using the IDREF or IDREFS declared value.

The default value associated with an attribute that has the ID keyword as its declared value must be either:

Because each identifier must be unique to the document, the standard recommends that the same attribute name (e.g. id) is used for all identifiers. This recommendation is not, however, compulsory.

The <ARTWORK> element could be assigned a unique identifier that could be referenced from the text by extending its definition to read:

   <!ELEMENT artwork - O    EMPTY              >
   <!ATTLIST artwork width  NMTOKEN  colwidth
                     depth  NUTOKEN  #IMPLIED
                     file   ENTITY   #REQUIRED
                     id     ID       #REQUIRED >

Each SGML identifier (known as an id value) must be a valid SGML name, starting with a letter. This means that identifiers such as <ARTWORK file=fig1 id=1> are invalid. If you do want to use numbers as identifiers you must place at least one letter in front of the first digit, e.g. <ARTWORK file=fig1 id=f1>

Unique identifiers can be entered in either case, any lowercase characters being converted to uppercase before the uniqueness of the identifier is determined (unless the SGML declaration has been altered to contain the statement NAMECASE GENERAL NO). This means, for example, that a start-tag of the form <ARTWORK FILE=fig1 ID=F1> would be treated as identical to the tag shown above. Note, however, that a start-tag reading <ARTWORK FILE=FIG1 ID=F1> would not be identical to its predecessors because entity names are normally case sensitive. This means that fig1 and FIG1 refer to different entities.

5.7.3 References to unique identifiers

An attribute with a declared value of IDREF or IDREFS can be used to refer to a unique identifier within the same document instance. Normally only one unique identifier will be involved, so the attribute can be declared using the singular keyword (IDREF). Typically the declaration will take the form:

   <!ELEMENT figref - O   EMPTY              >
   <!ATTLIST figref to    IDREF    #REQUIRED
                    page  (yes|no) no        >

In this case the figure reference element (<FIGREF>) has been declared as an empty element because its contents are automatically generated by the program. It has a compulsory attribute (to) which must be a reference to a unique identifier used in the same document instance.

At the point where the artwork is to be referred to within the text a figure reference should be entered in the form <p>As shown in <FIGREF to=f1>, ... . This might generate a cross-reference of the form As shown in Figure 3.1 ....

If the start-tag was changed to read <FIGREF to=f1 page=yes> the generated text might be extended to read As shown in Figure 3.1 on page 94 ...

While attributes using the IDREF or IDREFS keywords will normally have a default value of #REQUIRED, there are circumstances in which entries whose default value is #CONREF may apply.

The content reference (#CONREF) default value reserved name is particularly useful where documents are being prepared as a number of individual files, which will be linked together as subdocuments to a master document prior to output. Because cross-references can only be made to identifiers entered in the same document, cross-references to identifiers used in other subdocuments will need to be entered specifically by the author. To allow for this, the #CONREF default value option permits references to be made in two ways:

  1. By entering the wording required for the cross-reference as the contents of the element.
  2. By using a cross-reference attribute.

To see how this works, consider the following declaration for a figure reference:

   <!ELEMENT figref - O   (#PCDATA) >
   <!ATTLIST figref to    IDREF            #CONREF
                    page  (yes|no)         no       >

Because the to attribute has, in this case, been given a default value keyword of #CONREF the contents of the associated element cannot be declared to be EMPTY. Instead the element declaration has been given a content model that allows parsed character data to be entered.

Cross references to a unique identifier can still be made in the format used for the last example. When, however, the reference is to a figure in another subdocument the relevant entry should be entered as text within a start-tag and end-tag, e.g.:

   <p>As shown in <FIGREF>Figure A.1 in Appendix A</FIGREF> ...

When the content reference attribute is present in the start-tag, the element is treated as an EMPTY element (without content) and, therefore, no end-tag is present. When the attribute value is not specified, however, the element's end-tag must be entered to identify the end of the reference. Because the end-tag is present in some cases and not in others, the second of the tag omission indicators for any element associated with an attribute whose default value is #CONREF should be O.

Only one attribute should be defined using the #CONREF default value in any attribute definition list declaration. If the attribute list were, for some unusual reason, to contain two #CONREF default value keywords the parser must be able to imply values for both attributes because, if either attribute is present, the element will automatically become an empty one.

5.8 Controlling attribute values

Two other keywords can be used to control entered values:

  1. #FIXED when a fixed attribute value is required
  2. #CURRENT when the current attribute value is to be used as the default value.

If an entered default value is preceded by the reserved name #FIXED its value can never be changed. An example of an element with a fixed attribute value is the version attribute associated with the <HTML> element (see Chapter 12).

When the SGML declaration contains both SHORTTAG YES and OMITTAG YES the #CURRENT default value keyword can be used. This keyword tells users that, for the first occurrence of the associated element, a value must be entered (as if #REQUIRED had been used) but if no value is entered for subsequent occurrences of the element the last entered value will be used as the current default value.

It should be noted, however, that only one current value is associated with each attribute. If an attribute declaration is shared by a number of elements, the value used as the current value will be the last value entered for the named attribute in any of the associated elements. For example, if the following attribute definition was added to the document type declaration subset:

   <!ATTLIST (p|note) indent NUTOKEN #CURRENT>

and a section of text was coded as:

   <P indent=0>This is an example of a normal, unindented
   paragraph of text. Notice that, because the paragraph
   tag was the first one that used the indent attribute a
   value had to be entered, even though no indent was
   required.
   <NOTE indent=36pt>This note has been set with a 36pt
   indent.</NOTE>
   <P>Because no specific indent value has been stated this
   paragraph has also been indented by 36pt as this is
   the value currently associated with the indent
   attribute.
   <P indent=0>To cancel the indent applied to the note it
   is necessary to enter a new value for the indent
   attribute as part of the paragraph's start-tag.

the set text might appear in the form:

This is an example of a normal, unindented paragraph of text. Notice that,
because the paragraph tag was the first one that used the indent attribute, 
a value had to be entered even though no indent was required.

     NOTE: This note has been set with a 36pt indent.

     Because no specific indent value has been stated this paragraph has also
     been indented by 36pt as this is the value currently associated with the
     indent attribute.

To cancel the indent applied to the note it is necessary  enter a new value for
the indent attribute as part of the paragraph's start-tag.

Notice that, until the indent is specifically restated, the value entered at the start of the <NOTE> element remains in force for the <P> element as well.

Web SGML Adaptations Extension
When the Web SGML adaptations provided by Annex K to ISO 8879 are available the following forms of attributes can be defined:
  1. Attributes using shared name token sets

    Duplicated name tokens may occur in attribute definition lists, as the following example shows:

    <!ATTLIST figref to    IDREF    #REQUIRED
                        page  (yes|no) no       
                        figno (yes|no) yes >

    Note: Duplicated name tokens cannot be used to identify short forms of attribute specifications.

  2. Attributes declared in multiple definitions

    The same associated element type name can occur in multiple attribute list declarations. Where multiple declarations for the same attribute are encountered the first declaration applies. For example, the following declarations could be used to define the attributes associated with a figure reference:

    <!ATTLIST figref to    IDREF    #REQUIRED
                     page  (yes|no) no         >

    followed by:

    <!ATTLIST (figref|xref) page   (yes|no) yes
                            figno  (yes|no) yes        >

    When concatenated and duplicate entries are removed this would be equivalent to:

    <!ATTLIST figref to    IDREF    #REQUIRED
                     page  (yes|no) no       
                     figno (yes|no) yes >
  3. Attributes associated with all elements

    Where an attribute is applicable to all elements in a DTD the reserved keyword #ALL can be used in place of an element type name, e.g.:

    <!ATTLIST #ALL language NAME #IMPLIED>

    Note: Attributes assigned to an element using the #ALL keyword will always be overwritten by attributes assigned specifically to the element, irrespective of the order in which the attributes have been defined.

  4. Attributes associated with implied elements

    When ELEMENTS YES has been specified as one of the implied definition formats in theIMPLYDEF section of the minimization features specification in the SGML declaration, the reserved keyword #IMPLICIT can be used in place of the associated element type name to indicate that the attribute list applies only to elements which have not been formally defined in the DTD. For example, the following attributes could be added to implied elements:

    <!ATTLIST #IMPLICIT language NAME           #IMPLIED 
                        show-as  (inline|block) inline   >
  5. Typed data attributes

    The following declaration will allow a notation processor known by the identifier ISO8601 to be used to validate date and time attributes:

    <!ATTLIST message date  DATA ISO8601 [format="dateonly"] #REQUIRED
                      time  DATA ISO8601 [format="timeonly"] #IMPLIED  >

    Note particularly the use of a data attribute specification to qualify the way in which the notation processor is invoked as part of the attribute specification.

References

Guidelines for Electronic Text Encoding and Interchange (TEI P3) Edited by C, M. Sperberg-McQueen and Lou Burnard for The Association for Computers and the Humanities (ACH), The Association of Computational Linguistics (ACL) and The Association for Literary and Linguistic Computing (ALLC), Chicago/Oxford, 1994, 1289pp

Web SGML Adaptations, Annex K to ISO 8879:1986, ISO/IEC JTC1/WG4, December 1997