© Martin Bryan 1997 from SGML and HTML Explained published by Addison Wesley Longman
This chapter reviews how the declarations described in the preceding chapters can be combined to form a document type definition. It contains the following sections:
The rules that are required by an application to control the markup of a document are known as a document type definition (DTD). The prolog of each DTD must contain at least one document type declaration. The first document type declaration in any prolog is referred to as the base document type declaration.
Note: Short references may only be associated with the base document type declaration.
A document type declaration is an SGML markup declaration which starts with
the keyword DOCTYPE (or its declared replacement). When the
reference concrete syntax is being used a document type declaration starts:
<!DOCTYPE docname
where docname is a unique document type name
used to identify the base document element of one of the logical structures used
in the document/subdocument.
| Web SGML Adaptations Extension When the Web SGML adaptations provided by Annex K to ISO 8879 are available docname
can be replaced by the reserved keyword #IMPLIED. When this
keyword occurs in the document type declaration the base document element name
is taken from the first start-tag in the accompanying document instance.
Note: The If |
Where an externally stored document type definition is being used as part (or all) of the document type declaration, the relevant external identifier for the external subset can be entered immediately after the document type name to give the document type declaration the form:
<!DOCTYPE docname external-subset-identifier ... >
The start of any locally defined set of entity and element type declarations
(referred to as the internal subset)
is identified by a declaration subset open (DSO)
delimiter entered immediately after the document's name or external identifier.
A matching declaration subset close (DSC)
delimiter is used to identify the end of the document type declaration subset:
this must immediately precede the markup declaration close (MDC)
code that terminates the document type declaration. In the reference concrete
syntax the left square bracket ([) is used for DSO,
the right square bracket (]) is used for DSC and the
MDC code is the greater than sign (>), giving the
document type declaration the overall form:
<!DOCTYPE docname optional-external-subset-identifier
[ internal subset ] >
Only markup declarations, including comment declarations and marked sections, processing instructions and valid SGML separator characters, as defined in the associated SGML declaration, may be entered within the document type declaration subset. The collective noun used to describe these declarations is DTD declarations.
External identifiers can be used to add externally stored declarations to a document type declaration. Two mechanisms are provided for this:
It is important to understand the difference between these two approaches.
Declarations that are called using parameter entity references in the internal
subset are activated at the point indicated by the parameter entity reference.
Files containing the external subset are not activated until the closing
delimiter of the document type declaration subset (e.g. the ]
character) is encountered, even though the SYSTEM or PUBLIC
keyword and the file identifier immediately follows the document type name at
the start of the document type declaration.
Typically the files identified through an external identifier in a document's prolog will identify a file containing the markup declarations forming the bulk of the document type declaration subset. A typical example might be:
<!DOCTYPE act PUBLIC "-//OPOCE//DTD Act of the European Commission//EN" [
<!-- internal subset declarations -->
]>
Because SGML only recognizes the first definition of an entity it receives, the fact that the externally stored declarations are added to the end of the subset ensures that any entity declarations entered in the local document type declaration subset will override a definition with the same name in the externally stored declarations. For example, if the document type declaration subset contains the definition:
<!ENTITY p CDATA "<p>">
while the recalled document type definition contains the entity:
<!ENTITY p STARTTAG p>
the local definition would be used within the document, causing all &p;
entity references (or short references calling this entity) to be output as the
character string <p> rather than being recognized as a
paragraph element tag.
| Web SGML Adaptations Extension When the Web SGML adaptations provided by Annex K to ISO 8879 are available the same precedence rules apply to duplicated attribute declarations where more than one attribute definition list declaration applies to a given element type. |
Only one external identifier can be associated with each document type declaration. Where more than one set of externally stored set of declarations needs to be referenced in a document type definition, subsequent files will have to be declared and called within the document type declaration subset. For example, a document type declaration might start:
<!DOCTYPE act PUBLIC
"-//OPOCE//DTD Act of the European Commission//EN"
[ <!ENTITY % ISOchem PUBLIC
"ISO 9573-13:1991//ENTITIES Chemistry//EN" >
%ISOchem;
<!-- other local declarations -->
]>
Where required, public identifiers can be further qualified by system
identifiers showing where the relevant details can be found on the originating
system. For example, if the required declarations were stored in a file called
a-100296.dtd a document type declaration might start:
<!DOCTYPE act PUBLIC
"-//OPOCE//DTD Act of the European Commission//EN" "a-010296.dtd"
[ ... ]>
Notice that, for publicly declared entities, the word SYSTEM
does not precede the system identifier.
Element type sets are sets of inter-linked declarations that define the structure of a document, or part of a document. Element type sets can contain:
Where element type sets are stored in entities that can be identified using
formal public identifiers they should be assigned
a public text class of ELEMENTS.
Element type sets should contain a number of comment declarations that uniquely identify its provenance and history. Comment declarations should be used to identify:
Comment declarations are declarations which contain only comments. The
markup declaration open (MDO) delimiter must be immediately
followed by the two hyphens identifying the start of a comment, or by whatever
alternative comment (COM) delimiter has been defined in the
current concrete syntax. The closing comment delimiter is followed by the markup
declaration close symbol to give a declaration of the form:
<!-- Elements used in a typical textbook -->
Note: Unlike comments embedded within element and entity declarations,
comment declarations cannot have spaces on either side of the comment delimiters
as spaces are not permitted between the initial markup declaration open sequence
(<!) and the opening comment delimiter. The following
declaration is, therefore, illegal:
<! -- Elements used in a typical textbook -- >
as it should have been entered as:
<!-- Elements used in a typical textbook -->
Note, however, that a space is permitted after the second of the comment delimiters.
More than one comment can be included in a single comment declaration, if required.
A special short form of comment declaration, consisting of a markup
declaration open delimiter immediately followed by a markup declaration close
delimiter (e.g. <!>) can be used where the comment
declaration is simply being inserted to provide a blank space between markup
declarations. (This form of dummy line should be used wherever a blank line is
required between element type declarations in an element type set to indicate
that the line has deliberately been left blank.)
Note: The format of comments used within Netscape, whereby the hyphen pairs are omitted, is not valid SGML and should be carefully guarded against.
It will be found that most of the declarations required for textual elements embedded within paragraphs are similar to those used in existing element type sets. In most cases it will be simpler to copy an existing definition from another set rather than try to redefine the embedded elements from first principles.
Before modifying an existing element type set it is important that the currently declared document structure is fully understood. If the relevant tree diagrams and element descriptions are available this should not be difficult, but if all you have is an uncommented document type declaration it may take some time to work out all the details of the existing structure.
Where the changes required to the existing structure are minor their incorporation into the document type declaration subset is usually a simple matter. Major changes may, however, require a careful re-appraisal of the parameter entities used within the document. This latter point is especially important where two or more existing element type sets are being combined because both sets may contain parameter entities which have the same name, but different declarations.
When creating a completely new element type set you should try to start by identifying a base document element that the element type set is to cover. Sometimes this is not possible because the element type set is designed to provide facilities for a number of DTDs, but where this is the case it will not be possible to test the validity of the element type set outside of the context in which it is used. Where a single element, or a set of elements, has been identified as a potential base document element for the element type set it will become possible to test the element type set before referencing it in other DTDs. This will assist the DTD maintenance process.
It is advisable to use existing element type names wherever possible, even when the use of the element slightly changes between applications. Where elements are named consistently users will find it easier to recognize the elements, and will be less likely to enter the wrong tags. As well as reducing the likelihood of keying errors, using commonly recognized names will also reduce the time needed to train document creators in the use of a new document structure.
Another way of making element type sets easier to understand is to structure the declarations so that elements declared at the same level start underneath each other. By applying this technique you can build up structured element type sets in simple stages.
The following DTD fragment shows how part of a FORMEX-coded multi-level table, as used within the OPOCE, has been defined:
<!ELEMENT BLKROW - - (#PCDATA, ROW1+)>
<!ELEMENT ROW1 - - (#PCDATA, ROW2*)>
<!ELEMENT ROW2 - - (#PCDATA, ROW3*)>
<!ELEMENT ROW3 - - (#PCDATA, ROW4*)>
<!ELEMENT ROW4 - - (#PCDATA, ROW5*)>
<!ELEMENT ROW5 - - (#PCDATA, ROW6*)>
<!ELEMENT ROW6 - - (#PCDATA, ROW7*)>
<!ELEMENT ROW7 - - (#PCDATA, ROW8*)>
<!ELEMENT ROW8 - - (#PCDATA)>
While such structured sets can make the relationships of elements easier to understand, structures can be complicated by use of parameter entities. Where a large number of parameter entities are used in model group definitions a structured format may not be advantageous.
A record is defined within SGML as any data between a record start (RS)
and a record end (RE) code. In the reference concrete syntax the
record start code is the hexadecimal 0A (the ASCII line feed code) while the
record end code is hexadecimal 0D (the ASCII carriage return code).
SGML does not restrict the length of a record, and record boundaries do not need to be present. Where they are present within parsed text (as opposed to marked sections or markup declarations) their effect depends on their position.
When parsing the data within mixed content an SGML program ignores any record start codes, using the record end code as the sole guide to record boundaries. Three rules control the effect of the record end code:
RE in an element is ignored if it is not
preceded by an
RS code, some recognized data or a proper subelement (i.e. a
subelement that is specified in the model group for the element rather than in
an inclusion clause associated with the element or one of its parents)
RE in an element is ignored if the record
is not followed by data or a proper subelement
RE codes that do not immediately follow an RS
code, or another RE code, are ignored unless the program
identifies data, or a proper subelement, between the codes.In applying these rules subelement content is ignored as both proper and included subelements are treated as an atom which ends in the record it starts in.
Note: When start-tag omission is in force omitted markup recognition occurs before the above rules are applied.
The effect of these three rules on an element containing mixed content can be seen in the following example:
Record Contents ___________________________________________________________ 1 <P> 2 Record end codes immediately after tags are ignored. 3 4 <EM> 5 Emphasized phrases 6 </EM> do not always start on a new line. 7 </P>
Each of the records shown above starts with a record start code and ends with a record end code (the record numbers are not part of the file; they are simply shown for reference).
The first element of the text is the paragraph element whose start-tag appears in line 1. As this start-tag is immediately followed by a record end code, without any preceding text, rule 1 will result in the record end code on line 1 being ignored. This means the program will treat the first two lines of coding as if they had been entered in a single line reading:
<P>Record end codes immediately after tags are ignored.
The third record appears to consist simply of a record start code followed by a record end-code, but it could also contain other, hidden, codes such as Tab codes and Backspaces. If this was the case rule 3 above may result in the record end code at the end of the line being ignored. If the line is a true blank line, consisting of a record start code followed immediately by a record end code, however, the record end code will be retained when the document is parsed as the following element is a proper subelement.
As with the first record, the record end code at the end of the fourth record will be ignored as no data for the embedded subelement precedes it. At the end of the fifth record, rule 2 will cause the last record end of the embedded subelement to be ignored as, in this case, it is followed by the end-tag of the subelement rather than data or another level of embedded subelement. The program will, therefore, treat the records 4 to 6 as if a single record had been entered as:
<EM>Emphasized phrases</EM> do not always start on a new line.
As the record end code at the end of the sixth record is the last such code in the paragraph element, this will also be ignored, following rule 2, so the whole example will be treated by the program as if it had been entered as:
<P>Record end codes immediately after tags are ignored. <EM>Highlighted phrases</EM> do not always start on a new line.</P>
Note: The blank line will only be retained if the third record was actually blank.
| Web SGML
Adaptations Extension When the Web SGML adaptations provided by Annex K to ISO 8879 are available the KEEPRSRE YES option of
the extended FEATURES clause can be
used to switch off these rules, retaining all record start and end codes as
entered. When the adaptations are in force the rules shown above will only be
appliedwhen KEEPRSRE NO is specified. |
Where an element only contains nested subelements (i.e. its model group identifies it as having element content) record start and end codes will always be treated as separator characters, which are ignored within markup. Each proper or included subelement is treated as an atom that ends in the same record in which it begins.
Note: In the days when the SGML standard was being written record based mainframes were commonplace. Today, as such systems become less common, the anomalies introduced by adopting this approach are beginning to appear to be a bit dated. It is likely that this area of the SGML standard will be simplified when the standard is next extended.
Office for Official Publications of the European Communities (1985) FORMEX - Formalized Exchange of Electronic Publications (ISBN 92-825-5399-X) Luxembourg : OPOCE