Previous chapterNext chapterTable of Contents

© Martin Bryan 1997 from SGML and HTML Explained published by Addison Wesley Longman


Chapter 4
The SGML Declaration

This chapter explains briefly the component parts of the SGML declaration, and tries to give some idea of the role each part plays. It covers:

Many readers will find the concepts covered in this chapter difficult to grasp at first reading. Do not worry if you do not understand the role of any part of the SGML declaration at first reading. You are not meant to at this stage! The reason for asking you to read quickly through this chapter at the beginning of this explanation of SGML is that restrictions imposed by the SGML declaration are fundamental to understand many of the rules in SGML. Terms introduced in this chapter will be used throughout the remainder of the book. When you return to this chapter to remind yourself of the concepts being referred to by these terms you should find that the summary of the term given in this chapter will explain the restrictions imposed on other SGML constructs.

4.1 The role of the SGML declaration

When interchanging documents it is important that each transmitted code has a well defined function. In addition it is important that document markup can be correctly distinguished from codes that form the text of the document.

The rules defining the meanings of the constructs used by a particular language are known as the syntax of that language. Two distinct types of syntax have been defined for SGML:

This chapter will introduce you to many of the terms used to describe the SGML's abstract syntax. The use to which the abstract syntax is put will be explained in the following chapters.

One particular concrete syntax, called the reference concrete syntax, has been formally defined within ISO 8879:1986 to provide a reference against which variant concrete syntaxes can be compared. It is a requirement of conforming SGML systems that they be able to parse documents conforming to the reference concrete syntax.

Each SGML document transferred to another system should be accompanied by a declaration, called the SGML declaration, which defines the coding scheme used in its preparation. Figure 4.1 shows the SGML declaration that should be used if a document is transmitted without an SGML declaration. (Such documents referred to as basic SGML documents.)

<!SGML "ISO 8879:1986"
       -- Declaration for typical Basic SGML Document --
  CHARSET  BASESET   "ISO 646:1983//CHARSET International
                      Reference Version (IRV)//ESC 2/5 4/0"
           DESCSET   0   9  UNUSED
                     9   2  9
                     11  2  UNUSED
                     13  1  13
                     14  18 UNUSED
                     32  95 32
                     127 1  UNUSED
  CAPACITY PUBLIC    "ISO 8879:1986//CAPACITY Reference//EN"
  SCOPE    DOCUMENT
  SYNTAX   PUBLIC    "ISO 8879:1986//SYNTAX Reference//EN"
  FEATURES MINIMIZE  DATATAG NO  OMITTAG  YES  RANK     NO  SHORTTAG YES
           LINK      SIMPLE  NO  IMPLICIT NO   EXPLICIT NO
           OTHER     CONCUR  NO  SUBDOC   NO   FORMAL   NO
  APPINFO  NONE
>

Figure 4.1 SGML declaration for basic SGML document

The SGML declaration starts with a markup declaration open (mdo) sequence consisting of the codes <! . The declaration is closed by a matching markup declaration close (mdc) angle bracket (>) at the end of the declaration.

The rest of the first line of the SGML declaration consists of the letters SGML followed by a delimited string containing the number and date of the ISO standard in which SGML is defined ("ISO 8879:1986"). This statement indicates which version of the standard was used to prepare the following declarations.

The second line of the default SGML declaration contains some text bracketed by pairs of hyphens. Text entered in an SGML markup declaration between pairs of hyphens is treated as a comment. In this case the comment acts as a heading explaining the purpose of the following entries.

The names of the six main clauses that make up an SGML declaration are shown in the first column of the SGML declaration. They identify:

4.2 The Syntax clause

A key part of the SGML declaration is the SYNTAX clause, which controls the codes that can be used for document markup. In Figure 4.1 the syntax has been entered as a formal public identifier which references the default syntax defined in ISO 8879:1986, which is shown in Figure 4.2.

      SYNTAX   SHUNCHAR  CONTROLS 0 1 2 3 4 5 6 7 8 9 10 11 12 13
                           14 15 16 17 18 19 20 21 22 23 24 25 26
                           27 28 29 30 31 127 255
               BASESET   "ISO 646-1983//CHARSET International
                          Reference Version (IRV)//ESC 2/5 4/0"
               DESCSET   0 128 0
               FUNCTION  RE           13
                         RS           10
                         SPACE        32
                         TAB  SEPCHAR  9
               NAMING    LCNMSTRT     ""
                         UCNMSTRT     ""
                         LCNMCHAR     "-."
                         UCNMCHAR     "-."
                         NAMECASE     GENERAL YES
                                      ENTITY  NO
               DELIM     GENERAL      SGMLREF
                         SHORTREF     SGMLREF
               NAMES     SGMLREF
               QUANTITY  SGMLREF   

Figure 4.2 Formal definition of reference concrete syntax

4.2.1 The Reference Concrete Syntax

The declaration for SGML's reference concrete syntax given in the SYNTAX clause shown in Figure 4.2 contains eight subclause definitions, each identified by a keyword. These define:

The base character set used for the reference concrete syntax is that defined in international standard ISO 646. This 7-bit character set, known as the International Reference Version (IRV), is used as a starting point for all international standards that define character sets, e.g. ISO 6937, ISO 8859 and ISO/IEC 10646.

Note: A revision of ISO 646 took place in 1991. The revision (ISO 646:1991) matches the American Standard Code for Information Interchange (ASCII) used by many computer systems. (ISO 646 does allocate different names to some of the control characters, but these names do not affect the way these codes are used.) In addition it has been identified that the ISO 2022 Escape sequence used for ISO 646 in the SGML reference concrete syntax was incorrect: it should have been ESC 2/8 4/0. Strictly speaking, therefore, the reference concrete syntax should be updated to read "ISO 646:1991//CHARSET International Reference Version (IRV)//ESC 2/8 4/0". In practice it is likely that the next revision of SGML will adopt the 16-bit version of ISO/IEC 10646:1993 as its default code set.

The described character set portion of the reference concrete syntax character set description shows that 128 characters, starting from position 0 in the list, should be mapped to identical positions in the reference concrete syntax. Figure 4.3 shows the 128 codes defined in ISO 646.

Value ISO (16-bit) representation ISO name/ character Purpose
Decimal Hexadecimal
0 0 0/0 NUL Null code
1 1 0/1 TC1/SOH Transmission code 1 / Start of header
2 2 0/2 TC2/STX Transmission code 2 / Start of text
3 3 0/3 TC3/ETX Transmission code 3 / End of text
4 4 0/4 TC4/EOT Transmission code 4 / End of transmission
5 5 0/5 TC5/ENQ Transmission code 5 / Enquire
6 6 0/6 TC6/ACK Transmission code 6 / Acknowledge
7 7 0/7 BEL Bell
8 8 0/8 FE0/BS Format effector 0 / Backspace
9 9 0/9 FE1/HT Format effector 1 / Horizontal tab
10 A 0/10 FE2/LF Format effector 2 / Line feed
11 B 0/11 FE3/VT Format effector 3 / Vertical tab
12 C 0/12 FE4/FF Format effector 4 / Form feed
13 D 0/13 FE5/CR Format effector 5 / Carriage return
14 E 0/14 SO Shift out
15 F 0/15 SI Shift in
16 10 1/0 TC7/DLE Transmission code 7 / Data link escape
17 11 1/1 DC1 Device control character 1
18 12 1/2 DC2 Device control character 2
19 13 1/3 DC3 Device control character 3
20 14 1/4 DC4 Device control character 4
21 15 1/5 TC8/NAK Transmission code 8 / Negative acknowledge
22 16 1/6 TC9/SYN Transmission code 9 / Synchronize
23 17 1/7 TC10/ETB Transmission code 10 / End of text block
24 18 1/8 CAN Cancel
25 19 1/9 EM End of media
26 1A 1/10 SUB Substitute character
27 1B 1/11 ESC Escape
28 1C 1/12 FS/DT/IS4 Frame separator / ISO 6937 document terminator
29 1D 1/13 GS/PT/IS3 Group separator / ISO 6937 page terminator
30 1E 1/14 RS/IS2 Record separator
31 1F 1/15 US/IS1 Unit separator
32 20 2/0 Space
33 21 2/1 ! Exclamation mark
34 22 2/2 " Quotation mark
35 23 2/3 # Number sign
36 24 2/4 ¤ General currency sign (Dollar in ISO 646:1991)
37 25 2/5 % Percent
38 26 2/6 & Ampersand
39 27 2/7 ' Apostrophe
40 28 2/8 ( Left parenthesis
41 29 2/9 ) Right parenthesis
42 2A 2/10 * Asterisk
43 2B 2/11 + Plus sign
44 2C 2/12 , Comma
45 2D 2/13 - Hyphen
46 2E 2/14 . Full stop (Period)
47 2F 2/15 / Forward Slash (Solidus)
48 30 3/0 0
49 31 3/1 1
50 32 3/2 2
51 33 3/3 3
52 34 3/4 4
53 35 3/5 5
54 36 3/6 6
55 37 3/7 7
56 38 3/8 8
57 39 3/9 9
58 3A 3/10 : Colon
59 3B 3/11 ; Semicolon
60 3C 3/12 < Less-than sign
61 3D 3/13 = Equals sign
62 3E 3/14 > Greater-than sign
63 3F 3/15 ? Question mark
64 40 4/0 @ Commercial at
65 41 4/1 A
66 42 4/2 B
67 43 4/3 C
68 44 4/4 D
69 45 4/5 E
70 46 4/6 F
71 47 4/7 G
72 48 4/8 H
73 49 4/9 I
74 4A 4/10 J
75 4B 4/11 K
76 4C 4/12 L
77 4D 4/13 M
78 4E 4/14 N
79 4F 4/15 O
80 50 5/0 P
81 51 5/1 Q
82 52 5/2 R
83 53 5/3 S
84 54 5/4 T
85 55 5/5 U
86 56 5/6 V
87 57 5/7 W
88 58 5/8 X
89 59 5/9 Y
90 5A 5/10 Z
91 5B 5/11 [ Left square bracket
92 5C 5/12 \ Backward slash (Reverse solidus)
93 5D 5/13 ] Right square bracket
94 5E 5/14 ^ Circumflex accent
95 5F 5/15 _ Low line
96 60 6/0 ` Grave accent
97 61 6/1 a
98 62 6/2 b
99 63 6/3 c
100 64 6/4 d
101 65 6/5 e
102 66 6/6 f
103 67 6/7 g
104 68 6/8 h
105 69 6/9 i
106 6A 6/10 j
107 6B 6/11 k
108 6C 6/12 l
109 6D 6/13 m
110 6E 6/14 n
111 6F 6/15 o
112 70 7/0 p
113 71 7/1 q
114 72 7/2 r
115 73 7/3 s
116 74 7/4 t
117 75 7/5 u
118 76 7/6 v
119 77 7/7 w
120 78 7/8 x
121 79 7/9 y
122 7A 7/10 z
123 7B 7/11 { Left curly bracket
124 7C 7/12 | Vertical line
125 7D 7/13 } Right curly bracket
126 7E 7/14 ~ Tilde
127 7F 7/15 DEL Delete

Figure 4.3 The ISO 646 character set

Codes with values less than 32, and that with a value of 127, have been allocated to control functions, while the 95 codes with values between 32 to 126 are associated with printable (data) characters. Note that the character numbers entered in the SHUNCHAR section of the syntax clause shown in Figure 4.2 are those defined as control codes within ISO 646, e.g.:

        SHUNCHAR  CONTROLS 0 1 2 3 4 5 6 7 8 9 10 11 12 13
                           14 15 16 17 18 19 20 21 22 23 24 25 26
                           27 28 29 30 31 127 255

There are, however, certain control codes that are significant within an SGML document, not as characters but as codes which serve particular functions. These codes are identified in the FUNCTION section of the syntax definition. In the case of the reference concrete syntax four functions are defined:

The carriage return code (13) is used as the Record End code for the reference concrete syntax, with the line feed code (10) being used for the Record Start. The special rules that apply to the processing of these codes are explained in the section headed The effect of record boundaries in Chapter 11.

The Space character (32) is treated as a function character because it has a special function as a separator within SGML markup declarations. The Tab code (9) can also be used as a separator, but as it does not have exactly the same role as the space it is placed into a special group of separator characters identified by the SEPCHAR control word.

Additional function codes can be specified by adding to the list a triplet consisting of:

The types of function class that can be identified in SGML are:

The most commonly used function classes are SEPCHAR, which is used for all codes that can separate the component parts of a markup declaration (in addition to RE, RS and SPACE), and FUNCHAR, which is used to identify system specific functions.

Note: Markup scanning is suppressed between codes defined as markup-scan-out characters and codes defined as markup-scan-in characters, and for the code immediately following a markup-scan-suppress character.

The NAMING section of the syntax clause identifies which characters can be used in tag or entity names, and in SGML unique identifiers. By default SGML presumes that names can only start with alphabetic characters, in either shift, with subsequent characters being alphanumeric. The LCNMSTRT and UCNMSTRT entries in the syntax clause allow other, non-alphanumeric, characters to be defined as name start characters, the LCNMCHAR and UCNMCHAR entries defining which non-alphanumeric characters can be used as name characters after a name start character.

The reference concrete syntax only allows alphabetic characters to be used as name start characters, but within names the unaccented alphanumeric characters (a-z, A-Z and 0-9) can be supplemented by full stops and hyphens.

Note: Digits cannot be used as the first character of an SGML name.

Other characters that are required as parts of tag, attribute or entity names, or within unique identifiers, must be declared as valid name characters by putting the appropriate characters in the uppercase and lowercase name start or name character strings. The position of the entries in the string is important as characters in position n in the lowercase string may be replaced by the character in position n in the uppercase string during parsing. If there is no uppercase equivalent the lowercase character must be repeated in the uppercase string (and vice versa).

The NAMECASE entries of the syntax clause show that, by default, the reference concrete syntax allows uppercase substitution of lowercase characters within element and related markup (GENERAL YES) but for entity names such substitution is not permitted (ENTITY NO). This allows different entity declarations to be defined for &Eacute; and &eacute;, etc., while allowing <p> and <P> to be treated identically

The GENERAL SGMLREF entry in the DELIM section of the syntax clause shows that the general default set of SGML delimiters are used in the reference concrete syntax. Figure 4.4 lists these default delimiters and shows the formal name assigned to the identifier.

Character(s) Name    Purpose

   &         ERO     Entity reference open or
   &         AND     And connector (within declaration group)
   &#        CRO     Character reference open
   %         PERO    Parameter entity reference open
   ;         REFC    Entity reference close
   <         STAGO   Start-tag open
   </        ETAGO   End-tag open
   <!        MDO     Markup declaration open
   <?        PIO     Processing instruction open
   >         TAGC    Tag close or
   >         MDC     Markup declaration close or
   >         PIC     Processing instruction close
   (         GRPO    Group open (within declaration)
   )         GRPC    Group close (within declaration)
   [         DSO     Declaration subset open or
   [         DTGO    Data tag group open
   ]         DSC     Declaration subset close or
   ]         DTGC    Data tag group close
   ]]        MSC     Marked section close
   "         LIT     Start or end of literal string
   '         LITA    Alternative start or end of literal string
   =         VI      Value indicator (within attributes)
   --        COM     Start and end of comment
   -         MINUS   Exclusion set identifier
   +         PLUS    Inclusion set identifier or
   +         PLUS    Required and repeatable occurrence indicator
   *         REP     Optional and repeatable occurrence indicator
   ?         OPT     Optional occurrence indicator
   |         OR      Or connector (within declaration group)
   ,         SEQ     Sequence connector (within declaration group)
   /         NET     Null end-tag
   #         RNI     Reserved name indicator

Figure 4.4 Reference concrete syntax delimiter set

Note that some codes are assigned more than one meaning. This is because the meaning of a markup delimiter is dependent on the context in which it is encountered. There are 10 different markup contexts:

  1. CON - Recognized in content, including marked section content
  2. CXT - Recognized within both CON or DSM context
  3. DS - Recognized only within a declaration subset
  4. DSM - Recognized within a declaration subset or a marked section
  5. GRP - Recognized within a group
  6. LIT - Recognized within a literal
  7. MD - Recognized within a markup declaration
  8. PI - Recognized within a processing instructions
  9. REF - Recognized within an entity or character reference
  10. TAG - Recognised within a start-tag or end-tag.

Figure 4.5 shows which delimiters are recognized in which contexts.

Context Delimiters recognized
CON CRO ERO STAGO ETAGO NET MDO MSC PIO and short reference delimiters
CXT COM DSO GRPO MDC TAGC
DS DSC
DSM MDO MSC PERO PIO
GRP GRPO GRPC LIT LITA PERO AND OR SEQ PLUS REP RNI DTGO DTGC
LIT CRO ERO LIT LITA PERO
MD COM DSO DSC GRPO LIT LITA MINUS PLUS PERO RNI
PI PIC
REF REFC
TAG STAGO ETAGO TAGC VI LIT LITA

Figure 4.5 Contexts in which delimiters can be recognized

The SHORTREF SGMLREF entry in the DELIM section of the syntax clause shows that the standard set of SGML short reference delimiters, shown in Figure 4.6, can be used in conjunction with the reference concrete syntax.

Character(s) Number(s) Purpose

   &#TAB;       9       Horizontal tab
   &#RS;       10       Record start (line feed)
   &#RE;       13       Record end (carriage return)
               32       Space
   "           34       Quotation mark
   #           35       Number sign
   %           37       Percent
   '           39       Apostrophe
   (           40       Left parenthesis
   )           41       Right parenthesis
   *           42       Asterisk
   +           43       Plus sign
   ,           44       Comma
   -           45       Hyphen
   :           58       Colon
   ;           59       Semicolon
   =           61       Equals sign
   @           64       Commercial at
   [           91       Left square bracket
   ]           93       Right square bracket
   ^           94       Circumflex accent
   _           95       Low line
   {           123      Left curly bracket
   |           124      Vertical line
   }           125      Right curly bracket
   ~           126      Tilde
   --          45,45    Two hyphens
   BB          66,66    Two or more blanks (spaces or tabs)
   B&#RE;      66,13    Trailing blank(s) followed by record end
   &#RS;B      10,66    Record start followed by leading blanks
   &#RS;B&#RE; 10,66,13 Blank records (one or more blanks)
   &#RS;&#RE;  10,13    Empty record

Figure 4.6 Reference concrete syntax short reference delimiters

In the concrete reference syntax most punctuation characters can be used as short reference delimiters, though tag delimiters (&, <, /, !, ? and >), and certain other significant symbols (e.g. apostrophe, backslash, full stop and the general currency sign) are excluded. Six special code sequences are also defined, five of which allow common word processor line ending conventions to be used as short reference strings.

The QUANTITIES entry at the end of the syntax clause also requires the presence of the SGMLREF keyword to indicate that unless otherwise specified the default quantity set will be used. Figure 4.7 shows the default quantity limits.

Reserved Name Value Purpose
ATTCNT 40 Maximum number of attribute names and name tokens in an attribute definition list
ATTSPLEN 960 Maximum length of a start-tag attribute specification
BSEQLEN 960 Maximum length of blank sequence mappable to a short reference string
DTAGLEN 16 Maximum length of data tag string
DTEMPLEN 16 Maximum length of data tag template or pattern template
ENTLVL 16 Maximum number of nesting levels for entities
GRPCNT 32 Maximum number of tokens in group (one level)
GRPGTCNT 96 Maximum number of tokens at all levels in a model group (data tag groups count as 3 tokens)
GRPLVL 16 Maximum number of nesting levels in a model group
LITLEN 240 Maximum length of a delimited literal (within delimiters)
NAMELEN 8 Maximum length of names, numbers, tokens, etc.
NORMSEP 2 Default separator length when calculating the normalized length of names, tokens, etc.
PILEN 240 Maximum length of processing instructions
TAGLEN 960 Maximum length of start-tags
TAGLVL 24 Maximum number of open elements

Figure 4.7 Default Quantities

The most restrictive entries in the default quantity set are:

These entries often need to be increased from their default values. When SGML is revised it is anticipated that the default values will be changed to 32, 2048, 32 and 64 respectively. Most SGML parsers already default to these, or higher, levels, though they should still warn users when the standard values have been exceeded.

4.3 Other Clauses in the SGML Declaration

4.3.1 The Character Set description

The BASESET and DESCSET clauses in the character set description (CHARSET) that starts the SGML declaration are used to define the character set used within an SGML document. By default the ISO 646 character set used for markup is defined as the first component of the document's character set. This default document character set can be extended by referencing other ISO character sets. For example, the 96 character supplementary set of Latin accented characters, as defined in ISO 8859/1, could be added to the document's character set by placing the following entries underneath the standard DESCSET entry in the CHARSET clause at the start of the SGML declaration shown in Figure 4.1:

   BASESET "ISO 8859-1:1987//CHARSET Right Part of 
                                     Latin Alphabet No. 1//ESC 2/13 4/1"
   DESCSET 128 32 UNUSED -- Control character positions --
           160 95 160    -- 96 characters in set --
          

The extra characters would be accessed by codes with values between 160 and 255, other codes greater than 128 being ignored.

Where the 16-bit ISO 10646 code set is required the default definitions for the base set, and its associated description, can be changed to read:

   BASESET "ISO Registration Number 176//CHARSET ISO/IEC 10646-1:1993
            UCS-2 with implementation level 3//ESC 2/5 2/15 4/5"
   DESCSET   0      9  UNUSED
             9      2  9
            11      2  UNUSED
            13      1  13
            14     18  UNUSED
            32     95  32
           127      1  UNUSED
           128     32  UNUSED
           160  65374  160

Note: The first 128 codes of ISO/IEC 10646 are the 128 codes defined in ISO 646.

To use the full 32-bit ISO 10646 UCS-4 code set this would be changed to read:

   BASESET  "ISO Registration Number 177//CHARSET ISO/IEC 10646-1:1993
             UCS-4 with implementation level 3//ESC 2/5 2/15 4/6"
   DESCSET    0 9          UNUSED
              9 2          9
             11 2          UNUSED
             13 1          13
             14 18         UNUSED
             32 95         32
            127 1          UNUSED
            128 32         UNUSED
            160 2147483486 160

Note: The above character set was proposed as the definition to be used for the internationalized version of the HyperText Markup Language (HTML) on the World Wide Web in August 1996.

The described character set portion of the default document character set description shown in Figure 4.1 defines the purpose of the characters in ISO 646 more clearly than the matching entry in the syntax clause. It can be interpreted as:

4.3.2 The Capacity Set

The capacity set used with the reference concrete syntax is shown in Figure 4.8. This reference capacity set restricts the total number of stored markup characters within an SGML document to 35000 characters, but places no restrictions on the capacity of any one of the component parts of the markup, which can take up all 35000 bytes of strorage if required. In some large documents it is possible for this default total capacity to be exceeded

Note: Most current SGML systems will ignore the default capacity set restrictions, perhaps providing a warning message to users if the default limits are exceeded. Modern large-memory systems do not have the memory restrictions that were typically found in desktop systems of the 1980s, where it was important to warn users of large documents that they could exceed the program's memory allocation. Many of the existing restrictions defined in the capacity set clause will be removed when ISO 8879 is next updated.

Name Default Value Points Per Unit Purpose
TOTALCAP 35000 All Grand total of capacity points
ENTCAP 35000 NAMELEN Entity name capacity
ENTCHCAP 35000 1 No. of entity replacement characters
ELEMCAP 35000 NAMELEN Element name capacity
GRPCAP 35000 NAMELEN Tokens within model groups (data tag groups count as 3 tokens)
EXGRPCAP 35000 NAMELEN Number of exception groups
EXNMCAP 35000 NAMELEN Tokens within exception groups
ATTCAP 35000 NAMELEN Attribute name capacity
ATTCHCAP 35000 1 Attribute default values capacity
AVGRPCAP 35000 NAMELEN Attribute value token capacity
NOTCAP 35000 NAMELEN Data content notation name capacity
NOTCHCAP 35000 1 No. of characters in notation identifiers
IDCAP 35000 NAMELEN Explicit or default ID value capacity
IDREFCAP 35000 NAMELEN Explicit or default IDREF value capacity
MAPCAP 35000 NAMELEN Short reference map declaration capacity
LKSETCAP 35000 NAMELEN Link set/type declaration capacity
LKNMCAP 35000 NAMELEN Link/document type name storage capacity

Figure 4.8 Reference capacity set

4.3.3 The Scope clause

By default the SCOPE clause of an SGML declaration is the whole document (i.e. the syntax is used in both the document prolog and the document instance). If, however, the character set defined in the syntax section is only used to markup the text (i.e. all declarations have been coded using the reference concrete syntax) the default SCOPE DOCUMENT entry can be changed to read SCOPE INSTANCE.

4.3.4 The Features clause

The FEATURES clause of the SGML declaration shows which of SGML's optional features are required to process the document. The optional features are:

By default only the OMITTAG and SHORTTAG options are available, all other options being set to NO.

4.3.5 Application-specific information

The last clause in the SGML declaration can be used to transmit any application-specific information (APPINFO) needed to process the document. For example, a document that uses the ISO/IEC 10744 Hypermedia/Time-based Structuring Language (HyTime) application of SGML would have an entry reading APPINFO "HyTime". When no application specific information needs to be exchanged the default entry of APPINFO NONE applies

4.4 Alternative concrete syntaxes

ISO 8879 also identifies some special sets of alternative concrete syntaxes. The most important of these are:

The core concrete syntax is exactly the same as the reference concrete syntax except that the SHORTREF entry in the DELIM section is followed by NONE rather than SGMLREF. A document prepared using the core concrete syntax is referred to as a minimal SGML document.

Where the code extension techniques defined in ISO 2022 are being used to extend the character set beyond the 95 characters available in the reference concrete syntax, the multicode basic concrete syntax defined in Annex D of ISO 8879 can be used. If the short reference facility is not required the equivalent multicode core concrete syntax can be used.

Where characters outside the standard ISO 646 unaccented Latin alphabet are required in markup, variants of the reference concrete syntax will be needed. Each such variant concrete syntax can be publicly declared as a public concrete syntax and given a public identifier that can be used to call it from within the SGML declaration. For example, a German variant concrete syntax might be identified as:

     SYNTAX PUBLIC "ISO 8879:1986//SYNTAX Deutscher Hinweis//DE"

4.4.1 The HTML SGML declaration

The most famous variant concrete syntax is that used for the HyperText Markup Language (HTML). In the definition of Version 2.0 of this language, in RFC 1866, the following SGML declaration was specified:

<!SGML "ISO 8879:1986"
-- SGML Declaration for HyperText Markup Language (HTML). --

CHARSET BASESET "ISO 646:1983//CHARSET
                 International Reference Version
                 (IRV)//ESC 2/5 4/0"
        DESCSET   0 9  UNUSED
                  9 2  9
                 11 2  UNUSED
                 13 1  13
                 14 18 UNUSED
                 32 95 32
                127 1  UNUSED

        BASESET "ISO Registration Number 100//CHARSET
                ECMA-94 Right Part of
                Latin Alphabet Nr. 1//ESC 2/13 4/1"
        DESCSET 128 32 UNUSED
                160 96 32

CAPACITY SGMLREF
         TOTALCAP 150000
         GRPCAP 150000
         ENTCAP 150000

SCOPE DOCUMENT

SYNTAX SHUNCHAR CONTROLS 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
                         17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 127
       BASESET "ISO 646:1983//CHARSET
                International Reference Version
                (IRV)//ESC 2/5 4/0"
       DESCSET 0 128 0
       FUNCTION RE 13
                RS 10
                SPACE 32
                TAB SEPCHAR 9
       NAMING   LCNMSTRT ""
                UCNMSTRT ""
                LCNMCHAR ".-"
                UCNMCHAR ".-"
       NAMECASE GENERAL YES
                ENTITY NO
       DELIM    GENERAL SGMLREF
                SHORTREF SGMLREF
       NAMES    SGMLREF
       QUANTITY SGMLREF
                ATTSPLEN 2100
                LITLEN 1024
                NAMELEN 72 -- somewhat arbitrary; taken from
                              Internet line length conventions --
                PILEN 1024
                TAGLVL 100
                TAGLEN 2100
                GRPGTCNT 150
                GRPCNT 64

FEATURES MINIMIZE DATATAG NO
                  OMITTAG YES
                  RANK NO
                  SHORTTAG YES
         LINK     SIMPLE NO
                  IMPLICIT NO
                  EXPLICIT NO
         OTHER    CONCUR NO
                  SUBDOC NO
                  FORMAL YES
APPINFO "SDA" -- conforming SGML Document Access application -- >

This SGML declaration specifies the following changes to the default SGML declaration:

For version 4.0 of the HTML DTD, which supports multiple languages and the use of bidirectional texts, the following SGML declaration should be used to invoke the full ISO/IEC 10646 character set:

   <!SGML  "ISO 8879:1986"
    --
         SGML Declaration for HyperText Markup Language version 4.0

         With support for the first 17 planes of ISO 10646 and
         increased limits for tag and literal lengths etc.
    --

    CHARSET
          BASESET  "ISO Registration Number 177//CHARSET
                    ISO/IEC 10646-1:1993 UCS-4 with
                    implementation level 3//ESC 2/5 2/15 4/6"
         DESCSET 0       9       UNUSED
                 9       2       9
                 11      2       UNUSED
                 13      1       13
                 14      18      UNUSED
                 32      95      32
                 127     1       UNUSED
                 128     32      UNUSED
                 160     55136   160
                 55296   2048    UNUSED  -- SURROGATES --
                 57344   1056768 57344

CAPACITY        SGMLREF
                TOTALCAP        150000
                GRPCAP          150000
                ENTCAP          150000

SCOPE    DOCUMENT
SYNTAX
         SHUNCHAR CONTROLS 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
           17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 127
         BASESET  "ISO 646IRV:1991//CHARSET
                   International Reference Version
                   (IRV)//ESC 2/8 4/2"
         DESCSET  0 128 0

         FUNCTION
                  RE            13
                  RS            10
                  SPACE         32
                  TAB SEPCHAR    9

         NAMING   LCNMSTRT ""
                  UCNMSTRT ""
                  LCNMCHAR ".-_:"    
                  UCNMCHAR ".-_:"
                  NAMECASE GENERAL YES
                           ENTITY  NO
         DELIM    GENERAL  SGMLREF
                  SHORTREF SGMLREF
         NAMES    SGMLREF
         QUANTITY SGMLREF
                  ATTCNT   60      -- increased --
                  ATTSPLEN 65536   -- These are the largest values --
                  LITLEN   65536   -- permitted in the declaration --
                  NAMELEN  65536   -- Avoid fixed limits in actual --
                  PILEN    65536   -- implementations of HTML UA's --
                  TAGLVL   100
                  TAGLEN   65536
                  GRPGTCNT 150
                  GRPCNT   64

FEATURES
  MINIMIZE
    DATATAG  NO
    OMITTAG  YES
    RANK     NO
    SHORTTAG YES
  LINK
    SIMPLE   NO
    IMPLICIT NO
    EXPLICIT NO
  OTHER
    CONCUR   NO
    SUBDOC   NO
    FORMAL   YES
  APPINFO NONE
 >

Extensions to the SGML Declaration

Warning: Before using the extensions listed below you should ensure that both the document creation and document receiving systems can process these additional features.

Two extensions to the facilities provided by SGML declarations have been defined in the form of optional annexes to ISO 8879:

Annex J allows SGML to make full use of the extensive ranges of characters provided in the ISO/IEC 10646 character set by providing for the specification of ranges of characters and for identifying name characters for which case substitution is not permitted.

Annex K provides additional controls for optional features of SGML, and relaxes some of the previously mandatory restrictions to allow for situations where parts of the document type definition may not be accessible due to network constraints. Annex K also includes facilities for identifying where externally defined constraints on the use of declarations have been defined.

The following example shows how these extensions can be used to create an SGML declaration that defines the syntax used for the World Wide Web Consortium's Extensible Markup Language (XML):

<!SGML   -- SGML Declaration for XML --
         "ISO 8879:1986 (WWW)"
CHARSET  BASESET
         "ISO Registration Number 176//CHARSET
          ISO/IEC 10646-1:1993 UCS-4 with implementation
          level 3//ESC 2/5 2/15 4/6"
         DESCSET    0         9  UNUSED
                    9         2  9
                   11         2  UNUSED
                   13         1  13
                   14        18  UNUSED
                   32        95  32
                  127         1  UNUSED
                  128        32  UNUSED
                  160   1113952  160    -- 160 + 1113952 = 0x110000 --
CAPACITY NONE
SCOPE    DOCUMENT
SYNTAX   SHUNCHAR NONE
         BASESET "ISO Registration Number 176//CHARSET
                  ISO/IEC 10646-1:1993 UCS-4 with implementation
                  level 3//ESC 2/5 2/15 4/6"
         DESCSET   0 1114112 0
         FUNCTION  RE    13
                   RS    10
                   SPACE 32 
                   TAB   SEPCHAR 9
         NAMING    LCNMSTRT ""
                   UCNMSTRT ""
                   NAMESTRT 95 170 181 186 192-214 216-246 248-305 305-383 383-452
                            452-455 455-458 458-497 497-501 506-535 592-680 688-696
                            699-705 736-740 768-837 864-865 890 902 904-906 908
                            910-929 914 920 922 928-929 931-939 934 940-974 976-982
                            986 988 990 992 994-1011 1025-1036 1038-1103 1105-1116
                            1118-1153 1155-1158 1168-1220 1223-1224 1227-1228
                            1232-1259 1262-1269 1272-1273 1329-1366 1369 1377-1415
                            1425-1441 1443-1465 1467-1469 1471 1473-1474 1476
                            1488-1514 1520-1522 1569-1594 1601-1618 1648-1719
                            1722-1726 1728-1742 1744-1747 1749-1768 1770-1773
                            2305-2307 2309-2361 2364-2381 2385-2388 2392-2403
                            2433-2435 2437-2444 2447-2448 2451-2472 2474-2480 2482
                            2486-2489 2492 2494-2500 2503-2504 2507-2509 2519
                            2524-2525 2527-2531 2544-2545 2562 2565-2570 2575-2576
                            2579-2600 2602-2608 2610-2611 2613-2614 2616-2617 2620
                            2622-2626 2631-2632 2635-2637 2649-2652 2654 2672-2676
                            2689-2691 2693-2699 2701 2703-2705 2707-2728 2730-2736
                            2738-2739 2741-2745 2748-2757 2759-2761 2763-2765 2784
                            2817-2819 2821-2828 2831-2832 2835-2856 2858-2864
                            2866-2867 2870-2873 2876-2883 2887-2888 2891-2893
                            2902-2903 2908-2909 2911-2913 2946-2947 2949-2954
                            2958-2960 2962-2965 2969-2970 2972 2974-2975 2979-2980
                            2984-2986 2990-2997 2999-3001 3006-3010 3014-3016
                            3018-3021 3031 3073-3075 3077-3084 3086-3088 3090-3112
                            3114-3123 3125-3129 3134-3140 3142-3144 3146-3149
                            3157-3158 3168-3169 3202-3203 3205-3212 3214-3216
                            3218-3240 3242-3251 3253-3257 3262-3268 3270-3272
                            3274-3277 3285-3286 3294 3296-3297 3330-3331 3333-3340
                            3342-3344 3346-3368 3370-3385 3390-3395 3398-3400
                            3402-3405 3415 3424-3425 3585-3630 3632-3642 3648-3653
                            3655-3662 3713-3714 3716 3719-3720 3722 3725 3732-3735
                            3737-3743 3745-3747 3749 3751 3754-3755 3757-3758
                            3760-3769 3771-3773 3776-3780 3784-3789 3804-3805
                            3864-3865 3893 3895 3897 3902-3911 3913-3945 3953-3972
                            3974-3979 3984-3989 3991 3993-4013 4017-4023 4025
                            4256-4293 4304-4342 4352-4441 4447-4514 4520-4601
                            7680-7835 7840-7929 7936-7957 7960-7965 7968-8005
                            8008-8013 8016-8023 8025 8027 8029 8031-8061 8064-8116
                            8118-8124 8126 8130-8132 8134-8140 8144-8147 8150-8155
                            8160-8172 8178-8180 8182-8188 8319 8400-8412 8417 8450
                            8455 8458-8467 8469 8472-8477 8484 8486 8488 8490-8497
                            8499-8504 8544-8578 12295 12321-12335 12353-12436
                            12441-12442 12449-12538 12549-12588 12593-12686
                            19968-40869 44032-55203
                   LCNMCHAR ""
                   UCNMCHAR ""    
                   NAMECHAR 45 46 58 183 720 721 1600 1632-1641 1776-1785 2406-2415
                            2534-2543 2662-2671 2790-2799 2918-2927 3047-3055
                            3174-3183 3302-3311 3430-3439 3654 3664-3673 3782
                            3792-3801 3872-3881 8204-8207 8234-8238 8298-8303 12293
                            12337-12341 12443-12446 12540-12542    
         NAMECASE  GENERAL  NO
                   ENTITY   NO
         DELIM     GENERAL  SGMLREF 
                            HCRO     "&#38;#x"  -- 38 = ampersand --
                            NESTC    "/"
                            NET      ">"
                            PIC      "?>"
                   SHORTREF NONE
         NAMES     SGMLREF
         QUANTITY  NONE
         ENTITIES  "amp" 38 "lt" 60 "gt" 62 "quot" 34 "apos" 39
FEATURES MINIMIZE  DATATAG NO  OMITTAG NO  RANK NO    
                   SHORTTAG STARTTAG EMPTY   NO  UNCLOSED NO  NETENABL IMMEDNET
                            ENDTAG   EMPTY   NO  UNCLOSED NO
                            ATTRIB   DEFAULT YES OMITNAME NO  VALUE NO 
                   EMPTYNRM YES
                   IMPLYDEF ATTLIST  YES
                            DOCTYPE  YES 
                            ELEMENT  YES 
                            ENTITY   YES
                            NOTATION YES 
         LINK  SIMPLE   NO  IMPLICIT NO  EXPLICIT NO
         OTHER CONCUR   NO  SUBDOC   NO  FORMAL   NO  URN  NO
               KEEPRSRE YES VALIDITY TAG ENTITIES REF ANY  INTEGRAL YES
APPINFO NONE SEEALSO "http://www.w3.org/TR/PR-xml-971208"
>

Extended Naming Rules

To allow large character sets to be defined for SGML markup, Annex J of ISO 8879 defines the following extensions for SGML declarations:

  1. The use of decimal character numbers to identify characters to be added to the default set of name characters.
  2. The specification of contiguous ranges of valid name characters by the specification of two decimal character numbers connected by a hyphen.
  3. The addition of a NAMESTRT keyword within NAMING to allow name start characters that are not case sensitive to be defined separately from those that are case sensitive.
  4. The addition of a NAMECHAR keyword within NAMING to allow name characters that are not case sensitive to be defined separately from those that are case sensitive.

Examples of the use of these facilities can be seen in the NAMING section of the SGML declaration for XML shown above.

When the Extended Naming Rules are being used without applying the extensions defined in Annex K the minimum literal following the <!SGML must be extended to read "ISO 8879:1986 (ENR)"

Web SGML Adaptations

When both the extensions in Annex J and those in Annex K apply to an SGML declaration the minimum literal at the start of the definition is extended to read "ISO 8879:1986 (WWW)", where WWW stands for World Wide Web.

The additions provided by Annex K are:

  1. The ability to switch off capacity checking by specifying CAPACITY NONE.
  2. The ability to switch off quantity checking by specifying QUANTITY NONE in the SYNTAX clause.
  3. Two additional delimiters, to allow character references to be entered using hexadecimal numbers and to allow the null end-tag start delimiter to differ from that used to indicate the position of the null end-tag.
  4. An option to specify a predefined set of data character entity names in the SYNTAX clause.
  5. Additional controls for short forms of markup tags to allow separate selection of the rules for start-tags, attributes in start-tags and end-tags.
  6. An optional minimization feature to allow end-tags to be assigned to elements that are declared to be EMPTY.
  7. An additional minimization option to allow a fixed impliable definition of markup declarations to be used when no explicit definition is provided.
  8. An optional feature to allow systems to require the use of Internet Uniform Resource Names for public identifiers.
  9. An optional feature to allow Record Start and Record End codes to be retained during parsing where they would otherwise be discarded.
  10. An optional feature to control the types of validity checking to be carried out by the SGML parser.
  11. An optional feature to allow constraints to be placed on where entities can be used and whether or not elements can cross entity boundaries.
  12. An optional pointer to a file containing details of constraints over and above those provided in the SGML declaration that should be applied for a specific class of documents.
  13. An option to store the clauses that form the body of an SGML declaration in a separate file.

Hexadecimal character references

Because ISO/IEC 10646 character sets are typically displayed as 'planes' of 256 characters (16 columns of 16 characters) it is often easier to reference them using a hexadecimal (base 16) number than a decimal (base 10) number. For this reason the Web SGML adaptations include a new delimiter name, hexadecimal character reference open (HCRO), which can be used in the DELIM section of the SYNTAX clause. A typical use of this option is:

DELIMS GENERAL SGMLREF HRCO "&#38;#x"

Note particularly the use of an embedded (decimal) character reference, which indicates that the delimiter starts with an ampersand (&) code. This form of double escaping is required to ensure that an error is not reported when the parser checks the contents of the string defining the delimiter.

Predefined character data entities

At the end of the SYNTAX clause an optional new entry can be added to specify named character data entities that can be used to escape markup characters. It is recommended that all characters that are defined as the first character in a markup delimiter be provided with escape entity names to allow them to be used within the contents of elements or entities. For example, XML defines the following set of default entities:

ENTITIES  "amp" 38 "lt" 60 "gt" 62 "quot" 34 "apos" 39

This declaration states that the characters used to identify the start or end of markup declarations, processing instructions, elements, attribute values and entity references within a document instance can be identified using predefined character data entities as follows:

Delimiters using character by default Default character
(decimal value)
Predefined entity name in XML
ERO & (38) &amp;
STAGO, ETAGO, MDO and PIO < (60) &lt;
TAGC, MDC and PIC > (62) &gt;
LIT " (34) &quot;
LITA ' (39) &apos;

Short tag form control

In the MINIMIZE section of the FEATURES clause the options that can be associated with SHORTTAG have been extended. Instead of just saying NO to indicate that minimization of tags is not allowed or YES to say that all forms of tag minimization are allowed, you can now select each of the minimization options individually. If the new option is used then three new keywords must be added to the declaration, and each of these keywords must be followed by entries that consist of a keyword identifying a minimization option followed by an appropriate value.

The SHORTTAG option sets are:

Empty element ending rules

To enable the use of end-tags with empty elements users can add the optional EMPTYNRM YES empty element ending rules specification to the end of the MINIMIZE part of the FEATURES clause. When this empty element normalization option is activated, omission of the end-tag is controlled by the tag omission rules of the element. If the default rules for the automatic omission of end-tags from empty elements are to apply then EMPTYNRM NO must be specified.

Note: If this option is specified the options for implicit definitions must immediately follow it.

Implicit definitions

When empty element ending rules are specified they must be immediately followed by the keyword IMPLYDEF followed by the following keywords, in the order shown:

Note: When ENTITY YES is specified the reserved name #DEFUALT cannot be assigned to an entity declaration. DOCTYPE YES cannot be specified if concurrent or linked document types are being used.

The URN feature

To specify that public identiifers must be entered in the form of Internet Uniform Resource Names (URNs) you can now add URN YES after the FORMAL NO option at the end of the FEATURES clause. By default, or if no entry is specified, URN NO will be specified to indicate that no checking of public identifiers is required.

Note: If this option is used it must be immediately followed by the options listed under the Other new features heading below.

Other new features

When the URN feature has been added to the FEATURES clause it must be followed by specifications relating to the following new options:

SEEALSO

When the Web SGML Adaptations are being used the APPINFO clause can be extended by adding a SEEALSO statement which is followed by a public identifier that references a file that contains information on any additional constraints to be applied by applications using the SGML declaration. For example, the constraints specified in the XML specification can be referenced using the following statement:

APPINFO NONE SEEALSO "http://www.w3.org/TR/PR-xml-971208"

More than one public identifier can be specified if appropriate. If no additional rules apply the entry can either be omitted or changed to SEEALSO NONE.

SGML declaration references

When the Web SGML Adaptations are being used a shortened form of SGML declaration can associated with a document type declaration. This takes the form of an SGML declaration reference to an externally stored SGML declaration body. The format of the shortened declaration is:

<!SGML name external-identifier? >

where name is a reference concrete syntax name used to identify the SGML declaration and the optional external-identifier identifies the external entity which contains the clauses of the SGML declaration body. (If it is omitted the system is assumed to be able to use the name to find the relevant SGML declaration: this is equivalent to having an unqualified system identifier of SYSTEM for the external identifier)

The SGML declaration used by the Extensible Markup Language (XML) shown above could be referenced using the new Internet Domain Name form for formal public identifiers as:

<!SGML XML PUBLIC "+//IDN www.w3c.org//SD SGML declaration body for XML//EN">

The file referenced must start with the minimum literal that indicates which version of the SGML declaration is being used, followed by definitions for each of the clauses that make up the SGML declaration body. Comments can be interspersed between clauses, but may not precede the mimimum literal.

If the system knows where to find an SGML declaration known as XML all that needs to be added to the start of the file is <!SGML XML>.

References

Readers wishing to know more about the role of the SGML declaration should refer to the following books:

Goldfarb, C.F. (1990) The SGML Handbook Oxford: Clarendon Press

Bryan, M.T. (1987) SGML: An Author's Guide to the Standard Generalized Markup Language Wokingham: Addison-Wesley

The following ISO standards define character sets typically referenced in SGML declarations:

International Organization for Standardization/International Electrotechnical Commission (1991), Information Processing - 7-bit coded character set for information interchange (ISO/IEC 646:1991) Geneva: ISO

International Organization for Standardization/International Electrotechnical Commission (1994), Information technology - Coded graphic character set for text communication - Latin alphabet (ISO/IEC 6937:1994) Geneva: ISO

International Organization for Standardization (1987), Information processing - 8-bit single-byte coded graphic character sets Parts 1-10. (ISO 8859:1987) Geneva: ISO

International Organization for Standardization/International Electrotechnical Commission (1993), Information technology - Universal Multiple-Octet Coded Character Set (UCS) (ISO/IEC 10646:1993) Geneva: ISO

Details of the Web SGML Adaptations can be found in:

Web SGML Adaptations, Annex K to ISO 8879:1986, ISO/IEC JTC1/WG4, December 1997