© Martin Bryan 1997 from SGML and HTML Explained published by Addison Wesley Longman
This chapter explains briefly the component parts of the SGML declaration, and tries to give some idea of the role each part plays. It covers:
Many readers will find the concepts covered in this chapter difficult to grasp at first reading. Do not worry if you do not understand the role of any part of the SGML declaration at first reading. You are not meant to at this stage! The reason for asking you to read quickly through this chapter at the beginning of this explanation of SGML is that restrictions imposed by the SGML declaration are fundamental to understand many of the rules in SGML. Terms introduced in this chapter will be used throughout the remainder of the book. When you return to this chapter to remind yourself of the concepts being referred to by these terms you should find that the summary of the term given in this chapter will explain the restrictions imposed on other SGML constructs.
When interchanging documents it is important that each transmitted code has a well defined function. In addition it is important that document markup can be correctly distinguished from codes that form the text of the document.
The rules defining the meanings of the constructs used by a particular language are known as the syntax of that language. Two distinct types of syntax have been defined for SGML:
This chapter will introduce you to many of the terms used to describe the SGML's abstract syntax. The use to which the abstract syntax is put will be explained in the following chapters.
One particular concrete syntax, called the reference concrete syntax, has been formally defined within ISO 8879:1986 to provide a reference against which variant concrete syntaxes can be compared. It is a requirement of conforming SGML systems that they be able to parse documents conforming to the reference concrete syntax.
Each SGML document transferred to another system should be accompanied by a declaration, called the SGML declaration, which defines the coding scheme used in its preparation. Figure 4.1 shows the SGML declaration that should be used if a document is transmitted without an SGML declaration. (Such documents referred to as basic SGML documents.)
<!SGML "ISO 8879:1986"
-- Declaration for typical Basic SGML Document --
CHARSET BASESET "ISO 646:1983//CHARSET International
Reference Version (IRV)//ESC 2/5 4/0"
DESCSET 0 9 UNUSED
9 2 9
11 2 UNUSED
13 1 13
14 18 UNUSED
32 95 32
127 1 UNUSED
CAPACITY PUBLIC "ISO 8879:1986//CAPACITY Reference//EN"
SCOPE DOCUMENT
SYNTAX PUBLIC "ISO 8879:1986//SYNTAX Reference//EN"
FEATURES MINIMIZE DATATAG NO OMITTAG YES RANK NO SHORTTAG YES
LINK SIMPLE NO IMPLICIT NO EXPLICIT NO
OTHER CONCUR NO SUBDOC NO FORMAL NO
APPINFO NONE
>
Figure 4.1 SGML declaration for basic SGML document
The SGML declaration starts with a markup declaration open
(mdo) sequence consisting of the codes <! .
The declaration is closed by a matching markup declaration close
(mdc) angle bracket (>) at the end of the
declaration.
The rest of the first line of the SGML declaration consists of the letters
SGML followed by a delimited string containing the number and date
of the ISO standard in which SGML is defined ("ISO 8879:1986").
This statement indicates which version of the standard was used to prepare the
following declarations.
The second line of the default SGML declaration contains some text bracketed by pairs of hyphens. Text entered in an SGML markup declaration between pairs of hyphens is treated as a comment. In this case the comment acts as a heading explaining the purpose of the following entries.
The names of the six main clauses that make up an SGML declaration are shown in the first column of the SGML declaration. They identify:
CHARSET)
CAPACITY)
SCOPE)
SYNTAX)
FEATURES)
APPINFO).A key part of the SGML declaration is the SYNTAX clause, which
controls the codes that can be used for document markup. In
Figure 4.1 the syntax has been entered as a
formal public identifier which references the
default syntax defined in ISO 8879:1986, which
is shown in
Figure 4.2.
SYNTAX SHUNCHAR CONTROLS 0 1 2 3 4 5 6 7 8 9 10 11 12 13
14 15 16 17 18 19 20 21 22 23 24 25 26
27 28 29 30 31 127 255
BASESET "ISO 646-1983//CHARSET International
Reference Version (IRV)//ESC 2/5 4/0"
DESCSET 0 128 0
FUNCTION RE 13
RS 10
SPACE 32
TAB SEPCHAR 9
NAMING LCNMSTRT ""
UCNMSTRT ""
LCNMCHAR "-."
UCNMCHAR "-."
NAMECASE GENERAL YES
ENTITY NO
DELIM GENERAL SGMLREF
SHORTREF SGMLREF
NAMES SGMLREF
QUANTITY SGMLREF
Figure 4.2 Formal definition of reference concrete syntax
The declaration for SGML's reference concrete syntax given in the SYNTAX
clause shown in Figure 4.2 contains eight subclause
definitions, each identified by a keyword. These define:
SHUNCHAR:
shunned characters)BASESET) declaration followed by a description of how these
characters are to be used to define the concrete syntax (DESCSET:
described character set)
FUNCTION)
NAMING)
DELIM)
NAMES)
QUANTITY).The base character set used for the reference concrete syntax is that defined in international standard ISO 646. This 7-bit character set, known as the International Reference Version (IRV), is used as a starting point for all international standards that define character sets, e.g. ISO 6937, ISO 8859 and ISO/IEC 10646.
Note: A revision of ISO 646 took place in 1991. The revision (ISO
646:1991) matches the American Standard Code for Information Interchange (ASCII)
used by many computer systems. (ISO 646 does allocate different names to some of
the control characters, but these names do not affect the way these codes are
used.) In addition it has been identified that the ISO 2022 Escape sequence used
for ISO 646 in the SGML reference concrete syntax was incorrect: it should have
been ESC 2/8 4/0. Strictly speaking, therefore, the reference
concrete syntax should be updated to read "ISO 646:1991//CHARSET
International Reference Version (IRV)//ESC 2/8 4/0". In practice it
is likely that the next revision of SGML will adopt the 16-bit version of
ISO/IEC 10646:1993 as its default code set.
The described character set portion of the reference concrete syntax character set description shows that 128 characters, starting from position 0 in the list, should be mapped to identical positions in the reference concrete syntax. Figure 4.3 shows the 128 codes defined in ISO 646.
| Value | ISO (16-bit) representation | ISO name/ character | Purpose | |
|---|---|---|---|---|
| Decimal | Hexadecimal | |||
| 0 | 0 | 0/0 | NUL | Null code |
| 1 | 1 | 0/1 | TC1/SOH | Transmission code 1 / Start of header |
| 2 | 2 | 0/2 | TC2/STX | Transmission code 2 / Start of text |
| 3 | 3 | 0/3 | TC3/ETX | Transmission code 3 / End of text |
| 4 | 4 | 0/4 | TC4/EOT | Transmission code 4 / End of transmission |
| 5 | 5 | 0/5 | TC5/ENQ | Transmission code 5 / Enquire |
| 6 | 6 | 0/6 | TC6/ACK | Transmission code 6 / Acknowledge |
| 7 | 7 | 0/7 | BEL | Bell |
| 8 | 8 | 0/8 | FE0/BS | Format effector 0 / Backspace |
| 9 | 9 | 0/9 | FE1/HT | Format effector 1 / Horizontal tab |
| 10 | A | 0/10 | FE2/LF | Format effector 2 / Line feed |
| 11 | B | 0/11 | FE3/VT | Format effector 3 / Vertical tab |
| 12 | C | 0/12 | FE4/FF | Format effector 4 / Form feed |
| 13 | D | 0/13 | FE5/CR | Format effector 5 / Carriage return |
| 14 | E | 0/14 | SO | Shift out |
| 15 | F | 0/15 | SI | Shift in |
| 16 | 10 | 1/0 | TC7/DLE | Transmission code 7 / Data link escape |
| 17 | 11 | 1/1 | DC1 | Device control character 1 |
| 18 | 12 | 1/2 | DC2 | Device control character 2 |
| 19 | 13 | 1/3 | DC3 | Device control character 3 |
| 20 | 14 | 1/4 | DC4 | Device control character 4 |
| 21 | 15 | 1/5 | TC8/NAK | Transmission code 8 / Negative acknowledge |
| 22 | 16 | 1/6 | TC9/SYN | Transmission code 9 / Synchronize |
| 23 | 17 | 1/7 | TC10/ETB | Transmission code 10 / End of text block |
| 24 | 18 | 1/8 | CAN | Cancel |
| 25 | 19 | 1/9 | EM | End of media |
| 26 | 1A | 1/10 | SUB | Substitute character |
| 27 | 1B | 1/11 | ESC | Escape |
| 28 | 1C | 1/12 | FS/DT/IS4 | Frame separator / ISO 6937 document terminator |
| 29 | 1D | 1/13 | GS/PT/IS3 | Group separator / ISO 6937 page terminator |
| 30 | 1E | 1/14 | RS/IS2 | Record separator |
| 31 | 1F | 1/15 | US/IS1 | Unit separator |
| 32 | 20 | 2/0 | Space | |
| 33 | 21 | 2/1 | ! | Exclamation mark |
| 34 | 22 | 2/2 | " | Quotation mark |
| 35 | 23 | 2/3 | # | Number sign |
| 36 | 24 | 2/4 | ¤ | General currency sign (Dollar in ISO 646:1991) |
| 37 | 25 | 2/5 | % | Percent |
| 38 | 26 | 2/6 | & | Ampersand |
| 39 | 27 | 2/7 | ' | Apostrophe |
| 40 | 28 | 2/8 | ( | Left parenthesis |
| 41 | 29 | 2/9 | ) | Right parenthesis |
| 42 | 2A | 2/10 | * | Asterisk |
| 43 | 2B | 2/11 | + | Plus sign |
| 44 | 2C | 2/12 | , | Comma |
| 45 | 2D | 2/13 | - | Hyphen |
| 46 | 2E | 2/14 | . | Full stop (Period) |
| 47 | 2F | 2/15 | / | Forward Slash (Solidus) |
| 48 | 30 | 3/0 | 0 | |
| 49 | 31 | 3/1 | 1 | |
| 50 | 32 | 3/2 | 2 | |
| 51 | 33 | 3/3 | 3 | |
| 52 | 34 | 3/4 | 4 | |
| 53 | 35 | 3/5 | 5 | |
| 54 | 36 | 3/6 | 6 | |
| 55 | 37 | 3/7 | 7 | |
| 56 | 38 | 3/8 | 8 | |
| 57 | 39 | 3/9 | 9 | |
| 58 | 3A | 3/10 | : | Colon |
| 59 | 3B | 3/11 | ; | Semicolon |
| 60 | 3C | 3/12 | < | Less-than sign |
| 61 | 3D | 3/13 | = | Equals sign |
| 62 | 3E | 3/14 | > | Greater-than sign |
| 63 | 3F | 3/15 | ? | Question mark |
| 64 | 40 | 4/0 | @ | Commercial at |
| 65 | 41 | 4/1 | A | |
| 66 | 42 | 4/2 | B | |
| 67 | 43 | 4/3 | C | |
| 68 | 44 | 4/4 | D | |
| 69 | 45 | 4/5 | E | |
| 70 | 46 | 4/6 | F | |
| 71 | 47 | 4/7 | G | |
| 72 | 48 | 4/8 | H | |
| 73 | 49 | 4/9 | I | |
| 74 | 4A | 4/10 | J | |
| 75 | 4B | 4/11 | K | |
| 76 | 4C | 4/12 | L | |
| 77 | 4D | 4/13 | M | |
| 78 | 4E | 4/14 | N | |
| 79 | 4F | 4/15 | O | |
| 80 | 50 | 5/0 | P | |
| 81 | 51 | 5/1 | Q | |
| 82 | 52 | 5/2 | R | |
| 83 | 53 | 5/3 | S | |
| 84 | 54 | 5/4 | T | |
| 85 | 55 | 5/5 | U | |
| 86 | 56 | 5/6 | V | |
| 87 | 57 | 5/7 | W | |
| 88 | 58 | 5/8 | X | |
| 89 | 59 | 5/9 | Y | |
| 90 | 5A | 5/10 | Z | |
| 91 | 5B | 5/11 | [ | Left square bracket |
| 92 | 5C | 5/12 | \ | Backward slash (Reverse solidus) |
| 93 | 5D | 5/13 | ] | Right square bracket |
| 94 | 5E | 5/14 | ^ | Circumflex accent |
| 95 | 5F | 5/15 | _ | Low line |
| 96 | 60 | 6/0 | ` | Grave accent |
| 97 | 61 | 6/1 | a | |
| 98 | 62 | 6/2 | b | |
| 99 | 63 | 6/3 | c | |
| 100 | 64 | 6/4 | d | |
| 101 | 65 | 6/5 | e | |
| 102 | 66 | 6/6 | f | |
| 103 | 67 | 6/7 | g | |
| 104 | 68 | 6/8 | h | |
| 105 | 69 | 6/9 | i | |
| 106 | 6A | 6/10 | j | |
| 107 | 6B | 6/11 | k | |
| 108 | 6C | 6/12 | l | |
| 109 | 6D | 6/13 | m | |
| 110 | 6E | 6/14 | n | |
| 111 | 6F | 6/15 | o | |
| 112 | 70 | 7/0 | p | |
| 113 | 71 | 7/1 | q | |
| 114 | 72 | 7/2 | r | |
| 115 | 73 | 7/3 | s | |
| 116 | 74 | 7/4 | t | |
| 117 | 75 | 7/5 | u | |
| 118 | 76 | 7/6 | v | |
| 119 | 77 | 7/7 | w | |
| 120 | 78 | 7/8 | x | |
| 121 | 79 | 7/9 | y | |
| 122 | 7A | 7/10 | z | |
| 123 | 7B | 7/11 | { | Left curly bracket |
| 124 | 7C | 7/12 | | | Vertical line |
| 125 | 7D | 7/13 | } | Right curly bracket |
| 126 | 7E | 7/14 | ~ | Tilde |
| 127 | 7F | 7/15 | DEL | Delete |
Figure 4.3 The ISO 646 character set
Codes with values less than 32, and that with a value of 127, have been
allocated to control functions, while the 95 codes with values between 32 to 126
are associated with printable (data) characters. Note that the character numbers
entered in the SHUNCHAR section of the syntax clause shown in
Figure 4.2 are those defined as control codes within ISO
646, e.g.:
SHUNCHAR CONTROLS 0 1 2 3 4 5 6 7 8 9 10 11 12 13
14 15 16 17 18 19 20 21 22 23 24 25 26
27 28 29 30 31 127 255
There are, however, certain control codes that are significant within an
SGML document, not as characters but as codes which serve particular functions.
These codes are identified in the FUNCTION section of the syntax
definition. In the case of the reference concrete syntax four functions are
defined:
RE)
RS)
SPACE)
TAB).The carriage return code (13) is used as the Record End code for the reference concrete syntax, with the line feed code (10) being used for the Record Start. The special rules that apply to the processing of these codes are explained in the section headed The effect of record boundaries in Chapter 11.
The Space character (32) is treated as
a function character because it has a special function as a separator
within SGML markup declarations. The Tab code (9) can also be used as a
separator, but as it does not have exactly the same role as the space it is
placed into a special group of separator characters identified
by the SEPCHAR control word.
Additional function codes can be specified by adding to the list a triplet consisting of:
The types of function class that can be identified in SGML are:
SEPCHAR - separator characterMSOCHAR - markup-scan-out characterMSICHAR - markup-scan-in characterMSSCHAR - markup-scan-suppress characterFUNCHAR - unspecified form of function character.The most commonly used function classes are SEPCHAR, which is
used for all codes that can separate the component parts of a markup declaration
(in addition to RE, RS and SPACE), and
FUNCHAR, which is used to identify system specific functions.
Note: Markup scanning is suppressed between codes defined as markup-scan-out characters and codes defined as markup-scan-in characters, and for the code immediately following a markup-scan-suppress character.
The NAMING section of the syntax clause identifies which
characters can be used in tag or entity names, and in SGML unique identifiers.
By default SGML presumes that names can
only start with alphabetic characters, in either shift, with subsequent
characters being alphanumeric. The LCNMSTRT and UCNMSTRT
entries in the syntax clause allow other, non-alphanumeric, characters to be
defined as name start characters, the
LCNMCHAR and UCNMCHAR entries defining which
non-alphanumeric characters can be used as name
characters after a name start character.
The reference concrete syntax only allows alphabetic characters to be used as name start characters, but within names the unaccented alphanumeric characters (a-z, A-Z and 0-9) can be supplemented by full stops and hyphens.
Note: Digits cannot be used as the first character of an SGML name.
Other characters that are required as parts of tag, attribute or entity names, or within unique identifiers, must be declared as valid name characters by putting the appropriate characters in the uppercase and lowercase name start or name character strings. The position of the entries in the string is important as characters in position n in the lowercase string may be replaced by the character in position n in the uppercase string during parsing. If there is no uppercase equivalent the lowercase character must be repeated in the uppercase string (and vice versa).
The NAMECASE entries of the syntax
clause show that, by default, the reference concrete syntax allows uppercase
substitution of lowercase characters within element and related markup (GENERAL
YES) but for entity names such substitution is not permitted (ENTITY
NO). This allows different entity declarations to be defined for É
and
é, etc., while allowing <p> and
<P> to be treated identically
The GENERAL SGMLREF entry in the DELIM section
of the syntax clause shows that the general default set of SGML
delimiters are used in the reference concrete syntax.
Figure 4.4 lists these default delimiters and shows the
formal name assigned to the identifier.
Character(s) Name Purpose & ERO Entity reference open or & AND And connector (within declaration group) &# CRO Character reference open % PERO Parameter entity reference open ; REFC Entity reference close < STAGO Start-tag open </ ETAGO End-tag open <! MDO Markup declaration open <? PIO Processing instruction open > TAGC Tag close or > MDC Markup declaration close or > PIC Processing instruction close ( GRPO Group open (within declaration) ) GRPC Group close (within declaration) [ DSO Declaration subset open or [ DTGO Data tag group open ] DSC Declaration subset close or ] DTGC Data tag group close ]] MSC Marked section close " LIT Start or end of literal string ' LITA Alternative start or end of literal string = VI Value indicator (within attributes) -- COM Start and end of comment - MINUS Exclusion set identifier + PLUS Inclusion set identifier or + PLUS Required and repeatable occurrence indicator * REP Optional and repeatable occurrence indicator ? OPT Optional occurrence indicator | OR Or connector (within declaration group) , SEQ Sequence connector (within declaration group) / NET Null end-tag # RNI Reserved name indicator
Figure 4.4 Reference concrete syntax delimiter set
Note that some codes are assigned more than one meaning. This is because the meaning of a markup delimiter is dependent on the context in which it is encountered. There are 10 different markup contexts:
CON - Recognized in content, including marked section contentCXT - Recognized within both CON or DSM
contextDS - Recognized only within a declaration subsetDSM - Recognized within a declaration subset or a marked
sectionGRP - Recognized within a groupLIT - Recognized within a literalMD - Recognized within a markup declarationPI - Recognized within a processing instructionsREF - Recognized within an entity or character referenceTAG - Recognised within a start-tag or end-tag.Figure 4.5 shows which delimiters are recognized in which contexts.
| Context | Delimiters recognized |
|---|---|
CON |
CRO ERO STAGO ETAGO NET MDO MSC PIO and short reference
delimiters |
CXT |
COM DSO GRPO MDC TAGC |
DS |
DSC |
DSM |
MDO MSC PERO PIO |
GRP |
GRPO GRPC LIT LITA PERO AND OR SEQ PLUS REP RNI DTGO DTGC |
LIT |
CRO ERO LIT LITA PERO |
MD |
COM DSO DSC GRPO LIT LITA MINUS PLUS PERO RNI |
PI |
PIC |
REF |
REFC |
TAG |
STAGO ETAGO TAGC VI LIT LITA |
Figure 4.5 Contexts in which delimiters can be recognized
The
SHORTREF SGMLREF entry in the DELIM section of the
syntax clause shows that the standard set of SGML short reference delimiters,
shown in Figure 4.6, can be used in conjunction with the
reference concrete syntax.
Character(s) Number(s) Purpose
&#TAB; 9 Horizontal tab
&#RS; 10 Record start (line feed)
&#RE; 13 Record end (carriage return)
32 Space
" 34 Quotation mark
# 35 Number sign
% 37 Percent
' 39 Apostrophe
( 40 Left parenthesis
) 41 Right parenthesis
* 42 Asterisk
+ 43 Plus sign
, 44 Comma
- 45 Hyphen
: 58 Colon
; 59 Semicolon
= 61 Equals sign
@ 64 Commercial at
[ 91 Left square bracket
] 93 Right square bracket
^ 94 Circumflex accent
_ 95 Low line
{ 123 Left curly bracket
| 124 Vertical line
} 125 Right curly bracket
~ 126 Tilde
-- 45,45 Two hyphens
BB 66,66 Two or more blanks (spaces or tabs)
B&#RE; 66,13 Trailing blank(s) followed by record end
&#RS;B 10,66 Record start followed by leading blanks
&#RS;B&#RE; 10,66,13 Blank records (one or more blanks)
&#RS;&#RE; 10,13 Empty record
Figure 4.6 Reference concrete syntax short reference delimiters
In the concrete reference syntax most punctuation characters can be used as
short reference delimiters, though tag delimiters (&, <,
/, !, ? and >), and
certain other significant symbols (e.g. apostrophe, backslash, full stop and the
general currency sign) are excluded. Six special code sequences are also
defined, five of which allow common word processor line ending conventions to be
used as short reference strings.
The QUANTITIES entry at the end of the syntax clause also
requires the presence of the SGMLREF keyword to indicate that
unless otherwise specified the default quantity set
will be used. Figure 4.7 shows the default quantity
limits.
| Reserved Name | Value | Purpose |
|---|---|---|
ATTCNT |
40 | Maximum number of attribute names and name tokens in an attribute definition list |
ATTSPLEN |
960 | Maximum length of a start-tag attribute specification |
BSEQLEN |
960 | Maximum length of blank sequence mappable to a short reference string |
DTAGLEN |
16 | Maximum length of data tag string |
DTEMPLEN |
16 | Maximum length of data tag template or pattern template |
ENTLVL |
16 | Maximum number of nesting levels for entities |
GRPCNT |
32 | Maximum number of tokens in group (one level) |
GRPGTCNT |
96 | Maximum number of tokens at all levels in a model group (data tag groups count as 3 tokens) |
GRPLVL |
16 | Maximum number of nesting levels in a model group |
LITLEN |
240 | Maximum length of a delimited literal (within delimiters) |
NAMELEN |
8 | Maximum length of names, numbers, tokens, etc. |
NORMSEP |
2 | Default separator length when calculating the normalized length of names, tokens, etc. |
PILEN |
240 | Maximum length of processing instructions |
TAGLEN |
960 | Maximum length of start-tags |
TAGLVL |
24 | Maximum number of open elements |
Figure 4.7 Default Quantities
The most restrictive entries in the default quantity set are:
NAMELEN, which restricts the maximum
length of entity and tag names used with the reference concrete syntax to eight
charactersLITLEN, which restricts the maximum
length of an entity replacement string to 240 charactersTAGLVL, which restricts the number of
nested (open) tags to 24GRPCNT, which restricts the number of
elements within a single model group to 32.These entries often need to be increased from their default values. When SGML is revised it is anticipated that the default values will be changed to 32, 2048, 32 and 64 respectively. Most SGML parsers already default to these, or higher, levels, though they should still warn users when the standard values have been exceeded.
The BASESET and DESCSET clauses in the character
set description (CHARSET) that starts the SGML
declaration are used to define the character set used within an SGML document.
By default the ISO 646 character set used for markup is defined as the first
component of the document's character set. This default document character set
can be extended by referencing other ISO character sets. For example, the 96
character supplementary set of Latin accented characters, as defined in
ISO 8859/1, could be added to the
document's character set by placing the following entries underneath the
standard DESCSET entry in the CHARSET clause at the start of the SGML
declaration shown in
Figure 4.1:
BASESET "ISO 8859-1:1987//CHARSET Right Part of
Latin Alphabet No. 1//ESC 2/13 4/1"
DESCSET 128 32 UNUSED -- Control character positions --
160 95 160 -- 96 characters in set --
The extra characters would be accessed by codes with values between 160 and 255, other codes greater than 128 being ignored.
Where the 16-bit ISO 10646 code set is required the default definitions for the base set, and its associated description, can be changed to read:
BASESET "ISO Registration Number 176//CHARSET ISO/IEC 10646-1:1993
UCS-2 with implementation level 3//ESC 2/5 2/15 4/5"
DESCSET 0 9 UNUSED
9 2 9
11 2 UNUSED
13 1 13
14 18 UNUSED
32 95 32
127 1 UNUSED
128 32 UNUSED
160 65374 160
Note: The first 128 codes of ISO/IEC 10646 are the 128 codes defined in ISO 646.
To use the full 32-bit ISO 10646 UCS-4 code set this would be changed to read:
BASESET "ISO Registration Number 177//CHARSET ISO/IEC 10646-1:1993
UCS-4 with implementation level 3//ESC 2/5 2/15 4/6"
DESCSET 0 9 UNUSED
9 2 9
11 2 UNUSED
13 1 13
14 18 UNUSED
32 95 32
127 1 UNUSED
128 32 UNUSED
160 2147483486 160
Note: The above character set was proposed as the definition to be used for the internationalized version of the HyperText Markup Language (HTML) on the World Wide Web in August 1996.
The described character set portion of the default document character set description shown in Figure 4.1 defines the purpose of the characters in ISO 646 more clearly than the matching entry in the syntax clause. It can be interpreted as:
The capacity set used with the reference concrete syntax is shown in Figure 4.8. This reference capacity set restricts the total number of stored markup characters within an SGML document to 35000 characters, but places no restrictions on the capacity of any one of the component parts of the markup, which can take up all 35000 bytes of strorage if required. In some large documents it is possible for this default total capacity to be exceeded
Note: Most current SGML systems will ignore the default capacity set restrictions, perhaps providing a warning message to users if the default limits are exceeded. Modern large-memory systems do not have the memory restrictions that were typically found in desktop systems of the 1980s, where it was important to warn users of large documents that they could exceed the program's memory allocation. Many of the existing restrictions defined in the capacity set clause will be removed when ISO 8879 is next updated.
| Name | Default Value | Points Per Unit | Purpose |
|---|---|---|---|
TOTALCAP |
35000 | All | Grand total of capacity points |
ENTCAP |
35000 | NAMELEN |
Entity name capacity |
ENTCHCAP |
35000 | 1 | No. of entity replacement characters |
ELEMCAP |
35000 | NAMELEN |
Element name capacity |
GRPCAP |
35000 | NAMELEN |
Tokens within model groups (data tag groups count as 3 tokens) |
EXGRPCAP |
35000 | NAMELEN |
Number of exception groups |
EXNMCAP
|
35000 | NAMELEN |
Tokens within exception groups |
ATTCAP |
35000 | NAMELEN |
Attribute name capacity |
ATTCHCAP |
35000 | 1 | Attribute default values capacity |
AVGRPCAP |
35000 | NAMELEN |
Attribute value token capacity |
NOTCAP |
35000 | NAMELEN |
Data content notation name capacity |
NOTCHCAP |
35000 | 1 | No. of characters in notation identifiers |
IDCAP |
35000 | NAMELEN |
Explicit or default ID value capacity |
IDREFCAP |
35000 | NAMELEN |
Explicit or default IDREF value capacity |
MAPCAP |
35000 | NAMELEN |
Short reference map declaration capacity |
LKSETCAP |
35000 | NAMELEN |
Link set/type declaration capacity |
LKNMCAP |
35000 | NAMELEN |
Link/document type name storage capacity |
Figure 4.8 Reference capacity set
By default the SCOPE clause of an SGML declaration is the
whole document (i.e. the syntax is used in both the document prolog and the
document instance). If, however, the character set defined in the syntax section
is only used to markup the text (i.e. all declarations have been coded using the
reference concrete syntax) the default
SCOPE DOCUMENT entry can be changed to read
SCOPE INSTANCE.
The FEATURES clause of the SGML declaration shows which of
SGML's optional features are required to process the document. The optional
features are:
DATATAG)
OMITTAG)SHORTTAG)
RANK)SIMPLE,
IMPLICIT or EXPLICIT links)CONCUR)
SUBDOC)
FORMAL).By default only the OMITTAG and SHORTTAG options
are available, all other options being set to NO.
The last clause in the SGML declaration can be used to transmit any
application-specific information (APPINFO)
needed to process the document. For example, a document that uses the ISO/IEC
10744 Hypermedia/Time-based Structuring Language (HyTime) application of SGML
would have an entry reading
APPINFO "HyTime". When no application specific
information needs to be exchanged the default entry of APPINFO NONE
applies
ISO 8879 also identifies some special sets of alternative concrete syntaxes. The most important of these are:
The core concrete syntax is exactly the same as the
reference concrete syntax except that the SHORTREF entry in the
DELIM section is followed by NONE rather than
SGMLREF. A document prepared using the core concrete syntax is
referred to as a minimal SGML document.
Where the code extension techniques defined in ISO 2022 are being used to extend the character set beyond the 95 characters available in the reference concrete syntax, the multicode basic concrete syntax defined in Annex D of ISO 8879 can be used. If the short reference facility is not required the equivalent multicode core concrete syntax can be used.
Where characters outside the standard ISO 646 unaccented Latin alphabet are required in markup, variants of the reference concrete syntax will be needed. Each such variant concrete syntax can be publicly declared as a public concrete syntax and given a public identifier that can be used to call it from within the SGML declaration. For example, a German variant concrete syntax might be identified as:
SYNTAX PUBLIC "ISO 8879:1986//SYNTAX Deutscher Hinweis//DE"
The most famous variant concrete syntax is that used for the HyperText Markup Language (HTML). In the definition of Version 2.0 of this language, in RFC 1866, the following SGML declaration was specified:
<!SGML "ISO 8879:1986"
-- SGML Declaration for HyperText Markup Language (HTML). --
CHARSET BASESET "ISO 646:1983//CHARSET
International Reference Version
(IRV)//ESC 2/5 4/0"
DESCSET 0 9 UNUSED
9 2 9
11 2 UNUSED
13 1 13
14 18 UNUSED
32 95 32
127 1 UNUSED
BASESET "ISO Registration Number 100//CHARSET
ECMA-94 Right Part of
Latin Alphabet Nr. 1//ESC 2/13 4/1"
DESCSET 128 32 UNUSED
160 96 32
CAPACITY SGMLREF
TOTALCAP 150000
GRPCAP 150000
ENTCAP 150000
SCOPE DOCUMENT
SYNTAX SHUNCHAR CONTROLS 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 127
BASESET "ISO 646:1983//CHARSET
International Reference Version
(IRV)//ESC 2/5 4/0"
DESCSET 0 128 0
FUNCTION RE 13
RS 10
SPACE 32
TAB SEPCHAR 9
NAMING LCNMSTRT ""
UCNMSTRT ""
LCNMCHAR ".-"
UCNMCHAR ".-"
NAMECASE GENERAL YES
ENTITY NO
DELIM GENERAL SGMLREF
SHORTREF SGMLREF
NAMES SGMLREF
QUANTITY SGMLREF
ATTSPLEN 2100
LITLEN 1024
NAMELEN 72 -- somewhat arbitrary; taken from
Internet line length conventions --
PILEN 1024
TAGLVL 100
TAGLEN 2100
GRPGTCNT 150
GRPCNT 64
FEATURES MINIMIZE DATATAG NO
OMITTAG YES
RANK NO
SHORTTAG YES
LINK SIMPLE NO
IMPLICIT NO
EXPLICIT NO
OTHER CONCUR NO
SUBDOC NO
FORMAL YES
APPINFO "SDA" -- conforming SGML Document Access application -- >
This SGML declaration specifies the following changes to the default SGML declaration:
Note: As these characters have not also been specified as part of the
SYNTAX clause they cannot be used within markup, only within the
text of the document instance.
APPINFO clause.For version 4.0 of the HTML DTD, which supports multiple languages and the use of bidirectional texts, the following SGML declaration should be used to invoke the full ISO/IEC 10646 character set:
<!SGML "ISO 8879:1986"
--
SGML Declaration for HyperText Markup Language version 4.0
With support for the first 17 planes of ISO 10646 and
increased limits for tag and literal lengths etc.
--
CHARSET
BASESET "ISO Registration Number 177//CHARSET
ISO/IEC 10646-1:1993 UCS-4 with
implementation level 3//ESC 2/5 2/15 4/6"
DESCSET 0 9 UNUSED
9 2 9
11 2 UNUSED
13 1 13
14 18 UNUSED
32 95 32
127 1 UNUSED
128 32 UNUSED
160 55136 160
55296 2048 UNUSED -- SURROGATES --
57344 1056768 57344
CAPACITY SGMLREF
TOTALCAP 150000
GRPCAP 150000
ENTCAP 150000
SCOPE DOCUMENT
SYNTAX
SHUNCHAR CONTROLS 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 127
BASESET "ISO 646IRV:1991//CHARSET
International Reference Version
(IRV)//ESC 2/8 4/2"
DESCSET 0 128 0
FUNCTION
RE 13
RS 10
SPACE 32
TAB SEPCHAR 9
NAMING LCNMSTRT ""
UCNMSTRT ""
LCNMCHAR ".-_:"
UCNMCHAR ".-_:"
NAMECASE GENERAL YES
ENTITY NO
DELIM GENERAL SGMLREF
SHORTREF SGMLREF
NAMES SGMLREF
QUANTITY SGMLREF
ATTCNT 60 -- increased --
ATTSPLEN 65536 -- These are the largest values --
LITLEN 65536 -- permitted in the declaration --
NAMELEN 65536 -- Avoid fixed limits in actual --
PILEN 65536 -- implementations of HTML UA's --
TAGLVL 100
TAGLEN 65536
GRPGTCNT 150
GRPCNT 64
FEATURES
MINIMIZE
DATATAG NO
OMITTAG YES
RANK NO
SHORTTAG YES
LINK
SIMPLE NO
IMPLICIT NO
EXPLICIT NO
OTHER
CONCUR NO
SUBDOC NO
FORMAL YES
APPINFO NONE
>
Warning: Before using the extensions listed below you should ensure that both the document creation and document receiving systems can process these additional features.
Two extensions to the facilities provided by SGML declarations have been defined in the form of optional annexes to ISO 8879:
Annex J allows SGML to make full use of the extensive ranges of characters provided in the ISO/IEC 10646 character set by providing for the specification of ranges of characters and for identifying name characters for which case substitution is not permitted.
Annex K provides additional controls for optional features of SGML, and relaxes some of the previously mandatory restrictions to allow for situations where parts of the document type definition may not be accessible due to network constraints. Annex K also includes facilities for identifying where externally defined constraints on the use of declarations have been defined.
The following example shows how these extensions can be used to create an SGML declaration that defines the syntax used for the World Wide Web Consortium's Extensible Markup Language (XML):
<!SGML -- SGML Declaration for XML --
"ISO 8879:1986 (WWW)"
CHARSET BASESET
"ISO Registration Number 176//CHARSET
ISO/IEC 10646-1:1993 UCS-4 with implementation
level 3//ESC 2/5 2/15 4/6"
DESCSET 0 9 UNUSED
9 2 9
11 2 UNUSED
13 1 13
14 18 UNUSED
32 95 32
127 1 UNUSED
128 32 UNUSED
160 1113952 160 -- 160 + 1113952 = 0x110000 --
CAPACITY NONE
SCOPE DOCUMENT
SYNTAX SHUNCHAR NONE
BASESET "ISO Registration Number 176//CHARSET
ISO/IEC 10646-1:1993 UCS-4 with implementation
level 3//ESC 2/5 2/15 4/6"
DESCSET 0 1114112 0
FUNCTION RE 13
RS 10
SPACE 32
TAB SEPCHAR 9
NAMING LCNMSTRT ""
UCNMSTRT ""
NAMESTRT 95 170 181 186 192-214 216-246 248-305 305-383 383-452
452-455 455-458 458-497 497-501 506-535 592-680 688-696
699-705 736-740 768-837 864-865 890 902 904-906 908
910-929 914 920 922 928-929 931-939 934 940-974 976-982
986 988 990 992 994-1011 1025-1036 1038-1103 1105-1116
1118-1153 1155-1158 1168-1220 1223-1224 1227-1228
1232-1259 1262-1269 1272-1273 1329-1366 1369 1377-1415
1425-1441 1443-1465 1467-1469 1471 1473-1474 1476
1488-1514 1520-1522 1569-1594 1601-1618 1648-1719
1722-1726 1728-1742 1744-1747 1749-1768 1770-1773
2305-2307 2309-2361 2364-2381 2385-2388 2392-2403
2433-2435 2437-2444 2447-2448 2451-2472 2474-2480 2482
2486-2489 2492 2494-2500 2503-2504 2507-2509 2519
2524-2525 2527-2531 2544-2545 2562 2565-2570 2575-2576
2579-2600 2602-2608 2610-2611 2613-2614 2616-2617 2620
2622-2626 2631-2632 2635-2637 2649-2652 2654 2672-2676
2689-2691 2693-2699 2701 2703-2705 2707-2728 2730-2736
2738-2739 2741-2745 2748-2757 2759-2761 2763-2765 2784
2817-2819 2821-2828 2831-2832 2835-2856 2858-2864
2866-2867 2870-2873 2876-2883 2887-2888 2891-2893
2902-2903 2908-2909 2911-2913 2946-2947 2949-2954
2958-2960 2962-2965 2969-2970 2972 2974-2975 2979-2980
2984-2986 2990-2997 2999-3001 3006-3010 3014-3016
3018-3021 3031 3073-3075 3077-3084 3086-3088 3090-3112
3114-3123 3125-3129 3134-3140 3142-3144 3146-3149
3157-3158 3168-3169 3202-3203 3205-3212 3214-3216
3218-3240 3242-3251 3253-3257 3262-3268 3270-3272
3274-3277 3285-3286 3294 3296-3297 3330-3331 3333-3340
3342-3344 3346-3368 3370-3385 3390-3395 3398-3400
3402-3405 3415 3424-3425 3585-3630 3632-3642 3648-3653
3655-3662 3713-3714 3716 3719-3720 3722 3725 3732-3735
3737-3743 3745-3747 3749 3751 3754-3755 3757-3758
3760-3769 3771-3773 3776-3780 3784-3789 3804-3805
3864-3865 3893 3895 3897 3902-3911 3913-3945 3953-3972
3974-3979 3984-3989 3991 3993-4013 4017-4023 4025
4256-4293 4304-4342 4352-4441 4447-4514 4520-4601
7680-7835 7840-7929 7936-7957 7960-7965 7968-8005
8008-8013 8016-8023 8025 8027 8029 8031-8061 8064-8116
8118-8124 8126 8130-8132 8134-8140 8144-8147 8150-8155
8160-8172 8178-8180 8182-8188 8319 8400-8412 8417 8450
8455 8458-8467 8469 8472-8477 8484 8486 8488 8490-8497
8499-8504 8544-8578 12295 12321-12335 12353-12436
12441-12442 12449-12538 12549-12588 12593-12686
19968-40869 44032-55203
LCNMCHAR ""
UCNMCHAR ""
NAMECHAR 45 46 58 183 720 721 1600 1632-1641 1776-1785 2406-2415
2534-2543 2662-2671 2790-2799 2918-2927 3047-3055
3174-3183 3302-3311 3430-3439 3654 3664-3673 3782
3792-3801 3872-3881 8204-8207 8234-8238 8298-8303 12293
12337-12341 12443-12446 12540-12542
NAMECASE GENERAL NO
ENTITY NO
DELIM GENERAL SGMLREF
HCRO "&#x" -- 38 = ampersand --
NESTC "/"
NET ">"
PIC "?>"
SHORTREF NONE
NAMES SGMLREF
QUANTITY NONE
ENTITIES "amp" 38 "lt" 60 "gt" 62 "quot" 34 "apos" 39
FEATURES MINIMIZE DATATAG NO OMITTAG NO RANK NO
SHORTTAG STARTTAG EMPTY NO UNCLOSED NO NETENABL IMMEDNET
ENDTAG EMPTY NO UNCLOSED NO
ATTRIB DEFAULT YES OMITNAME NO VALUE NO
EMPTYNRM YES
IMPLYDEF ATTLIST YES
DOCTYPE YES
ELEMENT YES
ENTITY YES
NOTATION YES
LINK SIMPLE NO IMPLICIT NO EXPLICIT NO
OTHER CONCUR NO SUBDOC NO FORMAL NO URN NO
KEEPRSRE YES VALIDITY TAG ENTITIES REF ANY INTEGRAL YES
APPINFO NONE SEEALSO "http://www.w3.org/TR/PR-xml-971208"
>
To allow large character sets to be defined for SGML markup, Annex J of ISO 8879 defines the following extensions for SGML declarations:
NAMESTRT keyword within NAMING
to allow name start characters that are not case sensitive to be defined
separately from those that are case sensitive.NAMECHAR keyword within NAMING
to allow name characters that are not case sensitive to be defined separately
from those that are case sensitive.Examples of the use of these facilities can be seen in the NAMING
section of the SGML declaration for XML shown above.
When the Extended Naming Rules are being used without applying the
extensions defined in Annex K the minimum literal following the <!SGML
must be extended to read "ISO 8879:1986 (ENR)"
When both the extensions in Annex J and those in Annex K apply to an SGML
declaration the minimum literal at the start of the definition is extended to
read "ISO 8879:1986 (WWW)", where WWW
stands for World Wide Web.
The additions provided by Annex K are:
CAPACITY NONE.QUANTITY NONE in the SYNTAX clause.SYNTAX clause.EMPTY.Because ISO/IEC 10646 character sets are typically displayed as 'planes' of
256 characters (16 columns of 16 characters) it is often easier to reference
them using a hexadecimal (base 16) number than a decimal (base 10) number. For
this reason the Web SGML adaptations include a new delimiter name, hexadecimal
character reference open (HCRO), which can be used in the DELIM
section of the SYNTAX clause. A typical use of this option is:
DELIMS GENERAL SGMLREF HRCO "&#x"
Note particularly the use of an embedded (decimal)
character reference, which indicates that the
delimiter starts with an ampersand (&) code. This form of
double escaping is required to ensure that an error is not reported when the
parser checks the contents of the string defining the delimiter.
At the end of the SYNTAX clause an optional new entry can be
added to specify named character data entities that can be used to escape markup
characters. It is recommended that all characters that are defined as the first
character in a markup delimiter be provided with escape entity names to allow
them to be used within the contents of elements or entities. For example, XML
defines the following set of default entities:
ENTITIES "amp" 38 "lt" 60 "gt" 62 "quot" 34 "apos" 39
This declaration states that the characters used to identify the start or end of markup declarations, processing instructions, elements, attribute values and entity references within a document instance can be identified using predefined character data entities as follows:
| Delimiters using character by default | Default character (decimal value) |
Predefined entity name in XML |
|---|---|---|
ERO |
& (38) |
& |
STAGO, ETAGO, MDO and PIO |
< (60) |
< |
TAGC, MDC and PIC |
> (62) |
> |
LIT |
" (34) |
" |
LITA |
' (39) |
' |
In the MINIMIZE section of the FEATURES clause
the options that can be associated with SHORTTAG
have been extended. Instead of just saying NO to indicate that
minimization of tags is not allowed or YES to say that all forms
of tag minimization are allowed, you can now select each of the minimization
options individually. If the new option is used then three new keywords must be
added to the declaration, and each of these keywords must be followed by entries
that consist of a keyword identifying a minimization option followed by an
appropriate value.
The SHORTTAG option sets are:
STARTTAG, to indicate options that apply to start-tags, which
are:
EMPTY, to indicate whether empty
start-tags are to be allowed (YES) or not (NO)UNCLOSED, to indicate whether unclosed
start-tags are to be allowed (YES) or not (NO)NETENABL, to indicate whether null
end-tags can be enabled using the new NETSC delimiter for all
elements (ALL), for empty elements whose end-tag immediately
follows the start-tag (IMMEDNET) or for NO elements.ENDTAG, to indicate options that apply to end-tags, which
are:
EMPTY, to indicate whether empty
end-tags are to be allowed (YES) or not (NO)UNCLOSED, to indicate whether unclosed
end-tags are to be allowed (YES) or not (NO).ATTRIB, to indicate options for
short forms of attribute specification
in start-tags, which are:
DEFAULT, to indicate whether attributes can be omitted if
they have been assigned default values in an
attribute list declaration (YES) or not (NO)OMITNAME, to indicate whether attribute
name omission is permitted for attributes whose value is a unique member of
a named token group (YES) or not (NO)VALUE, to indicate whether the literal delimiters surrounding
an attribute value can be omitted (YES) or not (NO).To enable the use of end-tags with empty
elements users can add the optional
EMPTYNRM YES empty element ending rules specification to the end
of the MINIMIZE part of the FEATURES clause. When
this empty element normalization option is activated, omission of the end-tag is
controlled by the tag omission rules of the element. If the default rules for
the automatic omission of end-tags from empty elements are to apply then
EMPTYNRM NO must be specified.
Note: If this option is specified the options for implicit definitions must immediately follow it.
When empty element ending rules are specified they must be immediately
followed by the keyword IMPLYDEF followed by the following
keywords, in the order shown:
ATTLIST, to indicate whether undefined attributes can be
implied to have been declared as CDATA #IMPLIED (YES)
or not (NO)DOCTYPE, to indicate whether a document instance with no
associated document type declaration is treated as though the declaration were
<!DOCTYPE #IMPLIED SYSTEM [ ]> (YES) or not (NO)ELEMENT, to indicate whether undeclared elements can be
implied to have been declared as - - ANY (YES) or
not (NO)ENTITY, to indicate whether undeclared entities can be
implied to have been declared as SYSTEM (YES) or not
(NO)NOTATION, to indicate whether undeclared notations can be
implied to have been declared as SYSTEM (YES) or not
(NO).Note: When ENTITY YES is specified the reserved name
#DEFUALT cannot be assigned to
an entity declaration. DOCTYPE YES cannot be specified if
concurrent or linked document types are being used.
To specify that public identiifers must be entered in the form of Internet
Uniform Resource Names (URNs) you can now add URN YES after the
FORMAL NO option at the end of the FEATURES clause.
By default, or if no entry is specified, URN NO will be specified
to indicate that no checking of public identifiers is required.
Note: If this option is used it must be immediately followed by the options listed under the Other new features heading below.
When the URN feature has been added to the FEATURES clause it
must be followed by specifications relating to the following new options:
KEEPRERS, to indicate whether Record
End and Record Start codes found between elements in mixed content are to be
retained (YES) or not (NO)VALIDITY, to indicate which of the following types of
validity checking is to be applied to elements in the document instance:
TAG, to indicate that checking is only required to ensure that
the document is fully tagged, i.e. that every non-EMPTY element within the
document instance has both a start and an end-tagTYPE, to indicate that checking is required to ensure that
the element and its attributes are permitted in the current context (the
default condition when validity is not specified)TAG-TYPE, to indicate that checking is required to ensure
that the element and its attributes are permitted in the current context, and
that non-EMPTY elements has both a start and an end-tagNOASSERT, to indicate that no validity checking needs to be
applied to elements in the document instance.ENTITIES, to indicate the types of checks to be performed on
entities referenced in the document instance, which can be defined using the
following options:
NOASSERT, to indicate that no validity checking of entity
references is requiredREF, to indicate whether ANY entity references
can occur or only references to INTERNAL entities, or that
document instances can contain no entity references other than those that
reference predefined data character entities (NONE)INTEGRAL, to indicate whether or not elements must be
integrally stored so that every element starts and ends in the same entity (YES)
or not (NO). Note: If the REF option is present it must be immediately
followed by the INTEGRAL option. The NOASSERT option
must be used on its own.
When the Web SGML Adaptations are being used the APPINFO
clause can be extended by adding a SEEALSO statement which is
followed by a public identifier that
references a file that contains information on any additional constraints to be
applied by applications using the SGML declaration. For example, the constraints
specified in the XML specification can be referenced using the following
statement:
APPINFO NONE SEEALSO "http://www.w3.org/TR/PR-xml-971208"
More than one public identifier can be specified if appropriate. If no
additional rules apply the entry can either be omitted or changed to SEEALSO
NONE.
When the Web SGML Adaptations are being used a shortened form of SGML declaration can associated with a document type declaration. This takes the form of an SGML declaration reference to an externally stored SGML declaration body. The format of the shortened declaration is:
<!SGML name external-identifier? >
where name is a reference concrete syntax name used to
identify the SGML declaration and the optional external-identifier
identifies the external entity which contains
the clauses of the SGML declaration body. (If it is omitted the system is
assumed to be able to use the name to find the relevant SGML
declaration: this is equivalent to having an unqualified system identifier of
SYSTEM for the external identifier)
The SGML declaration used by the Extensible Markup Language (XML) shown above could be referenced using the new Internet Domain Name form for formal public identifiers as:
<!SGML XML PUBLIC "+//IDN www.w3c.org//SD SGML declaration body for XML//EN">
The file referenced must start with the minimum literal that indicates which version of the SGML declaration is being used, followed by definitions for each of the clauses that make up the SGML declaration body. Comments can be interspersed between clauses, but may not precede the mimimum literal.
If the system knows where to find an SGML declaration known as XML
all that needs to be added to the start of the file is <!SGML XML>.
Readers wishing to know more about the role of the SGML declaration should refer to the following books:
Goldfarb, C.F. (1990) The SGML Handbook Oxford: Clarendon Press
Bryan, M.T. (1987) SGML: An Author's Guide to the Standard Generalized Markup Language Wokingham: Addison-Wesley
The following ISO standards define character sets typically referenced in SGML declarations:
International Organization for Standardization/International Electrotechnical Commission (1991), Information Processing - 7-bit coded character set for information interchange (ISO/IEC 646:1991) Geneva: ISO
International Organization for Standardization/International Electrotechnical Commission (1994), Information technology - Coded graphic character set for text communication - Latin alphabet (ISO/IEC 6937:1994) Geneva: ISO
International Organization for Standardization (1987), Information processing - 8-bit single-byte coded graphic character sets Parts 1-10. (ISO 8859:1987) Geneva: ISO
International Organization for Standardization/International Electrotechnical Commission (1993), Information technology - Universal Multiple-Octet Coded Character Set (UCS) (ISO/IEC 10646:1993) Geneva: ISO
Details of the Web SGML Adaptations can be found in:
Web SGML Adaptations, Annex K to ISO 8879:1986, ISO/IEC JTC1/WG4, December 1997