© Martin Bryan 1997 from SGML and HTML Explained published by Addison Wesley Longman
This chapter explains how SGML entities are declared and used. It is split into the following sections:
An entity is defined in ISO 8879 as "a collection of characters that can be referenced as a unit". SGML places no constraints on the maximum size of an entity.
An entity that contains a complete SGML document is known as an SGML document entity. SGML document entities have three main sections:
SGML document entities can contain embedded references to other entities. There are two main types of entity:
Both of these categories can be further subdivided into:
Where the replacement text of a general entity
should not be parsed when being incorporated into the document it can be
declared as a character data entity (
Where the replacement text is defined in a manner that is system-specific it can
be defined as a specific character data entity (
Where the replacement text contains codes intended to control
processing it can be defined as a processing instruction
There are three main types of external entity that would be stored in a separate file:
Character data (
CDATA) and specific character data (
can also be stored in external entities.
Embedded entities are the key to understanding SGML. Each embedded entity has two components: an entity declaration and one or more entity references. The entity declaration defines the name and contents of the entity: the entity references identify the points at which those contents are to be incorporated into the document.
Entity declarations form part of the document type declaration. Parameter entity references are used within document type and link type declarations to identify the points at which the replacement text of parameter entities is to be read and interpreted. General entity references are used within the document instance to identify the points at which the replacement text or external file defined in the entity declaration are to be incorporated into the text.
Closely associated with SGML entities are character references and short references. Character references allow authors to enter characters that are not available on the keyboard by reference to a character number or a function name. Short references allow single characters, or specially defined groups of characters, to act as a shorthand reference to an entity.
An entity reference is entered into an SGML document to indicate each point at which the contents of a previously defined entity are to be incorportated into the document. There are two types of entity reference:
A general entity reference consists of:
When the reference concrete syntax is being used the entity reference open
&. The length of the entity name must not exceed
and the name must start with a valid
name start character and be followed by valid
A reference end is either:
REFC) delimiter (a semicolon by default)
RE) function code
A typical general entity reference will, therefore, take the form
&name if immediately followed by a space or record end
A special entity, known as the default entity, can be declared in a document type definition. If such a default entity has been declared its contents will be output whenever an otherwise undeclared name is encountered within an entity reference. Normally the default entity will contain a message warning that an unrecognized entity name has been encountered at that point in the document, e.g.:
*** Reference to undeclared entity found here ***
Parameter entity references may only occur within SGML markup declarations. A distinction is made between general entities and parameter entities to avoid the possibility of an author accidentally trying to declare an entity whose name has already been used by a DTD developer. By distinguishing between the uses to which the two types of entity are put, it is possible to unambiguously use the same name for a parameter entity and a general entity.
Parameter entities can also be referenced within markup declarations, such as those used to identify the role of marked sections, that can occur within document instances.
A parameter entity reference consists of:
When the reference concrete syntax is being used the parameter entity
reference open delimiter is
%. When the length of the parameter
entity reference open delimiter and the parameter entity name are added together
their length may not exceed that currently specified for the
NAMELEN quantity. (The delimiter is
treated as part of the entity name.)
A typical parameter entity reference will, therefore, have the form
|Web SGML Adaptations Extension|
When the Web SGML adaptations provided by Annex K to ISO 8879 are available options in the
Note: If the
Entity declarations form part of a document type declaration subset (or a link type declaration subset) defined within the document prolog.
Within the document type declaration subset, each
individual entity declaration is entered between its own set of
markup declaration delimiters. The reserved
ENTITY (or its previously declared replacement) follows the markup
declaration open (
MDO) delimiter to identify the declaration as an
entity declaration. The rest of the declaration consists of the entity
name followed by the replacement entity text to give
an entity declaration the general form:
<!ENTITY name "replacement entity text">
In its simplest form the replacement text will consist of a string of
characters delimited by a matched pair of either quotation marks (
or apostrophes (
'). A typical SGML text entity
might be declared as:
<!ENTITY OPOCE "Office for Official Publications of the European Communities">
This entity can be referenced by entering
points in the text of relevant document instances at which the replacement text
is to appear.
The replacement text of SGML text entities can include markup codes, including start-tags, embedded entity references, character references, short references and data tags, which will be interpreted as the entity text is added to the document. For example, a general entity declaration might take the form:
<!ENTITY en-reg "<em lang=fr>en règle</em>" >
When this entity is called, by entering
&en-reg; in the
text, the program will recognize the embedded text as a French emphasized
phrase, bracketed by an
<em lang=fr> start-tag and an
</em> end-tag. Before outputting this highlighted phrase in
the appropriate font the program will expand the reference to the entity called
è to obtain the system specific code needed to generate
e with a grave accent.
One word of warning: you cannot reference an entity within its replacement text as this will create a recursive loop. For this reason, the replacement string cannot contain any characters that might be treated as short references which should be mapped to the entity being defined.
A parameter entity declaration is distinguished from a general entity
declaration by having a parameter entity reference open (
%, and one or more spaces immediately in front of
the required name to give it the form:
<!ENTITY % name "replacement text" >
Typically the replacement text for a parameter entity will consist of a series of element type names separated by the relevant SGML model group connectors, e.g.:
<!ENTITY % heading "H1|H2|H3|H4|H5|H6">
It is important to remember that parameter entities must be declared before the entity is referred to within the document type definition. In most prologs you will find that all parameter entities are declared at the start of the document type definition subset. Where parameter entities are used to define the replacement text required for other parameter entity declarations care must be taken to ensure that the declarations always precede the references. For example, the following declarations are used in the Version 4.0 the HTML DTD
<!ENTITY % fontstyle "TT | I | B | BIG | SMALL"> <!ENTITY % phrase "EM | STRONG | DFN | CODE | SAMP | KBD | VAR | CITE | ABBR"> <!ENTITY % special "A | IMG | OBJECT | BR | SCRIPT | MAP | Q | SUB | SUP | SPAN | BDO"> <!ENTITY % formctrl "INPUT | SELECT | TEXTAREA | LABEL | BUTTON"> <!ENTITY % inline "#PCDATA | %fontstyle; | %phrase; | %special; | %formctrl;">
It is important to ensure that the definitions of the parameter entities
referenced in the replacement text for the
text parameter entity
are declared before they are referenced, as in the case of the above sequence of
|Web SGML Adaptations Extension|
When the Web SGML adaptations provided by Annex K to ISO 8879 are available the replacement text for a parameter entity occuring as a token separator in a markup declaration must contain only complete tokens, and may not include an unmatched group delimiter.
The purpose of an entity can be explained by incorporating comments
within the definition. The start and end of each comment must be indicated by
entering comment (
COM) delimiters (a pair of consecutive hyphens
in the reference concrete syntax). Like the replacement entity text, comments
can take up more than one line, e.g.:
<!ENTITY disclaim "Users should note that all International Standards undergo revision from time to time and that any reference made herein to any other International Standard implies its latest edition, unless otherwise stated." -- Must appear in the Foreword of each ISO standard -- >
Variations to the basic declaration allow users to specify the following special forms of general entities:
A special default entity can be
declared by using the reserved word
#DEFAULT in place of an entity
<!ENTITY #DEFAULT "*** Reference to undeclared entity found here ***">
The replacement text for this default entity will be used for any general entity reference whose name is not recognized as one of the entities declared in the currently active DTD.
|Web SGML Adaptations Extension|
When the Web SGML adaptations provided by Annex K to ISO 8879 are available, and
CDATA keyword can be placed between
the entity name and its replacement text to tell the program that the
replacement text is to be treated as a character data entity.
This means that any characters within the string that could possibly be
interpreted as markup codes will be ignored. For example, the declaration:
<!ENTITY para CDATA "<P>" >
would allow a
¶ entity reference to generate the
<P> rather than the start-tag for a paragraph
(which would be output if
CDATA was not used).
Where a document, such as this one, contains a lot of text that may be
mistaken for markup, it is better to declare special entities that can be used
to generate SGML delimiter sequences. The characters most likely to need
treating in this way are the less-than sign (
<) used at the
start of many types of markup declaration and the entity reference open (
&). The following declarations could be used to set
up entities that would meet this need:
<!ENTITY lt CDATA "<" > <!ENTITY amp CDATA "&" >
Using this definition, code for a paragraph start-tag (
could be entered as
<P>. (This would not be recognized
as a valid start-tag because tags and entity references are only recognized if
they are contained within the same entity.) Similarly, the general entity
&SGML; could be entered as
&SGML; to ensure that it is not recognized as an entity
SGML (or the default entity if no such entity has been
declared in the DTD).
It should be noted that the semicolon is a compulsory element of both the
last two entity references because they are immediately followed by a name
character. If the semicolon had been left out of the first example, the program
would have tried to find an entity whose declared name was
the case of the second example, the program would look for an entity called
ampSGML. If entities with these names had not previously been
declared, and no default entity had been defined, the parser should flag the
entity reference as invalid.
|Web SGML Adaptations Extension|
When the Web SGML adaptations provided by Annex K to ISO 8879 are available the predefined character data entities extension can be used to assign reserved character names for delimiters.
Another approach to this problem is to use the reserved
SDATA to identify the declaration as a specific
character data entity. For example, the following entity declarations
are provided in ISO 8879:
<!ENTITY lt SDATA "[lt ]" --=less-than sign--> <!ENTITY amp SDATA "[amp ]" --=ampersand-->
Most SGML-based programs would automatically expand standard ISO entity
references such as these to give the code sequence required to generate the
character during formatting because the convention of enclosing the entity name
in square brackets is one used for all ISO entity sets. But where users have
defined their own, system-specific, replacement codes in the entity replacement
text the inclusion of the
SDATA reserved name in the entity
definition will allow receiving programs to request the information needed to
generate the requested character(s) on the local text formatter.
Note: While characters defined as valid in the document's character set but as invalid in the document's concrete syntax can be included in SDATA entities, non-SGML characters that have been declared as unused in the document's character set cannot be entered as part of the replacement text of an SDATA entity.
When short references are being used the
replacement text for some entities will consist solely of an element start-tag
or end-tag. In such cases the role of the entity can be unambiguously defined by
preceding the undelimited element type name, and any associated
attributes, with a
ENDTAG reserved name.
For example, the declaration:
<!ENTITY refstart STARTTAG "sub align=left" >
will cause the program to replace
<!ENTITY refend ENDTAG "sub" >
will cause it to replace
Alternatively the two entities could have been defined as:
<!ENTITY refstart "<sub align=left>" > <!ENTITY refend "</sub>" >
but in this case the program would not know that the replacement text
contained a markup tag which needed further processing until it had added the
replacement text to the main text stream. In addition, while the first pair of
definitions would work irrespective of what definition is used for markup
delimiters, the second pair of definitions will only work while the reference
concrete syntax definitions for
TAGC are in force.
Other keywords can be used to identify parameter or general entities whose replacement text defines an embedded SGML markup instruction. The reserved words that are placed between the entity name and the replacement text to identify such entities are:
MSto identify the replacement text as a marked section
MDto identify the text as a markup declaration (bracketed by
PIto identify the text as a processing instruction (bracketed by
Typically these keywords will be used in entity declarations such as:
<!ENTITY special MD "USELINK special">
This declaration allows the entity reference
be used to generate a
<!USELINK special> markup declaration
in the text at a point where special processing of embedded elements is
External entities are declared with an external entity specification in place of the replacement text. The external entity specification indicates the source of the data to be added to the document and, optionally, the type of entity being defined
Two types of external entity are recognized by SGML:
Each of these main types can be further subdivided into one of three entity types:
The simplest way of declaring a system-specific external entity
that is known only to the systems it is used on is to use the reserved name
SYSTEM in place of the replacement text in the entity declaration,
in a declaration of the form:
<!ENTITY file1 SYSTEM >
file1 is a valid entity name that is also recognized by
the system as a reference to a file on the local storage system.
Note: Experience has shown that the use of this simplified form of referencing external entities leads to problems when interchanging documents between systems. For this reason the use of this shortened form for the identification of external entities within distributed systems is discouraged.
More typically, however, the
SYSTEM keyword in an entity
declaration will be qualified by a
system identifier that uniquely identifies the source of the
required entity. In many instances the system identifier will consist of a file
name, optionally qualified by a pathname, e.g.:
<!ENTITY module4 SYSTEM "c:\SGML\course\module4.sgm" >
When the program encounters a
&module4; entity reference
within the text, it will call the file identified by the system identifier and
parse its contents as SGML encoded text at the point identified by the entity
In 1996 a new annex to ISO/IEC standard 10744, the Hypermedia/Time-base Structuring Language (HyTime) introduced the concept of formal system identifiers (FSIs) to SGML. A formal system identifier has a structured form that identifies both the file required and the source of the file. For example, an FSI could be used to identify that a file has been referenced through an Internet Unique Resource Locator (URL) using a system identifier of the following form:
<!ENTITY chapter4 SYSTEM "<url SOIbase='http://www.u-net.com/~sgml/'>sgml-4.htm" >
When a system-specific external entity contains data that has been coded using a form of markup that differs from that used in the main document the system identifier can be qualified by an entity type statement. Four entity types are recognized within SGML:
CDATA) entities that contains only valid SGML characters that do not require parsing
SDATA) entities that contain characters whose interpretation is specific to the system
NDATA) entities that contain codes outside the set declared to be valid SGML characters for the document.
If an external entity contains a complete SGML-coded
document, including the appropriate document type declaration, it can be
declared as a system-specific SGML subdocument entity by
placing the reserved word
SUBDOC after the entity's system
<!ENTITY table1 SYSTEM "table1.sgm" SUBDOC >
Note: This feature can only be used if the SGML declaration has a
FEATURES clause containing a
SUBDOC YES entry. It
should be noted, however, that the
FEATURES clause defaults to
|Web SGML Adaptations Extension|
When the Web SGML adaptations provided by Annex K to ISO 8879 are available files containing entities identified as subdocuments may start with an SGML declaration or an SGML declaration reference.
When an SGML program that has received the above entity declaration
&table1; entity reference, it should store its
current parsing state before calling up the file called
While it is processing the subdocument it will use the document type declaration
identified at the start of the subdocument file, rather than the one previously
being used. When the end of the subdocument is reached, the parsing state stored
when the entity reference was encountered will be restored.
Where the retrieved entity contains data that is not
coded in SGML the entity must be declared as a data entity.
This is done by entering a reserved word (
NDATA) immediately after the system
identifier. A notation name identifying the type of coding
used within the data entity must follow the keyword, optionally followed by any
data attributes that are required to process the
contents of the referenced file.
If an Encapsulated PostScript file has been created it could be incorporated into a document by adding an entity declaration of the form:
<!ENTITY fig1 SYSTEM "fig1.eps" NDATA postscript >
A reference of the form
&fig1; will cause the parser to
pass the file called
fig1.eps in the current working directory to
the process identified in the notation declaration that has been given the name
Notation declarations have the general form:
<!NOTATION name identifier >
name is the notation name (used after
NDATA in the entity declaration) and
identifier is a
valid notation identifier, which is either system-specific or
If the notation identifier is system-specific, it will consist of the
SYSTEM followed by a system identifier that identifies the process
control file that needs to be activated to parse the non-SGML data, e.g.:
<!NOTATION postscript SYSTEM "eps.bat" >
When the system has finished processing the data it will transmit a special, system dependent, signal, known as an entity end signal, to the SGML parser.
Note: This signal is output by the system at the end of each entity to tell the parser that it can continue processing the rest of the text now that the entity reference has been satisfied. The entity end signal is not a control code, and need not be one of the codes declared within the document's character set. It can be any signal recognized by the operating system as an indication that the end of an entity's replacement text has been reached.
Where an external entity contains character data, or other system-specific information, its declaration must also be qualified by a suitable notation name, e.g.:
<!ENTITY example1 SYSTEM "example1.dtd CDATA SGML > <!ENTITY our-logo SYSTEM "our-logo.out" SDATA "logo" >
where the associated notation declarations could take the form:
<!NOTATION SGML SYSTEM "newstream.in" > <!NOTATION logo SYSTEM "Logo.bat" >
When a notation declaration has been associated with a data entity the notation name can optionally be qualified by data attributes. These data attributes can either be passed to the system as parameters associated with the commands that activate the required notation interpreter, or they can be used to determine which commands should be sent to the system.
Data attributes are declared in the same way as other attributes except that
the associated element type statement is replaced by the name(s) of the
notation(s) the attributes are to be associated with. To indicate the changed
role of the attribute definition the reserved name
precede the notation name(s).
To see how data attributes can be used in practice, consider the following data attribute definition:
<!ATTLIST #NOTATION (postscript|TeX) width NUTOKEN #IMPLIED depth NUTOKEN #IMPLIED >
Here two data attributes,
been associated with the notation declarations whose names are
TeX. If no values are entered for these attributes the width
and depth of the illustration will be as supplied. If the width or depth of the
illustration to be processed are to be altered during processing the entity
declaration can be extended to read:
<!ENTITY fig1 SYSTEM "fig1.eps" NDATA postscript [width="5in" depth="3in"]>
Notice that, within the entity declaration, the data attribute
specification uses the currently defined declaration
subset open (
DSO) and declaration
subset close (
DSC) codes to delimit the entered list of
attributes. In the reference concrete syntax these are the open and close square
Like other attributes, data attributes can be minimized if the permitted values have been defined as a name token group. For example, if the attribute list was extended to read:
<!ATTLIST #NOTATION (postscript|TeX) width NUTOKEN #IMPLIED depth NUTOKEN #IMPLIED align (left|right|centre) left >
the entity declaration could take the form:
<!ENTITY fig1 SYSTEM "fig1.eps" NDATA postscript [width="5in" depth="3in" centre]>
Note: When defining attributes for use with notations the
NOTATION keywords must not be used as these keywords
can only be used as declared values for attributes associated with elements.
#CONREF default value
options cannot be used.
Publicly declared external entities are external entities that contain declarations, text or other data designed to be used on more than one SGML system.
Many publicly declared entities consist solely of a pre-defined set of markup declarations which can be used to extend document type declaration subsets defined within the prolog. When the relevant parameter entity reference is encountered in the document type declaration, the program will add the declarations it has previously stored as a publicly declared entity to any local declarations.
The advantage of using publicly declared entities is that the declarations do not need to be transmitted between systems when the receiving system is already known to have access to them. Instead, all the user needs to do is add the necessary public entity declarations, with the associated references, to the document to tell receiving systems which sets of declarations will be referenced in transmitted documents.
Publicly declared external entities are said to be "publicly declared" because the relevant declarations have been assigned names known by receiving systems, but the declarations can be "private" in the sense that the associated definitions are only provided to a closed user community.
There are, however, certain publicly declared entities that may truly be called "publicly declared". These contain sets of declarations that have been defined by one of the organizations authorized by the International Organization for Standardization (ISO) to keep registers of declarations used in more than one document. Once a declaration set has been registered in this way it will have a unique name by which it can be recognized by all systems referencing the standardized data.
The International Organization for Standards (ISO) has defined
entity sets to identify most commonly used Latin, Greek
and Cyrillic characters, and other constructs defined in international
standards. Such sets are identified by special ISO owner identifiers.
ISO owner identifiers start with the letters
ISO followed by the
number and date of the standard referred to, e.g.
Note: Before requesting any publicly declared entity it is important to check that the relevant declarations will be available to any system receiving the marked-up SGML document. The fact that an entity has been publicly declared does NOT mean that it will be known to all SGML systems, it simply means that its definition does not need to be transmitted between systems that already know the definition.
Publicly declared external entities which just contain text can be requested by entering a declaration of the form:
<!ENTITY name PUBLIC "public identifier" >
This entity can be recalled at any point in the text by entering a general
entity reference of the form
Optionally the public identifier can be followed by the filename used to store the entity contents on the local system, e.g.
<!ENTITY name PUBLIC "public identifier" "filename.ext" >
Because this makes the entity declaration less portable, however, this form of external entity declaration is discouraged.
Note: With the development of the concept of enitity catalogs by the members of the SGML Open vendor's consortium there is nowadays little need for this form of qualified public identifier.
If the external entity contains SGML markup declarations that are to be added to the document type declaration subset it must be declared by entering a parameter entity declaration of the form:
<!ENTITY % name PUBLIC "public identifier">
the entity then being recalled by entering the relevant parameter entity
%name;) at some point between the entity
declaration and the declaration subset close character (e.g.
marking the end of the document type declaration subset.
The public identifier used in the entity declaration of the above external entities is either a formal public identifier or a name agreed between users.
Note: If the
FEATURES clause of the current SGML
declaration contains the entry
FORMAL YES the public identifier
must be a formal public identifier.
|Web SGML Adaptations Extension|
When the Web SGML adaptations provided by Annex K to ISO 8879 are available, and the
Note: If both
ISO 8879 restricts the characters that may be used in public identifiers to
a special set of minimum data characters consisting of the
uppercase and lowercase alphabetic letters, spaces (or
RS codes), numbers and the following characters, which are
declared to be part of a special character class:
' ( ) + , . - / : = ?
Note: This list of special characters cannot be extended or otherwise altered in the SGML declaration. If you wish to use an agreed name to identify a set of declarations you must make sure your name consists only of characters mapped in the current document character set to these special characters or one of the ISO 646 (ASCII) alphanumeric characters.
|Web SGML Adaptations Extension|
When the Web SGML adaptations provided by Annex K to ISO 8879 are available the following characters are added to the list of minimum data characters allowed within public identifiers:
; ! * # @ $ _ %
These characters are added so that Internet Uniform Resouce Names (URNs) can be used as formal public identifiers.
Formal public identifiers fall into one of three categories:
|Web SGML Adaptations Extension|
When the Web SGML adaptations provided by Annex K to ISO 8879 are available Intenet Domain names can be used to identify owners of formal public identifiers.
The rules for defining ISO assigned public identifiers are defined in ISO 9070:1991, Information Processing - SGML Support Facilities - Registration Procedures for Public Text Owner Identifiers. A typical ISO registered entity set will be identified by a declaration, within the document type declaration subset, of the form:
<!ENTITY % ISOlat1 PUBLIC "ISO 8879:1986//ENTITIES Added Latin 1//EN" >
This declaration must then be invoked by incorporating the parameter entity
%ISOlat1; into the document type declaration subset before the
closing square bracket, e.g:
<!DOCTYPE docname [ <!ENTITY % ISOlat1 PUBLIC "ISO 8879:1986//ENTITIES Added Latin 1//EN" > . . -- other required declarations -- . %ISOlat1; ]>
ISO assigned public identifiers have three components, separated from each other by a pair of solidus strokes (slashes). The three components are:
ENTITIES) and an associated public text description (e.g.
Added Latin 1)
Entity sets registered by bodies other than ISO will use a registered
owner identifier in place of the ISO owner identifier. The registered
name is preceded by
+// to identify the following identifier as
one applying to a registered set. A typical declaration might be:
<!ENTITY % EC-acts PUBLIC "+//OPOCE//DTD for European Community Acts//EN">
One special class of registered owner identifier that is of special interest to book publishers is that assigned to the International Standard Book Numbering Agency in ISO 9070. This special identifier allows publishers to be identified by reference to the group and publisher identifiers that form the first part of the ISBN numbers assigned to their books. For example, a set of entity declarations for use in European Commission publications could be assigned an identifier of the form:
<!ENTITY % OPOCE PUBLIC "+//ISBN 93 826::Office for Official Publications of the European Communities//ENTITIES Accented Characters//EN"
Companies can apply to register their own ISO 9070 owner names through the Graphic Communications Association, who are located at 100 Dangerfield Road, Alexandria, Virginia, USA.
|Web SGML Adaptations
When the Web SGML adaptations provided by Annex K of SGML are available the keyword
Where the declarations have not been formally registered, an unregistered owner identifier must be used as the owner identifier. This has the same form as a registered owner identifier, except that a hyphen is used in place of the initial plus. The name used to identify the owner must consist of one of the minimum data characters (alphanumeric or special) allowed in public identifiers. A typical declaration might take the form:
<!ENTITY % SGMLdoc PUBLIC "-//The SGML Centre//DTD for Manual Production//EN">
The public text class name that follows the owner identifier may have one of the following values:
TEXTwhen the only text of the document instance (including any element tags and entity references) is stored in the external entity
DOCUMENTwhen the external entity contains an SGML declaration, a document type declaration and text
SUBDOCUMENTwhen the external entity contains the document type declaration and the text for a subdocument to be referenced within the current document
DTDwhen the external entity contains a document type declaration subset containing declarations defining the document's structure and entities
ELEMENTSwhen the external entity only contains element, attribute or notation declarations, with their associated parameter entities, comments, processing instructions and marked sections
ENTITIESwhen the external entity only contains entity declarations
SHORTREFwhen the external entity only contains the short reference, entity and map use declarations making up a short reference set
LPDwhen the external entity only contains the link set, link attribute and entity declarations making up a link type declaration subset
CHARSETwhen the external entity only contains details of the base character set to be used within an SGML declaration
SYNTAXwhere the external entity only contains details of the concrete syntax to be used in the SGML declaration
CAPACITYwhen the external entity only contains capacity set declarations to be referenced in the SGML declaration
NOTATIONwhen the external entity identifies the process to be used when processing non-SGML data
NONSGMLwhen the stored data contains non-SGML characters.
Note: All public text class keywords must be entered using capital letters only.
|Web SGML Adaptations Extension|
When the Web SGML adaptations provided by Annex K to ISO 8879 are available
Only one keyword can appear in any formal identifier, though most of the keywords can be used more than once in a document. (Only one syntax or capacity declaration can be made in any SGML declaration.)
public text class keywords are used in a public identifier the entity must be
defined as a general entity. The
SHORTREF keywords may only be used in parameter entity
declarations because they refer to files that contain markup declarations that
need to be added to any local declarations within the
document type declaration subset. The
LPD keywords must be associated with
document type declarations and
link process definitions
keywords may only be used in the relevant section of an SGML declaration.
NOTATION keyword differs slightly from the other public
text class keywords in that it is only used to qualify notation declarations. It
is typically used in the form:
<!NOTATION postscript PUBLIC "-//my-system//NOTATION EPS Processor//EN">
This declaration defines a notation called
postscript as a
locally recognized notation that will be used to process Encapsulated PostScript
Each public text class keyword is qualified by a public text description explaining the purpose of
the publicly declared entity. (This description is restricted to the
alphanumeric and special minimum data characters used for public identifiers.)
Where the entity consists of declarations that are not generally available to
the public, the public text description should be preceded by an unavailable
text indicator (
-//) to give the identifier the form:
<!ENTITY % name PUBLIC "-//owner//class -//description//language" >
The public text language parameter that normally ends a formal public identifier must be one of the two character codes for identifying languages defined in ISO 639. This language code tells the system which language the public text has been prepared in. The most commonly encountered codes are:
Codes have also been defined for most European languages, including "dead" languages such as Latin, and for international languages such as Esperanto, Interlingua and Interlingue.
Notice that the language codes are all defined as a pair of capital letters. These codes cannot be replaced by the equivalent lower-case letters within the formal public identifier, even if the name case rules in the SGML declaration permit general substitution of tag characters, because the standard specifically states that the uppercase form must be used for all public text language values.
Where the public text class keyword is
CHARSET the public text
language code is replaced by a public text designating sequence.
This sequence of codes uniquely identifies the selected character set by using
techniques defined in ISO 2022. Each sequence starts with an Escape code (1/11,
hexadecimal 1B) followed by a number indicating the type of code set being
described. This number is further qualified by one or more numbers identifying
the required set of characters.
The types of code that can be indicated by the first code after the Escape code include:
Note: ESC 2/5 4/0 is a special sequence used to return to the basic G0 character set from a character set defined outside ISO 2022.
A typical use of a public text designating sequence to define a document's character set in an SGML declaration would be:
CHARSET DESCSET "ISO6937:1994//CHARSET Latin Alphabet//ESC 2/14 4/1"
The 96 character set required is that specified in the 1994 version of ISO/IEC standard 6937 which, in this instance, is being used as a G2 code set.
Where a publicly declared entity consists of entity declarations which
contain system-specific data (i.e. use the
SDATA option) the
associated formal public identifier can be further qualified by the addition of
a public text display version description. This description
identifies which types of device the entities will be recognized by.
A typical extended entry might take the form:
<!ENTITY % Ventura PUBLIC "-//The SGML Centre//ENTITIES Dingbats used in Ventura//EN//Ventura">
Note: When creating a formal public identifier it is important not to split the literal string immediately after one of the slashes (solidii) used to identify the start of a new component within the public identifier because the start of a new line in the delimited string will be treated as if a space had been entered, and spaces are not permitted immediately after a slash.
The way in which formal public systems are resolved into local system identifiers is system dependent. Software conforming to the rules laid down by the SGML Open consortium will use exchangeable SGML Open catalogs to record these mappings. If you receive a document without receiving a catalog with it you will need to create your own mapping between formal public identifiers and local files. If you receive a catalog with the files you will need to ensure that the filenames shown in the catalog are available at the specified location. If your versions of the referenced files are stored at other locations you will need to change the location shown in the catalog. If you do not have copies of any of the files listed in the catalog you will need to obtain them prior to parsing the document or DTD.
Keyboards can only provide keys for a limited range of characters. While many systems provide a menu option that can be used to select characters that are not directly available on the keyboard these menus do not always provide access to all the characters that are accessible on a printer. For characters that are not part of the SGML character set a mechanism is needed for referencing them that only requires the use of characters known to be in the character set.
Characters that cannot be entered by use of a dedicated keystroke can be
entered either as a reference to a previously declared system-specific (
entity or as a character reference, or by using a combination
of both techniques.
A character reference is a reference entered within the text that specifies
the required character either by entry of its
decimal value or by reference to the function name it has been allocated in the
currently defined concrete syntax (e.g.
A special character reference open (
delimiter is used to identify character references. In the reference concrete
CRO is defined as
&#, giving a typical
character reference the form
Character references can be used within the replacement text of an entity declaration. For example, the entity declaration:
<!ENTITY microns "µm">
could be defined to allow
µns; to be entered to
generate the characters µm.
Character references are often used when a markup character, such as a double quote, is required in the replacement text of an entity. For example, to define a piece of quoted text within an entity it could be declared as:
<!ENTITY OPOCE ""Office for Official Publications of the European Commission"">
Alternatively the entity could be defined within single quotes as:
<!ENTITY OPOCE '"Office for Official Publications of the European Commission"'>
If an apostrophe needs to be included in replacement text surrounded by
single quote delimiters it must be entered in the form of a character references
to the code whose decimal number is 39 (i.e.
Numeric character references are always treated as data (i.e. the character
they resolve to is not checked to see if it could be part of a markup
delimiter). Characters referenced via a function name, however, will be treated
as markup if entered at an appropriate point. For example,
will always be output as a carriage return, whereas
might be interpreted as the end of an entity reference if placed at the end of
an entity name, rather than as a code to be sent to the printer.
One point to remember about numeric character references is that their
numbers may need to be altered if the document's character set is changed, or if
the document is passed to a system using a different character set. For this
reason special characters should, wherever possible, be given function names, or
be defined as
SDATA entities, within a document's character set.
|Web SGML Adaptations
When the Web SGML adaptations provided by Annex K of SGML are available an extra delimiter string can optionally be defined in the
Note: The decimal numeric character reference
The predefined data character entities section of the SGML declaration provides an additional mechanism for identifying delimiter start characters. Entity names declared in this subclause are treated as if their replacement text was a numeric character reference.
Note: This means that the replacement text for entities named in this way will always be treated as data rather than markup. It should be noted that entity names specified as part of the predefined data character entities section of the SGML declaration cannot be used in to define entities within an associated document type definition or link type definition as the definitions provided in the SGML declaration are deemed to occur before any definitions in the prolog.
An entity set is a set of entity declarations that are designed to be used together in a number of prologs. As such they are often stored in a separate file that can be referenced using a publicly declared external entity.
Entity sets are typically used to declare:
Entity sets may also contain:
which complement the entity declarations within the set.
As with all external entities, there are two basic types of entity set:
Publicly declared entity sets are typically used to define sets of characters which are not part of the main ASCII character set or the document character set. The standardized entity names defined in such sets can be converted (once) by any receiving system to the local equivalent by use of appropriate system-specific (SDATA) replacement text.
ISO has defined (in IS0 8879 and in ISO/IEC TR 9573) character sets for:
ISOlat1for accented and dipthong characters used in Western Europe
ISOlat2for accented and dipthong characters using in Eastern and Northern Europe
ISOgrk2), together with an alternative set typically used in maths (
Wherever possible entity set declarations should start with a
comment declaration indicating the purpose of
the set and how it should be invoked. For example, the
starts with the following comment declarations:
<!-- (C) International Organization for Standardization 1986 Permission to copy in any form is granted for use with conforming SGML systems and applications as defined in ISO 8879, provided this notice is included in all copies. --> <!-- Charcter entity set. Typical invocation: <!ENTITY % ISOlat1 PUBLIC "ISO 8879:1986//ENTITIES Added Latin 1/EN"> %ISOlat1; -->
To invoke this entity set on a system that has a copy of the relevant entity declaration file users need only add the entity declaration and parameter entity reference specified in the comment to their document type declaration subset.
Private entity sets can be used to define entities that are to be used in a number of locally produced documents. Each entity set should be prepared as a separate file. Each set of entity declarations making up the entity set should be preceded by a comment declaration indicating the purpose of the set and how it should be invoked.
A document type definition can call any number of different entity sets, and contain its own entity declarations. The sequence in which the sets are called may be important. Normally entity sets will be called after the document's entity declarations to ensure that any locally declared entities having the same name as one of the entities in an entity set will retain their local declarations. (The first definition found for an entity name is the one used by the system.) If, for some reason, the same entity name is used in two or more entity sets, the declaration used will be the one in the set whose parameter entity reference was encountered first.
International Organization for Standardization (1988), Codes for the representation of languages (ISO 639:1988) Geneva: ISO
International Organization for Standardization/International Electroctechnical Commission (1994) Information technology - Character set structure and extension techniques (ISO 2022:1994) Geneva: ISO.
International Organization for Standardization/International Electroctechnical Commission (1991), Information processing - SGML support facilities - Registration procedures for Public Text Object Identifiers (ISO 9070:1991(E)) Geneva: ISO.
International Organization for Standardization/International Electroctechnical Commission (1991-4), Information processing - SGML support facilities - Techniques for using SGML - Parts 12-16 (ISO/IEC TR 9573) Geneva: ISO.
International Organization for Standardization/International Electroctechnical Commission (1992), Information technology - Hypermedia/Time-based Structuring Language (HyTime) (ISO/IEC 10744:1992) Geneva: ISO.