Previous chapter Next chapter Table of Contents

© Martin Bryan 1997 from SGML and HTML Explained published by Addison Wesley Longman


Chapter 6
Entity Declaration and Use

This chapter explains how SGML entities are declared and used. It is split into the following sections:

6.1 Types of entity

An entity is defined in ISO 8879 as "a collection of characters that can be referenced as a unit". SGML places no constraints on the maximum size of an entity.

An entity that contains a complete SGML document is known as an SGML document entity. SGML document entities have three main sections:

SGML document entities can contain embedded references to other entities. There are two main types of entity:

Both of these categories can be further subdivided into:

Where the replacement text of a general entity should not be parsed when being incorporated into the document it can be declared as a character data entity (CDATA). Where the replacement text is defined in a manner that is system-specific it can be defined as a specific character data entity (SDATA). Where the replacement text contains codes intended to control processing it can be defined as a processing instruction entity (PI).

There are three main types of external entity that would be stored in a separate file:

Character data (CDATA) and specific character data (SDATA) can also be stored in external entities.

Embedded entities are the key to understanding SGML. Each embedded entity has two components: an entity declaration and one or more entity references. The entity declaration defines the name and contents of the entity: the entity references identify the points at which those contents are to be incorporated into the document.

Entity declarations form part of the document type declaration. Parameter entity references are used within document type and link type declarations to identify the points at which the replacement text of parameter entities is to be read and interpreted. General entity references are used within the document instance to identify the points at which the replacement text or external file defined in the entity declaration are to be incorporated into the text.

Closely associated with SGML entities are character references and short references. Character references allow authors to enter characters that are not available on the keyboard by reference to a character number or a function name. Short references allow single characters, or specially defined groups of characters, to act as a shorthand reference to an entity.

6.2 Entity references

An entity reference is entered into an SGML document to indicate each point at which the contents of a previously defined entity are to be incorportated into the document. There are two types of entity reference:

6.2.1 General entity references

A general entity reference consists of:

When the reference concrete syntax is being used the entity reference open delimiter is &. The length of the entity name must not exceed the current NAMELEN quantity, and the name must start with a valid name start character and be followed by valid name characters.

A reference end is either:

A typical general entity reference will, therefore, take the form &name;, or just &name if immediately followed by a space or record end code.

A special entity, known as the default entity, can be declared in a document type definition. If such a default entity has been declared its contents will be output whenever an otherwise undeclared name is encountered within an entity reference. Normally the default entity will contain a message warning that an unrecognized entity name has been encountered at that point in the document, e.g.:

   *** Reference to undeclared entity found here ***

6.2.2 Parameter entity references

Parameter entity references may only occur within SGML markup declarations. A distinction is made between general entities and parameter entities to avoid the possibility of an author accidentally trying to declare an entity whose name has already been used by a DTD developer. By distinguishing between the uses to which the two types of entity are put, it is possible to unambiguously use the same name for a parameter entity and a general entity.

Parameter entities can also be referenced within markup declarations, such as those used to identify the role of marked sections, that can occur within document instances.

A parameter entity reference consists of:

When the reference concrete syntax is being used the parameter entity reference open delimiter is %. When the length of the parameter entity reference open delimiter and the parameter entity name are added together their length may not exceed that currently specified for the NAMELEN quantity. (The delimiter is treated as part of the entity name.)

A typical parameter entity reference will, therefore, have the form %name;.

Web SGML Adaptations Extension
When the Web SGML adaptations provided by Annex K to ISO 8879 are available options in the ENTITIES extension to the FEATURES clause can be used to control whether, and where, entity references that can be added to a file. By default no assertions (NOASSERT) are deemed to apply, but user can choose to restrict entity references as follows:
  • no references to entities other than those declared as predefined data character entities are permitted (REF NONE): the document instance must therefore be a reference-free document
  • only references to internal entities and predefined data character entities are allowed (REF INTERNAL): the document instance must therefore be an external-reference-free document
  • external, internal and predefined character data entities may be referenced (REF ALL)
  • only references to integrally stored entities, in which all opened elements and marked sections also end, are permitted (INTEGRAL YES): the document instance must therefore be an integrally-stored document instance
  • elements may start in one entity and end in another (INTEGRAL NO)

Note: If the NOASSERT keyword is not present both the REF and INTEGRAL keywords must be present, followed by one of the listed option identifiers.

6.3 Entity declarations

Entity declarations form part of a document type declaration subset (or a link type declaration subset) defined within the document prolog.

Within the document type declaration subset, each individual entity declaration is entered between its own set of markup declaration delimiters. The reserved name ENTITY (or its previously declared replacement) follows the markup declaration open (MDO) delimiter to identify the declaration as an entity declaration. The rest of the declaration consists of the entity name followed by the replacement entity text to give an entity declaration the general form:

   <!ENTITY name "replacement entity text">

In its simplest form the replacement text will consist of a string of characters delimited by a matched pair of either quotation marks (") or apostrophes ('). A typical SGML text entity might be declared as:

  <!ENTITY OPOCE "Office for Official Publications of the European Communities">

This entity can be referenced by entering &OPOCE; at points in the text of relevant document instances at which the replacement text is to appear.

The replacement text of SGML text entities can include markup codes, including start-tags, embedded entity references, character references, short references and data tags, which will be interpreted as the entity text is added to the document. For example, a general entity declaration might take the form:

   <!ENTITY en-reg "<em lang=fr>en r&egrave;gle</em>" >

When this entity is called, by entering &en-reg; in the text, the program will recognize the embedded text as a French emphasized phrase, bracketed by an <em lang=fr> start-tag and an </em> end-tag. Before outputting this highlighted phrase in the appropriate font the program will expand the reference to the entity called &egrave; to obtain the system specific code needed to generate a lowercase e with a grave accent.

One word of warning: you cannot reference an entity within its replacement text as this will create a recursive loop. For this reason, the replacement string cannot contain any characters that might be treated as short references which should be mapped to the entity being defined.

6.3.1 Declaring parameter entities

A parameter entity declaration is distinguished from a general entity declaration by having a parameter entity reference open (PERO) delimiter, e.g. %, and one or more spaces immediately in front of the required name to give it the form:

   <!ENTITY % name "replacement text" >

Typically the replacement text for a parameter entity will consist of a series of element type names separated by the relevant SGML model group connectors, e.g.:

   <!ENTITY % heading "H1|H2|H3|H4|H5|H6">

It is important to remember that parameter entities must be declared before the entity is referred to within the document type definition. In most prologs you will find that all parameter entities are declared at the start of the document type definition subset. Where parameter entities are used to define the replacement text required for other parameter entity declarations care must be taken to ensure that the declarations always precede the references. For example, the following declarations are used in the Version 4.0 the HTML DTD

   <!ENTITY % fontstyle "TT | I | B | BIG | SMALL">
   <!ENTITY % phrase    "EM | STRONG | DFN | CODE |
                         SAMP | KBD | VAR | CITE | ABBR">
   <!ENTITY % special   "A | IMG | OBJECT | BR | SCRIPT |
                         MAP | Q | SUB | SUP | SPAN | BDO">
   <!ENTITY % formctrl  "INPUT | SELECT | TEXTAREA | LABEL | BUTTON">
   <!ENTITY % inline    "#PCDATA | %fontstyle; | %phrase; | %special; | %formctrl;">

It is important to ensure that the definitions of the parameter entities referenced in the replacement text for the text parameter entity are declared before they are referenced, as in the case of the above sequence of declarations.

Web SGML Adaptations Extension
When the Web SGML adaptations provided by Annex K to ISO 8879 are available the replacement text for a parameter entity occuring as a token separator in a markup declaration must contain only complete tokens, and may not include an unmatched group delimiter.

6.3.2 Comments

The purpose of an entity can be explained by incorporating comments within the definition. The start and end of each comment must be indicated by entering comment (COM) delimiters (a pair of consecutive hyphens in the reference concrete syntax). Like the replacement entity text, comments can take up more than one line, e.g.:

   <!ENTITY disclaim "Users should note that all International
    Standards undergo revision from time to time and that
    any reference made herein to any other International
    Standard implies its latest edition, unless otherwise
    stated." -- Must appear in the Foreword of each ISO
                standard -- >

6.3.3 Special forms of general entity declaration

Variations to the basic declaration allow users to specify the following special forms of general entities:

A special default entity can be declared by using the reserved word #DEFAULT in place of an entity name, e.g.:

   <!ENTITY #DEFAULT
    "*** Reference to undeclared entity found here ***">

The replacement text for this default entity will be used for any general entity reference whose name is not recognized as one of the entities declared in the currently active DTD.

Web SGML Adaptations Extension
When the Web SGML adaptations provided by Annex K to ISO 8879 are available, and ENTITY YES has been declared in the IMPLYDEF section of the SGML declaration, entities that are not declared in the DTD are treated as if they were system-specific external entities whose storage location had been defined using the keyword SYSTEM without a qualifying filename. When this option is being used a default entity cannot be defined in the prolog.

The CDATA keyword can be placed between the entity name and its replacement text to tell the program that the replacement text is to be treated as a character data entity. This means that any characters within the string that could possibly be interpreted as markup codes will be ignored. For example, the declaration:

   <!ENTITY para CDATA "<P>" >

would allow a &para; entity reference to generate the characters <P> rather than the start-tag for a paragraph (which would be output if CDATA was not used).

Where a document, such as this one, contains a lot of text that may be mistaken for markup, it is better to declare special entities that can be used to generate SGML delimiter sequences. The characters most likely to need treating in this way are the less-than sign (<) used at the start of many types of markup declaration and the entity reference open (ERO) delimiter (&). The following declarations could be used to set up entities that would meet this need:

   <!ENTITY lt    CDATA "<" >
   <!ENTITY amp   CDATA "&" >

Using this definition, code for a paragraph start-tag (<P>) could be entered as &lt;P>. (This would not be recognized as a valid start-tag because tags and entity references are only recognized if they are contained within the same entity.) Similarly, the general entity reference &SGML; could be entered as &amp;SGML; to ensure that it is not recognized as an entity called SGML (or the default entity if no such entity has been declared in the DTD).

It should be noted that the semicolon is a compulsory element of both the last two entity references because they are immediately followed by a name character. If the semicolon had been left out of the first example, the program would have tried to find an entity whose declared name was ltp. In the case of the second example, the program would look for an entity called ampSGML. If entities with these names had not previously been declared, and no default entity had been defined, the parser should flag the entity reference as invalid.

Web SGML Adaptations Extension
When the Web SGML adaptations provided by Annex K to ISO 8879 are available the predefined character data entities extension can be used to assign reserved character names for delimiters.

Another approach to this problem is to use the reserved word SDATA to identify the declaration as a specific character data entity. For example, the following entity declarations are provided in ISO 8879:

   <!ENTITY lt     SDATA "[lt    ]" --=less-than sign-->
   <!ENTITY amp    SDATA "[amp   ]" --=ampersand-->

Most SGML-based programs would automatically expand standard ISO entity references such as these to give the code sequence required to generate the character during formatting because the convention of enclosing the entity name in square brackets is one used for all ISO entity sets. But where users have defined their own, system-specific, replacement codes in the entity replacement text the inclusion of the SDATA reserved name in the entity definition will allow receiving programs to request the information needed to generate the requested character(s) on the local text formatter.

Note: While characters defined as valid in the document's character set but as invalid in the document's concrete syntax can be included in SDATA entities, non-SGML characters that have been declared as unused in the document's character set cannot be entered as part of the replacement text of an SDATA entity.

When short references are being used the replacement text for some entities will consist solely of an element start-tag or end-tag. In such cases the role of the entity can be unambiguously defined by preceding the undelimited element type name, and any associated attributes, with a STARTTAG or ENDTAG reserved name. For example, the declaration:

   <!ENTITY refstart STARTTAG "sub align=left" >

will cause the program to replace &refstart; with <sub align=left>, while:

   <!ENTITY refend ENDTAG "sub" >

will cause it to replace &refend; with </sub>.

Alternatively the two entities could have been defined as:

   <!ENTITY refstart  "<sub align=left>" >
   <!ENTITY refend    "</sub>" >

but in this case the program would not know that the replacement text contained a markup tag which needed further processing until it had added the replacement text to the main text stream. In addition, while the first pair of definitions would work irrespective of what definition is used for markup delimiters, the second pair of definitions will only work while the reference concrete syntax definitions for STAGO, ETAGO and TAGC are in force.

Other keywords can be used to identify parameter or general entities whose replacement text defines an embedded SGML markup instruction. The reserved words that are placed between the entity name and the replacement text to identify such entities are:

Typically these keywords will be used in entity declarations such as:

   <!ENTITY special MD "USELINK special">

This declaration allows the entity reference &special; to be used to generate a <!USELINK special> markup declaration in the text at a point where special processing of embedded elements is required.

6.3 External entities

External entities are declared with an external entity specification in place of the replacement text. The external entity specification indicates the source of the data to be added to the document and, optionally, the type of entity being defined

Two types of external entity are recognized by SGML:

Each of these main types can be further subdivided into one of three entity types:

6.3.1 System-specific external entities

The simplest way of declaring a system-specific external entity that is known only to the systems it is used on is to use the reserved name SYSTEM in place of the replacement text in the entity declaration, in a declaration of the form:

   <!ENTITY file1 SYSTEM >

where file1 is a valid entity name that is also recognized by the system as a reference to a file on the local storage system.

Note: Experience has shown that the use of this simplified form of referencing external entities leads to problems when interchanging documents between systems. For this reason the use of this shortened form for the identification of external entities within distributed systems is discouraged.

More typically, however, the SYSTEM keyword in an entity declaration will be qualified by a system identifier that uniquely identifies the source of the required entity. In many instances the system identifier will consist of a file name, optionally qualified by a pathname, e.g.:

   <!ENTITY module4 SYSTEM "c:\SGML\course\module4.sgm" >

When the program encounters a &module4; entity reference within the text, it will call the file identified by the system identifier and parse its contents as SGML encoded text at the point identified by the entity reference.

In 1996 a new annex to ISO/IEC standard 10744, the Hypermedia/Time-base Structuring Language (HyTime) introduced the concept of formal system identifiers (FSIs) to SGML. A formal system identifier has a structured form that identifies both the file required and the source of the file. For example, an FSI could be used to identify that a file has been referenced through an Internet Unique Resource Locator (URL) using a system identifier of the following form:

   <!ENTITY chapter4 SYSTEM 
                     "<url SOIbase='http://www.u-net.com/~sgml/'>sgml-4.htm" >

6.3.2 Alternative markup notations

When a system-specific external entity contains data that has been coded using a form of markup that differs from that used in the main document the system identifier can be qualified by an entity type statement. Four entity types are recognized within SGML:

If an external entity contains a complete SGML-coded document, including the appropriate document type declaration, it can be declared as a system-specific SGML subdocument entity by placing the reserved word SUBDOC after the entity's system identifier, e.g:

     <!ENTITY table1 SYSTEM "table1.sgm" SUBDOC >

Note: This feature can only be used if the SGML declaration has a FEATURES clause containing a SUBDOC YES entry. It should be noted, however, that the FEATURES clause defaults to SUBDOC NO.

Web SGML Adaptations Extension
When the Web SGML adaptations provided by Annex K to ISO 8879 are available files containing entities identified as subdocuments may start with an SGML declaration or an SGML declaration reference.

When an SGML program that has received the above entity declaration encounters a &table1; entity reference, it should store its current parsing state before calling up the file called table1.sgm. While it is processing the subdocument it will use the document type declaration identified at the start of the subdocument file, rather than the one previously being used. When the end of the subdocument is reached, the parsing state stored when the entity reference was encountered will be restored.

Where the retrieved entity contains data that is not coded in SGML the entity must be declared as a data entity. This is done by entering a reserved word (CDATA, SDATA or NDATA) immediately after the system identifier. A notation name identifying the type of coding used within the data entity must follow the keyword, optionally followed by any data attributes that are required to process the contents of the referenced file.

If an Encapsulated PostScript file has been created it could be incorporated into a document by adding an entity declaration of the form:

   <!ENTITY fig1 SYSTEM "fig1.eps" NDATA postscript >

A reference of the form &fig1; will cause the parser to pass the file called fig1.eps in the current working directory to the process identified in the notation declaration that has been given the name postscript.

6.3.3 Notation declarations

Notation declarations have the general form:

   <!NOTATION name identifier >

where name is the notation name (used after NDATA in the entity declaration) and identifier is a valid notation identifier, which is either system-specific or publicly declared.

If the notation identifier is system-specific, it will consist of the reserved word SYSTEM followed by a system identifier that identifies the process control file that needs to be activated to parse the non-SGML data, e.g.:

   <!NOTATION postscript SYSTEM "eps.bat" >

When the system has finished processing the data it will transmit a special, system dependent, signal, known as an entity end signal, to the SGML parser.

Note: This signal is output by the system at the end of each entity to tell the parser that it can continue processing the rest of the text now that the entity reference has been satisfied. The entity end signal is not a control code, and need not be one of the codes declared within the document's character set. It can be any signal recognized by the operating system as an indication that the end of an entity's replacement text has been reached.

Where an external entity contains character data, or other system-specific information, its declaration must also be qualified by a suitable notation name, e.g.:

   <!ENTITY example1 SYSTEM "example1.dtd CDATA SGML >
   <!ENTITY our-logo SYSTEM "our-logo.out" SDATA "logo" >

where the associated notation declarations could take the form:

   <!NOTATION SGML SYSTEM "newstream.in" >
   <!NOTATION logo SYSTEM "Logo.bat" >

Data attributes

When a notation declaration has been associated with a data entity the notation name can optionally be qualified by data attributes. These data attributes can either be passed to the system as parameters associated with the commands that activate the required notation interpreter, or they can be used to determine which commands should be sent to the system.

Data attributes are declared in the same way as other attributes except that the associated element type statement is replaced by the name(s) of the notation(s) the attributes are to be associated with. To indicate the changed role of the attribute definition the reserved name #NOTATION must precede the notation name(s).

To see how data attributes can be used in practice, consider the following data attribute definition:

   <!ATTLIST #NOTATION (postscript|TeX) width NUTOKEN #IMPLIED
                                        depth NUTOKEN #IMPLIED >

Here two data attributes, width and depth, have been associated with the notation declarations whose names are postscript and TeX. If no values are entered for these attributes the width and depth of the illustration will be as supplied. If the width or depth of the illustration to be processed are to be altered during processing the entity declaration can be extended to read:

   <!ENTITY fig1 SYSTEM "fig1.eps" NDATA postscript [width="5in" depth="3in"]>

Notice that, within the entity declaration, the data attribute specification uses the currently defined declaration subset open (DSO) and declaration subset close (DSC) codes to delimit the entered list of attributes. In the reference concrete syntax these are the open and close square brackets respectively.

Like other attributes, data attributes can be minimized if the permitted values have been defined as a name token group. For example, if the attribute list was extended to read:

   <!ATTLIST #NOTATION (postscript|TeX) width NUTOKEN #IMPLIED
                                        depth NUTOKEN #IMPLIED
                                        align (left|right|centre) left >

the entity declaration could take the form:

   <!ENTITY fig1 SYSTEM "fig1.eps" NDATA 
                        postscript [width="5in" depth="3in" centre]>

Note: When defining attributes for use with notations the ENTITY, ENTITIES, ID, IDREF, IDREFS and NOTATION keywords must not be used as these keywords can only be used as declared values for attributes associated with elements. Similarly the #CURRENT and #CONREF default value options cannot be used.

6.3.4 Publicly declared external entities

Publicly declared external entities are external entities that contain declarations, text or other data designed to be used on more than one SGML system.

Many publicly declared entities consist solely of a pre-defined set of markup declarations which can be used to extend document type declaration subsets defined within the prolog. When the relevant parameter entity reference is encountered in the document type declaration, the program will add the declarations it has previously stored as a publicly declared entity to any local declarations.

The advantage of using publicly declared entities is that the declarations do not need to be transmitted between systems when the receiving system is already known to have access to them. Instead, all the user needs to do is add the necessary public entity declarations, with the associated references, to the document to tell receiving systems which sets of declarations will be referenced in transmitted documents.

Publicly declared external entities are said to be "publicly declared" because the relevant declarations have been assigned names known by receiving systems, but the declarations can be "private" in the sense that the associated definitions are only provided to a closed user community.

There are, however, certain publicly declared entities that may truly be called "publicly declared". These contain sets of declarations that have been defined by one of the organizations authorized by the International Organization for Standardization (ISO) to keep registers of declarations used in more than one document. Once a declaration set has been registered in this way it will have a unique name by which it can be recognized by all systems referencing the standardized data.

The International Organization for Standards (ISO) has defined entity sets to identify most commonly used Latin, Greek and Cyrillic characters, and other constructs defined in international standards. Such sets are identified by special ISO owner identifiers. ISO owner identifiers start with the letters ISO followed by the number and date of the standard referred to, e.g. "ISO 8879:1986".

Note: Before requesting any publicly declared entity it is important to check that the relevant declarations will be available to any system receiving the marked-up SGML document. The fact that an entity has been publicly declared does NOT mean that it will be known to all SGML systems, it simply means that its definition does not need to be transmitted between systems that already know the definition.

Requesting publicly declared external entities

Publicly declared external entities which just contain text can be requested by entering a declaration of the form:

   <!ENTITY name PUBLIC "public identifier" >

This entity can be recalled at any point in the text by entering a general entity reference of the form &name;.

Optionally the public identifier can be followed by the filename used to store the entity contents on the local system, e.g.

   <!ENTITY name PUBLIC "public identifier" "filename.ext" >

Because this makes the entity declaration less portable, however, this form of external entity declaration is discouraged.

Note: With the development of the concept of enitity catalogs by the members of the SGML Open vendor's consortium there is nowadays little need for this form of qualified public identifier.

If the external entity contains SGML markup declarations that are to be added to the document type declaration subset it must be declared by entering a parameter entity declaration of the form:

     <!ENTITY % name PUBLIC "public identifier">

the entity then being recalled by entering the relevant parameter entity reference (e.g. %name;) at some point between the entity declaration and the declaration subset close character (e.g. ]) marking the end of the document type declaration subset.

The public identifier used in the entity declaration of the above external entities is either a formal public identifier or a name agreed between users.

Note: If the FEATURES clause of the current SGML declaration contains the entry FORMAL YES the public identifier must be a formal public identifier.

Web SGML Adaptations Extension
When the Web SGML adaptations provided by Annex K to ISO 8879 are available, and the FEATURES clause contains an entry reading URN YES, the contents of the public identifier will be interpreted according to the rules specified in Internet Engineering Task Force document RFC2141 governing the creation of Universal Resource Names.

Note: If both FORMAL YES and URN YER have been speficied, public identifiers are interpreted either as formal public identifiers or as URNs.

ISO 8879 restricts the characters that may be used in public identifiers to a special set of minimum data characters consisting of the uppercase and lowercase alphabetic letters, spaces (or RE and RS codes), numbers and the following characters, which are declared to be part of a special character class:

     ' ( ) + , . - / : = ?

Note: This list of special characters cannot be extended or otherwise altered in the SGML declaration. If you wish to use an agreed name to identify a set of declarations you must make sure your name consists only of characters mapped in the current document character set to these special characters or one of the ISO 646 (ASCII) alphanumeric characters.

Web SGML Adaptations Extension
When the Web SGML adaptations provided by Annex K to ISO 8879 are available the following characters are added to the list of minimum data characters allowed within public identifiers:
; ! * # @ $ _ %

These characters are added so that Internet Uniform Resouce Names (URNs) can be used as formal public identifiers.

Formal public identifiers

Formal public identifiers fall into one of three categories:

Web SGML Adaptations Extension
When the Web SGML adaptations provided by Annex K to ISO 8879 are available Intenet Domain names can be used to identify owners of formal public identifiers.

The rules for defining ISO assigned public identifiers are defined in ISO 9070:1991, Information Processing - SGML Support Facilities - Registration Procedures for Public Text Owner Identifiers. A typical ISO registered entity set will be identified by a declaration, within the document type declaration subset, of the form:

   <!ENTITY % ISOlat1 PUBLIC
     "ISO 8879:1986//ENTITIES Added Latin 1//EN" >

This declaration must then be invoked by incorporating the parameter entity %ISOlat1; into the document type declaration subset before the closing square bracket, e.g:

   <!DOCTYPE docname [
     <!ENTITY % ISOlat1 PUBLIC
       "ISO 8879:1986//ENTITIES Added Latin 1//EN" >
     .
     .      -- other required declarations --
     .
     %ISOlat1; ]>

ISO assigned public identifiers have three components, separated from each other by a pair of solidus strokes (slashes). The three components are:

Entity sets registered by bodies other than ISO will use a registered owner identifier in place of the ISO owner identifier. The registered name is preceded by +// to identify the following identifier as one applying to a registered set. A typical declaration might be:

   <!ENTITY % EC-acts PUBLIC
     "+//OPOCE//DTD for European Community Acts//EN">

One special class of registered owner identifier that is of special interest to book publishers is that assigned to the International Standard Book Numbering Agency in ISO 9070. This special identifier allows publishers to be identified by reference to the group and publisher identifiers that form the first part of the ISBN numbers assigned to their books. For example, a set of entity declarations for use in European Commission publications could be assigned an identifier of the form:

   <!ENTITY % OPOCE PUBLIC "+//ISBN 93 826::Office for Official Publications
              of the European Communities//ENTITIES Accented Characters//EN" 

Companies can apply to register their own ISO 9070 owner names through the Graphic Communications Association, who are located at 100 Dangerfield Road, Alexandria, Virginia, USA.

Web SGML Adaptations Extension
When the Web SGML adaptations provided by Annex K of SGML are available the keyword IDN can follow the +// to indicate that the rest of the registered owner identifier is an Internet Domain Name that has been registered by one of the organizations authorized to apply the rules defined by the Internet Assigned Numbers Authority (IANA). For example, +//IDN www.sgml.u-net.com can be used to identify external entities created by The SGML Centre. If ISO 9070 structured naming is to be applied to an Intenet Domain Name then the same address space could be specificied as +//IDN u-net.com::sgml::www

Where the declarations have not been formally registered, an unregistered owner identifier must be used as the owner identifier. This has the same form as a registered owner identifier, except that a hyphen is used in place of the initial plus. The name used to identify the owner must consist of one of the minimum data characters (alphanumeric or special) allowed in public identifiers. A typical declaration might take the form:

   <!ENTITY % SGMLdoc PUBLIC
     "-//The SGML Centre//DTD for Manual Production//EN">

The public text class name that follows the owner identifier may have one of the following values:

Note: All public text class keywords must be entered using capital letters only.

Web SGML Adaptations Extension
When the Web SGML adaptations provided by Annex K to ISO 8879 are available SD can be used to identify an externally stored SGML declaration body referenced using the new SGML declaration reference form of specifying SGML declarations.

Only one keyword can appear in any formal identifier, though most of the keywords can be used more than once in a document. (Only one syntax or capacity declaration can be made in any SGML declaration.)

Where the TEXT, NONSGML or SUBDOCUMENT public text class keywords are used in a public identifier the entity must be defined as a general entity. The ELEMENTS, ENTITIES and SHORTREF keywords may only be used in parameter entity declarations because they refer to files that contain markup declarations that need to be added to any local declarations within the document type declaration subset. The DTD and LPD keywords must be associated with document type declarations and link process definitions respectively. The CHARSET, SYNTAX and CAPACITY keywords may only be used in the relevant section of an SGML declaration.

The NOTATION keyword differs slightly from the other public text class keywords in that it is only used to qualify notation declarations. It is typically used in the form:

   <!NOTATION postscript PUBLIC "-//my-system//NOTATION EPS Processor//EN">

This declaration defines a notation called postscript as a locally recognized notation that will be used to process Encapsulated PostScript files.

Each public text class keyword is qualified by a public text description explaining the purpose of the publicly declared entity. (This description is restricted to the alphanumeric and special minimum data characters used for public identifiers.) Where the entity consists of declarations that are not generally available to the public, the public text description should be preceded by an unavailable text indicator (-//) to give the identifier the form:

   <!ENTITY % name PUBLIC
            "-//owner//class -//description//language" >

The public text language parameter that normally ends a formal public identifier must be one of the two character codes for identifying languages defined in ISO 639. This language code tells the system which language the public text has been prepared in. The most commonly encountered codes are:

Codes have also been defined for most European languages, including "dead" languages such as Latin, and for international languages such as Esperanto, Interlingua and Interlingue.

Notice that the language codes are all defined as a pair of capital letters. These codes cannot be replaced by the equivalent lower-case letters within the formal public identifier, even if the name case rules in the SGML declaration permit general substitution of tag characters, because the standard specifically states that the uppercase form must be used for all public text language values.

Where the public text class keyword is CHARSET the public text language code is replaced by a public text designating sequence. This sequence of codes uniquely identifies the selected character set by using techniques defined in ISO 2022. Each sequence starts with an Escape code (1/11, hexadecimal 1B) followed by a number indicating the type of code set being described. This number is further qualified by one or more numbers identifying the required set of characters.

The types of code that can be indicated by the first code after the Escape code include:

Note: ESC 2/5 4/0 is a special sequence used to return to the basic G0 character set from a character set defined outside ISO 2022.

A typical use of a public text designating sequence to define a document's character set in an SGML declaration would be:

   CHARSET DESCSET "ISO6937:1994//CHARSET Latin Alphabet//ESC 2/14 4/1"

The 96 character set required is that specified in the 1994 version of ISO/IEC standard 6937 which, in this instance, is being used as a G2 code set.

Where a publicly declared entity consists of entity declarations which contain system-specific data (i.e. use the SDATA option) the associated formal public identifier can be further qualified by the addition of a public text display version description. This description identifies which types of device the entities will be recognized by.

A typical extended entry might take the form:

   <!ENTITY % Ventura PUBLIC
           "-//The SGML Centre//ENTITIES Dingbats used in Ventura//EN//Ventura">

Note: When creating a formal public identifier it is important not to split the literal string immediately after one of the slashes (solidii) used to identify the start of a new component within the public identifier because the start of a new line in the delimited string will be treated as if a space had been entered, and spaces are not permitted immediately after a slash.

The way in which formal public systems are resolved into local system identifiers is system dependent. Software conforming to the rules laid down by the SGML Open consortium will use exchangeable SGML Open catalogs to record these mappings. If you receive a document without receiving a catalog with it you will need to create your own mapping between formal public identifiers and local files. If you receive a catalog with the files you will need to ensure that the filenames shown in the catalog are available at the specified location. If your versions of the referenced files are stored at other locations you will need to change the location shown in the catalog. If you do not have copies of any of the files listed in the catalog you will need to obtain them prior to parsing the document or DTD.

6.4 Character references

Keyboards can only provide keys for a limited range of characters. While many systems provide a menu option that can be used to select characters that are not directly available on the keyboard these menus do not always provide access to all the characters that are accessible on a printer. For characters that are not part of the SGML character set a mechanism is needed for referencing them that only requires the use of characters known to be in the character set.

Characters that cannot be entered by use of a dedicated keystroke can be entered either as a reference to a previously declared system-specific (SDATA) entity or as a character reference, or by using a combination of both techniques.

A character reference is a reference entered within the text that specifies the required character either by entry of its decimal value or by reference to the function name it has been allocated in the currently defined concrete syntax (e.g. RE, RS, TAB or SPACE).

A special character reference open (CRO) delimiter is used to identify character references. In the reference concrete syntax CRO is defined as &#, giving a typical character reference the form &#181; or &#RE;.

Character references can be used within the replacement text of an entity declaration. For example, the entity declaration:

   <!ENTITY microns "&#181;m">

could be defined to allow &microns; to be entered to generate the characters µm.

Character references are often used when a markup character, such as a double quote, is required in the replacement text of an entity. For example, to define a piece of quoted text within an entity it could be declared as:

   <!ENTITY OPOCE
     "&#34;Office for Official Publications of the European Commission&#34;">

Alternatively the entity could be defined within single quotes as:

   <!ENTITY OPOCE
     '"Office for Official Publications of the European Commission"'>

If an apostrophe needs to be included in replacement text surrounded by single quote delimiters it must be entered in the form of a character references to the code whose decimal number is 39 (i.e.&#39;).

Numeric character references are always treated as data (i.e. the character they resolve to is not checked to see if it could be part of a markup delimiter). Characters referenced via a function name, however, will be treated as markup if entered at an appropriate point. For example, &#13; will always be output as a carriage return, whereas &#RE; might be interpreted as the end of an entity reference if placed at the end of an entity name, rather than as a code to be sent to the printer.

One point to remember about numeric character references is that their numbers may need to be altered if the document's character set is changed, or if the document is passed to a system using a different character set. For this reason special characters should, wherever possible, be given function names, or be defined as SDATA entities, within a document's character set.

Web SGML Adaptations Extension
When the Web SGML adaptations provided by Annex K of SGML are available an extra delimiter string can optionally be defined in the DELIMS section of the SYNTAX clause to identify the start of hexadecimal character references. For example, in the SGML declaration defined for XML the hexadecimal character reference open (HRCO) delimiter is defined by adding HRCO "&#38;#x" to the DELIM entry immediately after GENERAL SGMLREF. This indicates that character references can be made by entry of a hexadecimal number preceded by &#x and followed by a valid reference end code (e.g.&#xOA;).

Note: The decimal numeric character reference &#38; has to be used to identify that the hexadecimal character reference open string starts with a character that would otherwise be identified as the start of a possible character reference within the SGML declaration (&).

The predefined data character entities section of the SGML declaration provides an additional mechanism for identifying delimiter start characters. Entity names declared in this subclause are treated as if their replacement text was a numeric character reference.

Note: This means that the replacement text for entities named in this way will always be treated as data rather than markup. It should be noted that entity names specified as part of the predefined data character entities section of the SGML declaration cannot be used in to define entities within an associated document type definition or link type definition as the definitions provided in the SGML declaration are deemed to occur before any definitions in the prolog.

6.5 Entity sets

An entity set is a set of entity declarations that are designed to be used together in a number of prologs. As such they are often stored in a separate file that can be referenced using a publicly declared external entity.

Entity sets are typically used to declare:

Entity sets may also contain:

which complement the entity declarations within the set.

As with all external entities, there are two basic types of entity set:

6.5.1 Publicly declared entity sets

Publicly declared entity sets are typically used to define sets of characters which are not part of the main ASCII character set or the document character set. The standardized entity names defined in such sets can be converted (once) by any receiving system to the local equivalent by use of appropriate system-specific (SDATA) replacement text.

ISO has defined (in IS0 8879 and in ISO/IEC TR 9573) character sets for:

Wherever possible entity set declarations should start with a comment declaration indicating the purpose of the set and how it should be invoked. For example, the ISOlat1 set starts with the following comment declarations:

<!-- (C) International Organization for Standardization 1986
     Permission to copy in any form is granted for use with
     conforming SGML systems and applications as defined in
     ISO 8879, provided this notice is included in all copies.
-->
<!-- Charcter entity set. Typical invocation:
     <!ENTITY % ISOlat1 PUBLIC 
       "ISO 8879:1986//ENTITIES Added Latin 1/EN">
     %ISOlat1;
-->

To invoke this entity set on a system that has a copy of the relevant entity declaration file users need only add the entity declaration and parameter entity reference specified in the comment to their document type declaration subset.

6.5.2 Private entity sets

Private entity sets can be used to define entities that are to be used in a number of locally produced documents. Each entity set should be prepared as a separate file. Each set of entity declarations making up the entity set should be preceded by a comment declaration indicating the purpose of the set and how it should be invoked.

A document type definition can call any number of different entity sets, and contain its own entity declarations. The sequence in which the sets are called may be important. Normally entity sets will be called after the document's entity declarations to ensure that any locally declared entities having the same name as one of the entities in an entity set will retain their local declarations. (The first definition found for an entity name is the one used by the system.) If, for some reason, the same entity name is used in two or more entity sets, the declaration used will be the one in the set whose parameter entity reference was encountered first.

References

International Organization for Standardization (1988), Codes for the representation of languages (ISO 639:1988) Geneva: ISO

International Organization for Standardization/International Electroctechnical Commission (1994) Information technology - Character set structure and extension techniques (ISO 2022:1994) Geneva: ISO.

International Organization for Standardization/International Electroctechnical Commission (1991), Information processing - SGML support facilities - Registration procedures for Public Text Object Identifiers (ISO 9070:1991(E)) Geneva: ISO.

International Organization for Standardization/International Electroctechnical Commission (1991-4), Information processing - SGML support facilities - Techniques for using SGML - Parts 12-16 (ISO/IEC TR 9573) Geneva: ISO.

International Organization for Standardization/International Electroctechnical Commission (1992), Information technology - Hypermedia/Time-based Structuring Language (HyTime) (ISO/IEC 10744:1992) Geneva: ISO.