© Martin Bryan 1997 from SGML and HTML Explained published by Addison Wesley Longman
This chapter briefly overviews the roles of marked sections and processing instructions. It contains the following sections:
Marked sections can be used to identify sections of text that need special processing. There are two principal reasons for marking sections of text:
Where examples of SGML markup need to be included in a document they can be
entered as part of a marked section to ensure they are not misinterpreted as
markup. Using this approach obviates the need to convert the initial characters
(e.g. < and &) of markup sequences used in
examples to references to CDATA or SDATA entities
(e.g. < and &).
Where a single document contains more than one version of the text, marked sections allow users to determine which sections should be processed during a particular parse. Text which may or may not appear in the printed version can also be entered as marked sections so that users can determine, for example, when they should be included in drafts or proofs of the text, or left out of final versions of the text. Marked sections can also be used to identify text that has been added to a document on a temporary basis.
Marked sections can be used within the document prolog to identify variants of document type definitions. For example, they are used extensively in Text Encoding Initiative DTDs, and within later versions of the HTML DTD, where they identify elements that have been retained to ensure backward compatibility but whose use in new documents is deprecated.
Marked section declarations are enclosed between special
marked section start and marked section end
delimiter sequences within the document instance or prolog. The marked section
delimiter sequences are both made up of a combination of two other delimiters.
The marked section start delimiter sequence consists of the markup declaration
open (MDO) delimiter immediately followed by a declaration subset
open (DSO) delimiter. This initial DSO is followed
by a status keyword specification identifying the type of
marked section required, which is followed by a second DSO
delimiter that identifies the start of the marked section text.
The end of each marked section is indicated by a special marked
section close (MSC) delimiter sequence which is
immediately followed by a markup declaration close (MDC)
delimiter. The marked section close delimiter will consist of two matching
characters which are obviously linked to the DSO delimiters used
for the marked section start sequence. For example, in the reference concrete
syntax the opening square brackets used for the two DSO delimiters
in the marked section start sequence are matched by two closing square brackets
(]]) in the MSC. This gives a marked section the
form:
<![ status-keywords [ ... marked section ... ]]>
There are two special points to notice about this declaration format:
The five status keywords that can be used in the status keyword specification are:
CDATA to indicate that the contents of the marked section are
to be treated as character data
which does not contain any resolvable SGML markup
RCDATA to indicate that the contents of the marked section
are to be treated as replaceable character
data in which any character references or embedded text, CDATA
or SDATA entities are to be resolved before the marked section is
output
IGNORE to indicate that the contents of the marked section
are to be ignored during parsingINCLUDE to indicate that the contents of the marked section
are to be parsedTEMP to indicate that the section is a temporary part of the
document.Where no keyword is specified the INCLUDE keyword is assumed.
Where multiple keywords have been entered, and where marked sections are embedded within other marked sections, the order of precedence/inheritence that applies to the entered keywords (highest priority shown first) is:
IGNORE
CDATA RCDATA
INCLUDEEmbedded marked sections are only recognized where INCLUDE,
IGNORE and TEMP are the only keywords used in the
status keyword specification. Marked sections cannot be embedded within sections
defined using the CDATA or RCDATA keywords because,
within such sections, the program only looks for the next marked section end
delimiter sequence (]]>). As soon as it encounters this
sequence it will terminate the section that started with CDATA or
RCDATA, rather than any embedded marked section.
It should also be noted that marked sections can only contain valid SGML
characters as non-SGML data must always be called as part of an NDATA
external entity. The fact that a marked section is flagged to be ignored does
not mean that it may contain non-SGML (shunned) characters.
The CDATA and RCDATA keywords are typically used
in situations where the author wishes to output SGML tags as part of his text.
Only one of these two keywords should be used in any marked section keyword
list.
To include an example such as:
<EM>emphasized phrase</EM>
within a paragraph (without using an entity reference) you could enter it as:
<![ CDATA [<EM>highlighted phrase</EM>]]>
The CDATA keyword tells the parser that the contents of the
marked section are to be sent directly to the parser's output stream, without
being checked for embedded markup.
If the section of text to be marked contains characters which cannot be
entered directly (e.g. because they are not part of the document's character set
and so have to be defined as character references) the RCDATA
keyword can be used in place of CDATA. This will tell the parser
that it must resolve any general entity or character references within the
marked section of text during parsing. For example, to generate the sequence:
<SIZE>12µm</SIZE>
you could enter:
<![ RCDATA [<SIZE>12µm</SIZE>]]>
to ensure that the chararacter reference will be correctly resolved to the µ character while the start- and end-tag of the highlighted phrase are retained as part of the text.
The sequence ]]> cannot be entered directly within a
marked section when the reference concrete syntax is being used. When preparing
examples of marked sections remember to change at least one character of the
current markup section close delimiter. Normally the last character of the
nested marked section will be changed to >, and the first
to <, to ensure that the example will not be treated as a
marked section, giving it the form:
<![ CDATA [<![ RCDATA [<SIZE>12µm</SIZE>]]>]]>
The IGNORE and INCLUDE status keywords are
normally used to identify marked sections of text that belongs to different
versions of a document. To allow users to control which version is to be output
the relevant status keywords are normally defined as parameter entities in a
document type declaration subset at the
start of the document so that they can be quickly redefined when the job is
reprocessed. (Marked section keyword definitions are one of the few places a
parameter entity reference is valid within a document instance.)
In a typical application the necessary parameter entities might be defined as:
<!ENTITY % mark1 "IGNORE" -- Identifies text specifically for Mark 1 -- > <!ENTITY % mark2 "INCLUDE" -- Identifies text specifically for Mark 2 -- >
The associated parameter entity references, %mark1; and
%mark2; can be used within a marked section declaration in the
text to identify text that applies to a particular version of the product, as
the following example illustrates:
<P>To install the card:<UL> <LI>switch off the power supply <![ %mark1; [<LI>unscrew the retaining bolts holding the cover]]> <![ %mark2; [<LI>unclip the cover]]> <LI>select a spare card slot ...
Normally the first of the marked sections would be ignored so that the later, Mark 2, version is printed. If, however, it became necessary to reprint the Mark 1 version of the instructions the only parts of the document that need to be changed are the two entity declarations in the document type declaration subset, which would be changed to read:
<!ENTITY % mark1 "INCLUDE" -- Identifies text specifically for Mark 1 -- > <!ENTITY % mark2 "IGNORE" -- Identifies text specifically for Mark 2 -- >
The above technique can be extended to any number of versions, affecting many sections of text providing that care is taken not to overlap marked sections.
The IGNORE keyword can also be used to prevent notes added to the text as
reminders from being printed. While such notes can be flagged directly with the
IGNORE keyword, so that they are never processed, it is better
practice to use a parameter entity to control when they should be parsed. For
example, the parameter entity:
<!ENTITY % comment "IGNORE" >
could be used in conjunction with a declaration of the form:
<![ %comment; [Remember to say something about Marked Sections.]]>
If such notes are to appear in a draft all the author needs to do is change the entity declaration to read:
<!ENTITY % comment "INCLUDE" >
or even:
<!ENTITY % comment "" >
(Remember that when no keyword is specified the INCLUDE
keyword is assumed.)
There are occasions when part of a document may only be required
temporarily. For example, you may need to add the phrase "in preparation"
to a citation until such time as the cited work is published. In this case the
TEMP keyword can be used to identify the marked section as one
that will need to be removed later. Typically the entry will take the form:
<CITE>Bryan, M. T. (1996) <EM>SGML and HTML Explained</EM> <![ TEMP [(in preparation)]]></CITE>
The TEMP keyword is only a flag: it does not affect the way in
which the text is processed. In the above example the program treats the marked
section in exactly the same way that it would treat a section for which no
keyword has been entered; it acts as if INCLUDE had been entered
alongside TEMP. The only difference between the above declaration
and a declaration of the form:
<CITE>Bryan, M. T. (1996) <EM>SGML and HTML Explained</EM> <![[(in preparation)]]></CITE>
is that requesting the removal of the section when it is no longer required
will be easier when the TEMP keyword is present as the program can
identify such marked sections as ones that may need to be discarded.
More than one keyword can appear at the start of a marked section
declaration. For example, once the publication mentioned in the above citation
has been published, the word IGNORE could be added to the marked
section declaration to avoid having to delete the text. The stored citation
could then have the form:
<CITE>Bryan, M. T. (1996) <EM>SGML and HTML Explained</EM> <![ IGNORE TEMP [(in preparation)]]> </CITE>
By retaining the temporary section in this case, any future users of the citation will be able to see that it was prepared before the book was published, which should act as a warning that the citation may not be complete.
Where a marked section is likely to be used more than once in a document it
can be stored as an entity. To speed up identification of the text as a marked
section the MS keyword should be used in the entity declaration.
The program will then automatically add the marked section open and close
delimiter sequences to the replacement text entered for the entity. For example,
if the temporary "in preparation" marked section declaration defined
above was to be used in a number of citations it could be declared as an entity
by entering:
<!ENTITY inprep MS " TEMP [(in preparation)" >
Once the entity has been defined in this way the citation can be altered to read:
<CITE>Bryan, M. T. (1996) <EM>SGML Explained</EM> &inprep;</CITE>
When storing marked sections as entities, however, it is important to remember that all of the delimiters of the marked marked section must be defined within the same entity. For example, the entity references:
<!ENTITY ignore "<[IGNORE [" > <!ENTITY message "Remember to check this before publication" >
could not be used to create a marked section by entry of a definition such as:
&ignore;&message;]]>
| Web SGML Adaptations Extension When the Web SGML adaptations provided by Annex K to ISO 8879 are available, and the SGML declaration contains INTEGRAL YES as one of the
new entries in its FEATURES clause,
a marked section must start and end in the same entity. |
SGML allows marked sections to be nested up to the level defined by the
currently active TAGLVL quantity. By default up to 24 different
levels of marked sections can be active at any point in the document. Only
sections defined using the INCLUDE, IGNORE and
TEMP keywords can contain nested marked sections.
To see how nesting works, consider the following example:
<!DOCTYPE book SYSTEM "my-book.dtd" [ <!ENTITY % bibliog "INCLUDE" > <!ENTITY % comment "IGNORE" > <!ENTITY inprep MS " TEMP [(in preparation)" > ]> <book> ... <![ %bibliog; [<H1>References</H1> <CITE>Bryan, M. T. (1996) <EM>SGML Explained</EM> &inprep;</CITE> <![ %comment; [Need to cite Burn's paper here &inprep;]]> ... ]]>
Here the bibliography is to be included in the document but, because in some
cases it will not be required, it has been treated as a marked section whose
presence can be controlled by use of the %bibliog; parameter
entity. Within the bibliography further marked sections have been used to define
temporary additions to the citation (e.g. the &inprep; entity
declaration) and to add a note for the author. Notice that, within the note, the
&inprep; entity reference has been used as a reminder of why
the details still need to be added. This reference to an embedded marked section
will, however, only be expanded if the keyword stored in the %comment;
parameter entity reference is changed to INCLUDE.
If the entity declaration for the %bibliog; entity reference
is changed to:
<!ENTITY % bibliog "IGNORE" >
all the marked sections embedded within the bibliography will be ignored
because the IGNORE keyword in the outermost marked section has
precedence over all other embedded keywords.
Processing instructions are instructions to the local system telling it, in its own language, how to process the document. Typically processing instructions are used to define how the following text should be formatted. Because such instructions are system specific, and often also application dependent, they need to be specially identified so that, for example, when the document is sent to another system, or its format is changed, any processing instructions incorporated in the document, or its document type declaration, can be changed accordingly.
Because processing instructions are normally written in a language that is
known only to the current system, or those using a similar set of instructions,
they cannot be entered as part of the generalized coding used for SGML. To
distinguish the processing instructions from other markup, therefore, they are
enclosed in a special set of delimiters known as the processing
instruction open (PIO) and processing
instruction close (PIC) delimiters. In the reference
concrete syntax PIO is <? and PIC
is >.
The format of the data within the processing instruction is determined by the processing system. The only restrictions placed on the format of processing instructions by SGML is that:
It should be noted that, once a processing instruction has been started, the SGML parser will ignore all characters up to and including the currently defined processing instruction close sequence.
The maximum length of individual processing instructions is controlled by
the PILEN quantity in the SGML declaration. In the reference
concrete syntax PILEN is set to 240.
Where two different systems may to be used to format a document, the marked section facility can be used to control the processing instructions used on each system. For example, two sets of processing instructions could be entered in the text as:
<![ %systema; [<?processing instruction for System A>]]> <![ %systemb; [<?processing instruction for System B>]]>
While the document is being processed on System A the associated parameter entities would be defined as:
<!ENTITY % systema "INCLUDE" > <!ENTITY % systemb "IGNORE" >
When the document is transferred to System B the processing instructions used can be quickly changed by altering the entity declarations to read:
<!ENTITY % systema "IGNORE" > <!ENTITY % systemb "INCLUDE" >