Previous chapter Next chapter Table of Contents

© Martin Bryan 1997 from SGML and HTML Explained published by Addison Wesley Longman

Chapter 8
Marked Sections and Processing Instructions

This chapter briefly overviews the roles of marked sections and processing instructions. It contains the following sections:

8.1 The role of marked sections

Marked sections can be used to identify sections of text that need special processing. There are two principal reasons for marking sections of text:

Where examples of SGML markup need to be included in a document they can be entered as part of a marked section to ensure they are not misinterpreted as markup. Using this approach obviates the need to convert the initial characters (e.g. < and &) of markup sequences used in examples to references to CDATA or SDATA entities (e.g. &lt; and &amp;).

Where a single document contains more than one version of the text, marked sections allow users to determine which sections should be processed during a particular parse. Text which may or may not appear in the printed version can also be entered as marked sections so that users can determine, for example, when they should be included in drafts or proofs of the text, or left out of final versions of the text. Marked sections can also be used to identify text that has been added to a document on a temporary basis.

Marked sections can be used within the document prolog to identify variants of document type definitions. For example, they are used extensively in Text Encoding Initiative DTDs, and within later versions of the HTML DTD, where they identify elements that have been retained to ensure backward compatibility but whose use in new documents is deprecated.

8.2 Marked section declarations

Marked section declarations are enclosed between special marked section start and marked section end delimiter sequences within the document instance or prolog. The marked section delimiter sequences are both made up of a combination of two other delimiters. The marked section start delimiter sequence consists of the markup declaration open (MDO) delimiter immediately followed by a declaration subset open (DSO) delimiter. This initial DSO is followed by a status keyword specification identifying the type of marked section required, which is followed by a second DSO delimiter that identifies the start of the marked section text.

The end of each marked section is indicated by a special marked section close (MSC) delimiter sequence which is immediately followed by a markup declaration close (MDC) delimiter. The marked section close delimiter will consist of two matching characters which are obviously linked to the DSO delimiters used for the marked section start sequence. For example, in the reference concrete syntax the opening square brackets used for the two DSO delimiters in the marked section start sequence are matched by two closing square brackets (]]) in the MSC. This gives a marked section the form:

   <![ status-keywords [ ... marked section ... ]]>

There are two special points to notice about this declaration format:

The five status keywords that can be used in the status keyword specification are:

Where no keyword is specified the INCLUDE keyword is assumed.

Where multiple keywords have been entered, and where marked sections are embedded within other marked sections, the order of precedence/inheritence that applies to the entered keywords (highest priority shown first) is:

Embedded marked sections are only recognized where INCLUDE, IGNORE and TEMP are the only keywords used in the status keyword specification. Marked sections cannot be embedded within sections defined using the CDATA or RCDATA keywords because, within such sections, the program only looks for the next marked section end delimiter sequence (]]>). As soon as it encounters this sequence it will terminate the section that started with CDATA or RCDATA, rather than any embedded marked section.

It should also be noted that marked sections can only contain valid SGML characters as non-SGML data must always be called as part of an NDATA external entity. The fact that a marked section is flagged to be ignored does not mean that it may contain non-SGML (shunned) characters.

8.3 Using marked sections

The CDATA and RCDATA keywords are typically used in situations where the author wishes to output SGML tags as part of his text. Only one of these two keywords should be used in any marked section keyword list.

To include an example such as:

   <EM>emphasized phrase</EM>

within a paragraph (without using an entity reference) you could enter it as:

   <![ CDATA [<EM>highlighted phrase</EM>]]>

The CDATA keyword tells the parser that the contents of the marked section are to be sent directly to the parser's output stream, without being checked for embedded markup.

If the section of text to be marked contains characters which cannot be entered directly (e.g. because they are not part of the document's character set and so have to be defined as character references) the RCDATA keyword can be used in place of CDATA. This will tell the parser that it must resolve any general entity or character references within the marked section of text during parsing. For example, to generate the sequence:


you could enter:

   <![ RCDATA [<SIZE>12&micro;m</SIZE>]]>

to ensure that the chararacter reference will be correctly resolved to the µ character while the start- and end-tag of the highlighted phrase are retained as part of the text.

The sequence ]]> cannot be entered directly within a marked section when the reference concrete syntax is being used. When preparing examples of marked sections remember to change at least one character of the current markup section close delimiter. Normally the last character of the nested marked section will be changed to &gt;, and the first to &lt;, to ensure that the example will not be treated as a marked section, giving it the form:

   <![ CDATA [&lt;![ RCDATA [<SIZE>12&micro;m</SIZE>]]&gt;]]>

8.3.1 Ignored sections

The IGNORE and INCLUDE status keywords are normally used to identify marked sections of text that belongs to different versions of a document. To allow users to control which version is to be output the relevant status keywords are normally defined as parameter entities in a document type declaration subset at the start of the document so that they can be quickly redefined when the job is reprocessed. (Marked section keyword definitions are one of the few places a parameter entity reference is valid within a document instance.)

In a typical application the necessary parameter entities might be defined as:

   <!ENTITY % mark1 "IGNORE"  -- Identifies text specifically for Mark 1 -- >
   <!ENTITY % mark2 "INCLUDE" -- Identifies text specifically for Mark 2 -- >

The associated parameter entity references, %mark1; and %mark2; can be used within a marked section declaration in the text to identify text that applies to a particular version of the product, as the following example illustrates:

<P>To install the card:<UL>
   <LI>switch off the power supply
   <![ %mark1; [<LI>unscrew the retaining bolts holding the cover]]>
   <![ %mark2; [<LI>unclip the cover]]>
   <LI>select a spare card slot ...

Normally the first of the marked sections would be ignored so that the later, Mark 2, version is printed. If, however, it became necessary to reprint the Mark 1 version of the instructions the only parts of the document that need to be changed are the two entity declarations in the document type declaration subset, which would be changed to read:

   <!ENTITY % mark1 "INCLUDE" -- Identifies text specifically for Mark 1 -- >
   <!ENTITY % mark2 "IGNORE"  -- Identifies text specifically for Mark 2 -- >

The above technique can be extended to any number of versions, affecting many sections of text providing that care is taken not to overlap marked sections.

The IGNORE keyword can also be used to prevent notes added to the text as reminders from being printed. While such notes can be flagged directly with the IGNORE keyword, so that they are never processed, it is better practice to use a parameter entity to control when they should be parsed. For example, the parameter entity:

   <!ENTITY % comment "IGNORE" >

could be used in conjunction with a declaration of the form:

   <![ %comment; [Remember to say something about Marked Sections.]]>

If such notes are to appear in a draft all the author needs to do is change the entity declaration to read:

   <!ENTITY % comment "INCLUDE" >

or even:

   <!ENTITY % comment "" >

(Remember that when no keyword is specified the INCLUDE keyword is assumed.)

8.3.2 Temporary sections

There are occasions when part of a document may only be required temporarily. For example, you may need to add the phrase "in preparation" to a citation until such time as the cited work is published. In this case the TEMP keyword can be used to identify the marked section as one that will need to be removed later. Typically the entry will take the form:

   <CITE>Bryan, M. T. (1996) <EM>SGML and HTML Explained</EM>
   <![ TEMP [(in preparation)]]></CITE>

The TEMP keyword is only a flag: it does not affect the way in which the text is processed. In the above example the program treats the marked section in exactly the same way that it would treat a section for which no keyword has been entered; it acts as if INCLUDE had been entered alongside TEMP. The only difference between the above declaration and a declaration of the form:

   <CITE>Bryan, M. T. (1996) <EM>SGML and HTML Explained</EM>
   <![[(in preparation)]]></CITE>

is that requesting the removal of the section when it is no longer required will be easier when the TEMP keyword is present as the program can identify such marked sections as ones that may need to be discarded.

8.3.3 Combining keywords

More than one keyword can appear at the start of a marked section declaration. For example, once the publication mentioned in the above citation has been published, the word IGNORE could be added to the marked section declaration to avoid having to delete the text. The stored citation could then have the form:

   <CITE>Bryan, M. T. (1996)
   <EM>SGML and HTML Explained</EM>
   <![ IGNORE TEMP [(in preparation)]]>

By retaining the temporary section in this case, any future users of the citation will be able to see that it was prepared before the book was published, which should act as a warning that the citation may not be complete.

8.3.4 Storing marked sections as entities

Where a marked section is likely to be used more than once in a document it can be stored as an entity. To speed up identification of the text as a marked section the MS keyword should be used in the entity declaration. The program will then automatically add the marked section open and close delimiter sequences to the replacement text entered for the entity. For example, if the temporary "in preparation" marked section declaration defined above was to be used in a number of citations it could be declared as an entity by entering:

   <!ENTITY inprep MS " TEMP [(in preparation)" >

Once the entity has been defined in this way the citation can be altered to read:

   <CITE>Bryan, M. T. (1996) <EM>SGML Explained</EM> &inprep;</CITE>

When storing marked sections as entities, however, it is important to remember that all of the delimiters of the marked marked section must be defined within the same entity. For example, the entity references:

   <!ENTITY ignore  "<[IGNORE ["                                >
   <!ENTITY message "Remember to check this before publication" >

could not be used to create a marked section by entry of a definition such as:

Web SGML Adaptations Extension
When the Web SGML adaptations provided by Annex K to ISO 8879 are available, and the SGML declaration contains INTEGRAL YES as one of the new entries in its FEATURES clause, a marked section must start and end in the same entity.

8.3.5 Nested marked sections

SGML allows marked sections to be nested up to the level defined by the currently active TAGLVL quantity. By default up to 24 different levels of marked sections can be active at any point in the document. Only sections defined using the INCLUDE, IGNORE and TEMP keywords can contain nested marked sections.

To see how nesting works, consider the following example:

   <!DOCTYPE book SYSTEM "my-book.dtd" [
   <!ENTITY % bibliog "INCLUDE" >
   <!ENTITY % comment "IGNORE" >
   <!ENTITY inprep MS " TEMP [(in preparation)" > ]>
   <![ %bibliog; [<H1>References</H1>
   <CITE>Bryan, M. T. (1996) <EM>SGML Explained</EM> &inprep;</CITE>
   <![ %comment; [Need to cite Burn's paper here &inprep;]]>

Here the bibliography is to be included in the document but, because in some cases it will not be required, it has been treated as a marked section whose presence can be controlled by use of the %bibliog; parameter entity. Within the bibliography further marked sections have been used to define temporary additions to the citation (e.g. the &inprep; entity declaration) and to add a note for the author. Notice that, within the note, the &inprep; entity reference has been used as a reminder of why the details still need to be added. This reference to an embedded marked section will, however, only be expanded if the keyword stored in the %comment; parameter entity reference is changed to INCLUDE.

If the entity declaration for the %bibliog; entity reference is changed to:

   <!ENTITY % bibliog "IGNORE" >

all the marked sections embedded within the bibliography will be ignored because the IGNORE keyword in the outermost marked section has precedence over all other embedded keywords.

8.4 Processing instructions

Processing instructions are instructions to the local system telling it, in its own language, how to process the document. Typically processing instructions are used to define how the following text should be formatted. Because such instructions are system specific, and often also application dependent, they need to be specially identified so that, for example, when the document is sent to another system, or its format is changed, any processing instructions incorporated in the document, or its document type declaration, can be changed accordingly.

Because processing instructions are normally written in a language that is known only to the current system, or those using a similar set of instructions, they cannot be entered as part of the generalized coding used for SGML. To distinguish the processing instructions from other markup, therefore, they are enclosed in a special set of delimiters known as the processing instruction open (PIO) and processing instruction close (PIC) delimiters. In the reference concrete syntax PIO is <? and PIC is >.

The format of the data within the processing instruction is determined by the processing system. The only restrictions placed on the format of processing instructions by SGML is that:

It should be noted that, once a processing instruction has been started, the SGML parser will ignore all characters up to and including the currently defined processing instruction close sequence.

The maximum length of individual processing instructions is controlled by the PILEN quantity in the SGML declaration. In the reference concrete syntax PILEN is set to 240.

8.5 Using processing instructions in marked sections

Where two different systems may to be used to format a document, the marked section facility can be used to control the processing instructions used on each system. For example, two sets of processing instructions could be entered in the text as:

   <![ %systema; [<?processing instruction for System A>]]>
   <![ %systemb; [<?processing instruction for System B>]]>

While the document is being processed on System A the associated parameter entities would be defined as:

   <!ENTITY % systema "INCLUDE" >
   <!ENTITY % systemb "IGNORE"  >

When the document is transferred to System B the processing instructions used can be quickly changed by altering the entity declarations to read:

   <!ENTITY % systema "IGNORE"  >
   <!ENTITY % systemb "INCLUDE" >