© Martin Bryan 1997 from SGML and HTML Explained published by Addison Wesley Longman
This chapter explains some of the optional SGML features that are provided by a few advanced SGML tools. It is split into the following sections:
More than one document structure may be required to cope with the varying roles of the data stored in an SGML document. Sometimes the different document structures are used individually; at other times concurrent multiple roles can be allocated to a single piece of text. Three different techniques for relating document structures are provided in SGML:
SUBDOCuments).CONCURrent DTDs.LINKs between the
structures declared in different DTDs.Where a document is made up from a number of previously prepared subsections, each of which may have their own document structure, the individual subsections can be created as externally stored subdocuments of the main document. Subdocuments must be declared as external entities in the document type declaration used by the calling document. At the appropriate point in the text an entity reference is used to call the previously declared subdocument into the main document.
Concurrent document structures can be used, for example, to distinguish between the different purposes to which data may be put (e.g. for book production, CD-ROM delivery or on-line database retrieval), or to identify structures generated during the processing of a document (e.g. the layout structure of a formatted book). Each concurrent structure used within the document must be declared in the document's prolog using a separate document type declaration.
Note: The concept of concurrent structures is particulary important to the Association for Computing in the Humanities' Text Encoding Initiative (TEI), where it is used for recording the way in which particular editions of a work have been paginated or edited.
Where it is possible to automatically create one structure from another the
parser can be instructed to automatically create the alternative structure using
SGML's explicit link (EXPLICIT) feature.
Simpler controls on processing are provided by the simple
link (SIMPLE) option to control the processing of the whole
document, and the implicit link (IMPLICIT)
option to control the processing of individual elements.
Note: Links can only be associated with a base document type - never with concurrent document types. Explicit links cannot be used at the same time as concurrent markup structures because concurrent document types cannot be activated at the same time as link types are active.
SGML subdocuments are self-contained, externally stored, entities which consist of a document type declaration followed by text marked up using the entities, elements and attributes defined in the local declaration.
SGML subdocuments are particularly useful where a document contains special sections of data that are, typically, produced from different sources. If, for example, complex tables are to be generated from a spreadsheet package, the DTD used to validate tables produced by the package can be transmitted with the table when it is imported into another document, avoiding the need to ensure that the table structure in the receiving document matches that of all the programs supplying it with information.
Before preparing text for use as a subdocument of another document it is important to ensure that the same SGML declaration will be used for each subdocument in the overall document. It is the SGML declaration of the main document that applies to an SGML subdocument as the subdocument may not contain its own SGML declaration. If local SGML declaration has been added to the subdocument while it is being prepared it will need to be removed before the file can be used as a subdocument within another document.
| Web SGML Adaptations Extension When the Web SGML adaptations provided by Annex K to ISO 8879 are available each subdocument can have its own SGML declaration. If the SGML declaration is omitted, the subdocument uses the SGML declaration applicable to the entity from which the subdocument entity is referenced. |
If subdocuments are to be used the SUBDOC NO entry in the
OTHER section of the FEATURES clause of the
SGML declaration must be changed to
SUBDOC YES n, where n indicates the maximum number
of subdocuments that will be open at any point in the document.
Once a subdocument file has been prepared it must be declared as an external entity before it can be called. If the subdocument is stored locally it can be declared as a system-specific entity by entry of an entity declaration such as:
<!ENTITY paper1 SYSTEM "c:\pub\captured.txt" SUBDOC>
If the system can automatically recognize the name of the file from the
entity's name, the optional system identifier (e.g. "c:\pub\captured.txt")
can be omitted to give an entity declaration such as:
<!ENTITY captured SYSTEM SUBDOC>
Where the subdocument's contents are already known to all systems likely to receive the document, the entity can be publicly declared by entry of a declaration such as:
<!ENTITY copyrite PUBLIC "-//OPOCE//DOCUMENT Copyright notice//EN" SUBDOC >
It should be noted that, even though they contain markup declarations, SGML subdocuments are defined as general entities rather than parameter entities. This is because the entity reference for the subdocument must occur at the appropriate point within the text of the document instance, rather than within the document prolog.
A previously defined subdocument should be called by entering an entity
reference, such as ©rite;, at the appropriate point in the
text. Before requesting a subdocument from the system, an SGML parser will
record the current state of the processor for recall after the subdocument has
been processed.
A typical document instance referencing stored subdocuments might look like this:
<!DOCTYPE SUP PUBLIC "-//OPOCE//DTD OJ Supplement//EN" [
<!ENTITY rec94372 SYSTEM "rec94372.cat" SUBDOC>
<!ENTITY txt94372 SYSTEM "al94-372.enc" SUBDOC>
]>
<SUP><RECORD ID="FXAL94372ENC">&rec94372;&txt94372;</sup>
Within a subdocument only locally defined markup declarations apply. This means that you cannot cross-refer to an identifier declared in the main document from within a subdocument, or vice versa.
Note: ISO/IEC 10744, the Hypermedia/Time-based Structuring Language (HyTime), shows how SGML can be extended to allow references in one document to reference identifiers in another document, which could be an embedded subdocument.
Where two or more "views" of a document's contents can exist concurrently, more than one document type declaration can be specified at the start of a document. For example, if data is to be stored in a controlled document database and also displayed on the World Wide Web, it may be necessary to indicate two different roles for a piece of text in its markup, as the following (somewhat simplified) example shows:
<(TEI.2)TEI.2><(HTML)HTML>
<!--TEI header elements omitted here for simplicity-->
<(TEI.2)BIBL><(TEI.2)MONOGR><(HTML)HEAD>
<(TEI.2)AUTHOR><(HTML)TITLE>Shirley, James</(HTML)TITLE></(TEI.2)AUTHOR>
</(HTML)HEAD><(HTML)BODY>
<(TEI.2)TITLE type=main><(HTML)H2>
The Gentlemen of Venice
</(HTML)H2></(TEI.2)TITLE>
<(TEI.2)TITLE type=subordinate><(HTML)H3>
A tragi-comedie presented at the private house in Salisbury Court
by Her Majesties servants
</(HTML)H3)</(TEI.2)TITLE>
<(TEI.2)IMPRINT><(HTML)ADDRESS>
<(TEI.2)PUBLISHER>H. Moseley</(TEI.2)PUBLISHER><(HTML)BR>
<(TEI.2)PUBPLACE>London</(TEI.2)PUBPLACE></(HTML)ADDRESS>
<(TEI.2)DATE><(HTML)P><(HTML)STRONG>
1655
</(HTML)STRONG></(HTML)P)</(TEI.2)DATE></(TEI.2)IMPRINT>
<(TEI.2)EXTENT><(HTML)P>78pp</(HTML)P></(TEI.2)EXTENT>
</(TEI.2)MONGR></(TEI.2)BIBL></(HTML)BODY>
</(TEI.2)TEI.2></(HTML)HTML>
Notice how the name of the elements have been qualified by a bracketed
document type specification to indicate which of the DTDs
declared in the document prolog they are associated with. In this case two well
known industry standard DTDs have been used, that of the
Text Encoding Initiative (TEI) and
that of the HyperText Markup Language (HTML)
used on World Wide Web.
Note: For simplicity sake I have omitted the TEI header elements that
should precede the start of the bibliographic entry (<BIBL>)
information. The reasons for this omission will be explained shortly.
It is important to note that there is not, in this example, a one-to-one
correspondence between the occurrences of elements within the two DTDs. For
example, within the address element of HTML there is no equivalent of a line
element. Instead the HTML linebreak (<BR>) empty element is
associated with the end-tag of the TEI publisher element.
More than one element may sometimes need to be used in one structure to
obtain the correct result in another structure. For example, to print the TEI
date in a bolder typeface and on a separate line within HTML it is necessary to
associate the <(TEI)DATE> element with two HTML elements,
<(HTML)P> and <(HTML)STRONG>.
Normally such concurrent document structures will not be entered directly by authors but will be created through automatic processes. The concurrent structures produced by these processes will be used to tell the system how it should present data to different processes without having to maintain a separate copy of the file for each process.
Concurrent document structures can also be used to record intermediate
stages in a process, or the final state of a file when processing has been
completed. For example, in a typical publishing application a document will pass
through a number of production stages to produce galley proofs, paginated text,
imposed sheets, etc. Traditionally each of these stages has resulted in an
output file which is coded to suit the use to which it will be put. The problem
with this is that, if changes are required to the text, more than one version of
the file may have to be updated. To avoid having to create different files at
each stage in the production process SGML allows the details required for each
structure to be stored in the same file. SGML marked sections can then be used
to identify text that is specific to a particular version of the file, other
structures defining the relevant parameter entities as IGNOREd
text.
When the FEATURES clause of the SGML declaration has been
altered to contain a CONCUR YES n entry, where n
indicates the maximum number of document structures that can be used
concurrently, DTDs can be declared in the document's
prolog
in addition to that of the base document
type (which is always the first DTD declared in the prolog). Each
document structures must be declared by entry of a document type declaration.
Typically a document with two concurrent document structures might start:
<!SGML "ISO 8879:1986"
BASESET "ISO 646:1983//CHARSET
International Reference Version (IRV)//ESC 2/5 4/0"
.
.
.
FEATURES MINIMIZE DATATAG NO OMITTAG YES RANK NO SHORTTAG YES
LINK SIMPLE NO IMPLICIT NO EXPLICIT NO
OTHER CONCUR YES 2 SUBDOC YES 2 FORMAL NO
APPINFO NONE
>
<!DOCTYPE TEI.2 SYSTEM "tei2.dtd"
[<!ENTITY % TEIonly "INCLUDE">
<!ENTITY % HTMLonly "IGNORE" >]>
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN"
[<!ENTITY % TEIonly "IGNORE" >
<!ENTITY % HTMLonly "INCLUDE">]>
The first element entered after the above declarations is that for the TEI P3 DTD. As the locally stored file identified in this declaration is the first DTD in the prolog it is the base document type for the document, which is automatically taken by the SGML parser as the DTD name to be associated with any tag that does not have one specified.
When the document is processed to create the equivalent HTML file the second structure, which is aimed at presentation of the information on a screen, can be added to the TEI logical structure used to capture and store the data.
The above example also illustrates how parameter entity definitions can be added to the document type declaration subset to allow users and applications to identify data that is specific to a particular structure. The definitions given above could be used to extend the earlier example as follows:
<(TEI.2)TEI.2><(HTML)HTML>
<![ %(HTML)TEIonly; [
<!--TEI header elements-->
<!(TEI.2)TEIHEADER> ... </(TEI.2)TEIHEADER>
]]>
<(TEI.2)BIBL><(TEI.2)MONOGR>
<![ %(TEI.2)HTMLonly; [
<(HTML)HEAD><(HTML)TITLE>TEI Monograph Catalogue</(HTML)TITLE>
<(HTML)BASE href="http://www.u-net.com/~sgml/TEI">
<(HTML)LINK rel=translation title="Français" href="Bibl-FR.htm">
</(HTML)HEAD>
]]>
<(HTML>BODY>
<(TEI.2)AUTHOR><(HTML)H1>Shirley, James</(HTML)H1></(TEI.2)AUTHOR>
<(TEI.2)TITLE type=main><(HTML)H2>
The Gentlemen of Venice
</(HTML)H2></(TEI.2)TITLE>
<(TEI.2)TITLE type=subordinate><(HTML)H3>
A tragi-comedie presented at the private house in Salisbury Court
by Her Majesties servants
</(HTML)H3)</(TEI.2)TITLE>
<(TEI.2)IMPRINT><(HTML)ADDRESS>
<(TEI.2)PUBLISHER>H. Moseley</(TEI.2)PUBLISHER><(HTML)BR>
<(TEI.2)PUBPLACE>London</(TEI.2)PUBPLACE></(HTML)ADDRESS>
<(TEI.2)DATE><(HTML)P><(HTML)STRONG>
1655
</(HTML)STRONG></(HTML)P)</(TEI.2)DATE>
</(TEI.2)IMPRINT><(TEI.2)EXTENT><(HTML)P>78pp</(HTML)P></(TEI.2)EXTENT>
</(TEI.2)MONOGR></(TEI.2)BIBL>
<![ %(TEI.2)HTMLonly; [
<(HTML)HR>
<(HTML)P>Webmaster:
<(HTML)A href="mailto:webmaster@our-site.com">
webmaster@our-site.com</(HTML)a>
</(HTML)P>
]]>
</(HTML)BODY>
</(TEI.2)TEI.2></(HTML)HTML>
The indented material belongs to only one of the structures. The marked section delimiters will ensure that any enclosed text will not be reproduced as part of the structure it does not belong to.
Marked sections provide a useful method of overcoming one of the major
problems with using concurrent structures. Where data occurs in the
document instance it must be valid in all structures at the point entered.
To understand the types of problems that can occur because of this rule you
should note the difference between the example given above and
the earlier example which did not use marked sections.
In the initial example the TEI <AUTHOR> element is
associated with the HTML <TITLE> element, which forms part
of the header of the HTML document, rather than part of the main body text. Once
marked sections are used it becomes possible to provide a separate HTML title
(which is only used as a window title) and move the author details into the text
body.
Note: If the latter version of the <TITLE> element
had been used in the first example an error would be reported because the
<MONOGR> element cannot contain parsed character data. (Its
model is purely element content.)
Not all SGML features can be used within concurrent document structures. In particular, those SGML features that may only be used within the base document type, such as empty start-tags and net-enabling start-tags, cannot be used within concurrent document structures. Similarly care must also be taken to ensure that notation names associated with data entities or attribute lists are declared in each of the DTDs they are associated with.
SGML uses five types of declarations to link concurrent document structures:
Link type declarations are
similar in structure to document type declarations. They must be entered in the
document prolog after the DTDs they relate to. Like other markup
declarations, link type declarations begin with the markup declaration open
delimiter (MDO) delimiter followed, without any intervening
spaces, by a reserved name, LINKTYPE by default. This is followed
by a link type name that uniquely identifies the link
process definition. As well as being different from that of any other
link type declaration in the same prolog, this name must also be different from
the names used for document type definitions in the same prolog.
Link type declarations may be separated from DTDs, and other link type declarations, by comment declarations, spaces, record start and end codes, valid separator characters or processing instructions, which are collectively referred to within ISO 8879 as other prolog.
Except for link set use declarations, which are used in a way similar to
short reference maps, link related declarations must either be embedded within a
link type declaration subset within the link type declaration,
or stored in a seperate file that is referenced as all or part of the link type
declaration subset. The mechanism used is similar to that used for document type
definitions (see Chapter 11), with any definitions
called from a separate file being read after any local definitions encoutered
between a matched pair of declaration subset open (DSO) and
declaration subset close (DSC) delimiters.
Three types of link are recognized by SGML:
The types of links that can be used within a document are controlled by the
LINK entries in the FEATURES clause of the SGML
declaration. In the reference concrete syntax the LINK features
are disabled by entry of the following line:
LINK SIMPLE NO IMPLICIT NO EXPLICIT NO
If simple links are required in a document the first entry in this line must
be changed to SIMPLE YES followed by a number indicating the
maximum number of simple links to be used in the document. If implicit links are
to be used the second entry changes to IMPLICIT YES (without a
qualifying number). Where explicit links are required the maximum number of
links to be used within a single chain in the document must be stated after the
entry EXPLICIT YES, giving a composite entry of the form:
LINK SIMPLE YES 4 IMPLICIT NO EXPLICIT YES 2
If EXPLICIT YES is specified, multiple DTDs can be declared in
the prolog. The number of explicit links allowed in the FEATURES
clause must be, at least, one less than the number of document type declarations
in the longest chain of linked documents declared in the prolog.
A simple link specification takes the form:
<!LINKTYPE proof #SIMPLE #IMPLIED
[<!ATTLIST book style CDATA #FIXED "300dpi.prn">]>
This declaration tells the system to "use the style sheet that generates 300 dot per inch proofs of books when this linktype declaration is active".
The link type name (proof) that follows the
LINKTYPE declaration type keyword is followed by two, compulsory,
keywords (#SIMPLE and #IMPLIED) to indicate the type
of link being specified and the type of result expected.
The link type declaration subset that follows the specification, between the square brackets, contains a single attribute definition list declaration that defines one or more fixed attributes that are to be assigned to the base document type element (the first one defined in the prolog).
Note: When used in link type declaration subsets, the attribute's
declared value cannot be defined by use of the ID, IDREF,
IDREFS or NOTATION keywords, and #CURRENT
and
#CONREF cannot be used for the default value. (These restrictions
apply because link attributes cannot be used within an element's start-tag. In
each of the above cases users would need to specify the applicable attribute
values or supply the contents the attributes are to apply to.)
More than one simple link can be specified in a prolog. For example, the following link types could be associated with the TEI DTD:
<!LINKTYPE print #SIMPLE #IMPLIED
[<!ATTLIST tei.2 style CDATA #FIXED "postscript">]>
<!LINKTYPE load-dbs #SIMPLE #IMPLIED
[<!ATTLIST tei.2 dbs-name CDATA #FIXED "catalogue">]>
to specify the way the TEI.2 document should be processed before being sent for printing or loading into a bibliographic catalogue. Note, however, that a simple link could not be associated with the HTML DTD shown in the concurrent document examples above as only the base document type can be linked to the implied structure.
As with document type declarations, the declaration subset can be stored in an external file which can be referenced using system or public identifiers. For example, the two entries shown above could be shortened to:
<!LINKTYPE print #SIMPLE #IMPLIED SYSTEM "print.lpd">
<!LINKTYPE load-dbs #SIMPLE #IMPLIED
PUBLIC "-//our-firm//LPD Database loading link process definition//EN">
Note particularly the use of the LPD public class name in the
formal public identifier to indicate that the file to be referenced contains a
link type declaration subset.
The way a link is activated depends on the application. For conformance testing purposes an SGML parser should be able to use a processing instruction of the following form to activate the both of the link type definitions given above:
<?rast-active-lpd: print load-dbs>
Implicit link specifications can be used to associate
processing attributes with any element in any DTD. When the LINK
entries in the FEATURES clause of the SGML declaration contain the
entry IMPLICIT YES, the link type name can be followed by the name
of a document type whose document type declaration precedes it in the prolog,
and the word
#IMPLIED. This tells the system that this link type declaration,
when activated as detailed above, will add link attributes to
elements in the named DTD.
The following example shows how some printing properties could be associated with elements making up a TEI bibliographic entry for a monograph:
<!LINKTYPE print tei.2 #IMPLIED
[<!ENTITY % bibl "(author|title|publisher|pubPlace|date|extent)" >
<!ATTLIST %bibl; align (start|end|centred|justified) start
family CDATA "Times Roman"
weight NAME medium
posture NAME upright
size NUTOKEN 12pt
measure NMTOKEN 36pi
l-indent NUTOKEN 0
r-indent NUTOKEN 0
attcond CDATA #IMPLIED >
<!LINK #INITIAL
author [centred weight=bold size=18pt]
title [attcond="type=main" centred size=24pt]
title [attcond="type=subordinate" centred size=16pt]
publisher [weight=bold]
date [family="Arial" posture=italic]
extent [end family="Arial"] >
]>
The parameter entity defined at the start of the link set declaration subset identifies all the elements in a monograph's bibliographic entry that have text associated with them. The attribute definition list declaration then associates 9 attributes with each of these elements, and assigns default values to 8 of the 9 attributes.
Each implicit link type declaration must have at
least one link set declaration whose associated name is a
special reserved name, #INITIAL. This identifies the start point
for the link process. Like other markup declarations the link set declaration
begins with a markup declaration open (MDO) delimiter followed,
without intervening spaces, by the reserved name identifying the type of
declaration, LINK. The name assigned to the link set must follow
this reserved name, separated from it and the subsequent link rules
by one or more spaces or other separator characters. The link set declaration
ends when the next markup declaration close (MDC) delimiter is
encountered.
For implicit links the link rules take the form of one or more source
element specifications each of which consists of the name of an
associated element type, which must be that of an element
defined in the DTD identified by the link type declaration, and a link
attribute specification. As in other SGML declarations, the associated
element type specification can be either a single element type name or a
bracketed
name group. The link attribute
specification consists of an attribute
specification list, as found within an element's start-tag, bracketed by the
current declaration subset open (DSO) and declaration subset close
(DSC) delimiters.
Notice that there are two declarations for the title element.
Multiple entries are permitted where the selection of an appropriate option can
be determined using some application-specific rule. In this example an attribute
condition (attcond) attribute has been defined to check the
current value of the type attribute. If the value of the type
attribute is main the title will be set in 24pt Times Roman
centred on the 36 pica (6 inch) default measure. If the value is subordinate
the title will be set, centred on the measure, in 16pt Times Roman.
Notice also that there is no entry for the publisher element.
This is because this element should be set using the default settings for the
attributes, which specify that the text should be set in 12pt Times Roman, using
a medium weight and upright posture, so that the start of the publisher's name
is aligned with the start of the 36 pica measure, to which no indents are to be
applied.
Where style sheets are used to record the details of the parameters to be
associated with each element, implicit links provide a natural route for linking
elements to style sheets.The following example shows, the name assigned to the
style sheet specification (e.g. h1) is the only attribute needed
to control the link process:
<!LINKTYPE format tei.2 #IMPLIED
[<!ENTITY % bibl "(author|title|publisher|pubPlace|date|extent)" >
<!ATTLIST %bibl; style NAME "normal"
attcond CDATA #IMPLIED >
<!LINK #INITIAL
author [style=h1]
title [attcond="type=main" style=h2]
title [attcond="type=subordinate" style=h3]
publisher [style=p]
date [style=date]
extent [style=extent] >
]>
In this case the style sheet name, which is passed to the text formatting routines, activates a predefined set of formatting instructions which control the appearance of the element's text.
As explained above, the link type declaration subset can be stored in a separate, easily reusable, file that can be associated with many document instances. A typical implementation might add the following prolog to the TEI example document shown above:
<!DOCTYPE TEI.2 SYSTEM "tei2.dtd">
<!LINKTYPE format tei.2 #IMPLIED
PUBLIC "-//our-firm//LPD Styles for TEI Bibliography//EN">
Sometimes you need to associate a specific set of processing rules with a particular occurrence of an element. If an element has been assigned a unique identifier, a ID link set declaration can be used to assign link attributes to elements with relevant IDs.
When the reference concrete syntax is being used ID link set declarations
are defined, within the link type declaration subset, in a declaration that
begins
<!IDLINK and ends with >. Between these markup
delimiters there must be one or more entries consisting of:
The following example shows how an ID link set declaration could be used to provide overrides for specific instances of a publisher name:
<!IDLINK
isea publisher [family="Avant Garde"]
OPOCE publisher [family="Helvetica"]
sgml-cen publisher [weight=bold posture=italic] >
If the publisher element in the TEI bibliographic entry used above had been:
<(TEI.2)PUBLISHER id=isea>isea sa</(TEI.2)PUBLISHER>
the IDLINK definition would ensure that the company name would
be printed using the house style for that company, which requires that the name
be set in 12pt Avant Garde.
Note: Only one ID link set declaration may be specified in each link type declaration.
Explicit links are used to link elements in a source document
structure to elements in a result document structure. The explicit
link specification that follows the LINKTYPE keyword and
the link type name consists of the names of two DTDs defined earlier
in the prolog. The subsequent link type declaration subset, between the square
brackets, can contain:
#INITIAL,
which will form the start point for the link processThe following prolog could be used to link a TEI bibliographic entry for a monograph to an HTML document structure:
<!DOCTYPE TEI.2 SYSTEM "tei2.dtd">
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<!LINKTYPE CreateCD TEI.2 HTML
[<!ATTLIST title attcond CDATA #IMPLIED>
<!LINK #INITIAL author title
title [attcond="type=main"] h2
title [attcond="type=subordinate"] h3
imprint address
pubPlace br
date strong
extent p >
]>
For explicit links the link set name (e.g. #INITIAL) is
followed by matched pairs of element type names, optionally qualified by
attribute specifications, that form an explicit link rule. The
first element type name and attribute specification pair is known as the
source element specification. This starts with the name(s) of
one of the elements defined in the DTD whose base document element is the first
DTD named in the link type declaration. The element type name can be qualified
by one or more attributes forming a link attribute specification,
as can be seen in the entries for the two variants of the TEI <TITLE>
element.
The source element specification is followed by a result element specification, which also starts with an element type name. This second name must be one the element type names defined in the DTD whose base document element is the second DTD named in the link type declaration.. This element type name can, optionally, be followed by a result attribute specification showing the attributes to be associated with the selected element in the result document. These attributes must be ones whose attribute definition list declaration has been declared as part of the second of the DTDs named in the link type declaration.
The following document instance could be processed using the prolog shown above:
<TEI.2>
<!--TEI header elements omitted here for simplicity-->
<BIBL><MONOGR>
<AUTHOR>Shirley, James</AUTHOR>
<TITLE type=main>
The Gentlemen of Venice
</TITLE>
<TITLE type=subordinate>
A tragi-comedie presented at the private house in Salisbury Court
by Her Majesties servants
</TITLE>
<IMPRINT>
<PUBLISHER>H. Moseley</PUBLISHER>
<PUBPLACE>London</PUBPLACE>
<DATE>1655</DATE>
</IMPRINT><EXTENT>78pp</EXTENT>
</MONGR></BIBL></TEI.2>
The HTML document structure produced as a result of processing the link process definition will be:
<TITLE> Shirley, James </TITLE><H2> The Gentlemen of Venice </H2><H3> A tragi-comedie presented at the private house in Salisbury Courtby Her Majesties servants </H3><ADDRESS> H. Moseley <BR> London <STRONG> 1655 </STRONG></ADDRESS><P> 78pp </P>
This is not a complete HTML document, but it will be accepted by most HTML-based programs as the only elements missing from it are ones that are declared omissible in the HTML DTD. When the output of the link process is parsed against this DTD it will generate a file of the form:
<HTML VERSION= "-//W3C//DTD HTML 2.0//EN"> <HEAD><TITLE> Shirley, James </TITLE></HEAD><BODY><H2> The Gentlemen of Venice </H2><H3> A tragi-comedie presented at the private house in Salisbury Court by Her Majesties servants </H3><ADDRESS> H. Moseley <BR> London <STRONG> 1655 </STRONG></ADDRESS><P> 78pp </P></BODY></HTML>
Note that the parser has determined from the HTML DTD that the <TITLE>
element should be placed in the <HEAD> section of the HTML
document instance, and that all the other elements should be placed in the
<BODY> section. It has also added the default values to
those attributes that were assigned one in the DTD.
If you compare this document structure with that given as an
example of concurrent markup earlier in the chapter
you will find that they are not identical. Using SGML links it is not possible
to position the date outside of the <ADDRESS> result element
generated by the <IMPRINT> source element, or to generate
more than one start-tag in the result structure when a source element is
encountered. (To get the structures to be the same both a <P>
and a <STRONG> start-tag would have needed to be generated
in response to the <DATE> start-tag.)
If you compare the document structure with the example
showing the use of both concurrent markup and marked sections you will find
that the TEI <AUTHOR> element has been linked to the HTML
<TITLE> element rather than to the H1 element
as, without this, the compulsory title element in the header would not be
present and an error would be reported when the HTML file was parsed.
Sometimes problems such as those illustrated by the last example can be overcome by using one of the two mechanims provided by SGML for switching link sets:
#USELINK, which allows an alternative
link set to be used within a specified element#POSTLINK, which allows an
alternative link set to be activated when the end-tag for an element is
encoutered.The #USELINK option is particularly useful for differentiating
between the ways in which an element can be processed in different contexts.
For example, one of the DTDs used by the Office for Official Publications for
the European Communities (OPOCE) contains the following specification for a
paragraph:
<!ELEMENT (p|elem) - - (#PCDATA|list)+ > <!ELEMENT list - O (elem)+ >
Notice that this definition is recursive: lists are made up of elements (elem)
that can themselves contain lists.
The following, made-up, example shows how a complex paragraph could be marked up using these elements. The marked-up text, which contains four levels of nested lists, has been indented to make it clear which level each component is nested to.
<P>The budget for 1996 will be distributed as follows:
<LIST>
<ELEM>Payments to Directorates
<LIST>
<ELEM>DGI
<LIST>
<ELEM>Division: A
<LIST>
<ELEM>Brussels: 12 Mecus</ELEM>
<ELEM>Luxembourg: 8 Mecus</ELEM></LIST>
<ELEM>Division: B
<LIST>
<ELEM>Brussels: 10 Mecus</ELEM>
<ELEM>Luxembourg: 6 Mecus</ELEM></LIST>
<ELEM>Division: C
<LIST>
<ELEM>Brussels: 9 Mecus</ELEM>
<ELEM>Luxembourg: 12 Mecus</ELEM></LIST></LIST>
<ELEM>DGII
<LIST>
<ELEM>Division: D
<LIST>
<ELEM>Brussels: 21 Mecus</ELEM>
<ELEM>Luxembourg: 18 Mecus</ELEM></LIST>
<ELEM>Division: E
<LIST>
<ELEM>Brussels: 5 Mecus</ELEM>
<ELEM>Luxembourg: 2 Mecus</ELEM></LIST>
<ELEM>Division: F
<LIST>
<ELEM>Brussels: 19 Mecus</ELEM>
<ELEM>Luxembourg: 2 Mecus</ELEM></LIST></LIST></LIST>
<ELEM>Payments to Member States
<LIST>
<ELEM>Greece: 25 Mecus</ELEM>
<ELEM>Austria: 12 Mecus</ELEM>
<ELEM>Finland: 8 Mecus</ELEM></LIST>
<ELEM>Payments to Other Bodies
<LIST>
<ELEM>OPOCE: 10 Mecus</ELEM></LIST>
</P>
It could be decided that, when distributed over the World Wide Web, the
first two levels of list should be numbered, but subsequent levels should be
bulleted. The following (simplified) example of an explicit link shows how the
#USELINK option can be used to enforce the correct nesting and
numbering of lists:
<!DOCTYPE blk0 SYSTEM "cat2.dtd">
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN">
<!LINKTYPE to-HTML blk0 HTML [
<!LINK #INITIAL p p
list #USELINK level-1 ol >
<!LINK level-1 elem li
list #USELINK level-2 ol [type=a] >
<!LINK level-2 elem li
list ul [type=disc] >
]>
HTML distinguishes between two types of lists, ordered lists (<OL>),
which are preceded by an automatically generated number or letter (depending on
the level of nesting), and unordered lists (<UL>), which are
preceded by a bullet or dash (depending on both level and the style
specifications of the browser you are using).
The rules for processing paragraphs are defined in the initial link set.
OPOCE paragraphs are to be mapped to HTML paragraphs. In this case the same
markup tag (<P>) is used in both DTDs. (It should be noted,
however, that the two elements have different content models.)
When a top level list (<LIST>) element is identified
within the OPOCE paragraph it will be associated with an HTML ordered list (<OL>)
element with the default rules relating to numbering. Within the top level list
the parser will use the link set called level-1 to link elements.
This will link any OPOCE <ELEM> elements to HTML list items
(<LI>). If a nested list is encountered this will be mapped
to a nested ordered list whose setting for the value of the type
attribute is a to ensure arabic numbering of the nested list.
Within the nested list the rules defined in the link set called level-2
will be used.
Within a second level list any <ELEM> elements will be
linked to HTML list items and any nested lists will be linked to HTML unordered
lists (<UL>). To ensure that the list will be bulleted a
type=disc attribute value has been specified. Note that there is
no #USELINK statement at this level. If there are further levels
of nesting they will continue to use the current link set, level-2,
which will map all levels of nested list in the same way.
When the link rules are applied to the sample paragraph they will generate an HTML file of the form:
<P>The budget for 1996 will be distributed as follows:
<OL>
<LI>Payments to Directorates
<OL type="a">
<LI>DGI
<UL type="disc">
<LI>Division: A
<UL type="disc">
<LI>Brussels: 12 Mecus</LI>
<LI>Luxembourg: 8 Mecus</LI></UL>
<LI>Division: B
<UL type="disc">
<LI>Brussels: 10 Mecus</LI>
<LI>Luxembourg: 6 Mecus</LI></UL>
<LI>Division: C
<UL type="disc">
<LI>Brussels: 9 Mecus</LI>
<LI>Luxembourg: 12 Mecus</LI></UL></UL>
<LI>DGII
<UL type="disc">
<LI>Division: D
<UL type="disc">
<LI>Brussels: 21 Mecus</LI>
<LI>Luxembourg: 18 Mecus</LI></UL>
<LI>Division: E
<UL typ="disc">
<LI>Brussels: 5 Mecus</LI>
<LI>Luxembourg: 2 Mecus</LI></UL>
<LI>Division: F
<UL type="disc">
<LI>Brussels: 19 Mecus</LI>
<LI>Luxembourg: 2 Mecus</LI></UL></UL></OL>
<LI>Payments to Member States
<OL type="a">
<LI>Greece: 25 Mecus</LI>
<LI>Austria: 12 Mecus</LI>
<LI>Finland: 8 Mecus</LI></OL>
<LI>Payments to Other Bodies
<OL type="a">
<LI>OPOCE: 10 Mecus</LI></OL></OL>
This will generate a listing of the following form:
The budget for 1996 will be distributed as follows:
- Payments to Directorates
- DGI
- Division: A
- Brussels: 12 Mecus
- Luxembourg: 8 Mecus
- Division: B
- Brussels: 10 Mecus
- Luxembourg: 6 Mecus
- Division: C
- Brussels: 9 Mecus
- Luxembourg: 12 Mecus
- DGII
- Division: D
- Brussels: 21 Mecus
- Luxembourg: 18 Mecus
- Division: E
- Brussels: 5 Mecus
- Luxembourg: 2 Mecus
- Division: F
- Brussels: 19 Mecus
- Luxembourg: 2 Mecus
- Payments to Member States
- Greece: 25 Mecus
- Austria: 12 Mecus
- Finland: 8 Mecus
- Payments to Other Bodies
- OPOCE: 10 Mecus
Where special instructions are to be associated with specific instances of an element, the source document type definition must include an attribute definition list declaration for that element which contains an attribute declared using the ID keyword as its declared value. In the source document instance each element that is to be treated differently must be given a unique identifier that the system can use to determine where one of the link rules defined in the ID link set declaration defined in the current link type declaration subset is to be applied.
When preparing ID link sets it is important to remember that each link set
must cater for any subelements that the model allows to be embedded within the
element being linked to the result document. Failure to do so may result in
embedded elements not being formatted properly, and at best will result in their
being given default formatting parameters. To overcome this problem the #USELINK
option can be combined with the IDLINK option to give a
declaration of the form:
<!IDLINK special p #USELINK #INITIAL block [style=special-p] >
The #POSTLINK option can be used to specify that an
alternative link set is to be activated when the end-tag for the specified
element is encountered, or implied by the program.
The classic example of the need to change the link rules immediately after a specific element is provided by books in which the first paragraph of text after a chapter heading is set in a different format from other paragraphs. This could be handled using a link set of the following type:
<!LINKTYPE fromHTML HMTL page [
[<!LINK #INITIAL
h1 #POSTLINK firstpar block [style=chapter-head]
p block [style=para] >
<!LINK firstpar
p #POSTLINK #INITIAL block [style=initial-para] >
] >
In the initial link set the first level of heading, e.g. the chapter
heading, has been declared in such a way that, when the end-tag for the heading
(</H1>) is encountered, the parser will switch to a special
link set (firstpar) before processing the following text. The
first paragraph differs from other paragraphs: it uses the style sheet known as
initial-para to format the block it forms on the
page.
When the end-tag for the first paragraph (</P>) is
detected by the parser the #POSTLINK #INITIAL entry associated
with the paragraph element type name (P) in the firstpar
link set will cause the parser to revert to using the link definition for
paragraphs given in the initial link. This will ensure that subsequent
paragraphs use the normal style sheet for paragraph blocks (para)
As will be obvious from the complexity of the above, simplified, examples, preparing link type declarations for a document structure of the complexity of that defined for a book can be a time-consuming task. Fortunately a number of short cuts are available.
As with the other markup declarations, parameter entities can be used to reduce the amount of repetitive keying required. As the following example shows, this can greatly reduce the length of link attribute specifications:
<!LINKTYPE print act page [
<!ENTITY % catalog SYSTEM "catalog.lpd" >
<!ENTITY % preamble SYSTEM "preamble.lpd" >
<!ENTITY % terms SYSTEM "terms.lpd" >
<!LINK #INITIAL %catalog;
title [attcond="type='main'"] block [style="big-title"]
title [attcond="type='subordinate'"] block [style="small-title"]
%preamble;
%terms; >
Here the majority of the links to be used are defined in three link process definition files stored on the local systems. These standard link set definitions are called in using parameter entities, with any elements not covered by standard definitions being added to the stored definitions.
Entity declarations defined within the document type declaration subset of the document identified in the linktype declaration as the source document type for the current link type declaration can also be used within the link type declaration subset if they are appropriate. If entity declarations of the same name occur in both the link type declaration and the document type declaration, however, the one in the link type declaration will have priority while the link is active.
Another short cut is to use a name group for the associated element type part of a source element specification in the link set declaration (or in the link attribute set declarations). For example, if the HTML tags for entering computer coding examples are to have the same format when printed they could be declared as:
<!LINK #INITIAL ...
(xmp|listing|code|plaintext) block [face=courier]
... >
Care must, however, be taken in using this technique. In particular, it is important to check that the linked elements require the same set of attributes for both the source and result elements. If different sets of attributes apply on either side of the link this technique should not be used.
Where a number of elements share the same output format the values should be used as the default values in the specification of the attributes in the result DTD. This will allow the attribute specification list to be omitted where the result element only requires these default values.
Another possible short cut is to let the program imply the elements that a
link applies to, and the relevant attribute values. This can be done for
individual elements within an explicit link set by replacing the details of the
result element specification with the keyword #IMPLIED to give a
link set declaration of the form:
<!LINK linkname element [attributes] #IMPLIED>
Here the result of the link process will automatically be implied by the formatting program whenever the specified element is encountered, the optional link attribute specification associated with the source element specification defining parameters to be passed to the program.
The parser can also be instructed to link all otherwise unlinked source
elements to a default result element by using the #IMPLIED keyword
in place of the source element specification, e.g.:
<!LINK linkset1 #IMPLIED block [style=normal]>
Link type declarations can be controlled from within the text by use of link set use declarations. (These have a very similar form to the short reference map use declarations.) The general form of such declarations is:
<!USELINK setname linkname>
where USELINK is the default version of the keyword defined in
the reference concrete syntax, setname is the name given to one of
the link set declarations (<!LINK ...>) in the document's
prolog and linkname is the link type name used to identify the
link type declaration (<!LINKTYPE ...>) that contains the
relevant link set declarations. A typical entry would be:
<!USELINK #INITIAL to-HTML >
As with the USEMAP declaration, the special #EMPTY
keyword can be used to switch off a link. To disable the to-HTML
link type once it has been enabled, for example, you could enter the following
declaration at any point in the text:
<!USELINK #EMPTY to-HTML >
If the link set map is switched off within an element by entry of a link set
use declaration with the keyword #EMPTY, the original link set can
be restored by entering a declaration of the form:
<!USELINK #RESTORE to-HTML >
On seeing this markup declaration, the program will restore the link set that was associated with the current element prior to the preceding link set use declaration (e.g. the one that was current when the element began).
Where a publicly declared link type declaration subset is already known to
the receiving system it can be invoked, like other publicly declared declaration
sets, by use of a formal public identifier. In this case the public identifier
qualifies a link type declaration, and so the public text class keyword used in
the formal public identifier is LPD. A typical declaration might
be:
<!LINKTYPE create-CD OPOCE HTML PUBLIC
"-//OPOCE//LPD CD creation link set//EN">
If the publicly declared link set is to be extended by local definitions, which may override some of the definitions in the publicly declared set, the formal public identifier can be followed by a link type declaration subset. It should be noted that, as with DTDs, externally stored link declarations are added to the end of the local definitions, the first definition always taking precedence. To ensure the proper handling of entity references, all entities declared within the link are treated as preceding entities declared in the source DTD. This means that any entity declarations within the link with the same name as entities declared in the DTD will take precedence. Similarly, if the link declarations contain attributes which reference general entities not declared in the link type declaration or the link's source DTD, they will, if no default entity has been defined in the link's source DTD, be taken from the declarations within the document type definition for the same element.
From the above examples it can be seen that SGML provides both document creators and document designers with a number of techniques for controlling how entered text is to be processed. It should not, however, be thought that link statements and concurrent document types provide all the tools needed to produce paginated text. Fully paginated text requires a powerful text formatter, which will normally need to be set up for specific applications. The degree of interaction possible between the SGML document designer and the text formatter will depend on the skill of the system's designers in linking the formatter to the information stored as an integral part of the SGML document.
A new standard, ISO/IEC 10179, published in 1996, provides the power needed to manage the more complex forms of transformations required for the formatting documents. The Document Style Semantics and Specification Language (DSSSL) defined in ISO/IEC10179 uses a variant of the LISP-based Scheme programming language to control the way in which SGML document trees are converted for formatting. The language also defines an SGML Document Query Language (SDQL) that can be used to identify specific components of an SGML document tree.
Guidelines for Electronic Text Encoding and Interchange (TEI P3) Edited by C, M. Sperberg-McQueen and Lou Burnard for The Association for Computers and the Humanities (ACH), The Association of Computational Linguistics (ACL) and The Association for Literary and Linguistic Computing (ALLC), Chicago/Oxford, 1994, 1289pp
International Organization for Standardization (1996), Information technology - Text and office systems - Document Style Semantics and Specification Language (DSSSL) (ISO/IEC 10179:1996) Geneva: ISO.
International Organization for Standardization (1992), Information technology - Hypermedia/Time-based Structuring Language (HyTime) (ISO/IEC 10744:1992) Geneva: ISO.