The need for process-based semantic sets

Martin Bryan, The SGML Centre

There are three basic ways of looking at a set of information objects that make up an electronic commerce message:

  1. as a single hierarchy of objects conforming to the rules specified in a document model
  2. as a set of individual information objects which has been ordered for the purpose of transmission
  3. as an ordered set of processes, each process of which consists of an ordered or hierarchical set of information objects.

The first of these approaches is the one traditionally used for EDI messages, and is also the traditional view used for structured documents such as those defined using XML. It is presumed that a hierarchy of nested information objects has to be traversed in ordered from beginning to end, each object being processed as it is encountered within the message.

The second approach is the one traditionally used for data dictionaries and computer programs. It presumes that the contents of each information object have thier own rules regarding creation and processing, and that each such information object can be used in combination with any other information object to form a message that will convey meaning irrespective of the order in which the information objects are processed.

The third approach recognises that there are relationships between individual information objects that require them to be used in a specific sequence if a specific process is to be undertaken, but that a message typically contains the results of multiple processes, which may not need to be processed in a specific order. For example, for the purposes of identifying a delivery address we need to process the information objects that make up the address in a particular order. The order that the information objects are to be presented in may be dependent on the processes that are to be used for the delivery. For example, the order in which address components are listed for postal delivery may be different from those required for delivery using a transport contractor. Neither of these processes may match the order in which information about the relevant address was originally captured, stored or transmitted.

The processes that are used to capture the information that is to form an electronic commerce message will typically differ from the processes that are used to process the information at its point of receipt. Typically the information that is used to create the message will be captured in a set of discrete processes, each of which captures one subset of the final message. The message, therefore, represents the ordered concatenation of a sequence of information object sets/hierarchies. At the receiving end the message will be broken down into the individual information object sets/hierarchies, each of which could be passed to a separate process. What is, however, unlikely to happen is that the inidividual information objects that form a specific set/hierarchy will be passed to different processes. For example, if a name is broken down into a surname, forenames and initials it is unlikely that the surname will be sent to one process and the initials sent to another. It is much more likely that all three components of the name will be passed to a single process, which will then decide which of the component parts of the name it needs to process, and the order in which it will use them within the process.

In describing information objects it is important to identify the process with which an information object is intended to be associated. This name forms a natural "parent" for the information objects in a hierarchical view of the information set. For example:

Address
   RoomIdentifier
   BuildingIdentifier
   Street
   AdministrationArea
   PostingIdentifier
   Region
   Country
or
PersonalName
   FamilyName
   GivenName
   MiddleInitials
   Title

Process-based structures provide natural "units of reuse" for message components. When constructing a new message what should be selected are complete process units rather than the individual information objects.

Any repository of reusable information objects should allow objects to be selected based on intended process. Whilst being able to select complete messages is useful when the process sequence you want to describe has already been exactly described by someone else, there will not always be a match between the current information exchange requirements and the current set of messages. Any new set of processes will require a new message format, but in most case it will not require that all the processes described will be new processes that have to be described in terms of unique sets of information objects. Typically a new message type consists of a combination of a number of existing information object sets with one or more new sets of information objects that equate to a new process.

Given the above facts, it is important to structure any repository of lexical semantics in such a way that information objects used by a process can be easily identified. This is not to suggest that the same information object can only apply to one process. There are situations where a particular information object can validly be associated with more than one process. However, the name of the information object should not be dependent on the processes that use it. It should, therefore, be possible to separate out the process name from the information object name, so that the two components can be used separately, one as the parent of another. For example:

Delivery
  Date
  Time
or
Born
  Date
  Time

Processes may contain nested processes. For example, as well as a date and time a set of delivery instructions may need to include a delivery address, which can be captured by reusing the information objects defined for an existing process, e.g.

Delivery
  Date
  Time
  Address
    RoomIdentifier
    BuildingIdentifier
    Street
    AdministrationArea
    PostingIdentifier
    Region
    Country

Sometimes only a subset of the information recorded by a particular process may be required, e.g.

Born
  Date
  Time
  Address
    AdministrationArea
    Region
    Country

How does this relate to the structure of existing repositories of electronic commerce business objects?

Microsoft's BizTalk Repository and the OASIS Catalog provide access to XML Schemas or Document Type Defintions (DTDs) that define complete messages. With the BizTalk Repository you can register your interest in a particular schema so that you will be notified of any future changes. With the OASIS Catalog you can register your own message format on-line.

CommerceOne's Common Business Library and the ISO TC154 Basic Semantic Register provide unordered sets of information objects that you can use to create a new message. The Common Business Library definitions to contain information about any subelements that make up another element, but provide no automated mechanism to obtain the definition of all the information elements associated with a process. The Basic Semantic Register uses structured object names that are supposed to indicate process relationships, but these are very inconsistently applied, and you cannot rely on the fact that all the objects used by a process will have common initial components to their names.

Perhaps not surprisingly, the nearest match to the concept of identifying the processes for which related information objects are used is to be found in the UN-EDIFACT Directories that have been used to define the structure of EDI messages for years. Each UN segment and segment group acts as a parent for a set of related information objects. The problem with the UN directory is that, over time, it has become overloaded. Rather than recognize parentage separately from the application of the information objects attempts have been made to generalize the use of particular segments, and then to use qualifiers to indicate differences in parentage. For example, the UN directories define a single DateTime segment, and then try to qualify this by saying that this date/time relates to delivery date, this one to birth date, etc.

Some efforts have been made to restructure EDI message formats into more process dependenet sets. In particular the possibility of creating object-oriented EDI messagess (OO-edi) have been discussed for many years. The recent advent of XML as a language for creating and transmitting electronic commerce messages has highlighted the benefits of using object oriented message structures, but has at the same time highlighted the benefits of using parentage to differentiate between different uses of the same information objects, rather than creating separately named information objects for each process.

More recently discussions have taken place within TC154 about the advantages of using a tiered naming approach, with the first tier representing an object class name, the second a property name and the third a representation name. My personal opinion is that information object names should not include a component to indicate datatype as this makes it difficult for applications to assign a more restricted definition to the object at a later date. This is particularly important where the data type is an enumerated list, or a number that may be required to fall within a range.

What can we conclude from the preceding discourse? Firstly we can note that reusability of processes is the key factor in deciding which information objects should be related to one another. Secondly we should note that a message represents the ordered concatenation of information created by multiple processes. Thirdly we should note that identification of the process is more important than identification of its individual information objects. This identification should provide a natural "parent" name for the set of information objects.

What does this tell us about the structure of any repository of reusable information object semantic definitions? Firstly that the main selection criteria should be based on process names. Secondly that all information objects related to a process should be automatically associated with the process name: asking for semantics about the process will automatically provide semantics about its information objects, including any nested sub-processes.

Semantic repositories that are process based will themselves be hierarchically structured. Whilst the best structure for such a repository still needs to be researched, let me suggest how it could be structured, either as an LDAP directory structure, or as a structured document. The following hypothetical structure shows how such a directory could be presented to users:

OrganizationIdentification
  OrganizationName
  Address
    RoomIdentifier
    BuildingIdentifier
    Street
    AdministrationArea
    PostingIdentifier
    Region
    Country
  ContactPoints
    TelephoneNo
    FaxNo
    E-mail
    WebSite

PersonIdentification
  PersonalName
    FamilyName
    GivenName
    MiddleInitials
    Title
  Nickname
  Address
    RoomIdentifier
    BuildingIdentifier
    Street
    AdministrationArea
    PostingIdentifier
    Region
    Country
  ContactPoints
    TelephoneNo
    FaxNo
    E-mail
    WebSite

ProductIdentification
  ReferenceNumber
  Name
  PackedQty
  Measurements
    Height
    Width
    Depth
    Weight

DeliverySpecification
  ProductIdentification
    ReferenceNumber
    Name
  QuantityRequired
  WhenRequired
    Date
    Time
  WhereRequired
    Address
      RoomIdentifier
      BuildingIdentifier
      Street
      AdministrationArea
      PostingIdentifier
      Region
      Country
  WhoRequiredBy
    PersonalName
      FamilyName
      GivenName
      MiddleInitials
      Title
    Nickname
    ContactPoints
      TelephoneNo
      FaxNo
      E-mail

Observations on the concepts presented in this paper should be addressed in the first instance to Martin Bryan at The SGML Centre, 29 Oldbury Orchard, Churchdown, Glos. GL3 2PU, UK (mtbryan@sgml.u-net.com)