Managing the Multilingual Semantic Web

When I use a word it means just what I choose it to mean, neither more or less. - Humpty Dumpty/Lewis Carole
You mustn't take what I said out of context - Every 20th Century British Prime Minister
Words mean what I say they mean today - not what I thought they meant yesterday, or will think tomorrow - Martin's Corollary

This short paper seeks to highlight some of the main problems that will be encountered by those trying to manage multilingual semantic webs, and suggests some simple extensions to the concepts defined by the Resource Description Framework (RDF) and XML Topic Maps (XTM), based on techniques developed for linguistic analysis, that could be used to overcome these problems.

A basic assumption of this paper, which may prove invalid over the long term, is that, because current thinking requires the use of the Extensible Markup Language (XML) for the interchange of semantic information over the Internet, the best way to represent this information is through the use of XML. The advantage of this approach is that existing XML tools can be used to associate meaning with components of XML documents. The disadvantage is that XML cannot efficiently define all the operations required to manage a multilingual semantic web. Necessarily, therefore, what is proposed in this paper is a subset of what will be required to completely manage a truly multilingual semantic web.

This paper does not discuss the ways in which concepts can or should be associated. The paper concentrates on the way in which concept definitions should be managed, rather than applied, though it does suggest a modification that is required to the way occurrences of concepts are identified based on linguistic requirements identified within the text.

Problems to be solved

  1. Words have many meanings

    A particular string of characters can have multiple meanings. In the case of the three letters 'run', for example, the Oxford English Dictionary distinguishes over 40 different meanings that can be assigned to the word. Therefore the string itself is not sufficient to identify meaning. 

    Words have different forms, depending on the grammatical context in which they are used. Nouns can be singular or plural. Sometimes the same noun serves as both singular and plural (e.g. sheep); more often there is a regular change from singular to plural (e.g. ewe, ewes) but occasionally the connection between the singular and plural forms is irregular (e.g. goose, geese). Verbs can be in the past, present or future tense and, for many languages, will change form depending on whether they are used in the first, second or third person. Different forms of each application of a verb need to be recognized as identifying a single meaning, whichever grammatical form was used to identify the concept.

    The same string of characters can have different meanings in different languages. Alternatively the same string can mean the same thing in different languages. Some meanings are shared by all languages; some meanings are unique to a set of languages. It is important to identify which languages a particular term applies to.

    The ordering of meanings is critical to human understanding. Wherever possible, the most common meanings of a set of characters should be presented before those of the least common ones, if only to reduce the amount of time needed to identify the correct meaning to a minimum. However, it is important to note that the ordering of words is language dependent. 

    In many languages the sex assigned a string is important, either as a differentiator or as a means of determining which qualifying phrases can legitimately be assigned to the word.

    If we automatically assign generated unique identifiers to each word the link between the characters and the meaning is broken. Identifiers can be useful for grouping together related subjects. Identifiers are used to reference the concept. In many cases you need to be able to also identify individual names assigned to concepts.

    Possible solutions

    Each name should be individually referable through a unique identifier, which should be related in some way to that assigned to identify the concept as a whole. Where practicable identifiers should be assigned in a manner that allows related subjects to be placed together in a sorted list of identifiers.

    Ordering should be a property of a name, not the concept. Each sense in which a sequence of characters can be used should be assigned a meaning identifier string that can be used to reference the meaning. Ordering strings should start with the string concerned (minus any spaces or punctuation) and be followed by a qualifying number that indicates the level of importance of the number in a way that ensures the most logical ordering of entries when sorted according to an assigned algorithm for the selected language (e.g. run-010, run-020, etc, for the most important, second most common, etc, meaning of the term). The ordering string for singular and plural nouns, or for different verb tenses, should not change.

    The grammatical form of the word should be recorded as part of the identification of its permitted forms. This can either be as an attribute of a naming element or by assigning separate elements for each grammatical form.

    The different forms in which the term can be used should be recorded with the meaning. For example, if the term is a noun, then singular and plural meanings should be recorded, as should the sex of the word. The following examples suggest how this could be achieved:

    <concept id="animal-sheep">
     <identifying-terms xml:lang="en">
      <noun type="s" gender="n" meaning="sheep-010" id="animal-sheep-01">sheep</noun>
      <noun type="s" gender="f" meaning="ewe-010" id="animal-sheep-02">ewe</noun>
      <noun type="s" gender="m" meaning="ram-030" id="animal-sheep-03">ram</noun>
      <noun type="s" gender="n" meaning="lamb-020" id="animal-sheep-04">lamb</noun>
      <noun type="p" gender="n" meaning="sheep-010" id="animal-sheep-05">sheep</noun>
     </identifying-terms>
     <identifying-terms xml:lang="fr">
      <noun type="s" gender="m" meaning="mouton-010" id="animal-sheep-11">mouton</noun>
      <noun type="s" gender="f" meaning="brebis-010" id="animal-sheep-12">brebis</noun>
      <noun type="s" gender="m" meaning="agneau-020" id="animal-sheep-13">agneau</noun>
      <noun type="s" gender="f" meaning="agnelle-010" id="animal-sheep-14">agnelle</noun>
      <noun type="p" gender="m" meaning="mouton-010" id="animal-sheep-15">moutons</noun>
     </identifying-terms>
     ...
    </concept>
  2. Meanings are dependent on context

    The meaning of a word is determined by the context in which it is used. Simply identifying a string in a piece of text is not sufficient in the general case to identify its meaning. For example, consider the following applications of the letters 'runs': the water runs ...; an athlete runs ...; the computer runs ...; the computer bus runs ...; the bus runs .... In the latter case, what follows the word can also affect the meaning of the term, as the following examples show: the bus runs on Thursdays only; the bus runs via the High Street; the bus runs on petrol; the bus runs profitably; the bus runs on time most days.

    The meaning of the word depends on the generic type of the associated subject. For example, the meaning of the word runs depends on whether it is applied to a liquid, an animal, a machine, a form of transport, etc. The generic type identifies the "domain" in which the term is applicable.

    Meaning can depend on whether you are referring to a whole thing or a part of something, or on relative position in space and time. For example, the uppermost bed and the lowest bed of a sequence of rocks are clearly distinguishable, though the meaning of the word 'bed' is the same both cases, while the bed of a truck is dependent on its position within the whole.

    Context identifiers take different forms in different languages. There is not a one-to-one correspondence between the forms required for different languages.

    Possible solutions

    Identify the set of domains in which a term can validly be applied. 

    Associate each application of a term with the qualifying domain in which it has been applied. 

    <generic-domain id="liquid">
      <definition xml:lang="en">
        Fluid that has a sufficiently low viscosity to allow
        easy flow under the pressure of gravity on Earth.
      </definition>
      ...
    </generic-domain>
    <concept id="liquid-water">
     <domain is="liquid"/>
     <identifying-terms xml:lang="en">
      <noun type="s" meaning="water-010" id="liquid-water-01">water</noun>
      <noun type="p" meaning="water-010" id="liquid-water-02">waters</noun>
     </identifying-terms>
     <identifying-terms xml:lang="fr">
      <noun type="s" gender="f" meaning="eau-010" id="liquid-water-11">eau</noun>
      <noun type="p" gender="f" meaning="eau-010" id="liquid-water-12">eaus</noun>
     </identifying-terms>
      ...
    </concept>
    <concept id="liquid-flow">
     <domain is="liquid"/>
     <identifying-terms xml:lang="en">
      <verb tense="present" person="1s 2s 1p 2p 3p" meaning="flow-010" id="liquid-flow-01">flow</verb>
      <verb tense="present" person="3s" meaning="flow-010" id="liquid-flow-02">flows</verb>
      <verb tense="past" person="1s 2s 3s 1p 2p 3p" meaning="flow-010" id="liquid-flow-03">flowed</verb>
      <verb tense="future" person="1s 2s 3s 1p 2p 3p" meaning="flow-010" id="liquid-flow-04">flow</verb>
      <verb tense="present" person="1s 2s 1p 2p 3p" meaning="run-010" id="liquid-flow-05">run</verb>
      <verb tense="present" person="3s" meaning="run-010" id="liquid-flow-06">runs</verb>
      <verb tense="past" person="1s 2s 3s 1p 2p 3p" meaning="run-010" id="liquid-flow-07">ran</verb>
      <verb tense="future" person="1s 2s 3s 1p 2p 3p" meaning="run-010" id="liquid-flow-08">run</verb>
     </identifying-terms>
      ...
    </concept>
    <instance of="liquid-flow-05" context-is="liquid-water-02" location="..."/>
  3. Terms used to convey meanings change over time, as well as space

    Terms used to define meaning can change over time for technical or cultural reasons. For example, many chemicals have both generic and formal names, the former having been originally assigned by the manufacturer and the latter being derived using a set of scientific rules determined at a later date.

    The same term may have different meanings in different geographical areas within a given linguistic community. For example, the terms trunk and boot are used in different ways in UK and USA, despite their claims to share a common language.

    Terms of measurement can change both over time and space. For example, changes from Fahrenheit to Centigrade for measuring temperature have occurred in different UK industries at different times, but a similar change has not affected many USA industries.

    Possible solutions

    Record old and new names. Identify source of technical terms.

    Record the dates at which terms started and/or ended their validity. Currently valid terms would have no end-date.

    <concept id="liquid-ethanol">
     <domain is="liquids"/>
     <identifying-terms xml:lang="en">
      <noun type="s" meaning="ethanol-010" id="liquid-ethanol-01" use-started="1550">alcohol</noun>
      <noun type="s" meaning="ethanol-010" id="liquid-ethanol-01" 
            source="Chemical Society" use-started="1903">ethanol</noun>
      <noun type="s" meaning="ethanol-010" id="liquid-ethanol-01">ethyl alcohol</noun>
      <noun type="s" meaning="ethanol-010" id="liquid-ethanol-01" use-ended="1945">spirit</noun>
      <noun type="p" meaning="ethanol-010" id="liquid-ethanol-01" xml-lang="en-uk">spirits</noun>
     </identifying-terms>
  4. Linguistic boundaries - Vive la difference

    Across languages the boundaries of terms can differ. For example, the Spanish term that maps to the English words agriculture and farming, agricultura, does not cover the raising of cattle, which is presumed to be an integral part of the concept in the UK. A farmer in the UK would not, however, refer to the raising of rabbits and horses as part of agriculture, as they are not part of the normal human food chain, whereas a Belgian farmer would consider them to be part of the food chain.

    The only way to determine what a term is expected to mean is to provide an adequate statement of its boundaries in at least one language. However, it is important that the statement be in a form that is comprehensible to all users of the concept, irrespective of their linguistic background. Therefore, translations of the definition should be provided in all supported languages.

    It is not realistic to expect one person to provide all the linguistic definitions and terms associated with a particular concept. Translations of definitions and the identification of relevant terms for describing the concept will be done by different people at different times in different countries. The maintenance cycle for each language will often be separate. There needs to be a mechanism whereby separately maintained sets of terms and defintions of can refer to each other in a way that allows the different languages to be interconnected quickly.

    Possible solutions

    To allow definitions and sets of identifying terms to be imported from external web sites.

    To allow definitions and sets of identifying terms to identify the concept of which they form part.

    To provide a mechanism whereby the fact that the mapping between a term and a concept is only partial is indicated, and to indicate whether that is because the term only covers part of the concept or because the concept only covers part of the term. 

    The following example suggests how this could be achieved:

    <concepts id="FarmingTerminologyEnglish"> 
     <concept id="animal-sheep">
      <definition xml:lang="en">Animals of genus Ovis</definition>
      <identifying-terms xml:lang="en">
       <noun type="s" gender="n" meaning="sheep-010" id="animal-sheep-01">sheep</noun>
       <noun type="s" gender="f" meaning="ewe-010" id="animal-sheep-02" coverage="partial">ewe</noun>
       <noun type="s" gender="m" meaning="ram-030" id="animal-sheep-03" coverage="partial">ram</noun>
       <noun type="s" gender="n" meaning="lamb-020" id="animal-sheep-04" coverage="partial">lamb</noun>
       <noun type="p" gender="n" meaning="sheep-010" id="animal-sheep-05">sheep</noun>
      </identifying-terms>
      <import-identifying-terms from="http://www.myco.com/farming/terms-fr.xml#animal-sheep" xml:lang="fr"/>
      ...
     </concept>
     ...
    </concepts>
    <concepts id="FarmingTerminologyFrench"> 
     <concept id="animal-sheep" in="http://www.myco.com/farming/terms-en.xml">
      <definition xml:lang="fr">Animaux de genre Ovis</definition>
      <identifying-terms xml:lang="fr">
       <noun type="s" gender="m" meaning="mouton-010" id="animal-sheep-11">mouton</noun>
       <noun type="s" gender="f" meaning="brebis-010" id="animal-sheep-12" coverage="partial">brebis</noun>
       <noun type="s" gender="m" meaning="agneau-020" id="animal-sheep-13" coverage="partial">agneau</noun>
       <noun type="s" gender="f" meaning="agnelle-010" id="animal-sheep-14" coverage="partial">agnelle</noun>
       <noun type="p" gender="m" meaning="mouton-010" id="animal-sheep-15">moutons</noun>
      </identifying-terms>
     </concept>
     ...
    </concepts>
  5. Managing semantic definitions over time

    Sets of concepts will necessarily change with time. The sets of definitions and terms associated with a particular concept will change over time. The set of instances of a particular concept will change regularly. The person responsible for defining a concept in a particular language at a particular time will typically be different from the person who identifies instances of the concept within information resources. Individual elements of a concept definition may need to be updated individually, without updating the whole concept.

    Possible solutions

    Identify the person who created each definition, using inheritance to limit the number of identifications necessary. Distinguish modifiers from creators. 

    Identify time at which record was created and last modified.

    The following example suggests how this might be achieved:

    <concepts id="FarmingTerminologyEnglish" version="20030621T164530Z">
     <concept id="animal-sheep" created-by="martin@is-thought.co.uk" on="20030604T102333Z0100">
      <definition xml:lang="en">Animals of genus Ovis</definition>
      <identifying-terms xml:lang="en">
       <noun type="s" gender="n" meaning="sheep-010" id="animal-sheep-01">sheep</noun>
       <noun type="s" gender="f" meaning="ewe-010" id="animal-sheep-02" coverage="partial">ewe</noun>
       <noun type="s" gender="m" meaning="ram-030" id="animal-sheep-03" coverage="partial">ram</noun>
       <noun type="s" gender="n" meaning="lamb-020" id="animal-sheep-04" coverage="partial"
             modified-by="steve@coolheads.com" last-modified="20030607T093422Z-0600>lamb</noun>
       <noun type="p" gender="n" meaning="sheep-010" id="animal-sheep-05">sheep</noun>
      </identifying-terms>
      <import-identifying-terms from="http://www.myco.com/farming/terms-fr.xml#animal-sheep" xml:lang="fr" 
                                created-by="michel@coolheads.com" on="20030608T093422Z-0500"/>
      ...
     </concept>
     ...
    </concepts>
    <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
     <rdf:Description rdf:about="http://www.myco.com/farming/terms-en.xml#animal-sheep">
      <concepts id="FarmingTerminologyFrench"version="20030621T164530Z"> 
       <concept id="animal-sheep" created-by="michel@coolheads.com" on="20030608T093422Z-0500">
        <definition xml:lang="fr">Animaux de genre Ovis</definition>
        <identifying-terms xml:lang="fr">
         <noun type="s" gender="m" meaning="mouton-010" id="animal-sheep-11">mouton</noun>
         <noun type="s" gender="f" meaning="brebis-010" id="animal-sheep-12" coverage="partial">brebis</noun>
         <noun type="s" gender="m" meaning="agneau-020" id="animal-sheep-13" coverage="partial">agneau</noun>
         <noun type="s" gender="f" meaning="agnelle-010" id="animal-sheep-14" coverage="partial">agnelle</noun>
         <noun type="p" gender="m" meaning="mouton-010" id="animal-sheep-15">moutons</noun>
        </identifying-terms>
       </concept>
       ...
      </concepts>
     </rdf:Description>
    </rdf:RDF>

Martin Bryan, IS-Thought, June 2003