The Swedish SIMPLE Lexicon

The Swedish SIMPLE Lexicon

A language engineering resource with access to semantic information in Swedish
elaborated by Språkdata, Göteborgs Universitet

Contact person Maria Toporowska Gronostaj

1. General design information

The Swedish SIMPLE lexicon is a language engineering resource, being elaborated by Språkdata, Göteborgs Universitet, with focus on the encoding semantic information in the lexicon. The semantic data provided in the mentioned lexicon is linked to the morphological and syntactic data in the Swedish PAROLE lexicon, which results in a NLP resource covering the whole PAROLE-SIMPLE mandatory spectrum of lexical information. (See the document on the Swedish PAROLE lexicon.)

The Swedish lexicon is one in the suite of lexicons for the 12 EU languages which are harmonized with respect to their content and the encoding formalism. Its theoretical and formal design is conformant to the guidelines presented by the Specification Group in the Linguistic Specifications SIMPLE (Lenci et al. 1998). The encoding of semantic data has been implemented in the SGML Document Type Definition format.

The lexical data in the LE-PAROLE and SIMPLE lexicons has been to a great extent supported by lexical resources elaborated at Språkdata, Göteborgs University. Machine readable monolingual dictionaries such as Svenska ord and Nationalencyklopedins ordbok as well as the lexical database, Göteborgs Lexikaliska Databas, GLDB, have proved to be valuable sources of information on Swedish morphology, syntax and semantics. Some of the lexical material from GLDB has been reused, as for example definitions and examples. It is worth noting that the discrimination of senses in the Swedish SIMPLE lexicon follows, to a large extent, sense distinctions made in the GLDB data base.

Lexicon population in the Swedish SIMPLE lexicon is a subset of morphological and syntactic units encoded in the Swedish PAROLE lexicon. At the core of the lexicon, there are Swedish equivalents to the set of base concepts shared by all the SIMPLE partners. In the final phase of the project, the lexicon will comprise 10000 semantic units (Usems) distributed among nouns (7000 USEMs), verbs (2000 USEMs) and adjectives (1000 USEMs). (By August the 1st, 1999, 3442 noun Usems and 325 verb USems SGML have been encoded in the lexicon.)

2. The semantic layer

A semantic unit, Usem, provides access to the semantic layer. It represents a meaning, a word sense. Each semantic unit is assigned a semantic type plus some other sorts of semantic information. (For example the semantic units hund ('dog') or katt ('cat') are assigned to the semantic type Animal. The semantic type names an ontological category from the SIMPLE ontology and it predefines other sorts of semantic information asssociated to a particular type, such as for example semantic class or different semantic relations invoked by Pustejovsky's qualia (Pustejovsky, 1995).

The following mandatory semantic information is encoded in the Swedish SIMPLE lexicon, whenever relevant:

template type, semantic type of the USEM
domain, domain information from the ERLI/LexiQuest's domain list
semantic class, a node in ERLI/LexiQuest's ontology
gloss, definition taken from the GLDB data base
predicative representation, predicate associated with the USEM and its argument structure
selectional restrictions on the arguments in the predicative representation

Besides the above types of semantic information encoded on the semantic layer, there is also information on links between morphological and syntactic units on one hand and syntactic and semantic units on the other hand. The relations established between the units are either one-to-one type or one-to-many types, which is reflected in the statistics below:

	Total number of morphological units:	2187
	Total number of syntactic units:	2795
	Total number of semantic units:	3767

(The statistics is based on the data available by August, the 1st, 1999.)

The material encoded so far seems to confirm the assumption that assigning different sorts of semantic information to a Usem supports the sense discrimination. In many cases, a cluster of semantic information, including template type, domain and semantic class, is sufficient to disambiguate different readings, especially if it is linked to the information on syntactic unit (Usyn) in the Swedish PAROLE lexicon. The example below illustarates the point:

Usyn	Usem	Template	Domain	SemClass

DN0	doktor1	Profession	Medicine	Occupation_agent
DN0	doktor2	Social_status	Higher_Education	Situ
DNNCOMPLFIX	doktor2	Social_status	Higher_Education	Situ
DNNRESTRAPPA	doktor2	Social_status	Higher_Education	Situ

(Situ stands for situational or punctual properties with regard to the nouns referring humans.)

The data encoded in the Swedish SIMPLE lexicon can be instantly clustered and viewed from a number of different semantic and morphosyntactic criteria. For example, clustering of information under template Instrument with semantic classes Apparatus and Instrument for Usems that sort under General (unspecified) domain results in the following lists:

WVSF_TEMPLATE_Instrument_PROT
TSVP_APPARATUS_TS_classificateur_de_nom_C:
assistent2 diskmaskin1 element3 freestyle1 frys1 giljotin1 hushållsassistent1 kaffebryggare1 kamera1 kassettradio1 kopiator1 kpist1 kulspruta1 kylskåp1 lampa1 ljudradio1 mixer1 persondator1 skrivmaskin1 torktumlare1

WVSF_TEMPLATE_Instrument_PROT
TSVP_INSTRUMENT_TS_classificateur_de_nom_C:
borr1 durkslag1 dyrk1 gaffel1 hyvel1 kaffesked1 kamin1 kastspö1 kikare1 klocka1 kniv1 kompass1 kratta1 kälke1 matsked1 miniräknare1 pistol1 sabel1 sked1 skruvmejsel1 spritkök1 staffli1 såg1 tesked1 timer1 ur1 visselpipa1

The search for items that sort under the template Instruments but are restricted to the domain Music or Musical_Instrument results in the following lists:

WVSF_TEMPLATE_Instrument_PROT
TSVP_MUSIC_TS_domaine_D
TSVP_APPARATUS_TS_classificateur_de_nom_C:
grammofon1 jukebox1 kassettdäck1

WVSF_TEMPLATE_Instrument_PROT
TSVP_MUSIC_TS_domaine_D
TSVP_MUSICAL_INSTRUMENT_TS_classificateur_de_nom_C:
fiol1 flöjt1 gitarr1 piano1 trombon1 trumpet1 tuba1 violin1 violoncell1 xylofon1

2.1 Nouns

How the semantic information is encoded according to the SGML DTD format is illustrated below for the noun bild ('picture'). From the Usyn part in the excerpt below, it follows that one syntactic Description, DNNCOMPLFAVAX, is linked to two Usems, namly USEMbild1_n319 and USEMbild2_n320. These two senses are assigned different clusters of semantic features namely:

trait_sem_valpond_l="WVSF_TEMPLATE_Artwork_PROT
TSVP_ARTS_TS_domaine_D TSVP_ARTIFACT_TS_classificateur_de_nom_C

trait_sem_valpond_l="WVSF_TEMPLATE_Cognitive_fact_PROT
TSVP_COGNITIVE_FACT_TS_classificateur_de_nom_C

It might be worth to observe that these two have different definitions and examples, but they share the same syntactic description. One of the senses is equivalent to the base concept (BC), which is noted in the comment field (commentaire="BC:455\02985557").

bild

<Um_S
    id="NO1710_1867"
    catgram="NOUN"
    sscatgram="COMMON"
    autonomie="YES"
    usyn_l="US1710_1867 US1710_1867_1 US1710_X">
    <Umg
        nieme="0"
        appellation="bild"
        mf="NO_ENER1">
        <Lib>bild</Lib>
        <Radg
            nieme="1">
            <Lib>bild</Lib></Radg></Umg></Um_S>

    <Usyn
        id="US1710_1867_1"
        description="DNNCOMPLFAVAX">
        <Corresp_Usyn_Usem
            usem_cible="USEMbild1_n319">
        <Corresp_Usyn_Usem
            usem_cible="USEMbild2_n320">
</Usyn>

<Usem
    id="USEMbild1_n319"
    appellation="bild"
    exemple="en medalj med konungens &; på &en ser man gränden strax till höger om domkyrkan; text och &er är dåligt samordnade; hans &er har utvecklats i romantisk riktning; i det moderna samhället översköljs vi av &er"
    commentaire="BC:455\02985557"
    definition_libre="GLDB:bild1/1:(plant) föremål som för synsinnet återger en del av verkligheten::el. ngt som kunde vara verkligt; om teckning, målning, fotografi m. m."
    trait_sem_valpond_l="WVSF_TEMPLATE_Artwork_PROT TSVP_ARTS_TS_domaine_D TSVP_ARTIFACT_TS_classificateur_de_nom_C">
</Usem>

<Usem
    id="USEMbild2_n320"
    appellation="bild"
    exemple="rapporten ger en god & av det ekonomiska läget"
    commentaire="BC:ZZ"
    definition_libre="GLDB:bild1/2:föreställning som ger en helhetsuppfattning (av ngt)::vanl. av ngn komplicerad företeelse med många ingående delar"
    trait_sem_valpond_l="WVSF_TEMPLATE_Cognitive_fact_PROT TSVP_COGNITIVE_FACT_TS_classificateur_de_nom_C">
</Usem>

The encoding of deverbal nouns differs somewhat from the above encoding. It is more extensive, because it also includes information on predicative representation, the predicate and its arguments structure and types of relation holding between the deverbal noun and the predicate (the valued attribute Master, type_of_link, affected_arg). Thus the objects and attributes used to describe deverbal nouns are shared with those used to describe events.

(For more examples, the reader is advised to look at the Swedish PAROLE-SIMPLE sample lexicon)

2.2 Verbs

The guidelines for representation of Events, elaborated by Pisa Specification Group (1999), the set of templates with structured semantic information and the DTD document provide information on how to represent the semantic data on verbs and deverbal nouns in the SIMPLE project. Assigning semantic information to verb Usems in the Swedish SIMPLE lexicon is focused on the encoding following mandatory data in the first place:

template type, specified by the SIMPLE ontology and the events templates
domain, LexiQuest/ERLI's list of domains
semantic class, based on LexiQuest semantic classes for verbs
predicative representation, including information on predicate, its arguments, selectional restrictions

The example below illustrates the above types information in the SGML-representation for the verb avskeda (to dismiss).

avskedar

<Um_S
    id="VB935_1006"
    catgram="VERB"
    autonomie="YES"
    usyn_l="US935_1006">
    <Umg
        nieme="0"
        appellation="avskedar"
        mf="VB_711">
        <Lib>avskedar</Lib>
        <Radg
            nieme="1">
            <Lib>avskeda</Lib></Radg></Umg></Um_S>

<Usyn
    id="US935_1006"
    description="D01P11DO">
    <Corresp_Usyn_Usem
        usem_cible="USEMavskedar1"
        correspondance="ISObivalent">
</Usyn>

<Usem
    id="USEMavskedar1"
    appellation="avskedar"
    exemple="en arbetare avskedades för ordervägran"
    commentaire="BC:ZZ"
    definition_libre="GLDB:säga upp från tjänst"
    trait_sem_valpond_l="WVSF_TEMPLATE_Cause_constitutive_change_PROT TSVP_CHANGE_TS_classificateur_de_verbe_C">
    <RepresentationPredicative
        type_de_lien="Master"
        arg_inclus="sans_i"
        predicat="PRED_avskedar1">
</Usem>

<Predicat
    id="PRED_avskedar1"
    appellation="avskedar"
    exemple=""
    type="LEXICAL"
    pivot="NO"
    argument_l="Arg0PRED_avskedar1 Arg1PRED_avskedar1">

<Argument
    id="Arg0PRED_avskedar1"
    informe_arg_decrit_l="IA_ArgHuman">

<Argument
    id="Arg1PRED_avskedar1"
    informe_arg_decrit_l="IA_ArgHuman">

<InformeArg
    id="IA_ArgHuman"
    statut="CHECK"
    trait_sem_valpond_l="WVSF_TEMPLATE_Human_PROT">

<Correspondance
    id="ISObivalent"
    corresp_arg_pos_l="ARG0P0 ARG1P1">

The above SGML-representation captures following semantic features assigned to the verb avskeda:

Template type:	Cause_constitutive_change
Domain:	general (default value)
Semantic class:	Change
Predicates type:	lexical
Multilingual:	No
Argument_list:	Arg0 Arg1
Arguments selectional restrictions:	Human
Arguments obligatory status:	Check
Correspondence between syntactic and semantic arguments:	isomorphic bivalent

More SGML-encoded verbs can be found in the Swedish PAROLE-SIMPLE sample lexicon.

3. Conclusion

A lexicon resource which aims at the specification of semantic information is a prerequisite for all the content oriented NLP. Applications such as machine translation, parsing and information retrieval are but some examples which require a considerable amount of semantic information bound to lexical items in order to support tasks such as assignation of equivalents, text disambiguation or information retrieval. The semantic information encoded in the SIMPLE project provides links to the morphological and syntactic information in the Swedish PAROLE lexicon. The integration of the information in the three modules turn the PAROLE -SIMPLE lexicons into a robust NLP resource.

4. References

Pisa Specification Group, 1999, Guidelines for the Representation of Events.

Pustejovsky, J., 1995. The Generative Lexicon. Cambridge; MA. The MIT Press.

Footnotes

SIMPLE, Semantic Information for Multifunctional Plurilingual Lexica is an EC funded project with focus on the encoding of semantic data relevant for NLP tasks.

PAROLE, Preparatory Action for Linguistic Resources Organization for Language Engineering, is an EC funded project aiming at generating LE resources.