The Swedish SIMPLE Lexicon
A language engineering resource with access to semantic information in Swedish
elaborated by Språkdata, Göteborgs Universitet
Contact person Maria Toporowska Gronostaj
1. General design information
The Swedish SIMPLE lexicon is a language engineering resource, being elaborated by Språkdata, Göteborgs Universitet, with focus on the encoding semantic information in the lexicon. The semantic data provided in the mentioned lexicon is linked to the morphological and syntactic data in the Swedish PAROLE lexicon, which results in a NLP resource covering the whole PAROLE-SIMPLE mandatory spectrum of lexical information. (See the document on the Swedish PAROLE lexicon.)
The Swedish lexicon is one in the suite of lexicons for the 12 EU languages which are harmonized with respect to their content and the encoding formalism. Its theoretical and formal design is conformant to the guidelines presented by the Specification Group in the Linguistic Specifications SIMPLE (Lenci et al. 1998). The encoding of semantic data has been implemented in the SGML Document Type Definition format.
The lexical data in the LE-PAROLE and SIMPLE lexicons has been to a great extent supported by lexical resources elaborated at Språkdata, Göteborgs University. Machine readable monolingual dictionaries such as Svenska ord and Nationalencyklopedins ordbok as well as the lexical database, Göteborgs Lexikaliska Databas, GLDB, have proved to be valuable sources of information on Swedish morphology, syntax and semantics. Some of the lexical material from GLDB has been reused, as for example definitions and examples. It is worth noting that the discrimination of senses in the Swedish SIMPLE lexicon follows, to a large extent, sense distinctions made in the GLDB data base.
Lexicon population in the Swedish SIMPLE lexicon is a subset of morphological and syntactic units encoded in the Swedish PAROLE lexicon. At the core of the lexicon, there are Swedish equivalents to the set of base concepts shared by all the SIMPLE partners. In the final phase of the project, the lexicon will comprise 10000 semantic units (Usems) distributed among nouns (7000 USEMs), verbs (2000 USEMs) and adjectives (1000 USEMs). (By August the 1st, 1999, 3442 noun Usems and 325 verb USems SGML have been encoded in the lexicon.)
2. The semantic layer
A semantic unit, Usem, provides access to the semantic layer. It represents a meaning, a word sense. Each semantic unit is assigned a semantic type plus some other sorts of semantic information. (For example the semantic units hund ('dog') or katt ('cat') are assigned to the semantic type Animal. The semantic type names an ontological category from the SIMPLE ontology and it predefines other sorts of semantic information asssociated to a particular type, such as for example semantic class or different semantic relations invoked by Pustejovsky's qualia (Pustejovsky, 1995).
The following mandatory semantic information is encoded in the Swedish SIMPLE lexicon, whenever relevant:
Besides the above types of semantic information encoded on the semantic layer, there is also information on links between morphological and syntactic units on one hand and syntactic and semantic units on the other hand. The relations established between the units are either one-to-one type or one-to-many types, which is reflected in the statistics below:
Total number of morphological units: | 2187 | |
Total number of syntactic units: | 2795 | |
Total number of semantic units: | 3767 |
(The statistics is based on the data available by August, the 1st, 1999.)
The material encoded so far seems to confirm the assumption that assigning different sorts of semantic information to a Usem supports the sense discrimination. In many cases, a cluster of semantic information, including template type, domain and semantic class, is sufficient to disambiguate different readings, especially if it is linked to the information on syntactic unit (Usyn) in the Swedish PAROLE lexicon. The example below illustarates the point:
Usyn | Usem | Template | Domain | SemClass |
---|---|---|---|---|
DN0 | doktor1 | Profession | Medicine | Occupation_agent |
DN0 | doktor2 | Social_status | Higher_Education | Situ |
DNNCOMPLFIX | doktor2 | Social_status | Higher_Education | Situ |
DNNRESTRAPPA | doktor2 | Social_status | Higher_Education | Situ |
(Situ stands for situational or punctual properties with regard to the nouns referring humans.)
The data encoded in the Swedish SIMPLE lexicon can be instantly clustered and viewed from a number of different semantic and morphosyntactic criteria. For example, clustering of information under template Instrument with semantic classes Apparatus and Instrument for Usems that sort under General (unspecified) domain results in the following lists:
WVSF_TEMPLATE_Instrument_PROT
TSVP_APPARATUS_TS_classificateur_de_nom_C:
assistent2 diskmaskin1 element3 freestyle1 frys1 giljotin1 hushållsassistent1 kaffebryggare1 kamera1 kassettradio1 kopiator1 kpist1 kulspruta1 kylskåp1 lampa1 ljudradio1 mixer1 persondator1 skrivmaskin1 torktumlare1
WVSF_TEMPLATE_Instrument_PROT
TSVP_INSTRUMENT_TS_classificateur_de_nom_C:
borr1 durkslag1 dyrk1 gaffel1 hyvel1 kaffesked1 kamin1 kastspö1 kikare1 klocka1 kniv1 kompass1 kratta1 kälke1 matsked1 miniräknare1 pistol1 sabel1 sked1 skruvmejsel1 spritkök1 staffli1 såg1 tesked1 timer1 ur1 visselpipa1
The search for items that sort under the template Instruments but are restricted to the domain Music or Musical_Instrument results in the following lists:
WVSF_TEMPLATE_Instrument_PROT
TSVP_MUSIC_TS_domaine_D
TSVP_APPARATUS_TS_classificateur_de_nom_C:
grammofon1 jukebox1 kassettdäck1
WVSF_TEMPLATE_Instrument_PROT
TSVP_MUSIC_TS_domaine_D
TSVP_MUSICAL_INSTRUMENT_TS_classificateur_de_nom_C:
fiol1 flöjt1 gitarr1 piano1 trombon1 trumpet1 tuba1 violin1 violoncell1 xylofon1
2.1 Nouns
How the semantic information is encoded according to the SGML DTD format is illustrated below for the noun bild ('picture'). From the Usyn part in the excerpt below, it follows that one syntactic Description, DNNCOMPLFAVAX, is linked to two Usems, namly USEMbild1_n319 and USEMbild2_n320. These two senses are assigned different clusters of semantic features namely:
trait_sem_valpond_l="WVSF_TEMPLATE_Artwork_PROT
TSVP_ARTS_TS_domaine_D TSVP_ARTIFACT_TS_classificateur_de_nom_C
trait_sem_valpond_l="WVSF_TEMPLATE_Cognitive_fact_PROT
TSVP_COGNITIVE_FACT_TS_classificateur_de_nom_C
It might be worth to observe that these two have different definitions and examples, but they share the same syntactic description. One of the senses is equivalent to the base concept (BC), which is noted in the comment field (commentaire="BC:455\02985557").
bild
<Um_SThe encoding of deverbal nouns differs somewhat from the above encoding. It is more extensive, because it also includes information on predicative representation, the predicate and its arguments structure and types of relation holding between the deverbal noun and the predicate (the valued attribute Master, type_of_link, affected_arg). Thus the objects and attributes used to describe deverbal nouns are shared with those used to describe events.
(For more examples, the reader is advised to look at the Swedish PAROLE-SIMPLE sample lexicon)
2.2 Verbs
The guidelines for representation of Events, elaborated by Pisa Specification Group (1999), the set of templates with structured semantic information and the DTD document provide information on how to represent the semantic data on verbs and deverbal nouns in the SIMPLE project. Assigning semantic information to verb Usems in the Swedish SIMPLE lexicon is focused on the encoding following mandatory data in the first place:
The example below illustrates the above types information in the SGML-representation for the verb avskeda (to dismiss).
avskedar
<Um_SThe above SGML-representation captures following semantic features assigned to the verb avskeda:
Template type: | Cause_constitutive_change |
Domain: | general (default value) |
Semantic class: | Change |
Predicates type: | lexical |
Multilingual: | No |
Argument_list: | Arg0 Arg1 |
Arguments selectional restrictions: | Human |
Arguments obligatory status: | Check |
Correspondence between syntactic and semantic arguments: | isomorphic bivalent |
More SGML-encoded verbs can be found in the Swedish PAROLE-SIMPLE sample lexicon.
3. Conclusion
A lexicon resource which aims at the specification of semantic information is a prerequisite for all the content oriented NLP. Applications such as machine translation, parsing and information retrieval are but some examples which require a considerable amount of semantic information bound to lexical items in order to support tasks such as assignation of equivalents, text disambiguation or information retrieval. The semantic information encoded in the SIMPLE project provides links to the morphological and syntactic information in the Swedish PAROLE lexicon. The integration of the information in the three modules turn the PAROLE -SIMPLE lexicons into a robust NLP resource.
4. References
Pisa Specification Group, 1999, Guidelines for the Representation of Events.
Footnotes
SIMPLE, Semantic Information for Multifunctional Plurilingual Lexica is an EC funded project with focus on the encoding of semantic data relevant for NLP tasks.
PAROLE, Preparatory Action for Linguistic Resources Organization for Language Engineering, is an EC funded project aiming at generating LE resources.