The aim of this project is to make existing FrameNet (FN) resources computationally accessible for multilingual natural language generation and controlled semantic parsing via a shared semantico-syntactic grammar and lexicon API.
We provide a currently bilingual but potentially multilingual FN-based grammar and lexicon library implemented in Grammatical Framework (GF) on top of GF Resource Grammar Library (RGL). The API of the FN-based library represents a shared set of automatically extracted semantico-syntactic verb valence patterns from 66,918 annotated sentences in Berkeley FrameNet (BFN 1.5) and 4,267 sentences in Swedish FrameNet (SweFN, a snapshot taken in December 3, 2014). The concise set of 869 patterns covers 483 shared frames (using BFN frames as interlingua) and 77.5% of sentences evoking the shared frames in both BFN and SweFN (44,645 and 2,596 sentences respectively).
Based on the FN-annotated sentences covered by the shared valence patterns, and the GF RGL type system for verbs, we have extracted 3,432 lexical entries (subcategorized lexical units, LUs) from BFN, and 1,899 entries form SweFN. LUs between BFN and SweFN are not directly aligned, therefore a specific lexicon is generated for each language. However, a partial shared lexicon has been automatically derived on top of the language-specific lexicons, currently providing a mapping between 703 LUs in BFN and 900 LUs in SweFN. The shared lexicon covers 25.1% (11,223) of BFN sentences and 35.8% (928) of SweFN sentences – of the above mentioned sentences which are represented by the shared valence patterns.
All numbers are indicative and a subject to change if more corpus examples, translation equivalents or improved heuristics is provided.
As a side result, a unified method for comparing and mapping semantic and syntactic valence patterns and lexical units across framenets is proposed. Thus, from the perspective of developers of FN-annotated corpora, this can be seen as a tool providing cross-lingual hints on how to improve the coverage.
This work has been supported by Swedish Research Council under Grant No. 2012-5746 (Reliable Multilingual Digital Communication: Methods and Applications) and by Centre for Language Technology in Gothenburg. The research leading to these results has received funding also from Latvian State Research Programme NexIT.