Project within VR “Rambidrag: Det digitaliserade samhället - igår, idag, imorgon”, 2013–2017
Aarne Ranta (PI), Gerardo Schneider, Koen Claessen
Department of Computer Science and Engineering
Chalmers University of Technology and University of Gothenburg
The project will develop technology for multilingual digital communication. The technology will enable services that citizens can use in their own languages, and also the spreading of up-to-date information in multiple languages. This will be done in a reliable, precise way, so that the users can trust the information they get.
Let us consider a possible use case: a service in which two participants can prepare a rental contract for a house. The house might be owned by an Italian person, located in Germany, and rented by a Swede. Three languages would thereby be involved, to make sure that the owner and the tenant understand each other accurately and that the contract is compliant with German regulations. It should be customized to the details relevant to the house and also to the wishes of the owner and the tenant. Ideally, all of the partners should be able to pose questions such as, “can the contract be transferred to a third person”, and get answers via an inference engine without reading the whole contract, let alone involving an expert lawyer.
Today, such services are scarce because they require manual work. Translations to different languages must be made manually, because automatic tools such as Google translate, are not reliable enough for such precision tasks. The inference required for question answering is equally manual, because the information contained in contracts is not formal enough to be reasoned mechanically; methods like string-based search are not accurate enough. The aim of this project is to solve both problems, with a solution that is common to a great extent.
The solution is to use controlled natural language (CNL), which is a subset of natural language with a formally specifiable structure. Our use of CNL is inspired by compiler technology, where abstract syntax is a formal structure underlying programming languages. When compilers analyse programs and reason about them, they work on the abstract syntax. This idea is adapted to natural languages in the Grammatical Framework (GF) (Ranta 2011), which moreover allows the mapping between abstract syntax and multiple simultaneous languages. For the mentioned use case, the following workflows are possible by using GF and abstract syntax trees (AST):
Static translation of contracts:
contract in German/Italian/Swedish/...
→ contract AST
→ contract in German&Italian&Swedish&...
Corrections and updates of the contract:
changes in German/Italian/Swedish/...
→ changes in AST
→ changes in German&Italian&Swedish&...
Queries about the contract:
question in German/Italian/Swedish/...
→ question in AST
→ answer in AST
→ answer in German/Italian/Swedish/...
In addition to grammars, the project uses formal methods, such as automated reasoning and software verification. Reasoning is used for question answering, but also for consistency checking of documents. Software verification is in this project applied to a new field: to computational grammars, which are very complex programs often created in collaborative and distributed ways. Their quality is crucial for the reliability of the overall system.
An example of how formal methods can be applied in this context is ambiguity analysis of computational grammars. In the contract example above, it is crucial that all parties understand each other unambiguously, i.e. the text of each contract can only be understood in one way. It is a fact of life that the use of natural language leaves open different ways of interpreting the same text. Therefore, it is important to analyze the grammars involved to firstly be made aware of ambiguities, and secondly, avoid their presence in formal documents such as contracts.
The world-wide-web is highly multilingual. For instance, Wikipedia is available in over 300 languages, maintained by volunteers. In Sweden, authorities try to follow their obligation to publish information in the official minority languages in Sweden. In both cases, however, the information is far from being in synchrony. In Wikipedia, a majority of the articles are only available in English; the translations that exist are often shorter than the originals or differ otherwise from them. In Sweden, usually only the information in Swedish is up to date. The other languages, if available at all, are often produced in unprofessional ways (for instance by employees who happen to speak targeted languages) or even produced by Google translate. The report by Funka Nu and the Swedish Language Council (Funka Nu 2011) gives a survey and recommendations on Authorities’ usage of Google translate.
Now, Google translate is a tool targeted for the consumers of information. They use it at their own risk, and no-one is responsible for the translations - neither the original author, nor Google. A common error in Google and other statistical translation systems is alignment errors. For instance, the French prix 99 euros can in Swedish become pris 99 kronor, because euros and kronor are aligned due to their frequent co-occurrences in parallel texts. If a consumer reads a French e-commerce site via Google translate and sees this offer, she cannot claim to get the product for this extraordinary price. But if the offer has been published by the e-commerce site itself, then they have an obligation to follow it.
What is needed is automatic translation tools for producers. These tools should be quickly adaptable to the frequently changing information and render it accurately in the targeted languages. What makes this possible is that the producers know what they need to say: for instance, that they only need to publish e-commerce offers or rental contracts. Therefore they can use translation tools that work on limited domains, and can therefore be made reliable. Consumers’ translation tools, in contrast, must work on an open domain. This means that they must be able to cope with any documents thrown at them - but the users are happy with browsing quality.
Multilingual GF grammars have been established as a reliable and efficient tool for limited domain translation, for instance, in the European MOLTO project (Multilingual On-Line Translation, FP7-ICT-247914,2010-2013). The focus of MOLTO has been on making it easy to produce translation systems for new domains and languages, via software tools and libraries. The methods have been tested on several domains: tourist phrasebooks, mathematical teaching material, museum object descriptions, and pharmaceutical patents. The systems have covered up to 17 simultaneous languages. 25 languages are currently included in the GF Resource Grammar Library, which makes the complex linguistic rules (morphology, syntax) available for application programmers, who can thereby concentrate on the semantics and the abstract syntax. The current proposal builds partly on MOLTO’s results. It will apply them to new areas and new kinds of users. The main scientific innovation is the introduction of formal methods in the loop, which at the same time permits scaling up the applications in a reliable way by semi-automatic means using statistics and machine learning.
In MOLTO, some ingredients of formal methods have already been involved, although they are outside the project’s main mission, which is translation. As one example, formal reasoning is involved in the form of SPARQL queries, which enable information retrieval from documents whose abstract syntax is aligned with an ontology (Dannélls et al. 2012). As another example, grammar testing has been developed as an adaptation of the QuickCheck tools (Claessen and Hughes 2000) to GF grammars. Automated testing of grammars is becoming more and more important as the grammars grow larger and come from heterogeneous sources - not only from programmers with varying skills, but also by automatic grammar extraction from ontologies and statistical translation models (Détrez et al. 2012).
The two main cases for grammar testing identified so far are ambiguity and adequacy testing. Testing for ambiguities is essential since the CNL described by the grammar is an interface to a logical system that will be used for formal verification and in this context, language ambiguities could compromise the functionality of the system. Most CNLs which are used for reasoning such as Attempto (Fuchs et al., 1999) have mechanisms to disambiguate natural language constructions, so that the meaning is always unique. In our case, the situation is more complicated due to the multilingual context, since we also need to resolve cases of ambiguous translation between natural languages, within the CNL (Angelov and Ranta 2010).
Adequacy testing entails checking that the semantics described in the abstract syntax of the grammar describing the CNL is preserved in all languages for which we write a concrete syntax. This also assures meaning-preserving translation between any pair of languages from the CNL. A prototype of this method was used in (Détrez et al., 2012), a tourist phrasebook grammar available in 15 languages, where each new language added was tested by comparing its constructions with their equivalents in English and the abstract syntax.
The project is divided into four work packages:
Building on the translation methodology developed in the European MOLTO project, we will create a set of information producer tools. These tools will enable companies, authorities, and other organizations to customize their translation systems. The goal is to enable system building with the same effort as it takes to translate manually a set of static web pages. The effort will be used on building GF grammars instead, which will enable an automatic translation of any new web pages. The special technique involved here is example-based grammar writing, which enables bootstrapping of GF grammars from translation examples (Détrez et al. 2012). The examples can be produced manually by translators without training in GF, but we will also investigate the extraction of examples from statistical translation model.
The main asset for productive grammar writing is the GF Resource Grammar Library (Bringert et al. 2012), which we will develop further, concentrating in particular on the major immigrant languages of Sweden and on the official languages of the European Union. The traditional development of the library is by voluntary work in an open-source community; in the 11 years of the library, it has received contributions from over 40 programmers around the world. But this work needs continuous coordination and quality control. The Formal methods work package will be applied to help this work as well.
Grammars are objects with a very intricate structure, comparable in complexity to software systems. Thus the occurrence of bugs in grammars is just as likely as their occurrence in software. We plan to use successful methods from software verification, namely automated formal verification techniques and software testing techniques, in order to reason about properties of grammars and detect bugs.
One strand of work to do here is to create methods that we can use to systematically look for errors in grammars. By error we mean that the grammar can produce sentences which are not grammatically correct, for example producing the wrong word order or wrong inflection of a verb. Another strand of work is to analyze grammars for properties that are important to know for translation of formal documents, for example which sentences can be interpreted in more than one way?
Although grammars and software systems have much in common, there are also several challenges in adopting software verification methods to grammars. One obvious one is that we often do not have a formal specification of what constitutes a correct grammar, which means we need to involve expert humans in some of the grammar testing methods. We need to minimize the need of this by automatically identifying the “hot spots” where human expertise is needed. Another challenge is that the set of sentences generated by a grammar is governed not only by the application at hand, but also by the actual natural language being used, which limits the control we have over the grammar. For example, if we discover that a certain French sentence can be interpreted in two different ways, we have to deal with that in another way than just changing the grammar so that the ambiguity is removed by stipulation; we cannot decide what is and what is not part of the French language!
The technique we initially will use in this part of the project is QuickCheck, a testing tool originally developed for Haskell (Claessen and Hughes 2000) but has later been targeted to Java and Erlang as well. QuickCheck deploys random generation of structured test data, properties to test against, and shrinking to find minimal counterexamples. QuickCheck is already a good fit; the random generation methods that are part of QuickCheck are suited for generation of data in simple grammars already (see preliminary results for an overview for what’s been done already).
The challenges we have left to solve here are: (a) How to effectively generate test data for GF grammars in their most general form? So far, we know how to generate test data for simply-typed first-order grammars well, and we plan to use our experience from generating well-typed functional programs to help us further along this line (Palka et al., 2011). (b) How to mix formal properties checkable by a computer with properties only checkable by hand by a human? Using humans as test oracles puts other demands on a test data generator than having automated oracles. We are planning to develop methods, based on test data generation, for creating minimal sets of sentences for a human expert to look at, that cover all constructs in a given grammar, but which also have a reasonably complex structure for understandability.
Formal methods are a means to verify grammars. But they can also be applied to the documents created by using the grammars. Thus the techniques developed here will be complementary to the ones developed in the previous section. For instance, while with the techniques developed in section 2 we help us to disambiguate grammars w.r.t. their semantics, here we will be able to reason about logical inconsistencies, causality of events, dependencies, etc. on the concrete objects. For doing so we will adapt existing techniques from static analysis and model checking and develop some new ones.
By combining the two levels of reasoning we will be achieving a two-level modular approach where the verification results at the abstract level feeds and gets feedback from the verification at the concrete level, in order to enhance precision and performance in both levels, complementing each other on the kind of properties/issues to be verified.
As known, it is not possible to directly reason mechanically on documents written in natural language. We will thus define an abstract but expressive CNL as a target language for reasoning, with an automata-based semantics. We will use GF to create CNL translations of the abstract structure in many languages. Reasoning about the abstract documents thereby also covers the translations - as far as these are faithful and unambiguous renderings of the abstract structure, which is guaranteed by the formal methods in work package 2.
Later in the project, we shall widen the scope of the methods by exploring ways to automate the extension of grammars and reasoning methods, by using statistical machine translation models and machine learning techniques. This will be done in a controlled way without compromising the quality. Thus we will provide algorithms to extract operational models (i.e., automata) from the CNL, and develop techniques to query the CNL document (on the automaton), based on model checking techniques.
Finally, we will use GF to provide a two-way translation between the CNL developed here and the formal language (targeted to legal contracts) to be developed in the SAMECO project, in case accepted (see Other Grants below). We will also relate the CNL automata with the Kripke semantics of the SAMECO formal language.
Contracts. We will consider the representation and analysis of legal contracts. In particular we will be focusing on financial contracts. The reason for that is the relevance of such contracts for end users, banks and other financial institutions, and the overall economy in general. Besides, our approach seems to be complementary, adding new value, to existing commercial solutions like those offered by LexiFi.
E-commerce. Web shops are by their nature potentially international, but language is a great obstacle. In the EU, for instance, a true Common Market would mean the availability of e-commerce sites in all EU languages. This is an area where CNL-based methods show a great promise. It also has a natural connection with contracts, which are an essential part of e-commerce transactions.
Official information. Authorities in Sweden and other countries need reliable translations and automatic information access in many languages. We will find partnerships among potential users later in the project and build case studies that meet their needs. Building digital infrastructure in developing countries is another source of needs, which we can meet through our international contacts (e.g. a MOLTO partner BeInformed, and collaborations in South Africa and Kenya).
The four work packages will run in parallel, so that we start early with a case study that integrates all aspects of linguistic resources, formal methods, and reasoning. When the project proceeds, we will scale up the case studies to larger coverage. Thus we will start with the contract case study from Year 1 and extend it to e-commerce in Year 2. For the official information, we will search for interested partners and meet their needs. This will happen in Years 3 and 4.
The language coverage of the GF library is expected to grow from the current 25 to around 50, since the use of GF and its international community are expanding. By then we count on being able to cover all the official EU languages and the top-10 languages of the world. But we also have focus on languages in Africa and Asia.
The personnel represents three world-leading research groups at the Department: Language Technology (Ranta), Formal Methods (Schneider, Claessen), and Functional Programming (Claessen):
The main ontribution of the project is its solution to the problem of providing reliable multilingual translation methods in controlled/restricted domains; leading to (a) a safer situation for consumers because they can trust and what they see and query interactively their rights and obligations, (b) a much cheaper way (or a possibility at all) for providers to offer multilingual services that are understandable by all users.
On the research side, the contribution consists of the methods for systematic analysis and testing of grammars; leading to more reliable translations, more effective grammar development, more effective grammar applicability, and better grammar interaction (needed for translation between natural languages); and scalability of grammars via semi-automatic bootstrapping controlled by formal methods.
We have identified a challenge in today’s digital and internationalized society: that documents and other forms of communications have to be provided in multiple languages in order for all involved parties to understand what is going on. We also propose a combination of techniques that has the potential of solving this problem in restricted/controlled well-defined domains, and scale up these methods and make them more widely usable in the future. We acknowledge that the project is mainly about technology and focuses on “tomorrow’s” digital society - but we believe that better and more mature technology will free people from adapting to today’s constraints such as the dominance of English and the bad quality of translations.
The MOLTO project will end in May 2013, and it has already created an interest for this unique combination of formal methods and natural languages, both in the academia and in the industry. In the last two decades, machine translation and information retrieval have made an enormous impact on society, mainly using methods based on huge amounts of data, statistics, and machine learning. But the limitations of these methods are also more and more obvious. These methods will never be reliable in the same way as e.g. calculators and compilers can. The GF approach, proven in MOLTO, has had this as a goal, and the proposed project aims to take it into a new level, where we bring in new expertise from the formal methods community.
Previously work within the MOLTO project showed the advantages that GF offers for verbalizing large ontologies (Dannélls et al., 2012), and business platform meta-models (Davies et al., 2012). The latter is also significant due to the fact that CNLs were previously shown to be of great help in modelling of larger models that are also used for reasoning.
Schneider's previous work on the development of CL, a deontic-based formal language for partially representing normative texts (Prisacariu and Schneider, 2007; Pace and Schneider, 2009) in the context of the Nordunet3 project COSoDIS: Contract-Oriented Software Development for Internet Services (http://folk.uio.no/gerardo/nordunet3/index.shtml) is relevant to this application. Besides, some initial work has been done concerning the relation between a formal language and a CNL in (Montazeri et al, 2011).
There is work in progress on ambiguity testing, which focuses on discovering ambiguities in monolingual grammars and creating a database with distinct instances of ambiguous constructions. The approach is based on QuickCheck, which we use to generate test cases, which in this case are valid parse trees. We automatically retain the ambiguous constructions, and determine if they are an instance of a previously seen case, or if they represent a new ambiguity. The database keeps generalizations of the ambiguous cases along with a context that shows how the ambiguity propagates to the top-level category of the grammar.
The approach covers both lexical and syntactic ambiguities, giving a comprehensive analysis of all relevant cases of ambiguity that the grammar could display. Moreover, since GF grammars correspond to Parallel Multiple Context-Free Grammars (PMCFGs) (Ljunglöf, 2004), the current work is the first attempt to test such grammars for ambiguities. Testing for ambiguities so far mainly focused on Context-Free grammars, which are a strict subset of PMCFG (Brabrand et al., 2007).
Angelov, K. and Ranta, A.: Implementing Controlled Languages in GF. N. Fuchs (ed.), CNL-2009 Controlled Natural Languages, LNCS/LNAI 5972, 2010.
Bringert, B., Ranta, A. and Hallgren, T.: GF Resource Grammar Library Synopsis, http://www.grammaticalframework.org/lib/doc/synopsis.html, 2012.
Brabrand, C. Giegerich, R. and Møller, A.: Analyzing Ambiguity of Context-Free Grammars.In Proceedings of the 12th International Conference on Implementation and Application of Automata (CIAA) 2007, Springer-Verlag LNCS vol. 478
Claessen, K., and Hughes, J.: QuickCheck: A lightweight tool for random testing of Haskell programs. In Proc. of International Conference on Functional Programming (ICFP). ACM SIGPLAN, 2000.
Davis, B. and Enache, R. and Pretorius, L. and Van Grondelle, J.: Multilingual Verbalization of Modular Ontologies with GF and Lemon. Submitted to the Controlled Natural Languages Workshop, CNL2012, Zurich, Switzerland, 2012.
Dannélls, D., Enache, R., Damova, M. and Chechev, M.: Multilingual Online Generation from Semantic Web Ontologies. In the www2012 Conference, Lyon, France, 2012, to appear.
Détrez, G., Enache, R. and Ranta, A.: Controlled Languages for Everyday Use: The MOLTO Phrasebook. In Post-Proceedings of the Controlled Natural Languages Workshop (CNL 2010), Marettimo, Italy, LNCS-LNAI vol. 7175, 2012, to appear.
Fuchs, N., Schwertel, U. and Schwitter, R.: Attempto Controlled English - Not Just Another Logic Specification Language. In P. Flener (ed.): Logic-Based Program Synthesis and Transformation, Eighth International Workshop LOPSTR'98, Manchester, UK, June 1998. LNCS 1559, Springer Verlag, 1999.
Funka Nu, Slutrapport: Översättning på internet, 2011. http://www.funkanu.se/PageFiles/3596/Slutrapport-Oversattning-pa-internet.pdf
Ljunglöf, P.: Expressivity and Complexity of the Grammatical Framework. PhD thesis, Computer Science, University of Gothenburg, 2004.
Montazeri, S.M., Roy, N., and Schneider, G.: From Contracts in Structured English to CL Specifications. In FLACOS'11, vol. 68 of EPTCS, pp. 55-69, 2011.
Pace, G.J. and Schneider, G.: Challenges in the specification of full contracts. In iFM'09, vol. 5423 of LNCS, pp. 292-306, 2009. Springer
Palka, M., Claessen, K., Russo, A. and Hughes, J.: Testing an optimising compiler by generating random lambda terms. In Proc. of 6th International Workshop on Automation of Software Test (AST), 2011.
Prisacariu, C. and Schneider, G.: A formal language for electronic contracts. In FMOODS'07, vol. 4468 of LNCS, pp. 174-189, 2007. Springer.
Ranta, A.: När kan man lita på maskinöversättning? Språkteknologi för ökad tillgänglighet. Rapport från ett nordiskt seminarium Linköping, 27-28 oktober 2010, pp. 49-60, 2011. http://www.ep.liu.se/ecp/054/006/ecp10054006.pdf
Spreeuwenberg, S.,Van Grondelle, J., Heller, R. and Grijzen, H.: Design of a CNL to Involve Domain Experts in Modelling. In Post-Proceedings of the Controlled Natural Languages Workshop(CNL2010), Marettimo, Italy, LNCS-LNAI, Vol. 7175, 2012, to appear.