Friday, November 15th, 1996 
    IPO, Center for Research on User-System Interaction, Eindhoven
    
We are pleased to be able to say that the seventh CLIN meeting was a succes. We would like to thank anyone that attended the meeting for their contributions, which helped make it a fruitful and successful conference. The meeting was held last year and was organised by IPO in Eindhoven. IPO was then the Institute for Perception Research, but has recently changed its name to IPO, centre for research on user-system interaction.
A compilation of a selection of the papers presented at CLIN 1996 will be made. It will be issued at the next CLIN meeting in 1997.
At CLIN meetings, computational linguistics researchers in the Netherlands and Dutch-speaking Belgium gather and present their research. The meeting is also open to international participation. The default language of the conference is continental English. However, presentations with a Dutch title and abstract may be held in Dutch.
In 1996, Stephen Pulman of SRI International in Cambridge and the University of Cambridge Computer Laboratory presented a keynote lecture on Conversational Games, Belief Revision, and Commitment.
The local organisers,
This talk discusses the roles of these three concepts in some recent approaches to dialogue and tries to sketch a hybrid rule-based+statistical framework on which practical implementations could be based.
 The data-oriented approach to language processing assumes that
        previous  language experiences (rather than abstract linguistic rules)
        form the  basis for language perception and production. A Data-Oriented
        Processing  model (DOP model) therefore maintains a large corpus of
        linguistic  representations of previously occurring utterances. By
        combining  fragments from this corpus, representations for new
        sentences can be  generated. The frequencies of these fragments are
        used to estimate the  most probable representation of a given
        utterance. 
        
        
 A DOP model can be defined for almost every theory of
        linguistic  representation or utterance analysis. The original DOP
        model corresponds  to a theory in which the linguistic representation
        of an utterance is  given by a phrase structure tree. New
        representations are produced by  combining subtrees of previous
        representations. In this talk, we show how  a DOP model can be
        developed for the more articulated representations  provided by Lexical
        Functional Grammar (LFG). On this theory of  representation, the
        analysis of every utterance consists of a  constituent-structure (a
        phrase structure tree), a functional-structure  (an attribute-value
        matrix), and a correspondence function that maps  between them. We will
        show how the definitions for fragments and  combination-operations of
        original DOP can be straightforwardly extended  to a DOP model based on
        LFG representations. However, the original DOP  probability
        calculations do not properly apply to LFG's nonmonotonic  constraints
        on valid fragment combinations. We propose a new probability  model
        that does generalize appropriately to the case of nonmonotonic 
        conditions, and describe how this model applies to LFG representations.
           
        
Word phonemisation, the task of converting a word to its phonemic transcription (with word stress), is hard for two reasons. First, it involves a large amount of language-dependent knowledge hard to acquire by handcrafting; however, this task may be alleviated using inductive-learning algorithms to automatically induce the knowledge needed. Second, the task represents a non-linear classification task which is hard to (learn to) represent in a single-process system. Designing a modular system in which the task is solved in more than one step appears a good heuristic. However, modularisation may induce unwanted effects in performance: e.g., (i) many proposed orderings and separations of subtasks ignore relevant dependencies between subtasks; (ii) modular systems are relatively sensitive to cascading `snowball' errors. The paper provides empirical performance data obtained by systematically varying and optimising the number and the ordering of modules in a word phonemisation system. Individual modules are automatically induced on the basis of a large lexical data base of English, by symbolic (IGTree, IB1) or connectionist (Back-propagation) inductive learning algorithms. The results point out that both numbers and orderings of modules considerably affect generalisation performance. The results offer insight into subtask dependencies in morpho-phonology, and as a spin-off, provide indications for building accurate word phonemisation systems.
 Recent work in HPSG has (1) emphasized the role of argument
        structure (ARG-ST) as a level of representation independent of
        valency,  (2) demonstrated that (recursive) constraints on lexical
        entries may lead to accounts of unbounded dependency constructions and
        quantifier scope that are superior to previous proposals, and (3) used
        lexical rules to describe  processes that were previously considered to
        be primarily syntactic in nature  (extraction, selection of adjuncts,
        and clitization).
        
 In this talk I want to argue that a combination of (1) and (2)
        makes (3) superfluous. In particular, lexical rules for complement
        extraction,  introduction of adjuncts on COMPS, and clitization can be
        replaced by monotonic constraints on lexical entries which define the
        relationship between subcategorization and argument structure, between
        argument structure and valency (including SLASH), and between argument
        structure and phonology. 
        
 The advantages of this approach are that the `canonical' mapping
        between argument structure and valency not just holds for basic lexical
        entries, but for all lexical entries, that the various forms of an
        entry previously derived  by means of lexical rules can be seen as
        monotonic instantiations of a single basic entry, and that many of the
        problems associated with the use  of lexical rules (order sensitivity,
        default relations between input and  output, spurious derivations,
        recomputation of values on the output) disappear. 
        
        
 Wh-chains are known to be restricted by both local and global
        requirements. In order to parse wh-relations efficiently, we have to
        account for the interaction of the local and the global licensing
        mechanisms. 
        
 The parser Delilah, which handles Dutch categorially and
        context-sensitively, is equipped with a deterministic and incremental
        device for selecting wh-chains. By this device - a finite state network
        - the number of possible operator-variable relations which the parser
        has to check, is reduced as much as possible in an incremental and
        deterministic fashion. The network operates under the assumption of
        massive lexical ambiguity with respect to the local licensing of
        variables. It is fed by knowledge of local and global conditions on
        wh-chains. 
        
 For non-coordinated sentences the device may divide the number of
        possible operator-variable chains by twenty, and for coordinated
        sentences, by six - allowing in the latter case for across-the-board
        applications. We will present a detailed account of the grammatical and
        operational aspects of the network, and some figures as to its effect.
                  
        
 Syntactic analysis can be seen as a cascade of classification
        problems of two types: segmentation (constituent boundary detection)
        and disambiguation (morphosyntactic disambiguation, constituent
        labeling, and attachment decisions). By rephrasing syntactic analysis
        as a series of instances of a classification problem, machine learning
        techniques such as decision tree learning and memory-based learning
        become applicable. When using annotated example corpora (treebanks) as
        learning material, these machine learning techniques can generalize the
        knowledge implicit in the annotations to unseen text.  Obvious
        advantages of this approach include automatic learning (alleviating
        knowledge acquisition bottlenecks) and robustness due to the
        statistical nature of the learning algorithms.
        
 In previous work, we have applied memory-based learning techniques
        to segmentation and disambiguation problems in phonology
        (syllabification, stress assignment, grapheme disambiguation),
        morphology (analysis and synthesis), and morphosyntax (morphosyntactic
        disambiguation, i.e. part of speech tagging). In this paper we show
        that a benchmark phrase attachment problem (PP-attachment), can be
        learned using memory-based learning techniques. Advantages of the
        approach to existing stochastic techniques include (i) smooth automatic
        integration of knowledge sources, (ii) non-parametricity (no parameter
        estimation needed).  We also discuss the impact on generalization
        accuracy of different similarity metrics in the memory-based learning
        algorithm and of different input representations.
        
        
 The Interface between Text Structure and Linguistic Description within
        the ALEP platform.
        
 Input to the ALEP system is automatically converted into a SGML
        marked text, which will be the input to the linguistic processing. For
        the analysis of those tagged texts, some tsls rules (Text Structure to
        Linguistic Structure) have to be defined. So, if an item is tagged as a
        word (tag `W'), an obligatory tsls rule should define which kind of
        linguistic object (described in the grammar) will apply to this item.
        This allows a substantial modularization of the grammar, specifying
        which kind of linguistic rules will apply.
        
 Beside the SGML tags, we used the system-defined tag `USR' in
        order to deal with fixed phrases and `messy details'. User-defined
        (multiple-)word recognizers have been integrated into the text-handling
        component of ALEP. The tagged output of these programs gives the input
        for the tsls rules. We described generic lexicon entries (i.e. `dates',
        etc.) corresponding to the `USR'-tagged expressions. With this
        technique, the running time of the parser has been significantly
        improved and the coverage of the grammar considerably extended.
        
 The last step of our work has consisted in the extension of the
        set of tags defined within the ALEP system. So, for example, a tag
        `CAT' has been added. This allows us to integrate information delivered
        by a Part of Speech tagger. We extracted the PoS information and
        `lifted' it to the linguistic description via the tsls rules. This
        again leads to very substantial improvement in term of efficiency of
        the parser and of coverage of the grammar.
        
 And also a more theoretical question arises: can this strategy
        provide a practicable way for combining corpus-based and
        knowledge-based approaches to NLP? In any case, we will have to
        consider the reorganization of the (unification-based) grammar
        description with respect to the possibility of extracting
        morpho-syntactical information from PoS taggers.
        
 In GPSG and HPSG the distinction between elements with and
        without phrasal projection is drawn in terms of speech parts, cf. the
        major V, N, A, P vs. the minor Comp, Conj, etc. Contrary to this
        practice I claim that the major/minor distinction had better be treated
        as orthogonal to the speech part classification.
        
 To substantiate this claim I will show that the distinction
        between full and reduced personal pronouns in Dutch (jij/je, zij/ze,
        ...) is an instance of the major/minor dichotomy. Next, I will spell 
        out an HPSG style sort hierarchy for the description of minor signs
        and explore their syntactic peculiarities, i.e. the impossibility to
        be used as heads, fillers or conjuncts, and the deviance from the 
        LP constraints which hold for their major counterparts. Criteria 
        will be provided for identifying minor signs in other speech parts
        and in other languages.
        
 Since the minor elements behave differently from the major ones,
        both in terms of constituency and linear order, the distinction had
        better be made explicit in the grammar. This argues against the GB 
        policy to assign phrasal projections to all lexical elements (and
        to many affixes), as well as against a trend in HPSG to treat all
        lexical signs (incl. the complementizers) as heads.
        
 There was a time when this would have been needless to say, but
        times have changed. Groenendijk & Stokhof define dynamic semantics as
        follows:
        
        
 A semantics is dynamic if and only if its notion of conjunction is
        dynamic, and hence non-commutative.
        
        
 In this paper I argue that dynamic semantics, thus understood, is a
        rather bad idea. Dynamic semantics is an admittedly elegant but
        nonetheless misguided implementation of an essentially pragmatic
        principle. It is an obvious and even important truth that utterances
        are processed incrementally. The central tenet of dynamic semantics is
        that, to some extent at least, this processing strategy is encoded in
        the lexical entries of certain words, and especially in the lexical
        meaning of 'and'. Thus formulated, it will be plain that the very
        notion of a dynamic semantics is quite implausible. But apart from its
        lack of plausibility, it gives rise to all sorts of strange quandaries.
        Consider, for example, a young child learning the meaning of 'and'. Are
        we to suppose that he learns it in two steps? The truth-conditional
        part first, perhaps, and the dynamic part afterwards - or would it be
        the other way round? Would it be possible for a child to get the
        truth-conditional import of 'and' right but founder on its dynamic
        aspects? Clearly, such questions are absurd: the lexical meaning of
        'and' isn't dynamic.
        
        
 In my talk I will first elaborate on this point and then turn to
        proposals for giving dynamic interpretations to negation and
        disjunction as well. I will argue that these, too, are ill-founded
        empirically as well as conceptually.
        
        
 It is generally accepted nowadays that the scarcity of lexical 
        resources in NLP necessitates a kind of reusability. At least two 
        approaches to reusability can be distinguished, resulting in  different
        domains of what is reused. In one approach the lexicon  is a purely
        declarative knowledge base, containing all information  to be used by
        NLP-systems. Reusable information includes what is  encoded in
        features. System-specific information includes all  procedural
        knowledge. In the other approach, reusable information  is everything
        that is necessary for the mapping between text words  and lexemes in
        the dictionary. This includes both declarative and  procedural
        knowledge on morphology. In this approach system- specific information
        encompasses syntax and semantics.
        
        
 A typical example of the first approach is DATR. The second
        approach is not represented adequately by two-level morphology,  which
        lacks the notion of lexeme. A better representative is Word  Manager, a
        system developed in Basel. I will argue that this  approach to
        reusability has a number of important advantages  compared to the one
        represented by DATR.  
        
        
 Translation idioms and structural divergencies between languages
        are classical problems for machine translation. This holds in
        particular for compositional approaches, which require a
        translation-equivalence  between basic expressions and between grammar
        rules of source-language  and target-language grammar. One way to
        attack these problems, pursued in the Rosetta system, is to make use of
        grammar rules that can perform syntactically powerful operations,
        enabling a distinction between surface structure and compositional
        derivation structure.
        
        
 In this talk I present a formal basis for an alternative approach
        in which the individual grammars can be relatively simple (e.g.
        context-free or DCG), but where the translation relation between the
        grammars is more complex. Translation-equivalence is now defined as a
        relation between combinations of rules and basic expressions, so-called
        polynomials. Special attention is paid to the issue of completeness,
        i.e. to the conditions under which this translation method guarantees
        to yield at least one translation for each analysis of all
        source-language expressions.
        
        
 In this talk I will give an overview of the GoalGetter system. This
        system generates spoken summaries of football matches on the basis of
        concise teletext information. The system consists of a language
        generation component and a speech output generation component. The
        language generation component will be discussed in more detail in the
        presentation by Mariet Theune.
        
        
 The focus of this presentation will be on the speech output module.
        Speech output can be realised by either diphone synthesis or phrase
        concatenation.  With diphone synthesis one can generate an unlimited
        set of sentences.  Phrase concatenation is used in applications where
        the set of sentences is  limited. Entire words and phrases are recorded
        and can be strung together to  construct the spoken texts without any
        manipulations on the original recordings. Our approach to phrase
        concatenation is special in that we record variable words, like team
        names and player names, in several prosodic contexts. Dependent on the
        place where the variable is to be inserted in a carrier sentence and 
        information about accenting and phrasing, the right prosodic variant is
        selected.
        
        
 For application in connection with databases and in particular
        information systems like library systems, we shall analyze
        a few prototypical natural language queries. The query analysis
        recommended here is essentially automated and uses logic programming
        as a tool for analysis of natural language semantics, and it
        involves modelling the information content by means of
        a logical representation. It comprises the extensive application
        of induction using some homemade inductive meta systems
        that perform automated program synthesis through, as an intermediate
        step, some dataflow analysis resulting in the construction of
        some so-called dataflow structures (cf. Understanding & 
        Logic Prog.2-3). The resulting synthesized
        programs are logic grammars, more precisely definite clause
        grammars (DCG). The method seems very promising.
        
 As an illustration, we intend to examine a simple and prototypical
        query to a library information system 
        
Complement clitics in Modern Greek NPs exhibit an idiosyncratic type of climbing: they can attach on the noun head (1), prenominal adjectives (2), and a small set of left periphery elements (3). Though such clitics were taken to be affixes in previous approaches (e.g. Stavrou and Horrocks 1990), they do not satisfy various of the diagnostics that have been proposed to characterize Pronominal Affixes and distinguish them from Postlexical Clitics (see e.g. Miller 1992). Moreover, an account of their positioning in terms of Argument Composition (Hinrichs and Nakazawa 1990, 1994; Miller and Sag 1996) would encounter serious difficulties including the contrast in (4) which indicates that an adjective with a complement of its own cannot ``attract'' the noun head's clitic complement. I provide an account of clitic climbing in MG NPs in terms of Domain Union (Reape 1994, Kathol 1995) and that employs a notion of Attachment in the sense of Dowty (to appear) and Gunji (to appear). This approach can be straightforwardly extended so as to account for definite articles and NP-internal demonstratives which along with clitics cannot stand on their own, but rather require an appropriate host to attach on.
| 1. | to kenurio vivlio mu-CL | (lit.: the new book my) | |
| 2. | to kenurio tu-CL vivlio | (lit.: the new his book) | |
| 3a. | ola tus-CL ta vivlia | (lit.: all their the books) | |
| 3b. | afto su-CL to vivlio | (lit.: this your the book) | |
| 4a. | i [anagnorismeni [apo olus]] iperohi tu-CL | (lit.: the acknowledged by all superiority of-his) | |
| 4b. | * | i [anagnorismeni tu [apo olus]] iperohi | 
We will describe research on the treatment of Dutch compounds in the UPLIFT information retrieval project. Results of earlier experiments in the UPLIFT project indicated that splitting up compounds in the query and generating new compounds by simply combining query terms both improved retrieval performance. We subsequently experimented with adding constraints to the compound splitting and generation algorithms in order to restrict both processes and minimize over-generation. We experimented with using information about head-modifier relationships and corpus frequency information to formulate constraints. So far, we have not been able to improve on our initial strategy but the results of initial experiments have provided us with some important clues for further experimentation.
Rob van der Sandt's theory of `presuppositions as anaphors' is widely considered to be the empirically most adequate theory of presupposition projection on the market. In this talk, two weaknesses of Van der Sandt's theory are pointed out and remedied. The first weakness is the fact that a central notion of the theory, namely that of a `partial match', is not defined in a sufficiently precise way. The second weakness, in our opinion, is the fact that the theory takes only one kind of anaphora into account, in which anaphor and antecedent must always corefer. Both weaknesses are remedied in an updated version of the `presuppositions as anaphors' theory that we claim to be both more precise and more general than its predecessor.
The researches on creating the automatic processing system of the texts in Turkic languages shows that it is necessary to determine and take into consideration the morphonological regularities. The morphonological changes observed in the formal processing systems of Turkic texts in computers can be grouped as following:
Several string operations are introduced, as models of the coordination phenomenon in natural languages. Their relationships with other string operations are investigated, then obtaining the closure properties of families in the Chomsky hierarchy. In particular, CF is not closed under these operations. However, if coordination is defined only between strings with a common syntactic structure (both strings have derivations described by identical trees, modulo the coordinated subwords), then coordination preserves the context-freeness. The extension of this tree-based coordination operation to TAG's is also discussed.
A definition of the notion of answerhood is formalised using a proof system, i.e., Constructive Type Theory. The definition, which was proposed in the mid-eighties by Jeroen Groenendijk and Martin Stokhof, makes use of two concepts which, in the past fifteen years, have become central to the trade of formal semantics: context change and context-dependence. The formalisation using CTT is proposed as an alternative for Groenendijk and Stokhof's original formalisation in possible-world semantics. It is demonstrated that CTT, and in particular the fact that CTT is a proof system, enables a more fine-grained analysis which can be turned into a computational model. Furthermore, we contend that our formalisation of the definition of answerhood is a natural generalisation of definitions of answerhood which are phrased in terms of unification of the question and the answer.
In this paper we develop an HPSG analysis of certain (so far unnoticed) syntactic phenomena connected to verbal negation in Polish. First of all, we show that -- contrary to the received wisdom -- verbal negation is a morphological (rather than syntactic) process and we model this observation via lexical rules. Then we move to the so-called long distance negative concord, i.e., requirement that the verb has to be negated if any of its arguments is or contains a negative pronoun. We show that this is essentially a UDC as this `negation requirement' can cross arbitrary number of NP and PP boundaries. (VPs seem to be islands.) Since this `negation requirement' is discharged lexically (by negated verbs) and because of some intriguing lexical exceptions, we adapt the lexical approach to UDCs of Sag (1995) and Sag (1996). Finally, we investigate interesting behaviour of negative concord and of genitive of negation in the context of verb clusters, and show that this behaviour can be accounted for if arguments of the lower verbs are assumed to be raised to the nearest negated verb (if any), a la Hinrichs and Nakazawa (1989), and if case assignment and `negation percolation' are made sensitive to whether the argument has been realized from the given argument structure, or raised to higher verbs. In the latter we follow the non-configurational case assignment approach of Przepiorkowski (1996).
In my talk, I will give a corpus-based analysis of information update in information dialogues. The corpus used consists of 111 naturally ocurring telephone conversations recorded at the information service of Schiphol Airport. The information update will be described theoretically by extending the dynamic interpretation theory (DIT) of Bunt (Bunt 1995) with the information packaging notions "topic", "tail", and "focus" (Rats(1996), Vallduvi(1990)). The file change semantics of Heim will be used to show how the information update can be formalized. Examples and tables from the corpus will show how the information update is realized linguistically.
References:
In this talk I want to present (1) a summary and the main conclusions of my Ph.D. thesis on the automated syntactic and semantic analysis of nominal compounds in a technical domain, (2) experiences concerning the practical applicability and the potential business opportunities of speech and language technologies from the perspective of a large IT-supplier: Getronics Software
 The ANNO-project (An annotated public database for written Dutch; 
        Flemish short-term programme for speech and language technology) 
        intends to initiate the creation of a large database for the variant of
        Dutch used in Flanders, as there is no corpus of reasonable size
        available for Flemish Dutch.
        
 BRTN-Dutch beingnconsidered to reflect the national standard,
        the corpus consists of news bulletins and issues of the current
        affairs programme Actueel (both BRTN-radio).  Next to written
        texts intended to be spoken these contain transcriptions of
        interviews.
        
 In this talk we want to report on the choice of the material and
        the consequences this had, the types of annotation we used for
        the whole corpus or just part of it, the way annotating was done
        ((semi-)automatically or by hand), and why it was done that way,
        as well as on our future plans.
         
        
 In this talk I will discuss some aspects of the language generation
        component of the GoalGetter system. This system generates spoken
        summaries of football matches, based on teletext information. 
        
        
 The focus of the talk will be the accentuation of referring
        expressions in GoalGetter. Referring expressions play an important role
        in the football reports we generate, since we constantly have to refer
        to players and teams. First, I will briefly explain how the system
        generates different referring expressions depending on the context.
        Then I will discuss the accentuation rules we currently use:
        expressions referring to a 'new' object receive an accent, whereas
        expressions referring to a 'given' object do not. This approach is in
        line with many accentuation theories. However, it does not always give
        the correct result. I will argue that we need to add some notion of
        contrastive accent to our accentuation rules. A problem here is that
        the few existing contrast theories do not seem to be applicable to the
        football  domain.          
        
        
 Linguistics and Computer Science make an extensive use of tree
        structures. We present here a formalisation of trees (in fact, of
        forests) within the algebraic theory of binary relations (Del Vigna
        & Courrége, 1994) and we show how the relational framework
        also expresses the theory of command relations used in Generative
        Grammar (Del Vigna, 1996). In fact, this may be applied to various
        configurations in trees. The expressiveness, the simplicity and the
        elegance of relational algebra are widely recognized, particularly in
        the relational database model. More, as algebra, it allows  blind
         calculus and proofs based on rewritings. These qualities still
        hold with syntagmatic structures and, in other respects, the relational
        approach provides an unifying frame for several definitions of trees
        which occur in the literature.
        
 First, we introduce forests on a finite set N. Then, we defined a
        gridded forest as a pair (V,H) of forests on N. The definition is
        symmetric, i.e. the pair (H,V) is also a gridded forest on N. Four
        derived forms of gridded forests are presented: primitive, which
        corresponds to oriented and ordered trees (Aho & Ullman, 1972),
        functional, which corresponds to the data structure for binary trees
        used in programming, DP, which corresponds to the pair (dominance,
        precedence) in (Partee, Ter Meulen & Wall, 1990) and, finally,
        total. Algebraic formulae permit transition from any form to another
        and constitute a basic and usefull formal toolbox. Finally, we present
        the axioms, all expressed in relational algebra, which characterize,
        for a given forest, the set of its command relations.
        
 In het uit te voeren promotie-onderzoek wordt getracht een
        bijwerkingenprofiel van een geneesmiddel automatisch te extraheren uit
        medische literatuur. Allereerst wordt er een zo volledig mogelijk
        profiel opgesteld. Daarnaast zullen ontwikkelingen in de tijd gevolgd
        worden. Een eerste begin is gemaakt door medische teksten te beschouwen
        als een corpus van losse woorden. Uit dit corpus kunnen subcorpora
        geïsoleerd worden. De resultaten van enkele vergelijking tussen
        subcorpora zullen gepresenteerd worden.
        
 Door de reductie van tekst tot losse woorden gaat echter veel
        informatie verloren. Andere methoden om vaste structuren te ontdekken
        in de tekst zullen aangewend worden. Gedacht wordt o.a. aan
        collocaties, concept extraction en part-of-speech tagging. 
        
 De extractie van bijwerkingen is de basis voor twee
        onderzoekslijnen. De eerste lijn bouwt voort op de resultaten:
        bijwerkingen kunnen gebruikt worden in het vinden van nieuwe
        toepassingen voor bestaande medicijnen. De tweede lijn bouwt voort op
        de technieken. De technieken kunnen mogelijk gebruikt worden om een
        risicoprofiel op te stellen van een geneesmiddel.
        Computerlinguïstische analyse van medische literatuur zou bepaalde
        tendensen eerder kunnen signaleren dan in de huidige praktijk het geval
        is.
   
For more information:
clin96@ipo.tue.nl
CLIN 96 was sponsored by:

| IPO | Center for Research on User-System Interaction | click here |