Friday, November 15th, 1996
IPO, Center for Research on User-System Interaction, Eindhoven
We are pleased to be able to say that the seventh CLIN meeting was a succes. We would like to thank anyone that attended the meeting for their contributions, which helped make it a fruitful and successful conference. The meeting was held last year and was organised by IPO in Eindhoven. IPO was then the Institute for Perception Research, but has recently changed its name to IPO, centre for research on user-system interaction.
A compilation of a selection of the papers presented at CLIN 1996 will be made. It will be issued at the next CLIN meeting in 1997.
At CLIN meetings, computational linguistics researchers in the Netherlands and Dutch-speaking Belgium gather and present their research. The meeting is also open to international participation. The default language of the conference is continental English. However, presentations with a Dutch title and abstract may be held in Dutch.
In 1996, Stephen Pulman of SRI International in Cambridge and the University of Cambridge Computer Laboratory presented a keynote lecture on Conversational Games, Belief Revision, and Commitment.
The local organisers,
This talk discusses the roles of these three concepts in some recent approaches to dialogue and tries to sketch a hybrid rule-based+statistical framework on which practical implementations could be based.
The data-oriented approach to language processing assumes that
previous language experiences (rather than abstract linguistic rules)
form the basis for language perception and production. A Data-Oriented
Processing model (DOP model) therefore maintains a large corpus of
linguistic representations of previously occurring utterances. By
combining fragments from this corpus, representations for new
sentences can be generated. The frequencies of these fragments are
used to estimate the most probable representation of a given
utterance.
A DOP model can be defined for almost every theory of
linguistic representation or utterance analysis. The original DOP
model corresponds to a theory in which the linguistic representation
of an utterance is given by a phrase structure tree. New
representations are produced by combining subtrees of previous
representations. In this talk, we show how a DOP model can be
developed for the more articulated representations provided by Lexical
Functional Grammar (LFG). On this theory of representation, the
analysis of every utterance consists of a constituent-structure (a
phrase structure tree), a functional-structure (an attribute-value
matrix), and a correspondence function that maps between them. We will
show how the definitions for fragments and combination-operations of
original DOP can be straightforwardly extended to a DOP model based on
LFG representations. However, the original DOP probability
calculations do not properly apply to LFG's nonmonotonic constraints
on valid fragment combinations. We propose a new probability model
that does generalize appropriately to the case of nonmonotonic
conditions, and describe how this model applies to LFG representations.
Word phonemisation, the task of converting a word to its phonemic transcription (with word stress), is hard for two reasons. First, it involves a large amount of language-dependent knowledge hard to acquire by handcrafting; however, this task may be alleviated using inductive-learning algorithms to automatically induce the knowledge needed. Second, the task represents a non-linear classification task which is hard to (learn to) represent in a single-process system. Designing a modular system in which the task is solved in more than one step appears a good heuristic. However, modularisation may induce unwanted effects in performance: e.g., (i) many proposed orderings and separations of subtasks ignore relevant dependencies between subtasks; (ii) modular systems are relatively sensitive to cascading `snowball' errors. The paper provides empirical performance data obtained by systematically varying and optimising the number and the ordering of modules in a word phonemisation system. Individual modules are automatically induced on the basis of a large lexical data base of English, by symbolic (IGTree, IB1) or connectionist (Back-propagation) inductive learning algorithms. The results point out that both numbers and orderings of modules considerably affect generalisation performance. The results offer insight into subtask dependencies in morpho-phonology, and as a spin-off, provide indications for building accurate word phonemisation systems.
Recent work in HPSG has (1) emphasized the role of argument
structure (ARG-ST) as a level of representation independent of
valency, (2) demonstrated that (recursive) constraints on lexical
entries may lead to accounts of unbounded dependency constructions and
quantifier scope that are superior to previous proposals, and (3) used
lexical rules to describe processes that were previously considered to
be primarily syntactic in nature (extraction, selection of adjuncts,
and clitization).
In this talk I want to argue that a combination of (1) and (2)
makes (3) superfluous. In particular, lexical rules for complement
extraction, introduction of adjuncts on COMPS, and clitization can be
replaced by monotonic constraints on lexical entries which define the
relationship between subcategorization and argument structure, between
argument structure and valency (including SLASH), and between argument
structure and phonology.
The advantages of this approach are that the `canonical' mapping
between argument structure and valency not just holds for basic lexical
entries, but for all lexical entries, that the various forms of an
entry previously derived by means of lexical rules can be seen as
monotonic instantiations of a single basic entry, and that many of the
problems associated with the use of lexical rules (order sensitivity,
default relations between input and output, spurious derivations,
recomputation of values on the output) disappear.
Wh-chains are known to be restricted by both local and global
requirements. In order to parse wh-relations efficiently, we have to
account for the interaction of the local and the global licensing
mechanisms.
The parser Delilah, which handles Dutch categorially and
context-sensitively, is equipped with a deterministic and incremental
device for selecting wh-chains. By this device - a finite state network
- the number of possible operator-variable relations which the parser
has to check, is reduced as much as possible in an incremental and
deterministic fashion. The network operates under the assumption of
massive lexical ambiguity with respect to the local licensing of
variables. It is fed by knowledge of local and global conditions on
wh-chains.
For non-coordinated sentences the device may divide the number of
possible operator-variable chains by twenty, and for coordinated
sentences, by six - allowing in the latter case for across-the-board
applications. We will present a detailed account of the grammatical and
operational aspects of the network, and some figures as to its effect.
Syntactic analysis can be seen as a cascade of classification
problems of two types: segmentation (constituent boundary detection)
and disambiguation (morphosyntactic disambiguation, constituent
labeling, and attachment decisions). By rephrasing syntactic analysis
as a series of instances of a classification problem, machine learning
techniques such as decision tree learning and memory-based learning
become applicable. When using annotated example corpora (treebanks) as
learning material, these machine learning techniques can generalize the
knowledge implicit in the annotations to unseen text. Obvious
advantages of this approach include automatic learning (alleviating
knowledge acquisition bottlenecks) and robustness due to the
statistical nature of the learning algorithms.
In previous work, we have applied memory-based learning techniques
to segmentation and disambiguation problems in phonology
(syllabification, stress assignment, grapheme disambiguation),
morphology (analysis and synthesis), and morphosyntax (morphosyntactic
disambiguation, i.e. part of speech tagging). In this paper we show
that a benchmark phrase attachment problem (PP-attachment), can be
learned using memory-based learning techniques. Advantages of the
approach to existing stochastic techniques include (i) smooth automatic
integration of knowledge sources, (ii) non-parametricity (no parameter
estimation needed). We also discuss the impact on generalization
accuracy of different similarity metrics in the memory-based learning
algorithm and of different input representations.
The Interface between Text Structure and Linguistic Description within
the ALEP platform.
Input to the ALEP system is automatically converted into a SGML
marked text, which will be the input to the linguistic processing. For
the analysis of those tagged texts, some tsls rules (Text Structure to
Linguistic Structure) have to be defined. So, if an item is tagged as a
word (tag `W'), an obligatory tsls rule should define which kind of
linguistic object (described in the grammar) will apply to this item.
This allows a substantial modularization of the grammar, specifying
which kind of linguistic rules will apply.
Beside the SGML tags, we used the system-defined tag `USR' in
order to deal with fixed phrases and `messy details'. User-defined
(multiple-)word recognizers have been integrated into the text-handling
component of ALEP. The tagged output of these programs gives the input
for the tsls rules. We described generic lexicon entries (i.e. `dates',
etc.) corresponding to the `USR'-tagged expressions. With this
technique, the running time of the parser has been significantly
improved and the coverage of the grammar considerably extended.
The last step of our work has consisted in the extension of the
set of tags defined within the ALEP system. So, for example, a tag
`CAT' has been added. This allows us to integrate information delivered
by a Part of Speech tagger. We extracted the PoS information and
`lifted' it to the linguistic description via the tsls rules. This
again leads to very substantial improvement in term of efficiency of
the parser and of coverage of the grammar.
And also a more theoretical question arises: can this strategy
provide a practicable way for combining corpus-based and
knowledge-based approaches to NLP? In any case, we will have to
consider the reorganization of the (unification-based) grammar
description with respect to the possibility of extracting
morpho-syntactical information from PoS taggers.
In GPSG and HPSG the distinction between elements with and
without phrasal projection is drawn in terms of speech parts, cf. the
major V, N, A, P vs. the minor Comp, Conj, etc. Contrary to this
practice I claim that the major/minor distinction had better be treated
as orthogonal to the speech part classification.
To substantiate this claim I will show that the distinction
between full and reduced personal pronouns in Dutch (jij/je, zij/ze,
...) is an instance of the major/minor dichotomy. Next, I will spell
out an HPSG style sort hierarchy for the description of minor signs
and explore their syntactic peculiarities, i.e. the impossibility to
be used as heads, fillers or conjuncts, and the deviance from the
LP constraints which hold for their major counterparts. Criteria
will be provided for identifying minor signs in other speech parts
and in other languages.
Since the minor elements behave differently from the major ones,
both in terms of constituency and linear order, the distinction had
better be made explicit in the grammar. This argues against the GB
policy to assign phrasal projections to all lexical elements (and
to many affixes), as well as against a trend in HPSG to treat all
lexical signs (incl. the complementizers) as heads.
There was a time when this would have been needless to say, but
times have changed. Groenendijk & Stokhof define dynamic semantics as
follows:
A semantics is dynamic if and only if its notion of conjunction is
dynamic, and hence non-commutative.
In this paper I argue that dynamic semantics, thus understood, is a
rather bad idea. Dynamic semantics is an admittedly elegant but
nonetheless misguided implementation of an essentially pragmatic
principle. It is an obvious and even important truth that utterances
are processed incrementally. The central tenet of dynamic semantics is
that, to some extent at least, this processing strategy is encoded in
the lexical entries of certain words, and especially in the lexical
meaning of 'and'. Thus formulated, it will be plain that the very
notion of a dynamic semantics is quite implausible. But apart from its
lack of plausibility, it gives rise to all sorts of strange quandaries.
Consider, for example, a young child learning the meaning of 'and'. Are
we to suppose that he learns it in two steps? The truth-conditional
part first, perhaps, and the dynamic part afterwards - or would it be
the other way round? Would it be possible for a child to get the
truth-conditional import of 'and' right but founder on its dynamic
aspects? Clearly, such questions are absurd: the lexical meaning of
'and' isn't dynamic.
In my talk I will first elaborate on this point and then turn to
proposals for giving dynamic interpretations to negation and
disjunction as well. I will argue that these, too, are ill-founded
empirically as well as conceptually.
It is generally accepted nowadays that the scarcity of lexical
resources in NLP necessitates a kind of reusability. At least two
approaches to reusability can be distinguished, resulting in different
domains of what is reused. In one approach the lexicon is a purely
declarative knowledge base, containing all information to be used by
NLP-systems. Reusable information includes what is encoded in
features. System-specific information includes all procedural
knowledge. In the other approach, reusable information is everything
that is necessary for the mapping between text words and lexemes in
the dictionary. This includes both declarative and procedural
knowledge on morphology. In this approach system- specific information
encompasses syntax and semantics.
A typical example of the first approach is DATR. The second
approach is not represented adequately by two-level morphology, which
lacks the notion of lexeme. A better representative is Word Manager, a
system developed in Basel. I will argue that this approach to
reusability has a number of important advantages compared to the one
represented by DATR.
Translation idioms and structural divergencies between languages
are classical problems for machine translation. This holds in
particular for compositional approaches, which require a
translation-equivalence between basic expressions and between grammar
rules of source-language and target-language grammar. One way to
attack these problems, pursued in the Rosetta system, is to make use of
grammar rules that can perform syntactically powerful operations,
enabling a distinction between surface structure and compositional
derivation structure.
In this talk I present a formal basis for an alternative approach
in which the individual grammars can be relatively simple (e.g.
context-free or DCG), but where the translation relation between the
grammars is more complex. Translation-equivalence is now defined as a
relation between combinations of rules and basic expressions, so-called
polynomials. Special attention is paid to the issue of completeness,
i.e. to the conditions under which this translation method guarantees
to yield at least one translation for each analysis of all
source-language expressions.
In this talk I will give an overview of the GoalGetter system. This
system generates spoken summaries of football matches on the basis of
concise teletext information. The system consists of a language
generation component and a speech output generation component. The
language generation component will be discussed in more detail in the
presentation by Mariet Theune.
The focus of this presentation will be on the speech output module.
Speech output can be realised by either diphone synthesis or phrase
concatenation. With diphone synthesis one can generate an unlimited
set of sentences. Phrase concatenation is used in applications where
the set of sentences is limited. Entire words and phrases are recorded
and can be strung together to construct the spoken texts without any
manipulations on the original recordings. Our approach to phrase
concatenation is special in that we record variable words, like team
names and player names, in several prosodic contexts. Dependent on the
place where the variable is to be inserted in a carrier sentence and
information about accenting and phrasing, the right prosodic variant is
selected.
For application in connection with databases and in particular
information systems like library systems, we shall analyze
a few prototypical natural language queries. The query analysis
recommended here is essentially automated and uses logic programming
as a tool for analysis of natural language semantics, and it
involves modelling the information content by means of
a logical representation. It comprises the extensive application
of induction using some homemade inductive meta systems
that perform automated program synthesis through, as an intermediate
step, some dataflow analysis resulting in the construction of
some so-called dataflow structures (cf. Understanding &
Logic Prog.2-3). The resulting synthesized
programs are logic grammars, more precisely definite clause
grammars (DCG). The method seems very promising.
As an illustration, we intend to examine a simple and prototypical
query to a library information system
Complement clitics in Modern Greek NPs exhibit an idiosyncratic type of climbing: they can attach on the noun head (1), prenominal adjectives (2), and a small set of left periphery elements (3). Though such clitics were taken to be affixes in previous approaches (e.g. Stavrou and Horrocks 1990), they do not satisfy various of the diagnostics that have been proposed to characterize Pronominal Affixes and distinguish them from Postlexical Clitics (see e.g. Miller 1992). Moreover, an account of their positioning in terms of Argument Composition (Hinrichs and Nakazawa 1990, 1994; Miller and Sag 1996) would encounter serious difficulties including the contrast in (4) which indicates that an adjective with a complement of its own cannot ``attract'' the noun head's clitic complement. I provide an account of clitic climbing in MG NPs in terms of Domain Union (Reape 1994, Kathol 1995) and that employs a notion of Attachment in the sense of Dowty (to appear) and Gunji (to appear). This approach can be straightforwardly extended so as to account for definite articles and NP-internal demonstratives which along with clitics cannot stand on their own, but rather require an appropriate host to attach on.
1. | to kenurio vivlio mu-CL | (lit.: the new book my) | |
2. | to kenurio tu-CL vivlio | (lit.: the new his book) | |
3a. | ola tus-CL ta vivlia | (lit.: all their the books) | |
3b. | afto su-CL to vivlio | (lit.: this your the book) | |
4a. | i [anagnorismeni [apo olus]] iperohi tu-CL | (lit.: the acknowledged by all superiority of-his) | |
4b. | * | i [anagnorismeni tu [apo olus]] iperohi |
We will describe research on the treatment of Dutch compounds in the UPLIFT information retrieval project. Results of earlier experiments in the UPLIFT project indicated that splitting up compounds in the query and generating new compounds by simply combining query terms both improved retrieval performance. We subsequently experimented with adding constraints to the compound splitting and generation algorithms in order to restrict both processes and minimize over-generation. We experimented with using information about head-modifier relationships and corpus frequency information to formulate constraints. So far, we have not been able to improve on our initial strategy but the results of initial experiments have provided us with some important clues for further experimentation.
Rob van der Sandt's theory of `presuppositions as anaphors' is widely considered to be the empirically most adequate theory of presupposition projection on the market. In this talk, two weaknesses of Van der Sandt's theory are pointed out and remedied. The first weakness is the fact that a central notion of the theory, namely that of a `partial match', is not defined in a sufficiently precise way. The second weakness, in our opinion, is the fact that the theory takes only one kind of anaphora into account, in which anaphor and antecedent must always corefer. Both weaknesses are remedied in an updated version of the `presuppositions as anaphors' theory that we claim to be both more precise and more general than its predecessor.
The researches on creating the automatic processing system of the texts in Turkic languages shows that it is necessary to determine and take into consideration the morphonological regularities. The morphonological changes observed in the formal processing systems of Turkic texts in computers can be grouped as following:
Several string operations are introduced, as models of the coordination phenomenon in natural languages. Their relationships with other string operations are investigated, then obtaining the closure properties of families in the Chomsky hierarchy. In particular, CF is not closed under these operations. However, if coordination is defined only between strings with a common syntactic structure (both strings have derivations described by identical trees, modulo the coordinated subwords), then coordination preserves the context-freeness. The extension of this tree-based coordination operation to TAG's is also discussed.
A definition of the notion of answerhood is formalised using a proof system, i.e., Constructive Type Theory. The definition, which was proposed in the mid-eighties by Jeroen Groenendijk and Martin Stokhof, makes use of two concepts which, in the past fifteen years, have become central to the trade of formal semantics: context change and context-dependence. The formalisation using CTT is proposed as an alternative for Groenendijk and Stokhof's original formalisation in possible-world semantics. It is demonstrated that CTT, and in particular the fact that CTT is a proof system, enables a more fine-grained analysis which can be turned into a computational model. Furthermore, we contend that our formalisation of the definition of answerhood is a natural generalisation of definitions of answerhood which are phrased in terms of unification of the question and the answer.
In this paper we develop an HPSG analysis of certain (so far unnoticed) syntactic phenomena connected to verbal negation in Polish. First of all, we show that -- contrary to the received wisdom -- verbal negation is a morphological (rather than syntactic) process and we model this observation via lexical rules. Then we move to the so-called long distance negative concord, i.e., requirement that the verb has to be negated if any of its arguments is or contains a negative pronoun. We show that this is essentially a UDC as this `negation requirement' can cross arbitrary number of NP and PP boundaries. (VPs seem to be islands.) Since this `negation requirement' is discharged lexically (by negated verbs) and because of some intriguing lexical exceptions, we adapt the lexical approach to UDCs of Sag (1995) and Sag (1996). Finally, we investigate interesting behaviour of negative concord and of genitive of negation in the context of verb clusters, and show that this behaviour can be accounted for if arguments of the lower verbs are assumed to be raised to the nearest negated verb (if any), a la Hinrichs and Nakazawa (1989), and if case assignment and `negation percolation' are made sensitive to whether the argument has been realized from the given argument structure, or raised to higher verbs. In the latter we follow the non-configurational case assignment approach of Przepiorkowski (1996).
In my talk, I will give a corpus-based analysis of information update in information dialogues. The corpus used consists of 111 naturally ocurring telephone conversations recorded at the information service of Schiphol Airport. The information update will be described theoretically by extending the dynamic interpretation theory (DIT) of Bunt (Bunt 1995) with the information packaging notions "topic", "tail", and "focus" (Rats(1996), Vallduvi(1990)). The file change semantics of Heim will be used to show how the information update can be formalized. Examples and tables from the corpus will show how the information update is realized linguistically.
References:
In this talk I want to present (1) a summary and the main conclusions of my Ph.D. thesis on the automated syntactic and semantic analysis of nominal compounds in a technical domain, (2) experiences concerning the practical applicability and the potential business opportunities of speech and language technologies from the perspective of a large IT-supplier: Getronics Software
The ANNO-project (An annotated public database for written Dutch;
Flemish short-term programme for speech and language technology)
intends to initiate the creation of a large database for the variant of
Dutch used in Flanders, as there is no corpus of reasonable size
available for Flemish Dutch.
BRTN-Dutch beingnconsidered to reflect the national standard,
the corpus consists of news bulletins and issues of the current
affairs programme Actueel (both BRTN-radio). Next to written
texts intended to be spoken these contain transcriptions of
interviews.
In this talk we want to report on the choice of the material and
the consequences this had, the types of annotation we used for
the whole corpus or just part of it, the way annotating was done
((semi-)automatically or by hand), and why it was done that way,
as well as on our future plans.
In this talk I will discuss some aspects of the language generation
component of the GoalGetter system. This system generates spoken
summaries of football matches, based on teletext information.
The focus of the talk will be the accentuation of referring
expressions in GoalGetter. Referring expressions play an important role
in the football reports we generate, since we constantly have to refer
to players and teams. First, I will briefly explain how the system
generates different referring expressions depending on the context.
Then I will discuss the accentuation rules we currently use:
expressions referring to a 'new' object receive an accent, whereas
expressions referring to a 'given' object do not. This approach is in
line with many accentuation theories. However, it does not always give
the correct result. I will argue that we need to add some notion of
contrastive accent to our accentuation rules. A problem here is that
the few existing contrast theories do not seem to be applicable to the
football domain.
Linguistics and Computer Science make an extensive use of tree
structures. We present here a formalisation of trees (in fact, of
forests) within the algebraic theory of binary relations (Del Vigna
& Courrége, 1994) and we show how the relational framework
also expresses the theory of command relations used in Generative
Grammar (Del Vigna, 1996). In fact, this may be applied to various
configurations in trees. The expressiveness, the simplicity and the
elegance of relational algebra are widely recognized, particularly in
the relational database model. More, as algebra, it allows blind
calculus and proofs based on rewritings. These qualities still
hold with syntagmatic structures and, in other respects, the relational
approach provides an unifying frame for several definitions of trees
which occur in the literature.
First, we introduce forests on a finite set N. Then, we defined a
gridded forest as a pair (V,H) of forests on N. The definition is
symmetric, i.e. the pair (H,V) is also a gridded forest on N. Four
derived forms of gridded forests are presented: primitive, which
corresponds to oriented and ordered trees (Aho & Ullman, 1972),
functional, which corresponds to the data structure for binary trees
used in programming, DP, which corresponds to the pair (dominance,
precedence) in (Partee, Ter Meulen & Wall, 1990) and, finally,
total. Algebraic formulae permit transition from any form to another
and constitute a basic and usefull formal toolbox. Finally, we present
the axioms, all expressed in relational algebra, which characterize,
for a given forest, the set of its command relations.
In het uit te voeren promotie-onderzoek wordt getracht een
bijwerkingenprofiel van een geneesmiddel automatisch te extraheren uit
medische literatuur. Allereerst wordt er een zo volledig mogelijk
profiel opgesteld. Daarnaast zullen ontwikkelingen in de tijd gevolgd
worden. Een eerste begin is gemaakt door medische teksten te beschouwen
als een corpus van losse woorden. Uit dit corpus kunnen subcorpora
geïsoleerd worden. De resultaten van enkele vergelijking tussen
subcorpora zullen gepresenteerd worden.
Door de reductie van tekst tot losse woorden gaat echter veel
informatie verloren. Andere methoden om vaste structuren te ontdekken
in de tekst zullen aangewend worden. Gedacht wordt o.a. aan
collocaties, concept extraction en part-of-speech tagging.
De extractie van bijwerkingen is de basis voor twee
onderzoekslijnen. De eerste lijn bouwt voort op de resultaten:
bijwerkingen kunnen gebruikt worden in het vinden van nieuwe
toepassingen voor bestaande medicijnen. De tweede lijn bouwt voort op
de technieken. De technieken kunnen mogelijk gebruikt worden om een
risicoprofiel op te stellen van een geneesmiddel.
Computerlinguïstische analyse van medische literatuur zou bepaalde
tendensen eerder kunnen signaleren dan in de huidige praktijk het geval
is.
For more information:
clin96@ipo.tue.nl
CLIN 96 was sponsored by:
IPO |
Center for Research on User-System Interaction |
click here |