Invited Talk

Constraint-based Sentence Compression: An Integer Programming Approach
Mirella Lapata, University of Edinburgh

In this talk we introduce the sentence compression task, which can be viewed as producing a summary of a single sentence. An ideal compression algorithm should produce a shorter version of an original sentence that retains the most important information while remaining grammatical. The task has an immediate impact on several applications ranging from document summarisation to audio scanning devices for the blind and caption generation.

Previous approaches have primarily relied on parallel corpora to determine what is important in a sentence. These include data intensive methods inspired from machine translation using the noisy-channel model and from parsing by treating compression as a series of tree rewriting operations. Our work views sentence compression as an optimisation problem. We develop an integer programming formulation and infer globally optimal compressions in the face of linguistically motivated constraints. We show that such a formulation allows for relatively simple and knowledge-lean compression models that do not require parallel corpora or large-scale. The proposed approach yields results comparable and in some cases superior to state-of-the-art.

Accepted oral presentations (61)

Aligning complex translational correspondences via bootstrapping Show/hide abstract
Lieve Macken (Ghent University College) and Walter Daelemans (University of Antwerp)

In this talk we describe our sub-sentential alignment system that aligns translation units below sentence level. The sub-sentential alignment system is conceived as a cascade model consisting of two phases. The objective of the first phase is to link anchor chunks, i.e. chunks that can be linked with a very high precision. Anchor chunks are linked on the basis of lexical correspondences and syntactic similarity.

In the second phase, we try to align more complex translational correspondences via bootstrapping. The anchor chunks of the first phase are used to limit the search space in the second phase. We start from a sentence-aligned parallel corpus in which anchor chunks have been aligned on the basis of lexical clues and syntactic similarity. In a first step, candidate rules are extracted from all sentence pairs that only contain 1:1, 1:n and n:1 unlinked chunks. In a second step, the rules are applied on the whole training corpus, resulting in new sentence pairs containing 1:1, 1:n and n:1 unlinked chunks. The bootstrapping process is repeated several times. We present the first results of the bootstrapping approach and compare the results with the baseline system that aligns only anchor chunks.

Aligning syntactic divergences through parse tree transformation Show/hide abstract
Tom Vanallemeersch (University of Leuven)

Automated alignment of the parse tree of a source sentence with the parse tree of its translation is interesting for many purposes, such as development of machine translation systems, translation studies and bilingual term extraction. However, the source and target tree may use distinct formalisms and are often not isomorphic due to divergences that are caused by language-dependent syntactic structures and preferences, by paraphrases, by omissions, etc. In the area of statistical machine translation, these divergences have lead developers to parse the input sentences of the parallel training corpus and transform the trees in order to make them more similar to the structure of the target sentences. We apply a similar approach for tree alignment purposes by parsing both the source and target sentence, applying transformations to the trees, comparing the structure of the transformed trees and matching their terminal nodes through a bilingual lexicon. The transformations aim at normalizing specific syntactic structures, obtaining more similar trees and making the alignment of syntactic divergences more straightforward. Our study focuses on the language pair French-Dutch.

Annotation of Temporal Information in French Texts Show/hide abstract
André Bittar (Université Paris 7, France)

An important part of natural language text comprehension is the understanding of temporal information. The annotation language TimeML was created for the marking up of temporal information to allow for the interpretation of the temporal structure of texts, including identification of events and temporal expressions and temporal relations between these entities. The annotation schema has since been adopted by the ISO and work has been carried out towards establishing a temporal annotation standard as well as providing languagespecific guidelines.

This presentation focuses on our ongoing work to create a set of coherent resources for the temporal annotation of French texts according to the TimeML standard. We present an annotation guide, based on both corpus studies and linguistic theory, which provides annotation guidelines specific to French as well as proposals for extensions to the TimeML schema, in particular concerning the event ontology (we propose 4 new event classes), and modal and aspectual expressions, which require special attention in Romance languages such as French. The guide also contains instructions for the annotation of event nominals, which are not detailed in the general TimeML guidelines. We also present modules for the automatic annotation of events and temporal expressions in texts in accordance with our proposed guidelines, and a Gold Standard handannotated corpus, used to evaluate the performance of these modules. We present the results of a preliminary evaluation, which are comparable to those of a system for English.

Applying the LD to Catalan dialects: a brief comparison of two dialectometric approaches Show/hide abstract
Esteve Valls (University of Barcelona) and John Nerbonne (Alfa informatica, University of Groningen)

During the last decade, several dialectometric approaches have been applied to Catalan dialects with the aim of reviewing the traditional classification based on bundles of isoglosses. This has been possible thanks to the systematization of the Corpus Oral Dialectal (COD) of the University of Barcelona, a corpus of contemporary Catalan containing more than 660.000 phonetic and morphological items gathered from the 86 county towns of the whole Catalan-speaking area. In recent papers, linguistic distance between dialects has been calculated by means of different similarity/dissimilarity measures, usually regarding the number of coincidences/divergences with respect to the total number of elements compared among two varieties. Furthermore, special attention has been paid to the previous linguistic analysis of the data, as one of the major goals of the project is to compare the underlying differences of the language, instead of working on the basis of the phonetic outputs.

Despite this growing interest in dialectometry, however, Levenshtein Distance (LD) is still virtually unknown among Catalan linguists, and has never been applied to Catalan language yet. The purpose of this paper is, therefore, to show the first results of applying the LD to the COD data, and to compare the dialect groupings that arise from both approaches. Such a comparison will allow us to determine not only the importance of taking into account the structural (i.e. qualitative) differences in a quantitative approach, but also the influence of using two different techniques (LD and the one used at University of Barcelona) to calculate the linguistic distance.

An Aspect Based Document Representation for Event Detection Show/hide abstract
Wim De Smet and Marie-Francine Moens (Dept. of Computer Science, K.U.Leuven)

We have studied several techniques for creating and comparing content representations, in the domain of event detection. Our goal is to cluster news stories that discuss the same event, where an event is a specific happening in time, covering certain topics and involving named entities such as persons and locations. We define a (textual) document as a collection of aspects, i.e. disjoint components that reveal latent and/or extracted information. Though general in concept, we let the aspects in this setting correspond to the consisting elements of an event (topics and named entities). As underlying representations of each aspect we consider the vector space model and the probabilistic topic model Latent Dirichlet Allocation. To compare two documents, we combine the comparison of the documents' aspects. Evaluating our techniques on an event detection task in Wikinews, we find that an aspect based representation improves clustering when compared to a document based representation. This finding is confirmed for both the vector space and probabilistic models. There are limits to this improvement however: when splitting into more aspects, we are confronted with sparseness problems. To resolve this we introduce "importance factors", unsupervised learned weights that assess when sparse aspects introduce noise, and when they contain valuable information.

Our methods for aspect detection, for learning the importance factors of the aspects, and for event clustering are completely unsupervised.

Automated Methods for Extending the Lexicons of Deep Grammars Show/hide abstract
Kostadin Cholakov (Rijksuniversiteit Groningen)

Recently, the integration of shallow techniques with deep parsing systems in many real-world NLP applications-- MT, QA, RE, etc.-- has clearly illustrated the remaining importance of Deep Linguistic Processing. However, modern deep parsing approaches still face various problems. In particular, one major issue is the low coverage of the handcrafted grammars, which are the core part of the deep parsing systems. Typical causes for this problem include missing (the word is not listed in the lexicon of the grammar) or incomplete lexical entries (e.g. a verb is used with a subcategorization frame which is not listed in the lexicon). The manual extension of the lexicon and the grammar is costly and timeconsuming and therefore, we present various machine learning methods for the automated acquisition of missing and incomplete lexical entries. In our experiments we use the Alpino dependency parser for Dutch and we build on the acquisition approaches, presented in (van de Cruys, 2006). We have solved the issue of efficiently adapting Alpino's disambiguation model for the purposes of lexical acquisition and we have improved the classifier architecture. Furthermore, we plan to extend Alpino with the results obtained from the lexical acquisition and test its performance on real-world data. In this way, we will assure the general applicability of the methods we propose for improving the robustness of lexicalised grammars.

Automatic filtering of parallel corpora for improving alignment accuracy Show/hide abstract
Gideon J. Kotzé (Alfa informatica, University of Groningen)

In this paper, we demonstrate a method of automatically filtering a parallel corpus in order to improve the quality of its alignment. We identified various parameters influencing accuracy such as sentence length differences, number of sentences per alignment and misalignment of certain identifiable elements such as proper nouns. Every sentence alignment is automatically flagged as being either good or problematic. A manual inspection of the flagged output indicates a measure of the precision and recall, which in our case has shown satisfactory results so far. In a case study on the aligned English-Dutch Europarl corpus, 2.98% (39659 alignment units) have been tagged as problematic using our approach. Most of them are indeed incorrect - 89.96% according to a manual check on a small selection. Furthermore, 93.57% of the alignments tagged as good are indeed correct according to our study. This shows the ability of using this filtering method for improving precision with minor drops in recall. More details will be given during the presentation. After discussing the results, we finally speculate on how to improve the method and how it could or should be adapted to different text types. The work was done in the context of the STEVIN project Parse and Corpus Based Machine Translation (PaCoMT).

Automatic Text Categorization: Adding Syntactic Knowledge Show/hide abstract
Tom Ruette (University of Leuven)

Often, linguistic knowledge is neglected to represent the semantics of a text. I focus on the use of linguistic characteristics, c.q. non-lexical features, for the representation of texts in vector space models.

My hypothesis for the categorization question assumed that the representation of texts benefits from non-lexical, c.q. syntactic, knowledge. (cf. Pado and Lapata) Linguists have found out that Privileged Syntactic Arguments (Van Valin) mark a link between syntactic arguments - subjects and objects - and core semantic participants. As these participants carry the core meaning of a sentence, their syntactic characteristics can be used to create a specific representation. I tested three degrees of syntactic knowledge:

1 feature selection: selecting the head noun from subjects and direct objects: e.g. hond
2 feature representation:
2.1 feature selection, plus syntactic function: e.g. hond_su
2.2 feature selection, plus syntactic function and verbal dependency: e.g. hond_su_blaft

The Twente Nieuws Corpus provided the test material. About 6000 random articles from the corpus had to be categorized by the system. The newspaper section, i.e. binnenland (local news), buitenland (foreign affairs), etc., containing an article was used as the a priori category that the system had to find out. Comparing to a baseline configuration (standard bag-of-words model), the feature selection experiment (1) scored 12% higher. The (2.1) configuration beats the baseline, with almost 12% as well. The (2.2) experiment however is about 4.5% below that baseline.

Automatically extending linguistically enriched data Show/hide abstract
Daphne Theijssen, Nelleke Oostdijk, Hans van Halteren (Radboud University Nijmegen)

There are many situations in which speakers can choose between two or more structural variants which are equally grammatical but may differ in their acceptability in the given context. In the current project, we explore the use of Bayesian Network Modelling (BNM) for the purpose of modelling such syntactic variability. At present, we investigate the dative alternation, where speakers and writers can choose between structures with a double object (e.g. She gave him the book.) or prepositional dative structure (e.g. She gave the book to him.). Employing the one-million-word syntactically annotated ICE-GB Corpus, we were able to extract 790 relevant instances. The data set proves too small to allow drawing conclusions about the suitability of BNM for modelling syntactic variability.

To solve this (very common) data sparseness problem, we developed an approach to automatically extend our data, employing large corpora without syntactic annotations (BNC and COCA). First, we created a list of verbs occurring in both constructions and used them to find potentially relevant sentences in the corpora. The sentences found were then (partly) automatically filtered. Next, we wrote algorithms for automatic enrichment with the linguistic and discourse information desired: the animacy, concreteness, definiteness, discourse givenness, pronominality, person and number of the objects (the book and him in the example), and the semantic class of the verb. We evaluated the automatic labelling with the help of the existing data set of 790 manually annotated instances. The details of the method and the results found are presented at the conference.

Bootstrapping Approach for Learning Part-Whole Relations Show/hide abstract
Ashwin Ittoo (Faculty of Economics and Business, University of Groningen) and Gosse Bouma (Information Science, University of Groningen)

Due to the numerous challenges that it poses, learning of non-taxonomic relations from texts has been overlooked in ontology engineering efforts. Part- Whole relations, like meronymy and its inverse holonymy, are important nontaxonomic relations. A difficulty in automatically learning such relations is pattern ambiguity, due to ambiguous constructs that encode Part-Whole relations depending on the contexts. This paper presents an automatic, unsupervised approach to learn lexico-syntactic patterns that encode Part-Whole relations. Compared to previous research, the novelty in this work lies in its approach to handle pattern ambiguity automatically. It bases itself on the Part-Whole relations categories defined by Winston, Chaffin and Herrmann in "A Taxonomy of Part-Whole Relations", which were obtained from psycholinguistics experiments. Seed concepts participating in these relations are extracted from WordNet, and generalised to abstract entities via their hypernyms. Lexico-syntactic patterns that relate the concepts are extracted from a corpus. The sets of concepts and patterns are further augmented via bootstrapping. This procedure not only alleviates pattern ambiguity, but also ensures that the suggested approach has a high coverage of the different Part-Whole patterns. While most previous work relied on some human intervention to retain accurate patterns, we suggest an automatic validation mechanism that also acts as another safeguard against pattern ambiguity. Furthermore, we generalise semantically similar Part-Whole lexico-syntactic patterns into concise representations which facilitate their subsequent incorporation into Machine Learning techniques, such as decision trees.

Bootstrapping Automatic Synonymy Extraction Show/hide abstract
Kris Heylen, Yves Peirsman and Dirk Geeraerts (QLVL, K.U.Leuven)

In recent years, Word Space Models have become increasingly popular within Computational Linguistics for modelling word meaning. They are able to automatically retrieve semantically similar words by quantifying the extent to which words appear in the same contexts, with contexts being words occurring around, or entertaining a specific syntactic relation with, the target word. Usually, models use a restricted number of possible contexts to limit noise and computational costs. This selection is mostly done using a corpus-internal criterium like frequency (most frequent contexts minus a stop list) or some informativeness measure (e.g. entropy). In this paper, we take the novel approach of bootstrapping context selection: we select features that have proven to be good predictors of semantic similarity for known synonyms and use these to find new ones. For a training set of 10,000 Dutch nouns, we calculate their pairwise semantic similarities based on 20,000 context features and compare these to their Wu&Palmer similarity score in Dutch EuroWordnet. Next, we retain the 10,000 features whose contribution to the similarity measure shows the highest correlation with the EuroWordNet similiarty score. These context features are then used to find semantically similar words for a test set of 5,000 Dutch nouns. Finally, we evaluate (again against EuroWordNet) whether our bootstrapped Word Space Model is better at finding synonyms than models that use corpus-internal feature selection criteria.

Clustering Headlines for Automatic Paraphrase Acquisition Show/hide abstract
Sander Wubben, Antal van den Bosch, Emiel Krahmer, Erwin Marsi (Tilburg Centre for Creative Computing, Tilburg University)

For the development of a memory based paraphrase generation system, large amounts of training material are needed in the form of aligned paraphrases. Valuable sources of paraphrases are news article headlines: these headlines often describe the same event in various dierent ways. We present a method that acquires headline clusters from the web, and for each cluster selects the available paraphrase candidates. These candidates are found through sub-dividing the headline clusters into sub-clusters that ideally contain paraphrases. From each of these sub-clusters all possible paraphrase pairs can then be extracted and used to train the memory based paraphraser. The subclustering is done by using the k-means algorithm. A cluster-stopping algorithm is used to find the optimal k for each sub-cluster. For the development and evaluation of the system, data from the DAESO-project is used. In this project, clusters of headlines were manually divided into sub-clusters, resulting in 889 annotated news clusters. The headlines are converted into vector space by using a bag of words representation with tf.idf scores, so that similar vectors can be clustered. Because in this case quality of the alignments is more important than quantity, we evaluate this task by using an F-score which favours precision above recall. Preliminary results show the system achieves an F_0:5-score of 0.70 when evaluating only real clusters (i.e. two sentences or more) and 0.52 when evaluating all sentences, including outliers.

A comparison of HMM and Maximum Entropy models for Sequential Tagging Show/hide abstract (converted from poster)
Yan Zhao (Alfa informatica, University of Groningen)

We compare two main statistical methods, namely Hidden Markov Models (HMM) and Maximum Entropy (ME) models on some sequential tagging tasks. The tasks vary in tag sets and language. When the tag set is small, the ME model surpasses HMM, as we may expect. Perhaps more surprisingly, we also find situations in which HMM perform better than ME models. In addition to less memory usage and faster training time of HMM, we show that HMM is a more accurate model when the size of the tag set is large enough.

Constructing a Domain Ontology from a Domain Database and the Web Show/hide abstract
Marieke van Erp (Tilburg University)

Manual ontology building is a laborious process that requires a lot of time and knowledge from domain experts. Since the early 2000s automatic ontology building has been a hot topic in computer science (see for instance the ontology learning workshops at ECAI 2000, IJCAI 2001 and EKAW 2004). This surge in interest is partly thanks to the increasing abundance of domain-specific knowledge that is available through the WWW. Although often the starting point for ontology learning is text, here a textual database is used. The advantage of using a database is that it is already organised along the concepts in the domain, namely the database columns, which provide numerous instances of these concepts. Therefore we can focus on discovering relations for which an external resource is needed, which is where we turn to the web. Because the web is open to everyone there is information on virtually everything, even very domain specific topics. By querying pairs of database values against the web, snippets can be discovered in which both values occur. After linguistic analysis of the object-verb-subject and object-verb-prepositional phrases of the snippets for each instance pair the most frequent verb phrases for each column pair can be identified as relation label candidates for the relation between two concepts. The candidates are then normalised to their infinitives and clustered by synonymy after which they are judged by human annotators.

Corpus-based Source Language Modeling Using Parallel Treebanks Show/hide abstract
Joachim Van den Bogaert (Centre for Computational Linguistics, K.U.Leuven)

We describe a source language modeling software module for PaCo-MT, a parse and corpus-based Machine Translation system for Dutch, English and French under development. The focus is on the module's functionality within the system and the difficulties that rise with large-scale datasets.

The Paco-MT system takes its data from dedicated parsers such as the Alpino-parser for Dutch to create a large corpus of translation examples. Mapping unknown phrases against such a corpus is a complex problem which requires a lot of computational resources in space and time.

We discuss techniques to maximize the re-use of data already present in the dataset, by using all available syntactic information on different abstraction levels. This increases the precision of the system's output. At the same time it also increases the complexity of searches and the search space of the system itself. To maintain system performance we investigate which data structures can be used to model large data sets, which techniques can be used to narrow down the search space without compromising the output's precision and which criteria can be identified to balance output and performance. A Data-Oriented-Parsing approach is used for representing and querying the data.

Corpus-based Target Language Modeling Using Treebanks Show/hide abstract
Vincent Vandeghinste (Centre for Computational Linguistics, K.U.Leuven)

In this talk we describe experiments in target language modeling for Dutch for machine translation using large automatically parsed treebanks.

We explain the architecture of the Paco-MT system, which is a parse and corpus-based MT system under development translating from English to Dutch and from French to Dutch and vice versa, using deep parses and large aligned parallel treebanks. We focus more specifically on the target language generation side for Dutch: how we go from a bag of bags to a surface string. A bag of bags in this case is an intermediate representation of a sentence which is structured like an Alpino dependency tree, without the surface position information: for each node in the tree, we want to determine the surface order of the daughter nodes in order to generate a grammatical and fluent output. This is done by collecting the rewrite rules over the treebank, combining category, relation, head token and other information. We investigate several abstraction levels and treebank sizes and measure the results using BLEU, NIST, TER, and number of exact matches.

Definition extraction using a sequential combination of grammars and machine learning classifiers Show/hide abstract
Eline Westerhout and Paola Monachesi (Utrecht University)

We compare different combinations of a rule-based approach and machine learning to extract definitions of four types in the domain of automatic glossary creation within eLearning. This area provides its own requirements to the task.

Our approach has different innovative aspects compared to other research in this area. The first aspect is that we address also less common definition patterns. Second, we compared a common classification algorithm with an algorithm designed specifically to deal with imbalanced datasets, which seems to be more appropriate in our situation. As a third innovative aspect, we investigated the influence of definition structure on the classification results. We expected this information to be especially useful when the basic grammar is used, because the patterns matched with this grammar can have very diverse structures.

Two grammars were used in the first step: a sophisticated grammar aiming at getting the best balance between precision and recall and a basic grammar aiming only at getting a high recall. We investigated whether the machine learning classifiers were able improve the low precision obtained with the basic grammar while keeping the recall as high as possible and compared the results to the performance of the sophisticated grammar in combination with machine learning.

The conclusions are that the algorithm designed specifically to deal with imbalanced datasets for most types outperforms the standard classifier and that classification results improve when information on definition structure is included. The combination of the sophisticated grammar and machine learning outperforms the combination of a basic grammar and machine learning.

Dutch from logic Show/hide abstract
Crit Cremers, Hilke Reckman and Maarten Hijzelendoorn (LUCL, Leiden University)

In the framework of the Delilah language machine, we present an algorithm to produce well-formed and meaningful Dutch sentences on the sole basis of a fully specified propositional logical form. As even this logical form underspecifies syntactic form, the algorithm is essentially non-deterministic. The logical form of the produced sentence is homogeneous with the input form, however, and thus their semantic relation can be determined. That is, for a given logical form \phi the algorithm computes a family of sentences s with logical form \psi such that for each \psi the difference between \phi and \psi is minimal and can be computed.

In this talk we will discuss (a) the architecture of logical form in Delilah in the light of semantic theory and underspecification (b) the architecture of the non-deterministic generation procedure (c) the nature of the relation between the propositional semantic constraints and the logical form of the produced sentences.

Since Delilah already parses and interprets Dutch sentences, we are able to demonstrate the para-cycle sentence > parsing > computing logical form > generation > sentence.

Elephants and Optimality Again. SA-OT accounts for pronoun resolution in child language. Show/hide abstract
Tamás Biró (Alfa informatica, University of Groningen)

How much computational resource is needed for the human brain to comprehend a sentence? This paper argues for divergences between adults and children in pronoun resolution to be due to differences in computational resources, and not to fundamentally different mechanisms.

In their experiments, Hendriks, Spenader and colleagues (e.g., forthcoming in Journal of Child Language) asked children to decide whether a sentence describe correctly a picture. An image depicting an alligator and an elephant with the second hitting himself was accompanied by the sentence .the elephant hits him.. Surprisingly for an adult speaker, children tended to accept the sentence as correctly describing the scene, even though the same children would spontaneously use the reflexive .himself. if asked to produce a sentence recounting the drawing. Why do children accept personal pronouns with a reflexive interpretation?

Hendriks et al. employed bi-directional Optimality Theory to describe the phenomenon and they argued that young children lacking a theory of mind (the capability to read others. mind) are also unable to optimise bi-directionally. Without questioning the validity of their account, this talk will present an alternative explanation, which is based on the Simulated Annealing for Optimality Theory Algorithm. The model, analogous to the one employed for voice assimilation by Bíró (2006, chapter 6), predicts a 50% error rate, unless an infinite amount of seemingly useless candidates are also taken into account. Hence, we argue, adults. mental computation differ from children.s not by a different optimisation technique, rather by a wealthier candidate set.

Enhancing Coverage of Multilingual Lexicalised Grammars Show/hide abstract
Yi Zhang and Valia Kordoni (Saarland University) and Kostadin Cholakov (University of Groningen)

At present, various wide-coverage symbolic parsing systems for different languages exist and have been integrated into real-world NLP applications, such as IE, QA, grammar checking, MT and intelligent IR. This integration, though, has reminded us of the shortcomings of symbolic systems, in particular lack of coverage. When the hand-crafted grammars which usually lie at the heart of symbolic parsing systems are applied to naturally occurring text, we often find that they are underperforming. Typical sources of coverage deficiency include unknown words, words for which the dictionary did not contain the relevant category, Multiword Expressions (MWEs), but also more general grammatical knowledge, such as grammar rules and word ordering constraints. Currently, grammars and their accompanying lexica often need to be extended manually.

In this paper, we present a range of machine learning-based methods which enable us to derive linguistic knowledge from corpora, for instance, in order to solve problems of coverage and efficiency deficiency of large-scale lexicalised grammars. We focus in particular on presenting methods for the automated acquisition of lexical information for missing and incomplete dictionary entries, but also of more general linguistic knowledge, such as missing constructions. We ensure the general applicability of the automated grammatical engineering methods we propose by applying them to extend the coverage of the DELPH-IN GG (German Grammar) and ERG (English Resource Grammar). Our practice shows that grammar coverage can be significantly improved while retaining linguistic preciseness. And the statistical methods we propose serve as flexible complements to the traditional (manual) grammar engineering approach.

Evaluation of pairwise string alignment methods Show/hide abstract
Martijn Wieling, Jelena Prokic & John Nerbonne (University of Groningen)

Several computational methods have been developed to determine the similarity between two strings. Most of these algorithms are based on pairwise string alignment; with the Levenshtein distance being one of the most commonly used algorithms. The Levenshtein distance has also successfully been used in determining pronunciation differences in phonetic strings (Kessler, 1995; Heeringa, 2004). An interesting extension of the Levenshtein algorithm adds the swap-operation, allowing two adjacent characters to be interchanged (Lowrance & Wagner, 1975). This is definitely relevant in modeling similarity between phonetic strings in which metathesis occurs.

A problem of the Levenshtein distance is that deletion, insertion and substitution costs have to be specified in advance. Mackay and Kondrak (2005) proposed a method for identifying cognates between different languages based on Pair Hidden Markov Models (PairHMMs). In PairHMMs the deletion, insertion and substitution costs are iteratively refined using the Expectation Maximization algorithm. After training the most probable alignment is found using the Viterbi algorithm. Although Wieling and Nerbonne (2007) found that PairHMMs generated reasonable segment distances, aggregate dialectological results were similar to results obtained using the Levenshtein distance. In the current study we investigated the performance of several pairwise string alignment methods, including the regular Levenshtein distance, the swap-extended Levenshtein distance and the PairHMM algorithm. Instead of evaluating the alignments on the basis of how well the aggregate distance between varieties is modeled, we compare the string alignments to a gold standard set of alignments of Bulgarian phonetic dialect data in which metathesis occurs frequently.

Extending a Shallow Parser with PP-attachment and a comparison with full statistical parsing Show/hide abstract
Vincent Van Asch and Walter Daelemans (Universiteit Antwerpen)

In this presentation we address the extension of the English version of MBSP (the Memory-Based Shallow Parser) with prepositional phrase attachment. Although the pp-attachment task is a well-studied task in a discriminative learning context, it is mostly addressed in the context of artificial situations like the quadruple classification task (Ratnaparkhi, 1998) in which only two possible attachment points, each time a noun or a verb, are possible. In this research we provide a method to evaluate the task in a more natural situation, making it possible to compare the approach to full statistical parsing approaches. First, we show how to extract anchor-pp pairs from dependency trees in the GENIA and WSJ treebanks. Next, we discuss the extension of MBSP with a pp-attacher. We compare the memory-based pp-attacher with a statistical full parsing approach (Collins, 2003) and analyze the results. More specifically, we investigate the differences in robustness of both approaches to domain changes (in this case domain shifts between journalistic and medical language). We discuss the advantages of the memory-based pp-attacher over a statistical parser and how these advantages can contribute to a better understanding of the domain specificity of natural language tasks.

Extracting a deep German HPSG grammar from a detailed dependency treebank Show/hide abstract
Bart Cramer and Yi Zhang (Universität des Saarlandes)

Progress is reported on the parallel construction of a wide-coverage HPSG grammar and treebank for German. The method employed has strong affinity with methods that convert a theory-independent treebank to derivation trees of highly lexicalised grammar formalisms, for instance to HPSG (Miyao et al., 2005) and CCG (Hockenmaier and Steedman, 2003; 2005). After the creation of the treebank, deep linguistic features like valence frames are read off, which are then added to a manually created core grammar. Compiling a lexicon and a treebank are expensive and error-prone tasks, and this paradigm gives the opportunity to create both simultaneously in an automated fashion.

The contribution of this study is that a manually annotated treebank with richer linguistic features (Tiger treebank; Brants et al., 2002) is employed, allowing for a more elaborate core grammar, which results in a grammar with less overgeneration and more informative semantic output (Minimal Recursion Semantics (Copestake et al., 2005)). Currently, about 50% of the source treebank can be converted successfully, yielding a lexicon of 14.000 lemmas. A total of 239 fine-grained verbal lexical types are found, with 1.4 lexical types assigned to a verb lemma, on average. Around 2000 verbs, 6000 nouns and 2500 names are acquired. A comparison in performance and availability of resources will be made with respect to an existing, hand-crafted HPSG grammar for German (Müller and Kasper, 2000; Crysmann, 2003).

Extracting domain ontologies from social media applications Show/hide abstract
Paola Monachesi and Eelco Mossel (Utrecht University)

Domain ontologies are a useful support for learners since they can guide them in their search for relevant learning material given that they provide a formalization of the domain knowledge approved by an expert (Monachesi et al. 2008). However, this formalization might not correspond to the representation of the domain knowledge available to the learner which might be more easily expressed by the tagging emerging from communities of peers via available social media applications.

Therefore, in the Language Technology for Lifelong Learning project, we are going to develop a knowledge sharing system that connects learners to resources and learners to other learners by means of user profiles, ontologies, social tagging and social networks. More specifically, we are going to relate the tags that emerge from existing social tagging platforms (i.e. to the concepts of an existing domain ontology.

We report on a case study in which we extracted related tags for certain domain tags, based on co-occurrence on a set of bookmarks tagged by multiple users. The objectives were to assess whether the found related tags can be a useful source for enriching a given domain ontology, and whether it is possible to extract related tags from only a small set of tagged resources and from resources tagged by only few users, since this is the expected situation for a smaller learning community.

Our conclusion is that bookmarks tagged by around 10 people are a valid resource for deriving related tags. With higher numbers of people, a few more relations can be found, but the results are similar.

A generalized method for iterative error mining in parsing results Show/hide abstract
Daniël de Kok and Gertjan van Noord (Alfa informatica, University of Groningen)

Error mining is a useful technique for identifying forms that cause incomplete parses of sentences. Van Noord (2004) has described a method for finding suspicious n-grams of arbitrary length. Sagot and de la Clergerie (2006) have described an extension that uses an iterative process to gradually shift blame to specific unigrams or bigrams, rather than blaming each form occurring in an unparsable sentence. We have generalized the iterative method of Sagot and de la Clergerie to treat arbitrary length n-grams. An inherent problem of incorporating longer n-grams is data sparseness. Since a longer n-gram is less likely to occur, it may (incorrectly) be found to be more interesting than a shorter n-gram. Our new method takes data sparseness into account, producing n-grams that are as long as necessary to identify problematic forms, but not longer. It is not easy to evaluate the various error mining techniques. In our presentation, we propose a new evaluation metric which will enable us to compare the error mining variants.

Generating a Non-English Subjectivity Lexicon: Relations That Matter Show/hide abstract
Valentin Jijkoun and Katja Hofmann (ISLA, University of Amsterdam)

We describe a method for creating a non- English subjectivity lexicon based on an English lexicon, an online translation service and a general purpose thesaurus: Wordnet. We use a PageRank-like algorithm to bootstrap from the translation of the English lexicon and rank the words in the thesaurus by polarity using the network of lexical relations in Wordnet. We apply our method to the Dutch language. The best results are achieved when using synonymy and antonymy relations only, and ranking positive and negative words simultaneously. Our method achieves an accuracy of 0.82 at top 3,000 negative words, and 0.62 at top 3,000 positive words.

Getting (un)expected asymmetries in parsing head-final languages Show/hide abstract
Cristiano Chesi (University of Siena)

Surprisal-based Theory (ST) of processing (Hale 2001, Levy 2008), a kind of constraintsatisfaction parsing model (e.g. MacDonald 1993), has been shown to be superior to account for reading time decrease in verb-final clauses with additional preverbal constituents (Konieczny 2000), contrary to limited-resources psycholinguistic models that predict longer reading time (e.g. Dependency Locality Theory, DLT Gibson 1998): in simplified terms, extra pre-head constituents pose significant constraints on the following head(s), lowering (on a log-scale) the dependent probability of attaching specific constituents (Levy 2008). From this perspective, Earley parsers based on Probabilistic Context-Free Grammars (PCFGs, Charniak 2001) has been shown to be a plausible psycholinguistic model, highly reliable in predicting certain attachment difficulties/asymmetries (Hale 2001). ST, however, is not as predictive in object relative clauses in head-initial languages, where DLT-like models make the correct predictions (Gibson et al. 2005). In this work I will show that a (Top-Down, parallel) parsing algorithm that uses Phase-based Minimalist Grammars (PMGs, Chesi 2007) reconcile constraint-satisfaction and limited-resources models predictions in a fruitful way, recasting "head-final Vs. head-initial" asymmetries in terms of inverse selection mechanisms (Choi and Yoon 2006): PMGs can encode unbounded dependencies by means of argumental L(ast)I(n)F(irst)O(ut) memory buffers and selection features on the p(r)hase heads (Chesi 2007); the allocation of selection features (case-marked NPs select VPs in German and Japanese while VPs select NPs in Italian and English) is a reliable cue both in terms of resources allocation and in predicting, incrementally, on a log-scale as in ST, the likelihood of a given attachment point.

Harvesting Accurate and Fast Confusible Detectors Show/hide abstract
Antal van den Bosch (Tilburg centre for Creative Computing, Tilburg University)

A common type of spelling error is the confusion of one word for another existing word. Detecting an error of this kind necessarily involves analysing the linguistic context of the error. The past decade has seen a range of machine-learning-based solutions to this problem, producing knowledge-free, annotation-cost-free confusible classifiers. Although mostly high accuracies above 90% are reported, the end goal of error detection in unseen text requires the active pursuit of near-perfect performance, i.e. approaching 100% accuracy. In this presentation I first describe the automatic selection of perfect or near-perfect pairwise confusible disambiguators (e.g. deciding between the contextual appropriateness of 'achteren' versus 'achtereen') from a very large pool of candidate classifiers. The candidate pool is formed on the basis of a corpus-based lexicon of 4.2 million Dutch words; a confusible classifier is generated for each pair of words within a Levenshtein distance of 1 (i.e. they differ in one edit operation), where both words occur more than 100 times in a 600 million word corpus. In a cross-validation experiment, the score of each candidate classifier is assessed, and all 48,625 pairs are ranked by their estimated accuracy. In a second phase, a selection of all classifiers performing above a desired threshold accuracy (e.g. 99.5%) are aggregated into merged classifiers that could be integrated efficiently in word processing software: learning curve analyses illustrate how near-perfect accuracies can be combined with low memory footprints and processing speeds in the order of tens of thousands of classifications per second.

Identification of bilingual named-entities from Wikipedia using a pair Hidden Markov Model Show/hide abstract
Peter Nabende (University of Groningen)

We propose the use of a pair Hidden Markov Model (pair-HMM) to aid in the identification of bilingual matching named-entities. Identification of bilingual matching named-entities is important in many linguistic applications including machine translation and information retrieval so as to help deal with Out-Of-Vocabulary words (OOV words). Different models and versions of pair-HMMs have been applied successfully in Biological sequence analysis (Durbin et al., 1998), in cognate identification (Mackay and Kondrak, 2005), and in Dutch dialect comparison (Wieling et al., 2007). We adapt the pair-HMM toolkit used by Wieling et al. (2007) in our work for measuring the similarity between bilingual named entities extracted from Wikipedia for both cases of languages that use the same writing system and languages that use different writing systems. The similarity measures obtained using the pair-HMM are critical to the accuracy of bilingual named-entity identification system. We have tested the pair-HMM on English-Russian name-pairs, all extracted from links in Wikipedia Inter-language web pages. The pair-HMM is initially trained through the Baum-Welch algorithm using more than 35,000 English-Russian training name pairs extracted from both Wikipedia and the Geonames data dump. Two algorithms have been implemented in the pair-HMM that can be used for computing similarity scores: the Forward-backward and the Viterbi algorithm including their logarithmic variations. Given appropriate input data, the accuracy results obtained using the Forward-log algorithm show a promising application of the pair-HMM for the bilingual named entity identification process.

Impact of lexical probabilities on adapting a PCFG to a new domain. Show/hide abstract
Tejaswini Deoskar (ILLC, University of Amsterdam)

Current statistical parsers trained on data from a specific domain of a language perform poorly on data from a different domain. Several methods have been proposed to improve performance, using both annotated data from the new domain (such as supervised MAP estimation (Bacchiani, 2006)) and unannotated data (for example, self-training (McClosky et. al. 2006)). Methods using unannotated data are particularly relevant since annotated corpora are unavailable for most domains. In the current work, we use EM (Expectation-Maximization) re-estimation on unlabeled data from a new domain (such as fiction) to adapt a Wall Street Journal (WSJ) Treebank PCFG to the new domain. We focus our efforts on learning distributions of selectional preferences of lexical items in the new domain (lexico-syntactic properties), such as sub-categorization frames. EM re-estimation of lexico-syntactic probabilities on unlabeled WSJ data has previously been shown to improve the WSJ Treebank PCFG (Deoskar, 2008). Our motivation for using the method on a new domain is as follows: it is known that sub-categorization frames of verbs show a large variation across domains (Roland and Jurafsky, 1998). For instance, it has been shown for an HPSG parser that adapting the HPSG lexical model using annotated data from a new domain is more important than adapting the syntactic model (Hara et. al. 2007). Estimating lexicosyntactic preferences from unannotated data, of which much larger amounts are available, thus promises to be beneficial for adaptation.

In Search of the Why: What have we learnt? Show/hide abstract
Suzan Verberne, Lou Boves, Nelleke Oostdijk and Peter-Arno Coppen (Dept. of Linguistics, Radboud University Nijmegen)

In the research project 'In search of the Why', we have aimed at developing a system for answering why-questions. After four years of work, the project reaches an end. In our talk, we will summarize the main project results: what have we learnt about why-questions and how can a QA system answer them? In a QA system, why-questions need a different approach from factoid questions since their answers are explanations that cannot be stated in a single phrase. Therefore, passage retrieval is a suitable approach to why-QA. With a passage retrieval system based on word overlap only, we are able to find a relevant answer in the top 10 results for 45% of the why-questions. The addition of structural information (syntactic structure, cue words and document structure) to the ranking component of the system improves this success score to 55%.

We experimented with additional types of information that may be able to improve answer ranking. We found that discourse structures, which represent relations between spans of text on a higher level than syntax, can in theory be very useful for why-answer recognition. However, since automatic discourse parsing is not feasible in the current state-of-the-art NLP, these higher-level structures do not have direct practical usage. We also experimented with the addition of semantic expansions from a number of resources other than WordNet. We found that in their current state of development these semantic resources improve QA results because there are too many unresolved issues related to normalization and coverage.

Klinkende Taal: an automatic system for readability testing Show/hide abstract
Oele Koornwinder (GridLine bv)

Dutch officials are expected to write letters which are readable for a general public. In practice, this appears to be a difficult task, as officials are trained to communicate in the language of law. As a consequence, they are not able to adapt their own letters and information to the reading capacities of the working class. To bridge this gap, the Dutch Language Technology company GridLine developed an automatic, Word-compatible system for readability testing, called Klinkende Taal. It inspects Dutch official letters for passages which can cause reading difficulties, annotating both lexical problems (like officialese and domain-specific words and phrases) and style problems (like passive verb constructions and long sentences). For each type of problem, the system counts the number of occurrences, summarizing the text quality in a report with some key numbers. In this way, writers are stimulated to improve the readability of their texts. The Klinkende Taal software makes use of a number of Dutch HLT-tools, among which a tokenizer, a tagger, a lemmatizer, a compound splitter and a chunk analyser. These tools are implemented in a web service that annotates words and sentence parts which match a human-controlled list of difficult words and phrases, either exactly or in a derived form. We also use language technology for word suggestion in context and offline lexicon construction, applying automatic extraction methods to find terms and difficult words in domain specific text collections. In my presentation, I will demonstrate the application and discuss the underlying methods.

Language models for contextual error detection and correction Show/hide abstract
Herman Stehouwer and Menno van Zaanen (Tilburg University)

The problem of identifying and correcting confusibles, i.e. context-sensitive (spelling) errors, in text is often tackled using a machine learning classification approach. For each different set of confusibles, for instance containing confusibles then and than, a specific classifier is trained and tuned.

In this research, we investigate a more general approach to context-sensitive correction. Instead of training a classifier for each confusible set, a language model is used to measure the likelihood of sentences with different possible solutions for a confusible in place. The word in the sentence with highest likelihood is selected to be correct. This approach has the advantage that all confusible sets can be handled by only one model. Training for each new possible set of contextual errors is not required.

We compare the performance of the language model with that of a classifier-based approach and the detection and correction module of Microsoft Word. Firstly, manually and automatically generated lists of confusible sets are created. Secondly, sentences are selected from a corpus, containing words that are found in the list of confusibles. These words are then replaced by their incorrect variant. Finally, the systems are required to classify the correct words for each confusible. In addition to word-based classifiers and language model, more complex systems that incorporates POS- and chunk IOB-tags are tested.

Preliminary results show that the language-model based approach is outperformed by the classifier-based approach. We will investigate why this is the case.

Language technology and semantic knowledge in eLearning Show/hide abstract
Paola Monachesi (Utrecht University)

In this talk, I will report the final results of the Language Technology for eLearning (LT4eL) project, a European project which has recently ended. The aim of the project is to provide Language Technology based functionalities and to integrate semantic knowledge to enhance the management, distribution and retrieval of the learning material. Specifically, we have employed Language Technology resources and tools for the semi-automatic generation of descriptive metadata. We have developed new functionalities such as a key word extractor and a glossary candidate detector, based on definitions extracted from the learning material tuned for the various languages addressed in the project (Bulgarian, Czech, Dutch, English, German, Maltese, Polish, Portuguese, Romanian).

Semantic knowledge, in the form of ontologies, has been integrated to enhance the management, distribution and searchability of the learning material. The ontology, which is related to lexica in the relevant languages, allows for the cross-lingual retrieval of the required information. The impact of these new functionalities on the learning experience in a multilingual context will be discussed.

Language-independent bilingual terminology extraction from a multilingual parallel corpus Show/hide abstract
Els Lefever, Lieve Macken and Véronique Hoste (LT3, Ghent University College)

We present a language-pair independent terminology extraction module that is based on a sub-sentential alignment system that links linguistically motivated phrases in parallel texts. Statistical filters are applied on the bilingual list of candidate terms that is extracted from the alignment output. We compare the performance of both the alignment and terminology extraction module for three different language pairs (French-English, French-Italian, French-Dutch) and highlight language-pair specific problems (e.g different compounding strategy in French and Dutch). Comparisons with standard terminology extraction programs show an improvement of up to 20% for bilingual terminology extraction and competitive results (85% to 90% accuracy) for monolingual terminology extraction, and reveal that the linguistically based alignment module is particularly well suited for the extraction of complex multiword terms.

Linking Wikipedia Categories to Dutch WordNet Show/hide abstract
Gosse Bouma (Information Science, University of Groningen)

Wikipedia provides valuable category information for a large number of named entities. However, the category structure of Wikipedia is often associative and not strictly taxonomic. Therefore, a merger of Wikipedia and WordNet has been proposed. In this paper, we explore to what extent this is possible for Dutch Wikipedia and WordNet. In particular, we address the word sense disambiguation problem that needs to be solved when linking Wikipedia categories to polysemous WordNet literals. We show that a method based on automatically acquired predominant word senses outperforms a method based on word overlap (between Wikipedia supercategories and WordNet hypernyms). We compare the coverage of the resulting categorization with that of a corpus-based system that uses automatically acquired category labels.

Modelling Lexical Variation with Word Space Models Show/hide abstract
Yves Peirsman and Dirk Geeraerts (KU Leuven)

Word Space Models capture the meaning of a word in terms of its contexts in a corpus. They thus discover the semantic relatedness between two words on the basis of the similarity between their typical contexts. Contexts can be defined as a set of n words around the target, the syntactic relations in which the target takes part, etc. These models of lexical semantics are usually created on the basis of one corpus only. However, constructing a model on the basis of two corpora opens up interesting paths for the study of the lexical variation between those corpora. In our study, we use the Twente Nieuws Corpus (300 million words of Netherlandic Dutch newspaper articles), as well as a comparable corpus of Belgian Dutch newspaper texts. We construct a word space on the basis of these two corpora, with separate word entries for the two language varieties. We combine this approach with keyword statistics, in order to find Belgian-Netherlandic word pairs like dessert-toetje or gsm-mobieltje, which are indicative of significant lexical variation between the two language varieties. The results are evaluated against established lists of Belgian and Netherlandic Dutch words. In this way, we hope to show that advanced computational-linguistic methods can successfully be applied to the study of language variation.

Multiple sequence alignments in linguistics Show/hide abstract
Jelena Prokic, Martijn Wieling & John Nerbonne (University of Groningen)

In this study we apply and evaluate an iterative pairwise alignment program for linguistics, ALPHAMALIG (Alonso et al 2004), on phonetic transcriptions of words used in dialectological research. Most studies in dialectometry that measure linguistic differences at the phonetic level are based on pairwise comparison of transcriptions (Kessler 1995, Nerbonne et al. 1996, Gooskens and Heeringa 2002, Kondrak 2002). Although successful in the aggregate analysis of dialect distances, pairwise comparison fails to reveal important information about phone changes that can be easily detected if all transcriptions are compared at the same time. In order to perform multiple sequence alignments of X-SAMPA word transcriptions we use a slightly modified version of ALPHAMALIG. ALPHAMALIG implements the iterative pairwise alignment algorithm, which starts with aligning the two strings that have the minimum distance over all pairs of strings and iteratively aligns strings having the smallest distance to the multiple alignment to generate a new multiple alignment. The only restriction that we use while aligning transcriptions is that a vowel can only be aligned with a vowel and a consonant can only be aligned with a consonant.

To evaluate the quality of the multiple alignment, we propose a method based on comparing each column in the obtained alignments with the corresponding column in a set of gold standard alignments. Our results show that the alignments produced by ALPHAMALIG highly correspond with the gold standard alignments, making this algorithm suitable for the automatic generation of multiple string alignments.

Phrase Translation Probabilities Estimation Using Latent Segmentations and Smoothed Expectation-Maximization Show/hide abstract
Markos Mylonakis and Khalil Sima'an (ILLC, University of Amsterdam)

The conditional phrase translation probabilities constitute the principal components of phrase-based machine translation systems. These probabilities are estimated using a heuristic method based on the relative frequency of phrase pairs in the multi-set of the phrase pairs extracted from the word-aligned corpus. However, this approach does not seem to optimize any reasonable objective function of the word-aligned, parallel training corpus, leaving open the question whether it is suboptimal. The mounting number of efforts attacking this problem over the last few years exhibits its scientific relevance for the machine translation community as well as its difficulty. Nevertheless, earlier efforts on devising a better understood estimator either do not scale to reasonably sized training data, or lead to deteriorating performance. In this work we explore a new approach based on three ingredients: (1) A generative model with a prior over latent segmentations derived from Inversion Transduction Grammar (ITG), (2) A phrase table containing all phrase pairs without length limit, and (3) Smoothing as learning objective using a novel Maximum-A-Posteriori version of Deleted Estimation working with Expectation-Maximization. Where others conclude that latent segmentations lead to overfitting and deteriorating performance, we show that these three ingredients give performance equivalent to the heuristic method on reasonably sized training data, providing an effective, better understood estimator for the task.

A practical and extensible algorithm for frequent structure discovery in large corpora Show/hide abstract
Scott Martens (Centrum voor Computerlinguïstiek, K.U.Leuven)

Discovering frequent structures within large natural language corpora is one of the core problems of corpus linguistics. Identifying and sorting through significant structures, however, is difficult outside of trivial cases, and the growth in structured corpus data - tagged, parsed or otherwise enhanced data - renders string-based algorithms ineffective. This presentation will describe a practical algorithm to extract frequent structures from large corpora consisting of sequences of symbols, trees, bags of symbols, and any other data structure that can be represented as sequential or tree structured. It is applicable to efficiently discovering correlations in bags, repeated subtrees and sequences with gaps (e.g. skipgrams). Furthermore, it extracts the most frequent structures first, making it amenable to many kinds of monotonic search heuristics. This algorithm assumes random constant-time access to all parts of the corpus, but otherwise uses relatively little active memory, allowing it to process very large corpora without overflowing moderate memory bounds. Its performance is roughly linear on the quantity of data extracted from the corpus, which is not readily predictable in many cases but because it extracts the most frequent valid structures first, it is suitable for soft thresholds. This method is directly applicable to a wide variety of corpus linguistic analyses that use the automatic collection of statistics about the contents of natural language corpora.

Predicate-argument Frequencies in Dutch Pronoun Resolution Show/hide abstract
Gerlof Bouma (Linguistics Department, University of Potsdam) and Gosse Bouma (Information Science, University of Groningen)

In state-of-the-art pronoun resolution systems, surface-based and syntactic factors play a central role. They have proven to be reliable, measurable correlates of salience, one of the central forces underlying pronoun interpretation. However, even in the earliest (unimplemented) algorithms, semantic factors such as selection restrictions and plausibility were identified as another important information type (Hobbs 1978).

In a wide-coverage resolution system, predicate-argument frequencies may stand proxy for this semantic information. The suitability of a candidate antecedent is then (in part) determined by its coocurrence with the pronoun's governing predicate. Dagan et al.(1995) describe an implementation of this idea and report slight improvements in resolution performance. Since the successful use of predicate-argument frequencies relies on having a large corpus, preferably with a high quality of syntactic annotation, one might expect more dramatic improvements in more recent attempts. However, recent papers report contradicting results (Kehler, 2005; Yang et al. 2005).

In this talk we present experiments in using predicate-argument frequencies in pronoun resolution for Dutch. Our starting point is a system based on a maximum entropy ranking model, which includes in its feature set co-occurrence statistics of the predicate-candidate pair. The statistics come from a large, automatically parsed corpus. We also explore a different, novel application of co-occurrence statistics. Since a particular predicate-candidate pair may be rare, we use the co-occurrence statistics of the predicate with the candidate's semantically most similar words. Semantic similarity is also calculated from frequency data (Bouma & vd Plas, 2005). We compare our results with the current leading pronoun resolution systems for Dutch (Hoste, 2003; Mur, 2008).

A Regression Model for the English Benefactive Alternation Show/hide abstract
Daphne Theijssen, Hans van Halteren, Karin Fikkers, Frederike Groothoff, Lian van Hoof, Eva van de Sande, Simon Tazi, Jorieke Tiems, Veronique Verhagen and Patrick van der Zande (Radboud University Nijmegen)

In English, dative constructions preferably take one of two forms. One uses an indirect object, e.g. He gave me a cake. The other makes use of a prepositional phrase, e.g. He gave a cake to me. In principle, the choice between the two forms is free. However, for the alternation with to it has been shown that there are various factors which influence the choice, such as length and discourse givenness of the objects (Bresnan et al., 2007). It has even proved possible to relate the choice to these feature values with logistic regression models. Given sufficient amounts of training data, these models can reach a prediction accuracy well over 90%.

There has been much less attention to the dative alternation with the preposition for, which can be used for benefactive constructions (He baked me a cake versus He baked a cake for me). We investigate whether this alternation can be modeled with a similar high accuracy, using the same feature set and the same modelling techniques. Furthermore, we want to compare the alternation in adult and child language use. We use a data set containing benefactive constructions in written British English, taken from a number of corpus sources, i.e. ICE-GB, SUSANNE, LUCY and LCCPW, the latter two representing the written English of children (around 10 years of age). We will not only discuss the overall model accuracy, but also the relative role of the various features in models for the different kinds of text.

A Semantic Approach to Antecedent Selection in Verb Phrase Ellipsis Show/hide abstract
Dennis de Vries (University of Groningen)

Consider this example of Verb Phrase Ellipsis (VPE): The man [ant1 stood up because the door bell [ant2 rang]], but his son [vpe didn't]. Of the two possible antecedents, ant1 is the correct one. In earlier studies (Hardt, 1997; Nielsen et al., 2005) syntactical features of candidate antecedents were used to determine the most plausible antecedent. With their methods, (Hardt, 1997) and (Nielsen et al., 2005) reached accuracies of 84% and 79% respectively. Inspired by the ongoing theoretical debate on whether ellipsis is resolved syntactically or semantically, I elaborated on these studies by adding a number of semantic features. To acquire semantic information from discourse, I use Boxer (Bos, 2005), a semantic parser that constructs Discourse Representation Structures (Kamp and Reyle, 1993) from syntactically parsed discourse. These semantic features are (1) semantic similarity of VPE and antecedent subjects, (2) parallelism of prepositional phrases and (3) similarity in tense. To determine semantic similarity of nouns in (1) and (2), I use WordNet's path distance measure. With (1), ant1 will be chosen over ant2 because "son" has a higher semantic similarity with "man" than with "door bell". Like in (Hardt, 1997) a scoring mechanism is used to determine which of the possible antecedents is the most plausible one. Each feature that an antecedent may or may not have contributes to the score of an antecedent with a particular positive or negative value. These values are optimized using a Genetic Algorithm with over 500 manually annotated examples of VPE from the Wall Street Journal part of the Penn Treebank. This is work in progress.

Semi-supervised adaptation of a Syntactic Disambiguation Model using Structural Correspondence Learning Show/hide abstract
Barbara Plank and Gertjan van Noord (Alfa informatica, University of Groningen)

Most modern, effective NLP systems are based on supervised Machine Learning. A well-known drawback of such systems is their portability: Whenever we have labeled data from some source domain, but we would like a model that performs well on some new target domain, we face the problem of domain adaptation.

The need for domain adaptation arises in many NLP tasks: PoS tagging, Sentiment Analysis, Statistical Parsing, to name but a few. For parsing, most previous work on domain adaptation has focused on systems employing (constitutent or dependency based) treebank grammars (Gildea, 2001; McClosky et al., 2006; Shimizu and Nakagawa, 2007). Adaptation of disambiguation models is a far less studied area, most probably due to the fact that potential gains for this task are inherently bounded by the underlying grammar.

In the current study we focus on semi-supervised adaptation, i.e. no labeled target data at all. We present an application of Structural Correspondence Learning (SCL) (Blitzer et al., 2006) to adapt a stochastic attribute-value grammar (SAVG) to Wikipedia domains. So far, SCL has been applied successfully for PoS tagging and Sentiment Analysis (Blitzer et al., 2006; Blitzer et al., 2007). An attempt was made in the CoNLL 2007 shared task to apply SCL to non-projective dependency parsing (Shimizu and Nakagawa, 2007), however, without any clear conclusions. In this talk, we report on our exploration of applying SCL and show promising initial results on Wikipedia domains.

Solving analogies between symbols strings, with or without constraints Show/hide abstract
Jean-Luc Manguin (Laboratoire GREYC, Université de Caen)

Solving analogies is one of the main tasks in example-based natural language processing, for instance automatic translation (Lepage, 2005). An analogy between symbols strings can be written [A : B :: C : D], where A, B, C and D are symbols strings ; if the analogy is true, it implies that "A is to B what C is to D, and A is to C what B is to D" (a) , as in : [to bear : unbearable :: to suit : unsuitable] Y. Lepage (2003) has set up an important theory about analogy, and formalized the (a) conditions in the two following constraints :

1) The characters set of [(A + D)] is the same than [(B + C)]
2) [d(A,B)=d(C,D)] and [d(A,C)=d(B,D)] if d is a distance between symbols strings.

Of course, an automatic solver that computes D from A, B and C, should check if the solution match these conditions, as did Y. Lepage's first solver (Lepage, 1998). But we will present another algorithm that gets rid of the first condition. Moreover, we will proof that if the Levenshtein distance (or "edit distance") is used in the second condition, the "right" solution may sometimes be rejected, as in this example in dutch : [dagen : ze hadden gedaagd :: zagen : ze hadden gezaagd] Then we propose another distance which allows the solver to produce at least the right solution. We will also present several remaining problems that lead us to give other ways for solving analogies without checking the distance between strings.

Syntactic-Semantic Treebank for Domain Ontology Creation Show/hide abstract
Petya Osenova and Kiril Simov (Bulgarian Academy of Sciences, Bulgaria)

Treebanks have many applications like grammar development, linguistic knowledge acquisition, benchmarking, etc. Here we present a treebank especially developed for ontology creation and semantic annotation support. The first task is supported by exploitation of the treebank annotation in order to extract relevant concepts and relations for the domain ontology. The second task requires a treebank annotated additionally with semantic information in order to develop and test concept annotation grammars. The treebank is based on several kinds of domain texts: (1) glossaries and terminological lexicons; (2) industrial standards; and (3) domain texts (in our case magazine articles). The first source contains many of the most important terms in the domain with relevant definitions. Its drawback is that the definitions provide only partial information for the ontology creation. The second source includes precise information about the domain concepts and relations, and exhibits description granularity. The third source demonstrates the usage of the terms in broader context. The texts in the articles contain also usages of terms related to other domains which are connected to the domain of interest. The treebank was used in the life cycle of ontology construction and creation of concept annotation grammars. The annotation scheme reflects syntactic structures which represent the semantic relations in text. In addition, the words and phrases are annotated with concepts from the domain ontology, upper ontology and WordNet synsets. The last two sources of conceptual information are necessary to support the interaction of the domain terms with more general concepts.

A Tagger-Lemmatiser for 14th Century Dutch Charters Show/hide abstract
Hans van Halteren and Margit Rem (Radboud University Nijmegen)

Medieval texts present far greater problems for processing than modern variants. Especially the highly varying orthography makes any NLP task particularly difficult, for not only spelling is far from standardised, but the placement of spaces and punctuation as well. Because of this, it is of vital importance that before any task-directed processing, all texts are first tokenised, tagged and lemmatised. We have built a system that does just this, for 14th century Dutch charters.

Our system is trained on the 750K-word Corpus van Reenen-Mulder, which had previously been manually tagged with a tagset of about 180 wordclass tags (not counting clitic combinations) and lemmatised using modern lemmas. We made use of various off-the-shelf tagger generator components, such as SVM-Tagger, TnT and WPDV-tagger, but also developed new ones to deal with the orthographic variation. For tokenisation we deploy both character and token n-gram models. For spelling variation, we automatically generate generalised spelling variants, based on token-specific observed variation as well as on rules derived from frequent observations over all tokens. For both tagging and lemmatisation, the system reaches accuracies well over 90% when tested in ten-fold cross-validation. In this paper, we will describe the strategies used to handle the orthographic variation in more detail and show their impact by quality measurements with and without various components in place.

TEI-conformant XML encoding scheme for corpora with multiple levels of annotation Show/hide abstract
Szymon Bemowski and Adam Przepiórkowski (Polish Academy of Sciences)

The aim of this paper is to present a new encoding scheme for text corpora with multiple levels of linguistic annotation. The scheme has a form of a TEI instantiation designed according to the TEI P5 Guidelines. It specifies conventions for encoding the following levels of annotation: segmentation, text structure, morphosyntax, syntactic words, syntax, senses of words and occurrences of named entities. The design of the scheme makes it also possible to easily extend it with new levels of annotation in a uniform and elegant way.

The scheme, while fully TEI-conformant, also builds on earlier proposed standards, especially, XCES (XML Corpus Encoding Scheme), and on the ideas of LAF (Linguistic Annotation Framework). Various levels of annotation are kept as stand-off annotation in separate files, with references to the primary data and to each other. Complex values of attributes, e.g., morphological tags, are abbreviations to feature structures defined in external inventories.

During the talk, the scheme will be illustrated with numerous examples. Some emphasis will be put on the representation of alternatives (ambiguities), also at the segmentation level, and on the handling of discontinuity at various levels.

The proposed scheme is currently being applied to the National Corpus of Polish ( To the best of our knowledge, this is the first TEI application embracing so many annotation levels and worked out in such detail.

Towards Acquisition of Taxonomic Inference Show/hide abstract
Piroska Lendvai (Tilburg University)

In a corpus of Dutch medical encyclopedia texts, we focus on the mechanism of taxonomic inference that involves extraction of taxonomically coordinate terms and identifying a passage in the same document (the hypothesis) that states their relation explicitly. The inference elements are acquired by syntactic and semantic alignment, additionally driven by document structure. From the terms and the related hypothesis we harvest lexical paraphrase patterns, which are next linked to annotated domain concepts. The inference method is bootstrapped and transferred to unstructured documents to detect relations between automatically identified coordinate terms.

Training a parser on artificially fragmented data for spoken language understanding. Show/hide abstract
Lonneke van der Plas (Département de linguistique, Université de Genève), James Henderson (Département d'Informatique, Université de Genève) and Paola Merlo (Département de linguistique, Université de Genève)

In the near future, adaptive systems for spoken language understanding will need a more sophisticated treatment of natural language. For example, in self-help dialogue systems, simple keyword spotting approaches will no longer suffice due to the extensive situation descriptions with which users describe their difficulties to the system. Such utterances require parsing to be translated into dialogue acts or other shallow semantics.

Spoken data contains many incomplete sentences, especially when spoken within the context of a dialogue. This is in contrast with the corpora on which parsers are usually trained, i.e. newspaper texts such as the Wall Street Journal. Newspapers typically contain very little direct speech, and therefore mostly consist of complete and well-formed sentences. Parser performance on spoken data is degraded by the very different nature of the training data.

To improve performance on spoken data, we trained an SRL parser on an artificially created corpus that includes fragmented data. In order to build this corpus we carefully determined the distribution of fragments found in a subset of spoken data acquired from human-machine dialogues. We then extracted constituents from the Penn Treebank augmented with PropBank annotation in accordance with the distribution found in the subset. We trained the parser on this corpus and compared our results to the model trained on the original data.

We will show that we improve the parser's performance on incomplete sentences acquired from human-machine dialogues by training on artificially fragmented data.

Translating Questions for Cross-lingual QA Show/hide abstract
Jörg Tiedemann (Alfa informatica, University of Groningen)

Cross-lingual question answering (QA) is an approach to enable the search for information written in a language different from the query language. Adding a translation component to QA makes such systems more widely applicable to a larger group of users. The simplest setup for a cross-lingual system is to add a module for the translation of incoming questions to an existing monolingual QA system. For this, available general purpose translation engines can be applied but they have the drawback that they are not trained for this specific task and data. In our research, we try to develop a task-specific translation component tailored towards the syntax and semantics of questions. Our first experiments using a standard phrase-based SMT approach show that task-specific training data, even if only a small amount, is important for the quality of its output. We evaluate various settings including different types of parallel and monolingual data and the use of linguistic factors for better generalization. Especially tuning the language model that is responsible for grammaticality and fluency seems to be important for the improvement of the MT output. The syntactic structure of questions is usually very specific and linguistic features such as part-of-speech labels, dependency relations and syntactic supertags can help to build a general model for this type of structure. Our experiments are carried out for English-to-Dutch question answering. Details will be given in the actual presentation.

A treebank-driven investigation of predicate complements in Dutch Show/hide abstract
Frank Van Eynde (Centre for Computational Linguistics, K.U.Leuven)

Predicate complements are typically selected by copular verbs. Their number rarely exceeds a dozen in the lists that one finds in descriptive grammars, and in formal treatments the list tends to be even smaller, with most of the attention going to the copula and a few other verbs.

What this paper aims to show is that treebanks, if properly annotated and documented, provide a more reliable source for identifying the class of predicate selectors. More specifically, a quantitative investigation of the treebank of the Spoken Dutch Corpus will be shown to provide the means to identify a considerably larger class of predicate selectors, yielding a list of three dozen instead of one. Since all of the predicate selectors, including the copula, are also used in other ways, i.e. as intransitive verbs, auxiliaries, etc., it is important for NLP systems to have the means to distinguish the predicate selecting uses from the other ones. Also for this purpose, a quantitative investigation of the Spoken Dutch treebank will be shown to provide valuable information.

This work fits in the larger enterprise of automatic valence acquisition from treebanks, see Sarkar and Zeman 2000 for Czech, O'Donovan et al. 2005 for English, Kupsc and Abeille 2008 for French, Hinrichs and Telljohann 2009 for German.

Unsupervised Frame Induction with Non-negative Matrix Factorization Show/hide abstract
Tim Van de Cruys (University of Groningen)

In this talk, an algorithm is presented that automatically tries to induce semantic frames from a parsed corpus. It does so by combining all pairwise co-occurrences of a verb and each of its dependencies (subjects, direct objects, prepositional complements and modifiers), as well as the pairwise co-occurrences of the dependencies among each other. The pairwise co-occurrence data is combined in an interleaved non-negative matrix factorization framework, in order to overcome data sparseness and to generalize over the data, while keeping track of the multi-way relationships. The algorithm is evaluated qualitatively, and its advantages and disadvantages are investigated.

Unsupervised Methods for Head Assignment Show/hide abstract
Federico Sangati (University of Amsterdam)

We present three new algorithms for assigning heads in phrase structure trees based on different linguistic intuitions on the role of heads in natural language syntax.

In our investigation, heads are seen as a bridge between different syntactic representations of natural language sentences: phrase structures, lexicalized tree grammars, and dependency structures. The starting point of our approach is the observation that a head- annotated treebank defines a unique lexicalized tree substitution grammar, a grammar in which each tree fragment has a unique lexical anchor. Analogously, any head annotation of a parsetree can be uniquely mapped into a projective dependency structure of the same sentence.

This allows us to go back and forth between these three representations, and define objective functions for the unsupervised learning of head assignments in terms of features of the implicit derived grammatical structures.

We evaluate algorithms based on the match with gold standard head- annotations, and the comparative parsing accuracy of the lexicalized grammars and the dependency grammars they give rise to. On the first task, we approach the accuracy of hand-designed heuristics for English and inter-annotation-standards agreement for German. On the parsing evaluation tasks, our lexicalized grammars score 4% points higher than those derived by commonly used heuristics.

Unsupervised Morphology Learning Using a Lexicalized Grammar Show/hide abstract
Cagri Coltekin (CLCG/University of Groningen)

Morphological analysis is essential for a human language user, as well as for a large number NLP applications. We present a model of unsupervised morphology learning. The primary motivation of the model is simulating human language acquisition. However, especially for agglutinative languages, the system's performance is competitive with the state-of-the-art computational models of unsupervised morpheme analysis.

The model learns morphology of a natural language by learning a `morphemic' lexicalized grammar based on Categorial Grammar formalism in an unsupervised manner. The input to the model is unannotated, unsegmented words. The model learns a lexicalized word-grammar which is capable of analyzing and producing the words from the input language. Since the model learns in a completely unsupervised manner, it can in principle learn morphology of any language. However, current implementation is restricted to learning concatenative morphologies, and especially successful learning highly inflecting languages.

The advantage of using a lexicalized grammar for representing the word grammar is two fold. First, instead of learning a lexicon and a set of rules, it reduces the learning to only learning a lexicon with richer structure. Second, it can easily represent irregularities conditioned on the lexical item. The model uses a simple Bayesian learner. Starting with an empty lexicon, the model updates the lexicon iteratively using the current input and the lexicon.

After outlining the learning model, we will present results from unsupervised learning of morphology of Turkish, a highly agglutinating language, using the data from child directed speech in the CHILDES database.

Using lexico-syntactic patterns for automatic antonym extraction Show/hide abstract
Anna Lobanova, Jennifer Spenader, Tom van der Kleij (Artificial Intelligence, University of Groningen)

We present results on automatically extracting antonym pairs from the 74 million Dutch CLEF corpus using lexico-syntactic patterns. Previous corpus work on antonyms used manually selected lexico-syntactic patterns for the classification of existing antonyms (Jones 2002, Willners 2001). We identify productive lexico-syntactic patterns automatically using different sets of seed pairs, e.g. arm-rijk (poor-rich). The most frequent patterns, e.g. tussen arm en rijk, (between poor and rich), are then used iteratively to find new candidate pairs, and new pairs become new seeds. Search patterns were weighted by dividing the total number of times a pattern contained seed antonyms by the total number of times the pattern was found in the corpus. Found pairs were scored according to how often they occurred with each pattern, taking pattern weight into account. The best pairs are used as new seeds.

Three native Dutch speakers were asked to classify all extracted pairs with a score higher than 0.6 as antonyms, synonyms, correlates (i.e. ontological sisters) or none of the above. For an initial set of 6 seeds, 237 pairs were extracted. 25% were judged (by majority vote) as antonyms, 35% as correlates, 38% none of the above and only 1% as synonyms. Thus, contrary to e.g. clustering techniques, our approach does not erroneously find synonyms. Many found pairs classified by human evaluators as correlates can function as opposites in a context, e.g. democratie-dictatuur (democracy, dictatorship). Future work will focus on distinguishing correlates from antonyms.

Using the transitivity of meaning for distributional methods to remedy data sparseness Show/hide abstract
Lonneke van der Plas (Département de linguistique, Université de Genève)

One of the reasons for acquiring semantically related words is to augment the coverage of available manually built resources. These resources particularly lack coverage for infrequent words. However, the methods that are generally used to acquire semantically related words from corpora, the so-called distributional methods, suffer from data sparseness and perform poorly precisely on infrequent words. Distributional approaches calculate the similarity between any two words by determining the similarity between their respective context vectors. They build on the distributional hypothesis that states that words that share the same contexts are similar. The outcome of distributional systems for any particular word is a ranked list of related words proposed by the system, the nearest neighbours.

We try to remedy the problem of data sparseness that such methods suffer from by relying on the transitivity of meaning, i.e. if A and B are semantically related and B and C are semantically related, we can infer that A and C are semantically related as well. In practice this comes down to feeding the output of the system, the nearest neighbours, as additional input to the system. The fact that both A and C are nearest neighbours of B will now add to the similarity computed between them. We show that the method that uses the original data augmented with the weighted output of the system performs better on infrequent words than the method that uses the original data as such.

Vagueness and Interaction: Notes on a Project Show/hide abstract
Raquel Fernandez (University of Amsterdam)

The computational and formal study of dialogue has by now reached a good level of maturity (as witnessed e.g. by the SemDial and SIGdial international workshop series). This body of work, however, is very seldom combined with research on language acquisition and language change---phenomena where dialogue interaction plays a crucial role. We will present the main ideas of a new project that aims at using precise formal and computational models to bring together these three areas of linguistic inquiry---dialogue, acquisition, and change---in order to explain their interplay and commonalities.

As a case-study, we focus on `vague expressions' (such as `tall' or `cheap'). One of the aspects that makes these expressions particularly interesting from the perspective of linguistic interaction is their highly context-dependent nature. For instance, to say that `Paula is tall' may be judged appropriate if her height is 1.9m, but also so if her height is 1.5m and she is a 7-year-old. In an interactive setting (amongst humans or in human-machine communication), this context-sensitivity requires the coordination of a common standard of comparison that changes with the context of utterance and is often determined on the fly. The long-term goal of our research is to develop a computational approach that can account for how vague expressions are interpreted in conversation, how they are acquired during development, and how their meaning becomes shared within a linguistic community. In this talk we will present the first steps towards this aim, taking as starting point empirical data extracted from corpora.

PhraseNET: a proposal to identify and to extract phraseological units from electronic texts Show/hide abstract (no show)
J.L. De Lucca (Politechnical University of Valencia)

This article describes a new method to identify and extract Spanish phraseological units from textual corpora. There are different methods of classification of phraseological units, but we have to highlight the ones proposed by Corpas Pastor (1996). This author, starting from a wide conception of phraseology, classifies Spanish's phraseologisms in three different categories: collocations, locutions and phraseological enunciated units (fixed forms and routine formulas).

A detector and extractor of phraseological units can be defined as a group of computer programs that recognizes and extracts the phraseological units (UFS) that appear in a document. We present a system implementation for the automatic extraction of UFS from textual corpora based on software designed in a project. This software is known as PhraseNET and extracts metalinguistic statements and contexts from electronic documents, using statistics and searching algorithms. The extraction is done sentence-by-sentence and the proposed architecture is based on statistics and relational algebra. The main characteristic of this architecture is the scarce use of linguistic resources, which are replaced by algorithms of searches and statistical methods. Four experiments are presented in this paper to show the extraction of Spanish phraseological units from textual corpora. This application demonstrates the useful architecture of the software design, comparing the results of this system with manual extraction.

Retrieval on nonstandard documents: Development of a fuzzy search plugin for Mozilla Firefox Show/hide abstract (withdrawn)
Thomas Pilz and Wolfram Luther (Universität Duisburg-Essen)

The interdisciplinary project on rule-based search in text databases with nonstandard orthography develops mechanisms to ease working with unstandardized electronic documents. These may contain various types of nonstandard spellings like historical spelling variants, recognition errors, typing errors or idiosyncratic transcription. One of our goals is the development of a fuzzy search module for those texts. There are several distance measures that have proven their excellent results in fuzzy searching. We do not want to compete with these measures but provide a slim and easy-to-use way to support retrieval without specific preconditions such as phonetic transcription. Our aim is to provide researchers as well as interested amateurs with the ability to search in the web documents they are working with every day. The trainable stochastic distance measure we developed was embedded into a plugin for the popular browser Firefox. Its calculation is fast enough to allow instant fuzzy search and highlighting. By analyzing the page to be searched in for the amount and types of variant spellings included and adjusting the measure accordingly, we expect to increase the quality of retrieval. A Bayesian classifier was already shown to be able to detect more than 70 percent of spelling variants with less than 3 percent of false positives.

Sentence complexity in French: a corpus-based approach Show/hide abstract (withdrawn)
Ludovic Tanguy and Nikola Tulechki (Université de Toulouse le Mirail)

Measuring sentence complexity has been investigated by many different fields of (computational) linguistics. This includes readibility measures, language learning, text profiling and classification, controlled language checking, etc. Researches in these areas have developped many sets of linguistic features meant to be automatically mesaured in corpora.

Based on an investigation of these different studies, we identified 52 features associated with complexity. These features range from the simple word length used in readibility measures to specific syntactic structures such as the level of NP embedding used in controlled languages checkers. These features also differ widely in the types of NLP techniques and ressources that are needed for their computing: some only need a simple word tokeniser, while others need a syntactic parser.

We developed specific programs for these features and ran them on a 2 million word French corpus. Based on their statistical redundancy, we established a subset of 21 independent features, privileging low-cost techniques. We then performed a dimensionality reduction technique (principal component analysis) to identify more complex relations between these features. Based on the results we propose a five-dimensional model of the complexity of French sentences. These dimensions give us an insight on sentence complexity, through the identification of 5 independent characteristics of sentences: sentence length, nominal vs verbal character, syntactic density, lexical complexity and grammatical subject complexity. The features combined in some of these five dimensions can then be used for text profiling techniques with better results than isolated features.

A Tale of Two Resources: Stochastic Modeling of Hebrew Morphosyntactic Phenomena through a Fuzzy Alignment of Morphological and Syntactic Resources and Annotation Schemes Show/hide abstract (withdrawn)
Reut Tsarfaty (University of Amsterdam) and Yoav Goldberg (Ben Gurion University of the Negev)

Supervised methods make use of annotated data to learn a model that predicts the structural description of unseen text. A supervised statistical parser uses sentences paired-up with syntactic descriptions to learn sentence structure, and a morphological analyzer decompose words to their constituent morphemes. For morphologically ambiguous languages the Morphological analyzer may propose a set of analyses which may be augmented with a probabilistic component to discriminate between them.

These morphological and syntactic tasks are not orthogonal. Languages with productive morphology have many different surface forms, resulting in sparse word-tag distributions. Morphological resources such as wide-coverage dictionaries and FSFTs may be used to increase the treebanks lexical coverage, but such a combination of resources assumes that they agree on the analyses at the level they interface. For Modern Hebrew this assumption turns out to be strictly false.

We argue that such a misalignment of resources is not a technical matter. We show that for Modern Hebrew a 1:1 mapping is impossible based on linguistic grounds. Semitic phenomena (e.g., intermediate verbs and participials) introduce a level of ambiguity precisely at the morphosyntactic interface and the different perspectives on the data lead to annotation discrepancies that cannot be reconciled.

Here we propose the novel use of a stochastic fuzzy-match learned over an integrated, two-way, morphosyntactic annotation, corresponding to the different linguistic perspectives. We show that the stochastic layer allows to streamline the use of a treebank and a morphological resource for combating data sparseness, and the parsing results we obtain significantly improve arccordingly.

Accepted poster presentations (17)

One poster (Zhao) was converted to a talk.

COMPOST Dutch - the Best POS Tagger for Dutch Spoken Language Show/hide abstract
Jan Raab and Eduard Bejcek (Charles University in Prague)

In this work we present a better than state-of-the-art POS tagger developed for Dutch. It is based on Averaged Perceptron algortithm (Collins, 2002) and trained on CGN corpus (Dutch part only so far). The accuracy on randomly selected eval-test data is 97.2%, which represents a 1% error reduction compared to previous work (Bosch et al., 2006).

Although the improvement of accuracy may not be significant compared, COMPOST Dutch has few more benefits. The algortithm is implemented as a stand-alone program, which is easy to use and quite fast - it can process about 100k words per minute. COMPOST is freely downloadable from our website for research purpose. It works both under Linux and Windows platform.

Beside these properties we have to claim that the accessibilty of linguistic tools is the most important for practical use (provided they have a reasonable performance). Therefore we made COMPOST Dutch accesible as an online tool with no need to install any software.

Finally, we present few suggestions how to adapt this system for written data without being retrained. Problems of punctuation and capitalization can be treated with a special script.

Discourse Representation Theory as monadic computations Show/hide abstract
Gianluca Giorgolo, Christina Unger (UiL-OTS, Universiteit Utrecht)

In this paper we show how Discourse Representation Theory (DRT) can be interpreted as a computation with side effects, and how this allows us to mechanically transform a Montague-style lexicon into a dynamic lexicon. The main advantage of this approach is the possibility to exploit already existing Montague-style lexical resources for applications involving discourse level processing. We follow a suggestion made in [1] and interpret Discourse Representation Structures (DRSs) as chunks of stateful computations, i.e. computations that produce a value by possibly reading and modifying an environment. In the case of DRSs the environment is represented by discourse referents and the accessibility relation between DRSs, and we will compactly implement the environment as a set of discourse referents arranged in a tree together with a pointer. In this way accessibility is reduced to the dominance relation between tree-nodes. We formally represent this notion with what is called a monad, dened as a triple consisting in this case of a mapping from any type of value (e.g. entities, truth values, sets, etc.) to functions from states to the same type of value together with a possibly new state (type theoretically \all a.a -> State -> <a, State>, where State represents the type of states), an operation unit to lift values to monadic computations and an operation bind to compose these computations. Using these building blocks we can obtain a dynamic lexicon by lifting the purely compositional components of the lexicon using unit and adding the state changing effects where needed.

Extracting nouns from a tagged corpus Show/hide abstract
Lars Hellan and Lars G Johnsen (NTNU)

We report on a development of the project ENOK funded by the Norwegian Research Council, which was dedicated to constructing three classifiers for nouns: animate vs. inanimate, relational nouns, and mass terms vs. count terms. The classifier for animacy based itself on the syntactic information given in the Oslo corpus of tagged texts (, limited to the functions subject, object and prepositional object.

A Bayesian model classifying nouns into animate/inanimate/unknown from those syntactic functions reached an accuracy exceeding 90% depending on the decision level. However, a high decision level left a great part of the nouns unclassified. The penalty for high accuracy were a small catch: only 6000 out of a possible 70000 nouns were classified. The challenge is to extend the coverage while keeping the accuracy intact.

The classifier for animacy is being improved by incorporating verbs into the model along with corresponding syntactic functions. While this makes the context for classification more specific, it also makes the classifier more vulnerable to the effects of sparse data.

Both classifiers are compared with respect to accuracy and coverage.

On the Automatic Extraction of Absolute Phrases from a Tagged Corpus: A Report on the ICE-GB Show/hide abstract
James Vanden Bosch

Quirk et alia (1985) claimed that, "apart from stereotyped phrases, absolute clauses are formal and infrequent." But what counts as "formal"? And what counts as "infrequent"? In my larger study of this grammatical construction, I have found that absolute phrases are quite common and that they turn up in many genres. Apart from Kortmann (1991), it is quite difficult to find information on the frequency of the absolute phrase in spoken and written English. In this paper I will present a summary and discussion of my analysis of the automatic extraction of absolute phrases from the ICE-GB, a POS-tagged corpus. Of the six extraction formulas provided by Nelleke Oostdijk in these extractions, two formulas produced no absolute phrases at all. The remaining four formulas, however, had quite high rates of success, averaging just over 88% success in isolating absolute phrases in that corpus.

These extraction formulas, however, are by their very nature incapable of finding two broad categories of the absolute phrase, namely, the absolute phrase with a null verb form and the absolute phrase that exists as a sentence fragment, standing alone and punctuated as a sentence. Regardless, these four extraction formulas can provide a good deal of accurate information about the frequency of this construction in contemporary spoken and written English.

SpatioTemporal Analysis in a Multimodal, Multidocument, Multilingual Context Show/hide abstract
Ineke Schuurman (Centrum voor Computerlinguistiek, K.U.Leuven)

In STEX, our spatiotemporal analysis system, we take full advantage of the fact that the origin of the texts we analyse is known. This enables us not only to detect that a particular event took place at a certain time in a certain place, but also to link this event with a more precise date and a position on a contemporary map and, if applicable, provides us with the name that place has nowadays. In this respect our system differs from well known approaches like TimeML and SpatialML. Another difference is that in STEX the geospatial and temporal aspects are connected. Grice's Cooperative Principle, as reflected in the well-known maxims (often paraphrased as "Don't say too much, don't say too little"), plays a major rule in the process of disambiguation and stamping, resp. grounding of relevant expressions. For STEX to be able to be cooperative, it needs a certain amount of spatiotemporal world knowledge. STEX differs from MiniSTEx, as used in SoNaR, in several ways as it pays more attention to tense & aspect and is also extended to another language/tradition/... In the AMASS++-project (Advanced Multimedia Alignment and Structured Summarization, IWT, 2007-2011) STEX plays a role in linking texts with video/stills (multimodal aspect), in linking documents in the same language (multidocument aspect) through the timelines and/or geographic map locations involved, and in doing the same when several languages are involved (at the moment Dutch (Flanders, the Netherlands) and English (UK, USA)), its multilingual aspect. All these elements come into play when a multimodal, multilingual summary is made.

TACTiCS: a Tool for Analyzing and Categorizing Texts using Characteristics of Style Show/hide abstract
Kim Luyckx and Walter Daelemans (University of Antwerp)

We present TACTiCS, a Tool for Analyzing and Categorizing Texts using Characteristics of Style written in Python. It can be used for stylistic analysis of a text and for classifying the author or characteristics of the author (e.g. gender, age, personality) of a text. The modular system takes an automatic text categorization approach that labels documents according to a set of predefined categories. This methodology has been tested in previous studies on authorship attribution and verification, and personality prediction.

In the first module, the user submits texts with associated metadata - depending on the class to be predicted - to the web interface. The text is shallow-parsed using the Memory-Based Shallow Parser and saved in xml format. The second module performs feature construction and selection on the training data using k-fold cross-validation. Training and test instances are used for text categorization (module 3). The default learning algorithm is TiMBL (Tilburg Memory-Based Learner) - an instance of k Nearest Neighbor classification - but different learning algorithms can be plugged into the system.

Because of the simple xml encoding of text data and a user-friendly interface, combined with state-of-the-art machine learning and text categorization technology, this tool will be useful for both computational linguists investigating computational stylometry and wanting to add their own feature construction and machine learning modules, and humanities researchers wanting to investigate specific stylometry problems with their own corpora and data.

Towards the National Corpus of Polish Show/hide abstract
Adam Przepiórkowski, Rafal L. Górski, Barbara Lewandowska-Tomaszczyk and Marek Lazinski (Polish Academy of Sciences)

This paper presents a new corpus project, aiming at building a national corpus of Polish.

For Polish, the biggest Slavic language of the EU, there still does not exist a national corpus, i.e., a large, balanced, linguistically annotated and publicly available corpus. Currently, there exist three Polish corpora which are --- to various extents --- publicly available. The largest and the only one that is fully morphosyntactically annotated is the IPI PAN Corpus (, containing over 250 million segments (over 200 million orthographic words), but --- as a whole --- it is rather badly balanced. The PWN Corpus of Polish (, more carefully balanced, contains over 100 million words, of which only a 7.5 million sample is freely available for search. The PELCRA Corpus of Polish ( also contains about 100 million words, all of which are publicly searchable.

What makes the National Corpus of Polish project different from a typical YACP (Yet Another Corpus Project) is 1) the fact that all four partners in the project have in the past constructed corpora of Polish (including the 3 corpora mentioned above), 2) the partners bring into the project varying areas of expertise and experience, so the synergy effect is anticipated, 3) the corpus will be built with an eye on specific applications in various fields, including lexicography (the corpus will be the empirical basis of a new large general dictionary of Polish) and natural language processing (a number of NLP tools will

User-Centered Views over Ontology Show/hide abstract
Kiril Simov and Petya Osenova (Bulgarian Academy of Sciences, Bulgaria)

In the framework of two European projects ; we have created a model of the relation ontology-to-text, which is ontology centered. A lexicon was created on the base of the ontology. This lexicon contains items, whose meaning corresponds to the conceptual information in the ontology. It is a starting point for the creation of annotation grammars, which connect the lexical items with their realizations in text. The relation ontology-to-text is fundamental for the semantic annotation of texts. The other tasks are: ontology-based search and navigation. These tasks need additional information in the lexicon. It includes the context of ontology use and the communication goal.

We will present an extension of the lexicon model which to support the contextual usage of the ontology. We encoded the it by links from the lexical item to an ontology of contexts which includes descriptions of the usage situation of the lexical item. Each situation contains a number of participants, their background, their goals, history of the interaction between participants, etc. Contextual use requires some operations over the ontology. For example, some concepts (and/or relations, instances) are not appropriate to be used in some contexts. Thus, for the representation of the ontology these concepts have to be "erased" from it. But, on the other hand, these concepts are important when the ontology is used for inference. Thus, the "erase" operation is only related to ontology presentation, but not to ontology inference.

Using Averaged Perceptron Tagger for Word Sense Disambiguation Show/hide abstract
Eduard Bejcek and Jan Raab (Charles University in Prague)

Well performed Word Sense Disambiguation is an important first step for other NLP tasks, such as machine translation or information retrieval. In our paper we present an Averaged Perceptron approach to WSD for Czech. This task is complicated, as Czech has rich morphology, so every ambiguous lemma has numerous different word forms.

The usual problem for automatic WSD is data amount, since manual sense annotation is expensive and hard. We used data annotated (and to some extend also corrected) by Czech WordNet synsets (Bejcek, Mollerova, Stranak, 2006). We obtained 100,000 occurences of annotated words in na whole sentences. As a baseline we assigned the most frequent synset to each occurence. We tried both the most frequent synset for a given word form and for a lemma, as the morphological analysis was done for input data.

For our experiments we used the system More. It is based on the c Hidden Markov Model and the Averaged Perceptron (Collins, 2002). It was originally developed for morphological tagging (Raab, 2006), but as a classifier it could be used for WSD, too.

We made three experiments and exceeded baselines in all of them. First we took data as it is (i.e. with manual morphological annotation); we achieved 94.2% (with 87.9% baseline). Then we used a bare input data and assigned synsets to word forms. Our result is 90.7%, compared with 61.7% baseline. Last we enriched the input by a morphological tagger and achieved 94.2% (the baseline remained 61.7%).

An arabic text categorization system based on support vector machines SVM Show/hide abstract (no show)
Abdemadjid Achit (CRSTDLA Research Center) and Hamid Azzoune (USTHB University)

In this paper, we report the application of the machine learning method SVM to classify Arabic text documents into predefined thematic categories. As the number of electronic documents available increases, there is a growing need for creating software tools to assist users in searching, filtring, and classifying documents. Text categorization deals with the task of automatically assigning one or multiple predefined category labels to documents in natural language. In our work, we used the Support Vector Machines SVM, a statistical machine learning method, which has been applied to many real world problems, such as: text categorization, image/face recognition, hand-written digit recognition, bio-informatics. Applying the svm approach to a particular language involves resolving a number of design questions. The treatment chain for obtaining learned classifier from a collection of electronic text documents include several steps: tokenization, stopwords removing, stemming, terms frequencies computing, terms weights computing, feature selection/extraction, decision function parameters' computing using the SMO algorithm for instance.

Whereas an extensive range of methods have been applied to English text categorization, relatively few have been done for Arabic text categorization. In addition, there is a real lack of publicly available arabic linguistics resources such as: corpora, stemmer, morphological analyser for implementing and evaluating Arabic text categorization systems. Through this paper, we will try to present our experience in developing and implementing this useful application.

Persian Text Segmentation and Tokenization Show/hide abstract (no show)
Soheila Kiani, Mehrnoush Shamsfard (Shahid Beheshti University)

Text segmentation is the process of recognizing boundaries of text constituents, such as sentences, phrases and words .This task has different problems in various natural languages depending on linguistic features and prescribed form of writing. In Persian language, word segmentation also known as tokenization has more problems compared to Indian-derived languages due to different writing styles but less problems compared to Chinese-derived languages due to existence of space. Persian has different prescribed forms of writing differing in the style of writing words, using or elimination of spaces and using various forms of characters. There are one to four written forms for each character according to its place in a word. Space is not a deterministic delimiter. It may appear within a word or between words or may be absent between some sequential words. In the last case Persian will be similar to some Asian languages such as Chinese which has no space between words. Our proposed approach combines dictionary based and rule based methods and uses spaces, punctuation marks, morphological rules and written forms of alphabets to determine word boundaries in Persian and converts various prescribed forms of writing to a unique standard one. The developed tokenizer determines the words boundaries; concatenates the separated parts of a single word and separates individual words from each other. It also recognizes multi part verbs, numbers, dates, abbreviations and some proper nouns. This paper will describe the challenges in Persian tokenization and discuss our solutions. The experimental results show high accuracy in tokenization.

Word Sense Disambiguation: An Extended Approach Show/hide abstract (no show)
Yasaman Motazedi and Mehrnoush Shamsfard (Shahid Beheshti University)

In this paper a new Word Sense Disambiguation hybrid method is introduced. This method is applied in an English to Persian translation system called PEnTrans. This algorithm disambiguates each word in a sentence or phrase without any prior knowledge about the senses of other words or the domain of the text. The novelty of our approach lies in two parts: The new extension to Lesk WSD algorithm and the new scoring measure. The new algorithm combines grammatical knowledge derived from parsing the sentence, lexical knowledge extracted from WordNet and semantic knowledge obtained from both WordNet and eXtended WordNet. In our extension to Lesk algorithm, we use the word, its POS tag, its synsets, its gloss, its antecedents (up to two levels) and their tags and glosses to measure the appropriateness of each word sense. The scoring module assigns weights to each of the above parameters and calculates a primary score for each word sense by a linear function. The score reflects the similarity between the parameters of the word sense and its neighbors in the sentence. The primary score will be reduced by a penalty if there is any conflict between the POS or WSD tags of the compared elements. The maximum score shows the most probable sense for current words. The experimental results show the advantages of this method. In this paper after a brief description of WSD methods, we will introduce our proposed approach to disambiguate English words in a sentence and describe its components.

Word to Phoneme Converter for Cross-scripts Show/hide abstract (no show)
Kyaw Zar Zar Phyu and Khin Mar Lar Tun (University of Yangon, Myanmar)

An automated word-to-phoneme (W2P) conversion module is a crucial requirement for many real world applications including speech processing domain and natural language processing domain. Our goal is to construct a good W2P system that can automatically generate accurate phonemic representations for a word. Here, we take the advantage of our innovative bi-lingual pronouncing dictionary to generate phoneme strings for both scripts: Myanmar and English scripts. The bilingual pronouncing dictionary is constructed by the Myanmar orthographic book order. We use the phonemic transcription system for conversion rather than the phonetic transcription system. All of the rules used in our work for the transcription follow Myanmar phonological rules. The entries are categorized into groups by their similar pronunciation. The converter takes a Myanmar word in English and generates the pronunciation for it. It can solve for all Myanmar name words in English. It can also figure out for each Myanmar query word to generate phoneme strings if it is needed. The performance of the converter is measured by accuracy dimension. We have tested the two data sets for testing our method. The first data set, STU-NAMES, is a name collection of over 5000 bachelor and master students from our university. Another set, NEWS, is a list of over 2000 names collected from newspapers, magazines and journals. We found that 90% of the names for the first set and 95% of the names for the second set have correct phonemic representation respectively.

Addressing Subjectivity in Thematic Classification: Using Cluster Analysis to Derive Taxonomies of the Thematic Conceptions of the Prose Fiction of Wilike Collins Show/hide abstract (withdrawn)
A A Omar (Newcastle University, United Kingdom)

In spite of the continued importance of thematic classification in the critical theory, the issue of objectivity remains problematic. The notion that there are different criteria for grouping literary texts in relation to the theme has long been a central argument in literary criticism. Traditionally, thematic classifications have been generated by what will henceforth be referred to as the philological method, that is, by individual researchers reading of printed materials and the intuitive abstraction of generalizations from that reading. Collectively, studies of this kind have been criticized for their lack of objective, replicable, and reliable methods.

In the face of the problem mentioned above, the study proposes automated text classification (ATC) based on the lexical frequency of occurrences using word based document representations called bag of words (BOW). For concreteness of exposition, the discussion is based on an artificially constructed corpus of 69 texts. These are the novels, collection of short novels and individual short stories of Wilkie Collins. The main idea is to perform a document ranking wherein cluster information is used within a graph-based framework. In this the study takes the form of a case-study design, with in-depth analysis of multivariate statistical techniques, particularly cluster analysis and their feasibility in generating an automated thematic classification of the prose fiction of Collins. The study is in 6 main parts. Part 1 is Introduction. Part 2 is Methodology. Part 3 is Data creation. Part 4 is data analysis. Part 5 is Interpretation of the results. Part 6 is Conclusion.

French Tough-Adjectives: Getting Easier with Verbs Show/hide abstract (withdrawn)
Anna Kupsc (Université Michel de Montaigne)

The paper presents a syntactic lexicon of French adjectives automatically extracted from a treebank. The corpus is relatively small (about one million words), which does not allow for application of statistic techniques, but it contains rich grammatical annotations which have been validated by human experts. For creating the basic lexicon, we rely on corpus annotations and linguistic knowledge. We identify various adjectival constructions in the corpus and focus on separating their components from true arguments of adjectives.

This method works quite well for capturing elements which are locally realised but is unable to recognize arguments which have been extracted or displaced as they are not annotated in the treebank. Hence, the so-called tough-adjectives (1), which combine with a gapped VP, cannot be distinguished from adjectives which require a saturated phrase (2).

(1) La question est facile a [_VP repondre __]
the question is easy to answer
(2) Marc est long a [_VP repondre a la question]
Marc is slow to answer to the question
`It takes Marc a lot of time to answer the question.'

In order to address this problem, we use a syntacitc lexicon of verbs (obtained from the same corpus) in order to verify the valence of the verb in VP required by the adjective: the adjective is considered to belong to the `tough'-class only if an argument of the verb is missing. In particular, the paper discusses linguistic constraints on extraction, i.e., which types of dependents can be extracted from the VP complement of the adjective.

Toward a syntagmatic measure of eventhood for Italian event nouns Show/hide abstract (withdrawn)
Irene Russo and Tommaso Caselli (University of Pisa)

Providing a definition for event nouns is not an easy task: semantic analyses (Lyons, 1977) exploit a limited set of tests which are not satisfying. The aim of this work is to suggest a measure of eventhood for Italian nouns, based on a set of syntagmatic cues, chosen in a preliminary corpus study. This measure goes beyond morphologically identifiable elements and could be useful for several NLP tasks (e.g. events detection). Following Vendler (1967), we consider eventhood of a word as a property assignable on the base of co-occurrences: 37 verbs (e.g. annullare/ to cancel, finire/ to finish etc.) and 39 adjectives (e.g. frequente/ frequent, imminente/ imminent etc.) are retained as good indicators of eventhood. To ease the choice of event nouns and to identify polysemic alternations (e.g. assemblea/ meeting = EVENT HUMAN_GROUP), we have used a semantic lexicon resource (Ruimy et al. 2003), which presents a GL-based analysis of nouns. We selected 112 nouns from the "La Repubblica" (380M tokens) corpus according to four macro-classes reported in Table 1.

(Derivational event nouns: 57)
(Non-derivational event nouns: 15)
Non-morphological event nouns: 30)
(Non event nouns: 10)

Eventhood is computed by applying the formula below, i.e. the sums of co-occurrence frequencies of each noun with the chosen verbs and adjectives normalized by the total frequency of the noun in four syntagmatic patterns (NOUN-VERB, VERB-NOUN, ADJ-NOUN, NOUN-ADJ).


Thus, according to this measure guerra (war) (0.177) is most "eventive" than decisione (decision) (0.142), followed by libro (book) (0.070) and anniversario (anniversary) (0.043).