TLT 7 Abstracts
Invited Talks
-
Linguistic annotation for valence acquisition and for its evaluation
Adam PrzepiórkowskiValence acquisition is the task consisting in the automatic extraction (learning) of subcategorisation -- or argument structure -- from corpora. In this talk I will concentrate on two issues. The first issue is: how much linguistic annotation do we need for valence acquisition? Approaches range from linguistically lean, e.g., Brent's 1993 proposal to infer valence information from co-ocurrences of verbs with pronouns and functional words, to more recent proposals to read valence off from lavishly annotated treebanks. I will present the results of some experiments from Polish suggesting that shallow (or partial) parsing may be as useful in this task as more difficult and less efficient deep parsing. The second issue concerns the evaluation of the results of automatic valence acquisition. The common methodology is to compare the automatic results to a manually constructed valence dictionary, but -- again on the basis of some experiments carried out for Polish -- I will point out various weaknesses of this methodology and argue for the more costly and, hence, less common corpus-based evaluation.
-
Treebanks and evolutionary simulation for explaining typological patterns
Robert MaloufRecent work in Evolutionary Phonology (Blevins 2005, 2006, Blevins & Wedel 2008, Yu 2007, among others) has developed alternate explanations for typological universals or tendencies found across the sound systems of unrelated languages. This research emphasizes the role of patterns of language use and language change in the development of cross-linguistic patterns, rather than placing the burden of explanation on synchronic cognitive factors (i.e., Universal Grammar).
In this talk, we will review extensions of this work into the domain of morphology (Ackerman, Blevins, and Malouf to appear). We investigate the Paradigm Cell Filling Problem, a particular question for inflectional morphological systems which has received relatively little attention in the theoretical literature. Specifically, we ask: How do speakers of morphologically complex languages predict the full inflectional (or derivational) paradigms of novel words, given exposure to a small number of surface word forms? For example, a noun in Tundra Nenets can appear in 210 different inflected forms. Given exposure to one of these forms of a novel noun, how does a Tundra Nenets speaker predict the other 209 forms?
Our hypothesis is that speakers' need to solve the Paradigm Cell Filling Problem serves as a strong evolutionary pressure on language, which in turn leads morphological systems to develop in particular directions. Thus the Paradigm Cell Filling Problem is an indirect explanation for some of the typological patterns found cross-linguistically in morphological systems. In order to test our hypotheses about language development, we perform computer simulations of language evolution across many generations, to see which factors cause which patterns to arise. In this way, we treat language as a complex adaptive system and therefore link linguistic study to larger trends in the biological and social sciences (e.g., Miller and Page 2007).
As with many other complex adaptive systems, the outcome of our simulations of linguistic evolution can be highly dependent on the initial conditions. Therefore, to be reliable, our simulations need to be based on detailed and accurate information about synchronic language states. As the domain of investigation moves from morphophonology to morphosyntax, it becomes more difficult to find the information we need in conventional typological databases and lexicons. The kind of information we need can only be found in treebanks -- corpora with highly detailed linguistic annotations. Therefore, the development of large, richly annotated corpora in a variety of typological diverse languages is crucial to the evolutionary program for explaining cross-linguistic evolutionary patterns.
Session A: Representation
-
Towards a multi-representational treebank
Fei Xia, Owen Rambow, Rajesh Bhatt, Martha Palmer and Dipti Misra SharmaComputational, descriptive, and theoretical linguistics use both phrase (PS) structure and dependency structure (DS) to represent syntax. We believe that the next-generation treebank should be multi-representational, designed for both representations with an automatic conversion. In this paper, we highlight the assumptions made by existing PS-to-DS and DS-to-PS conversion algorithms and show the limitations of these algorithms. We then propose a new DS-to-PS conversion algorithm that outperforms existing algorithms and allows more flexibility. Our experiments and error analysis show that high-quality DS-to-PS conversion is possible if the conversion process is performed at the designing stage of treebank construction to ensure that all information we wish to represent in PS is provided in DS.
-
PASSAGE Syntactic Representation
Patrick Paroubek, Eric de la Clergerie, Sylvain Loiseau, Anne Vilnat and Gil FrancopouloWe present the PASSAGE syntactic representation based on syntactic relations, ini- tially developed for French in the scope of national evaluation campaigns. After a brief presentation of the non-nested chunks and syntactic relations of PASSAGE, we reuse the comparison elements that Marneffe and Manning have selected to compare the Standford typed dependencies (SD) against the GR and PARC representations, and show that PAS- SAGE is for a large part compatible compatible with these representation, standing closer to GR than to SD. After a presentation of the collaborative software support for PASSAGE representation, we conclude on some essential characteristics that pivot representation for syntax should exhibit.
Session B: Lassy
-
Huge Parsed Corpora in LASSY
Gertjan van NoordOne of the goals of the LASSY STEVIN project (Large Scale Syntactic Annotation of written Dutch) is a syntactically annotated (manually verified) corpus of 1 million words. In addition, the full STEVIN reference corpus of 500 million words will be syntactically annotated automatically. In this paper, the potential of such huge treebanks for applications in corpus linguistics, natural language processing and information extraction is illustrated.
-
Cultivating Trees: Adding Several Semantic Layers to the Lassy Treebank in SoNaR
Ineke Schuurman, Veronique Hoste and Paola MonachesiIn a recent STEVIN project, several semantic layers are added to the manually corrected part of the Lassy treebank (1 million words). This part of the Lassy treebank is included in the SoNaR corpus, a reference corpus for Dutch (500 million words). The added layers concern Named Entity labeling, co-reference labeling, semantic role labeling, and spatiotemporal labeling.
-
The Distribution of Weak and Strong Object Reflexives in Dutch
Gosse Bouma and Jennifer SpenaderWe use a syntactically annotated corpus to study the distribution of strong and weak reflexive objects in Dutch. Whereas previous work was limited to a small set of accidental reflexive verbs, we look at all transitive verbs in the corpus. We use subcategorization frames to approximate verb senses. We show that comparing the rate of pronominal usage to reflexive usage is a better predictor of strong or weak reflexive choice tendencies (giving a correlation of 33%) than considering all objects, confirming a suggestion by Haspelmath (2004). We also show that the automatic method gives results comparable to those for the semi-automatically collected data in Hendriks, Spenader, and Smits (2008).
Poster/demo session
-
Semantic Annotation of Genitive Attributes in a German Treebank
Maya BangerterGenitive attributes are usually tagged as such in treebanks. However, it is well known that this information is not sufficient for determining the type of relation between head nouns and attributes, as genitive attributes can express many different semantic relations. Various classifications have been proposed. For German, Helbig and Buscha use a semantically motivated typology, while Lindauer's classification is based exclusively on syntactic criteria and differentiates only between two large classes of genitive attributes. The Duden Grammatik pursues a ``mixed'' strategy. To my knowledge, nobody has so far proposed to apply this liguistic knowledge to a corpus. The challenge here is to come up with a classification that is both as easy to verify as Lindauer's and as fine-grained as Buscha's. In this paper I propose a detailed annotation scheme for German genitive attributes, using the above-mentioned approaches as guidelines, and report on first insights from its application to the Smultron Treebank. The results show that a detailed annotation of genitive attributes is possible and useful.
-
To Use a Treebank or Not - Which Is Better for Hypernym Extraction?
Erik Tjong Kim SangWe compare two processing methods for a single natural language processing (NLP) task. One uses a treebank created with a full parser while the other restricts itself to lexical and part-of-speech information. We show that for the task under investigation, hypernym extraction from text, the former does not outperform the latter. We compare the output of the two approaches and look for an explanation for this unexpected result.
-
LFG Parsebanker: A Tool for Building and Searching a Treebank as a Parsed Corpus
Victoria Rosen, Paul Meurer and Koenraad De SmedtWe present the LFG Parsebanker, a comprehensive toolkit for interactive incremental construction of a treebank as a parsed corpus. The toolkit offers an environment for batch and interactive parsing, versioning, inspection of structures, discriminant-based disambiguation, and statistics. It has recently been extended with a comprehensive structural search facility. The toolkit is used through a web interface.
-
PASSAGE Syntactic Representation
Patrick Paroubek, Eric de la Clergerie, Sylvain Loiseau, Anne Vilnat and Gil FrancopouloWe present the PASSAGE syntactic representation based on syntactic relations, initially developped for French in the scope of national evaluation campaigns. After a brief presentation of the two mains elements of the representation: non-recursive constituents and syntactic relations, we reuse the comparison elements that M.C. Marneffe and C.D. Manning have selected to compare the Standford typed dependencies (SD) against the GR and PARC representations, and show that PASSAGE representation is closer to GR than SD. We conclude on some essential characteristics that pivot representation for syntax should exihit, after having presented the existing collaborative software support for PASSAGE representation
Session C: Investigation
-
Similarity Rules! Exploring Methods for Ad-Hoc Rule Detection
Markus Dickinson and Jennifer FosterWe examine the role of similarity in ad hoc rule detection and show how previous methods can be made more corpus independent and more generally applicable. Specifically, we show that the similarity of a rule to others in the grammar is a crucial factor in determining the reliability of a rule, providing information unavailable in frequency. We also include a way to score rules which are not in the training data, thereby providing a platform for grammar generalization.
-
MonaSearch - A Tool for Querying Linguistic Treebanks
Hendrik Maryns and Stephan KepserMonaSearch is a new powerful query tool for linguistic treebanks. The query language of MonaSearch is monadic second-order logic, an extension of first-order logic capable of expressing probably all linguistically interesting queries. In order to process queries efficiently, they are compiled into tree automata. A treebank is queried by checking whether the automaton representing the query accepts the tree, for each tree. Experiments show that even complex queries can be executed very efficiently. The tree automaton toolkit MONA is used for the computation of the automata.
Session D: Exploitation
-
Constructing a Valence Lexicon for a Treebank of German
Erhard Hinrichs and Heike TelljohannThis paper describes the TüBa-D/Z valence lexicon that has been constructed in parallel with the TüBa-D/Z treebank of German. After a short introduction of the underlying treebank, the paper focuses on a quantitative analysis of the lexicon, emphasizes the importance of the lexicon for aiding consistency of annotation, and discusses the utility of such a lexicon for incorporation into other language resources and for NLP applications.
-
TePaCoC - A Testsuite for Testing Parser Performance on Complex German Grammatical Constructions
Sandra Kuebler, Ines Rehbein and Josef van Genabithn recent years, linguistic resources have become an indispensable component for many NLP applications. Their creation, however, involves an immense amount of manual work, which makes them not only valuable, but also extremely costly. One central issue in the creation of treebanks is the standardisation of linguistic annotations rather than their representation. These decisions regarding the target representation of our standard are of vital importance, as they determine the quality, and hence the usefulness, of present as well as of future language resources (since many annotation schemes are used with minimal adaptation for a variety of languages). Unfortunately, there is no straightforward way to assess the quality and suitability of different linguistic annotations. This paper makes a contribution to the problem by providing a resource for testing the impact of different data structures on a well-defined task. We focus on the design of syntactically annotated corpora and its effect on PCFG parsing.
For a fair and unbiased investigation of the impact of treebank annotation schemes on parsing results, we have to resort to human evaluation, which is time-consuming and thus can be applied to small data sets only. Therefore the data in the testsuite has to be choosen carefully.
In the paper we present the testsuite and describe the grammatical phenomena covered in the data. We discuss the different annotation strategies used in the two treebanks to encode these phenomena and present our error classification of potential parser errors.
-
A Data-Driven Dependency Parser for Romanian
Mihaela Calacean and Joakim NivreWe present the first data-driven dependency parser for Romanian, which has been developed using the MaltParser system and trained and evaluated on a dependency treebank for Romanian developed within the RORIC-LING project. The parser achieves a labeled attachment score of 88.6\% (unlabeled 92.0\%) when evaluated on held-out data from the treebank. We present a partial error analysis, focusing on accuracy for different parts of speech and dependencies of different length.
Session E: Annotation
-
Automatic Annotation of Morpho-Syntactic Dependencies in a Modern Hebrew Treebank
Noemie Guthmann, Yuval Krymolowski, Adi Milea and Yoad WinterMorpho-syntactic dependencies between sentence constituents are an inseparable part of syntactic analysis, in particular in Semitic languages. In those languages, because of the relatively free order of certain constituents, morpho-syntactic agreement features are sometimes the main clue for computational parsing models. Despite their centrality for syntactic analysis, morpho-syntactic dependencies have so far not been annotated in Hebrew resources. This paper describes the development and implementation of the morpho-syntactic dependency scheme used in the Modern Hebrew Treebank (MHT) project. The annotation scheme for dependencies is based on familiar but non-trivial grammatical rules for Modern Hebrew. These rules are used for two purposes. The first purpose is to annotate the morpho-syntactic dependencies between nodes in the treebank. The second purpose is to use the generated dependencies for automatically annotating agreement features of compound constituents. The rules are described in XML format and implemented using Python scripts that were run on the manually annotated MHT to produce MHT2, a version of the treebank that includes morpho-syntactic dependencies. A sample of the annotated dependencies was manually evaluated, which showed high accuracy of the automatic scheme. Errors detected mostly resulted from errors in the original syntactic annotation of the MHT, without dependency annotations. Thus, the development of the dependency scheme and its automatic implementation also proved helpful in improving the quality of the manual treebank annotation. A similar methodology is expected to be viable for other treebanks without dependency annotations, and especially for the Penn Arabic Treebank.
-
A Quechua-Spanish Parallel Treebank
Annette Rios Gonzales, Anne Göhring and Martin VolkMost treebank work in the past has focused on European and Asian languages. Now we want to focus on a very different language, Quechua, for which only few NLP resources exist. A Quechua-Spanish bilingual corpus was compiled in order to create a parallel treebank for these two languages. Since Quechua is a strongly agglutinative language we have decided to annotate the Quechua treebank on morphemes rather than words. For this reason, we developed a morphological analyzer that segments the Quechua words automatically. As for the syntax trees, we argue for Role and Reference Grammar (RRG) as a suitable grammar formalism for Quechua. For the Spanish part of the parallel treebank, we used a modified version of the AnCora tagsets with their respective guidelines. The alignment of Quechua to Spanish was challenging due to their different syntactic structures; the annotation and alignment tools we used proved to be suitable to a certain extent but need to be further adapted.
-
Extracting and Annotating Wikipedia Sub-Domains
Gisle Ytrestøl, Stephan Oepen and Daniel FlickingerWe suggest a simple procedure for the extraction of Wikipedia sub-domains, propose a plain-text (human and machine readable) corpus exchange format, reflect on the interactions of Wikipedia markup and linguistic analysis, and report initial experimental results in parsing and treebanking a domain-specific sub-set of Wikipedia content.