Dependency relations

Next: Log-linear models. Up: Disambiguation Previous: Penalty rules.

Dependency relations

We also experimented with statistical models based on dependency relations encoded in the dependency structure. The model assigns a probality to a parse by considering each dependency relation. For this purpose, dependency relations d are 5-tuples $d=\langle w_h,p_h,r,w_a,p_a\rangle$ where w_h is the head word, p_h is the corresponding part-of-speech tag taken from a small set of part-of-speechs $\{v,n,a,adv,p,\dots\}$ , r is the name of the relation taken from a small set of relation names $\{\mbox{\em su,obj1,obj2,vc,mod,det},\dots\}$ ; w_a is the argument word, and p_a is its part of speech.

The probability of a parse y given a sentence x might then be defined as:

$\begin{displaymath} p(y\vert x) = {1\over Z(x)}\prod_{d\in y} p(r,w_a,p_a\vert w_h,p_h) \end{displaymath}$

For disambiguation, the normalizing factor Z(x) is the same for every parse of a given sentence and can be ignored.

Due to the occurrence of reentrancies, dependency structures are generally not trees but graphs. Therefore, the product above gives poor results because it will have an unjustified bias against such reentrancies (a reentrancy gives rise to an additional dependency relation). For this reason, we have chosen to score parse trees by determining the mean value of $-\log p$ for each tuple; this improved results considerably. The probability of a dependency is calculated as follows:

p(r,w_a,p_a|w_h,p_h) = p(r|w_h,p_h) * p(p_a|w_h,p_h,r) * p(w_a|w_h,w_p,r,p_a)

The three components are each calculated using a linear back-off strategy, where the weights are determined by frequency and diversity (formula 2.66 of [9]). The quantities we use for backing off are given in the following table:

Because the size of the treebanks we have currently available is much too small to estimate these quantities accurately, we have chosen to do our estimation using unsupervised learning. We have parsed a large corpus (`de Volkskrant' newspaper text: first four months of 1997) using the penalty rules described in the previous section as our disambiguator. This corpus contains about 350,000 sentences and 6,200,000 words. We only used those sentences that the system could analyse as a single constituent, and within a reasonable amount of time. This meant that we could use the results of about 225,000 sentences. We estimated the quantity p using the best parse (according to the penalty rules) for each of these sentences. Collecting the 225,000 dependency structures took about one month of CPU-time (using the high-performance computing cluster of the University of Groningen).

As can be concluded from table 3, such a model performs much better than the baseline. Moreover, a combined model in which we simply add the rule penalties to the quantity p performs better than either model in isolation.

	cdbl10			cdbl20
technique	precision	recall	f-score	precision	recall	f-score
baseline	62.3	63.3	62.8	58.5	59.6	59.0
log linear	76.0	76.6	76.3	66.3	67.6	66.0
penalties	78.6	79.3	78.9	73.1	73.3	73.2
dependency rel's	78.9	79.7	79.3	69.7	71.1	70.4
heur. + dep-rel's	80.9	81.7	81.3	74.6	75.4	75.0
maximum	89.1	90.0	89.6	83.2	84.1	83.7

Next: Log-linear models. Up: Disambiguation Previous: Penalty rules.

Noord G.J.M. van
2001-05-15

back-off level	p(r\|w_h,p_h)	p(p_a\|w_h,p_h,r)	p(w_a\|w_h,w_p,r,p_a)
1	p(r\|p_h)	p(p_a\|p_h,r)	p(w_a\|w_p,r,p_a)
2	p(r)	p(p_a\|r)	p(w_a\|r,p_a)
3		p(p_a)	p(w_a\|p_a)
4			p(w_a)