The subset problem

A possible MT system implementing the notion linguistically possible translation may be constructed as the series of two monolingual grammars. It is assumed that each of the monolingual grammars defines a reversible relation between phonological representations and semantic representations. The translation relation simply is the composition of these two monolingual relations. The logical form language is thus used as some sort of interlingua.

In general such an approach faces a problem which has been called the subset problem in [52]. Each of the monolingual grammars defines (implicitly) a set of semantic representations. However, in general it need not be the case that for a given semantic representation defined in language a there exists a semantic representation in language b. In general we have the situation as in figure 5.2. In this figure, a grammar for language a relates a set of phonological representations phon_a with a set of semantic representations sem_a; the same holds for a grammar for language b. The set of semantic representations for each language is a subset of the possible sem; however in general sem_a and sem_b are different.

**Figure 5.2:** The subset problem
$\begin{figure} \begin{picture} (400,90)(0,30) \put(190,60){\vector(-1,0){80}} \p... ..._{b}$}} \put(205,100){\makebox(0,0){$\mbox{\it sem}$}} \end{picture}\end{figure}$

The problem can be seen as consisting of two parts. This first part of the subset problem can be characterized as a difference of coverage of the source- and target language grammar. The second part constitutes the logical equivalence problem.

Difference in coverage.

Logical equivalence.

There are several ways in which we could go about trying to tackle the subset problem. Solutions to the problem seem to have in common that in some way or other the different grammars of the languages between which translation is defined are put in correspondence, i.e. are tuned to each other.

For example, if we are to build a translation system between German and Russian we could construct the monolingual grammars of German and Russian very carefully in such a way that we know that the subset problem does not surface. This approach is worked out in the Rosetta system [52]. If we encounter an example in which the German grammar produces a semantic representation for which the Russian grammar does not provide a sentence, we simply change the grammar of German or Russian in such a way that there is a possible translation.

The important counter argument to this approach is, that this leads to a situation in which monolingual grammars are `impure', whereas from a methodological and practical point of view, we may require that each monolingual grammar should be `pure', i.e. not influenced by the design of all other monolingual grammars for reasons of modularity. Especially in multi-lingual systems this argument is an important one. Of course, it remains to be seen how important this modularity is in the construction of a practical system.

Instead I propose to tune semantic representations derived by monolingual grammars explicitly. The tuning is defined in an extra component: the transfer component. In this case, a translation system between German and Russian is constructed as follows. The monolingual grammars are constructed in a modular way, as desired. For each of the semantic representations (implicitly) defined by the German grammar, we define how these relate to the semantic representations (implicitly) defined by the Russian grammar.

This approach to the subset problem is taken in the MiMo2 system. The semantic representations derived by the monolingual grammars are explicitly tuned to each other by a transfer component. Moreover, this transfer component is defined by a constraint-based grammar (just like the monolingual grammars). For this reason it is very easy to guarantee that the translation relation defined by this system is reversible. Furthermore, if each of the monolingual and transfer grammars is reversible, then so is the translation relation.