Porting a Stochastic Part-of-Speech Tagger
to Swedish

Douglass R. Cutting

Xerox Palo Alto Research Center (PARC) and
Swedish Institute of Computer Science (SICS)

Abstract

The Xerox Part-of-Speech Tagger (XPOST) claims to be practical . One aspect of practicality as defined here is reusability . Thus it is meant to be easy to port XPOST to a new language. To test this, XPOST was ported to Swedish. This port is described and evaluated.

Practical Part-of-Speech Tagging


In previous work on part-of-speech tagging, a practical part-of-speech tagger was defined as one with the following set of properties: [Cuttinget al. 1992]

A     tagger  should assign the correct part
det/2 n       modal  v      det adj/2   n/2

of   speech to     every word in     the text
prep n      prep/2 det   n/2  prep/4 det n.


The author previously helped construct the Xerox Part-of-Speech Tagger (XPOST) in an attempt to meet these criteria. The present paper first reviews XPOST in the light of recent Scandinavian work on part-of-speech tagging. It then describes the author's experiences porting XPOST to Swedish while visiting the Swedish Institute for Computer Science (SICS).

Stochastic Part-of-Speech Tagging

Stochastic part-of-speech taggers operate by constructing a probabilistic model of text; then estimating the probabilities of the model, or training , and finally, using the trained model to assign parts-of-speech to previously unseen text. The models employed typically contain two sorts of probabilities: transition probabilities and symbol probabilities . Transition probabilities are recorded for sequences of tags, usually pairs, and indicate the probability of that sequence occurring in text, e.g. the probability that a determiner is followed by a noun. Symbol probabilities record the likelihood that a given input item, typically a word, assumes a given part of speech, e.g. the probability that "bank" is a verb. Given an input sequence (e.g. a sentence) and these two sets of probabilities, one may compute the probability of each possible tag assignment by multiplying all the applicable symbol and transition probabilities. The tag assignment with the highest such product is selected as most likely. This simple methodology has been shown to work quite well. [Church 1988]

A weakness with this approach is that symbol probabilities are difficult to estimate for words. A substantial portion of text is composed of low-frequency words. For these words, there are not enough observations make accurate estimates of symbol probabilities. And words which are unknown when training have no observations at all. Compounding this problem, Samuelsson has shown that symbol probabilities are more significant in improving accuracy than transition probabilities. [Samuelsson 1993] Together these suggest that, if one is not satisfied with the accuracy of a stochastic part-of-speech tagger, one should attempt to improve symbol probability estimation.

A common approach to such sparse-data problems is to develop an alternate representation which pools data into coarser categories, increasing the number of observations of each of a smaller set of phenomena. In XPOST, each word is represented by its ambiguity class the set of tags it may assume. All words in an ambiguity class are considered identical, and their observations may thus be pooled to provide better estimates.

XPOST guesses ambiguity classes for unknown words based on their suffixes. Frequencies of suffixes of a known words in a text are analyzed to generate a table which, given a suffix, names the ambiguity class which accounts for the vast majority of the words with that suffix. This is similar to the method for handling unknown words proposed by Ecklund. [Eklund 1993, Eklund 1993b]

Another weakness of many stochastic taggers is their reliance upon hand-tagged corpora for training. While hand-tagged corpora do provide accurate estimates, they are very expensive to produce. XPOST avoids reliance on hand-tagged corpora by using a hidden Markov model (HMM). The Baum-Welch (or forward-backward ) algorithm enables one to estimate symbol and transition probabilities of an HMM without hand-tagged training data. [Baum 1972]

The Baum-Welch algorithm operates by incrementally adjusting probabilities to make the training data more likely. One can steer it out of local-maxima by initializing some of the probabilities manually. For example, one might initialize the transition probability between determiner and noun to be higher than the transition probability between determiner and verb. In effect, this permits one to specify simple a priori grammatical constraints. Here we see that stochastic taggers are not purely data-driven and self-organizing, as is sometimes claimed by those promoting grammar-based taggers, but rather permit integration of linguistic knowledge.

Performance of XPOST on English

On the Brown text collection [Franciset al. 1982] XPOST achieves the following results:

Corandic is   an  emurient grof with many fribs; it  granks
n        v3sg det adj      n    prep adj  pl     pro v3sg

from corite, an  olg which cargs like lange.
prep n       det n   pro   v3sg  prep n

average milliseconds per word

tokenizer 604
lexicon 388
tagging 233
total 1235


XPOST thus appears to meet most of the criteria for practical part-of-speech tagging.

Porting XPOST to Swedish

Telemans corpus of tagged Swedish was used to evaluate XPOST on Swedish. [Teleman 1974] The methodology was similar to that used for the Brown corpus. First a lexicon was induced from the entire collection containing, for each word, the list of the tags which it may be assigned. The corpus was then divided into two sections, one containing the even numbered sentences and one containing the odd numbered sentences. The former were used, without tags, to train XPOST. The latter sentences were then automatically tagged by XPOST. The tags thus assigned were then compared with the tags assigned by Teleman.

The Teleman tagged corpus contains around 85 thousand words tagged with 259 unique tags. Many of these tags occur very infrequently in the corpus, making parameter estimation difficult. The tagset was thus initially recoded to the 13 tags specified by Samuelsson.[Samuelsson 1993] With this tagset, XPOST tagged 91% of the words correctly.

Examination of the errors suggested that XPOST might do better with a somewhat more refined tagset. This is not usually a good idea, as it creates more parameters to be trained, and hence, less evidence per parameter. The addition of some distinctions may not change the number of ambiguous forms, but may provide more precise grammatical contexts for disambiguating neighboring ambiguous forms. By this logic, genitive names and nouns were broken out as separate tags, and pronouns were broken into four categories: relative, personal, genitive and object. After these changes accuracy rose to 95%.

Issues

Some issues which remain to be examined in stochastic taggers include:

Should common words be modeled individually? Some authors have proposed (e.g. [Kupiec 1992]) that high-frequency words should have their own ambiguity class, even if the set of tags in the class is not distinct from that in other classes. The Swedish word om might benefit from this treatment. It is a high-frequency word which may be used as an adverb, a conjunction or a preposition. Other words with the same ambiguity, e.g. efter and sedan, are infrequent enough to benefit from having their statistics pooled, while om is frequent enough that it may fare better on its own.

[Voutilainenet al. 1992] have developed a tagging method which achieves high accuracy, but which moreover, can accurately predict its errors. In other words, rather than generating the wrong tags, it is able to pass the ambiguity along so that it may be resolved by higher-level processing. This is clearly a superior property. It remains to be seen if a stochastic tagger can implement this.

Acknowledgments

Thanks to Jussi Karlgren for suggesting that I visit SICS, to Christer Samuelsson for arranging my stay, and to the whole SICS NLP group for making that stay fun.

Bibliography

Baum, L.E. 1972. An Inequality and Associated Maximization Technique in Statistical Estimation for Probabilistic Functions of a Markov Process. In Inequalities. 3: p. 1--8.

Church, K. 1989. A Stochastic Parts Program and Noun Phrase Parser for Unrestricted Text. In Proceedings of the International Conference on Acoustics, Speech and Signal Processing .

Church, K.W. 1988. A Stochastic Parts Program and Noun Phrase Parser for Unrestricted Text. In Proceedings of the Second Conference on Applied Natural Language Processing, Austin, Texas.

Cutting, D., Kupiec, J., Pedersen, J., and Sibun, P. 1992. A Practical Part-of-Speech Tagger. In Proceedings of the Third Conference on Applied Natural Language Processing, Trento, Italy.

Eineborg, Martin and Gambäck, Björn. 1993. Tagging Experiments Using Neural Networks. In Nordiska Datalingvistikdagarna, Stockholm.

Eklund, Robert. 1993. A Probabilistic Word Class Tagging Module Based on Surface Pattern Matching. Stockholm University, Department of Computational Linguistics.

Eklund, Robert. 1993b. A Probabilistic Word Class Tagger Module Based on Surface Pattern Matching. In Nordiska Datalingvistikdagarna, Stockholm.

Francis, W.N. and Kucera, F. 1982. Frequency Analysis of English Usage . Houghton Mifflin.

Jelinek, F. 1985. Markov Source Modeling of Text Generation, in Impact of Processing Techniques on Communication, J.K. Skwirzinski, Editor. Nijhoff: Dordrecht.

Kupiec, J.M. 1992. Robust Part-of-Speech Tagging Using a Hidden Markov Model. Xerox Palo Alto Research Center.

Samuelsson, Christer. 1993. A Morphological Tagger Based Entirely on Bayesian Inference. In Nordiska Datalingvistikdagarna, Stockholm.

Teleman, U. 1974. Manual för Grammatisk Beskrivning av Talad och Skriven Svenska. University of Lund.

Voutilainen, Atro, Heikkilä, Juja, and Anttila, Arto. 1992. Constraint Grammar of English. Department of Linguistics, University of Helsinki.