February 17, 2005

MDL in Linguistic Modeling: Version 2.0 of the doctoral proposal

Less math, more linguistics, slightly shorter, fewer different things at once. And many thanks to Cosma Shalizi for comments in e-mail.

To read, click the "more" button. And once again, please comment.

Minimum description length principles (MDL) constitute a relatively new area in machine learning theory that is underexplored in corpus linguistics. [Rissanen 1978, Grünwald et al. 1998] At its simplest, it is an extension of the principles used in data compression. By exploiting regularities in data, it is possible to reduce the total amount of information necessary to represent it. This approach has become much more important in recent years in data modeling, since it can be used to operationalize traditionally vague notions like Occam?s Razor. [Hansen & Yu 2001]

I intend to pursue applications of MDL methods in corpus linguistics, with a particular emphasis on lexicology and lexicalist theories of language. [Mel'cuk 1997, Dixon 1984] It is no longer very controversial to consider linguistic theories models of language data in the sense intended by computer scientists and mathematicians. However, a lexicalist theory of language, because it places much more emphasis on the properties of specific utterances rather than seeking out generalizations that abstract away the content of language, is more immediately comparable to the kind of data-driven modeling natural to MDL.

By using MDL to model the stochastic properties of large corpora, I expect to find that the units of analysis produced by this kind of modeling will correlate to linguistic categories. In particular, I expect to be able to extract lexemes and information about the selectional properties of lexical entries from texts. MDL offers us the opportunity to define linguistic categories - the lexicon, morphemes, and syntactic structures - in an empirical way, without reference to traditional markers like prefix and suffix variation, or spaces between words.

Because MDL discovers arbitrary units with a single algorithm, a successful outcome creates the possibility of a more uniform way of defining linguistic categories. This approach is linked to the distributional criteria used to define fundamental units in phonology and resembles Saussure?s definition of linguistic units by difference. I intend in part to expand on recent work in morphology that follows this approach. [Singh & Ford 1997, Neuvel in press, Belkin & Goldsmith 2002]

In the end, I expect to be able to extract linguistic schema and, using the principle of model size minimization, to evaluate the ability of different stores of schemas to model corpora. This extends, in part, the Data Oriented Processing framework by adding solid criteria for evaluating candidate parse subtrees for the lexicon. [Bod 1998] Although there is some pre-existing work in using MDL methods to analyze raw corpus data, my intent is to pursue applications using data that is partially or wholly parsed, morphologically analyzed, and in other ways abstracted. [Kit 1998, Kit & Wilks 1999, Belkin & Goldsmith 2002] The relatively recent development of high quality parsers and morphological analyzers for a few languages makes this a possibility.

This has potentially significant applications in lexicography as a method of identifying essential linguistic structures, in syntax and morphology as a stochastic formalization of lexicalist principles, and in natural language processing as an automatic dictionary induction scheme driven by digitalized data.

Belkin, M., Goldsmith, J. Using Eigenvectors of the Bigram Graph to Infer Morpheme Identity. Morphological and Phonological Learning: Proceedings of the 6th Workshop of the ACL Special Interest Group in Computational Phonology (SIGPHON), Philadelphia, July 2002, pp. 41-47.

Bod, R. Beyond Grammar: An Experience-Based Theory of Language. CSLI Publications, Cambridge University Press, 1998.

Hudson, Richard A. Word Grammar. Oxford: Blackwell. 1984.

Ford, A. In Praise of Sakatyayan: Some Remarks on Whole Word Morphology. The Yearbook of South Asian Languages and Linguistics-2000. Thousand Oaks: Sage. 2000.

Grünwald, P., Kontkanen, P., Myllymiki, P., Silander, T., Tirri, H. Minimum encoding approaches for predictive modeling. Proc. 14th Int. Conf. on Uncertainty in AI (UAI'98), pp. 183-192. G.Cooper and S.Moral (eds.). 1998.

Hansen, M. H. and Yu, B. Model selection and the principle of minimum description length. J. American Statistical Association, vol. 96 (2001), pp. 746-774.

Kit, C. A goodness measure for phrase learning via compression with the MDL principle. The ESSLLI-98 Student Session, pp.175-187. Aug. 17-28 1998.

Kit, C. & Wilks, Y. Unsupervised learning of word boundary with description length gain. Computational Natural Language Learning - Proceedings CoNLL99 ACL Workshop. 1999.

Mel'čuk, I. Vers une linguistique Sens-Texte: Le?on inaugurale. Paris: Coll?ge de France. 1997. [http://www.olst.umontreal.ca/FrEng/melcukColldeFr.pdf]

Neuvel, S. Whole word morphologizer - expanding the word-based lexicon: A non-stochastic computational approach. In press. [http://www.neuvel.net/PDF_files/WWM_paper.PDF]

Rissanen, J. Modeling by shortest data description. Automatica, vol. 14, pp. 465-471. 1978.

Posted 2005/02/17 13:58 (Thu) | TrackBack