February 14, 2005

The Hegemonic Lexicon: a first draft

Below the fold is my first draft of my soon-to-be doctoral proposal. I invite - indeed, beg for - comments. I remind everyone that this is a hastily written first draft. It is far from a finished proposal. For example, the title is not transparently linked to the contents.

This, a CV, and two letters of recommendation are now all that stands between me and becoming a doctoral candidate. There's only one more impossible thing to do: I have been asked for two letters of recommendation, and it has been suggested that I really ought to get one from a prof from before my life in Belgium. This poses some real problems. My last bout of education before arriving in Europe was at Stanford in late 2000. Before that, I was at U de Montréal until 1994. My last completed degree before Belgium was an undergrad degree in Physics that I finished (with a pitiful GPA) in 1991. So, either I have to hit up profs who haven't seen me in five years, or ones that haven't seen me in 11 years, or profs I haven't seen in as much as 15 years in a totally different field where I was crap.

The candidate pool:

  • John Koza at Stanford (Good news: He gave me a good grade for a paper that's actually been cited somewhere. Bad news: I took the class by video and saw Koza in person, like, twice, five years ago.)
  • Igor Mel'čuk at U de Montréal (Good news: we got on well. Bad news: my last class with him was incomplete and eleven years ago.)
  • Alan Ford at U de Montréal (Good news: He gave me a good grade. Bad news: The one big paper I wrote for him was total shit, and I'm pretty sure he knew that, and it was some twelve or thirteen years ago.)
  • Carl Helrich at Goshen College (Good news: he was my undergrad advisor. Bad news: He was my undergrad advisor in Physics, at which I was utter crap and got shitty grades. If I'd been less of a stubborn ass, I'd have quit Physics early and majored in something closer to my interests.)

Anyway, I am also soliticing advice on how to approach a prof one hasn't seen in a long, long time - and possibly didn't impress that much - to obtain a letter of recommendation. Any help at all would be appreciated.

So, onward with the one-page proposal:

The Hegemonic Lexicon: Algorithmic Information Theory in lexically driven empirical linguistics

Minimum Description Length methods (MDL) are a relatively recent development in the application of Algorithmic information theory to the modeling of stochastic data. [Rissanen 1978] It is widely used in data mining and automated inference, and has found a very secure home in computer science as a data-driven machine-learning model. [Gr?nwald et al. 1998, Vit?nyi & Li 2000, Hansen & Yu 2001] It has only recently begun to appear in linguistics and perilinguistic fields, driven in part by the growth in corpus linguistics and increasingly easy access to large corpora and fast computers. This doctorate will attempt to apply MDL to a broad swath of corpus analysis operations with both theoretical and engineering consequences.

MDL offers us the possibility of operationalising a stochastic definition of the lexicon. Indeed, there are already efforts to do so for Chinese, where the development of lexical resources is hampered by a historical tradition of associating meaning with morphemes (generally, single Chinese characters) rather than words. [Kit 1998, Kit & Wilks 1999] The development of modern standard Mandarin has entailed the creation of multi-syllabic words, which were not as widespread in the classical language, and the shift from defining written Chinese in terms of the classical language to modern standard Mandarin has created this large shortcoming in Chinese lexical theory.

By extending MDL methods to the analysis of dependency trees rather than linear data sources, I hope to be able to develop a mathematically more rigorous conception of the lexicon and its components, one which is entirely separate from questions of prosody or spacing in texts. This outcome should encompass a variety of linguistic phenomena, from multi-word lexical entries to syntactic trees and enable all of them to be treated uniformily as lexical phonomena. In this respect, it compliments the Data Oriented Processing model pioneered by Rens Bod. [Bod 1998] This has considerable significance for linguistic theory, since it implies a far more lexically centred conception of syntactic and morphological rules, one more in line with dependency syntax theories. [Mel?cuk 1997, Dixon 1984]

I intend to apply MDL methods to prepared corpora as a stochastic data source in order to determine if:

  • The extracted segments of text regularly correspond to coherent segments of parse trees.
  • Productive syntactic structures are extracted from partially abstracted corpus data, for example by reducing words to their canonical forms or to parts of speech.

I intend to apply this approach to corpora in a variety of languages, dependent on access to adequate resources, in order to test its cross-linguistic applicability. Applying MDL methods in conjunction with vector quantization techniques [Nasrabadi & King 1988, Cook & Holder 1994], I hope to be able to provide a more rigorous body of techniques for abstracting lexical information ? one better suited to lexically driven theories of linguistics and to corpus linguistics in general.

Furthermore, I intend to apply MDL to morphological models to evaluate the efficiency of various morphological analyses as data compressors. By joining this approach to distribution-based methods of automated morphological analysis, I hope to be able to evaluate the viability of Saussurian approaches to morphology. [Singh & Ford 1997, Neuvel in press, Belkin & Goldsmith 2002]

In the end, this is an effort to apply sophisticated stochastic analysis methods to linguistic corpora in the expectation that more natural and more broadly applicable linguistic categories will emerge from it than those of traditional linguistic analysis. This is potentially very significant for linguistic theory ? bringing it in line with recent development in data modeling ? but also has considerable potential impact on language engineering and perilinguistic fields like lexicography, language education and information retrieval.

Belkin, M., Goldsmith, J. Using Eigenvectors of the Bigram Graph to Infer Morpheme Identity Morphological and Phonological Learning: Proceedings of the 6th Workshop of the ACL Special Interest Group in Computational Phonology (SIGPHON), Philadelphia, July 2002, pp. 41-47.

Bod, R. Beyond Grammar: An Experience-Based Theory of Language. CSLI Publications, Cambridge University Press, 1998.

Cook, D. J. and Holder, L.B. Substructure Discovery Using Minimum Description Length and Background Knowledge. J. Artificial Intelligence Research, Vol. 1, Feb. 1994, pp. 231-255.

Hudson, Richard A. Word Grammar. Oxford: Blackwell. 1984.

Ford, A. In Praise of Sakatyayan: Some Remarks on Whole Word Morphology. The Yearbook of South Asian Languages and Linguistics-2000. Thousand Oaks: Sage. 2000.

Gr?nwald, P., Kontkanen, P., Myllym?ki, P., Silander, T., Tirri, H. Minimum encoding approaches for predictive modeling. Proc. 14th Int. Conf. on Uncertainty in AI (UAI'98), pp. 183-192. G.Cooper and S.Moral (eds.). 1998.

Hansen, M. H. and Yu, B. Model selection and the principle of minimum description length. J. American Statistical Association, vol. 96 (2001), pp. 746-774.

Kit, C. A goodness measure for phrase learning via compression with the MDL principle. In: The ESSLLI-98 Student Session, pp.175-187. Aug. 17-28 1998.

Kit, C. & Wilks, Y. Unsupervised learning of word boundary with description length gain. Computational Natural Language Learning - Proceedings CoNLL99 ACL Workshop. 1999.

Mel'čuk, I. Vers une linguistique Sens-Texte: Le?on inaugurale. Paris: Coll?ge de France. 1997. [http://www.olst.umontreal.ca/FrEng/melcukColldeFr.pdf]

Nasrabadi, N. M. & King, R. A. Image Coding Using Vector Quantization: A review, IEEE Trans. on Communications, vol. COM-36, pp. 957-971, Aug. 1988.

Neuvel, S. Whole word morphologizer. expanding the word-based lexicon: A  non-stochastic computational approach. In press. [http://www.neuvel.net/PDF_files/WWM_paper.PDF]

Rissanen, J. Modeling by shortest data description. Automatica, vol. 14, pp. 465-471. 1978.

Vit?nyi, P. and Li, M. Minimum description length induction, Bayesianism, and Kolmogorov complexity. IEEE Trans. Information Theory, vol. 47, pp. 446-464. 2000.

Posted 2005/02/14 12:20 (Mon) | TrackBack

Do comments work again?

Posted by: Scott Martens at February 14, 2005 12:32

Apparently so.

Posted by: Scott Martens at February 14, 2005 12:32

Doh, forgot to mention Rens Bod.

Posted by: Scott Martens at February 14, 2005 12:45
Post a comment

Remember personal info?