April 8, 2004

Formal grammar and information theory: together again?

I'm not sure how many of my readers are into hybrid mathematical linguistics. My guess is circa three. But, if you are, I'm currently reading Formal grammar and information theory: together again? by Fernando Pereira at UPenn. He lays out, explicitly, the false assumption behind the rejection of empirical methods in linguistics and the presumption of poverty of stimulus in language learning:

In the last forty years, research on models of spoken and written language has been split between two seemingly irreconcilable points of view: formal linguistics in the Chomsky tradition, and information theory in the Shannon tradition. Chomsky (1957)'s famous quote signals the beginning of the split:
(1) Colorless green ideas sleep furiously.
(2) Furiously sleep ideas green colorless.

[...] It is fair to assume that neither sentence (1) nor (2) (nor indeed any part of these sentences) has ever occurred in an English discourse. Hence, in any statistical model for grammaticalness, these sentences will be ruled out on identical grounds as equally `remote' from English. Yet (1), though nonsensical, is grammatical, while (2) is not.

[...] Chomsky concluded that sentences (1) and (2) are equally unlikely from the observation that neither sentence or `part' thereof would have occurred previously (Abney, 1996). From this observation, he argued that any statistical model based on the frequencies of word sequences would have to assign equal, zero, probabilities to both sentences. But this relies on the unstated assumption that any probabilistic model necessarily assigns zero probability to unseen events. Indeed, this would be the case if the model probability estimates were just the relative frequencies of observed events (the maximum-likelihood estimator).

Since I am primarily interested in estimating the probability of unique, novel linguistic productions (e.g. the probability that sentence X is a translation of sentence Y when I have neither seen X nor Y before), it is very fortunate for me that Chomsky's fair assumption about statistical models of grammaticality is completely, utterly, irretrievably wrong.

Update: Heh, heh, heh. I just got to a good bit on page 7:

Using this estimate [bigram probabilities] for the probability of a string and an aggregate model with C = 16 trained on newspaper text using the expectation-maximization (EM) method (Dempster, Laird, & Rubin, 1977), we find that

p(Colorless green ideas sleep furiously) / p(Furiously sleep ideas green colorless)  ~=  2 x 105.

Thus, a suitably constrained statistical model, even a very simple one, can meet Chomsky's particular challenge.

So, in what must be a first in the history of linguistics, Pereira actually goes out and calculates just how much more probable Chomsky's impossible but grammatically correct sentence is when compared to a grammatically incorrect one. Tell your students, your friends, your co-workers: Colorless green ideas sleep furiously is at least 200,000 times as probable a sentence as Furiously sleep ideas green colorless.

Posted 2004/04/08 16:16 (Thu) | TrackBack

Speaking as one of the three readers, you should really read Abney's paper as well:
http://www.vinartus.net/spa/95c.pdf. He's at UM now, I really should go talk with him at some point...

Posted by: Cosma at April 8, 2004 16:51

And wouldn't 2x10^5 be 200,000, not 20,000?

Posted by: Cosma at April 8, 2004 16:52

Eh. What's an order of magnitude between friends? (I made the change.)

Posted by: Scott Martens at April 8, 2004 16:54

I'm reading Abney's paper now. I agree that one is perhaps a little full of oneself if one thinks that more data and faster computers is the solution to all of one's problems in linguistics. I don't think more CPU cycles will lead to AI for the same reasons why I think it's the wrong approach to lingusitics.

I'm absolutely, 100% with Abney that statistical methods have a great deal to offer language acquisition theory. On his other two issues - language change and language variation - I think a genuinely statistical approach is barking up the wrong tree. I have a half-baked (and only half thought out) alternative approach to those two issues, one which is predicated on a theory of language acquisition which is very much in the empirical tradition, but which is not a statistical theory in and of itself.

But, that would be an interminable post, one which would be hampered by my knowledge of the gaps in my theory. There are too many spots where my intuition tells me something follows from something else, but where I can't prove it on paper, nor demonstrate it by experiment, nor support it with real world data. Not yet, anyway.

Part 2 for instance: the Adult Monolingual Speaker. I think one of the foundational problems in linguistic methodology is assuming that this individual exists.

Posted by: Scott Martens at April 8, 2004 17:04

Rather, the last sentence of paragraph 2 should read:

I have a half-baked (and only half thought out) alternative approach to those two issues, one which is predicated on a statistical theory of language acquisition and which is very much in the empirical tradition, but which is not a statistical theory in and of itself.

I mean that I have a theory of language change and variation which isn't statistical - but is in the empirical tradition - although this theory of language change and variation is predicated on a statistical theory of langauge acquisition.

Posted by: Scott Martens at April 8, 2004 17:08

Ooooh, and he reads Tesnière. I like him already.

Posted by: Scott Martens at April 8, 2004 17:11

Well, as for adult monolingual speakers, I submit that I'm a pretty good approximation to one myself. I suspect it's the additional bit about being a member of a homogeneous speech community that you're really objecting to, though.

Posted by: Cosma at April 8, 2004 17:36

Actually, it's less the lack of homogeneity in the speaking community than the lack of homogeneity in the speaker. But otherwise, yeah, that's more what I'm objecting to.

Posted by: Scott Martens at April 8, 2004 19:55

Let me make sure you know that there are actually two readers - since I would hate you to think from now on you could just as well email your blog entries to Cosma instead of uploading them...

Posted by: Joerg Wenck at April 8, 2004 21:30

Well - two more and I'll have underestimated my audience, which will only lead to more postings on mathematical linguistics.

Posted by: Scott Martens at April 8, 2004 22:18

Fernando's little experiment is an important piece of PR, which indeed deserves to be more widely known. I made a few comments about it in
including the observation that it's a shame it took 43 years for someone to directly engage this argument.

Posted by: Mark Liberman at April 9, 2004 1:12

Just letting you know you have underestimated your audience ;)

Posted by: Peter Dirix at April 9, 2004 12:03

Indeed, you have 'misunderestimated' your audience, Scott. I enjoyed both articles. (I'm working on a topics course in computational linguistics under the direction of someone who has collaborated with Pereira, and possibly Liberman: Richard Sproat.)

Posted by: Pedro Poitevin at April 9, 2004 15:05

You can also still count me in as one of your avid readers, scott!

Posted by: Vincent at April 9, 2004 16:13

Scott, for what it's worth, I know next to nothing about linguistics, but I dilettante around reading about information retrieval and machine learning topics at work sometimes, and it seems to me that mathematical linguistics should be useful there. So I don't know if I count among the "into hybrid mathematical linguistics", but I'm certainly interested.

It seems obvious now that all Chomsky's example did was to demonstrate the uselessness of any statistical model of language which treats sentences as indivisible atoms. Despite my ignorance of linguistics, I'm inclined to suspect that such a model is a straw man.

Posted by: Jeremy Leader at April 10, 2004 2:51

Very late comment...

This isn't a counterexample to Chomsky. First, the statistical model provides a similarity metric, not a notion of "probability of utterance". Given a corpus C and a sentence S, it assigns some value indicating how similar S is to other sentences in C (very roughly speaking). To say that this similarity metric translates into a measure of utterance probability is an empirical hypothesis which cannot, so far as I can see, be tested, and which is intuitively implausible, since there are an indefinite number of factors governing the probability of any given utterance, and these factors primarily relate to the real world, not statistical properties of sentences in a corpus.

Second, it's not necessary for Chomsky's argument that "colorless green ideas..." and "furiously sleep ideas colored..." have identical probabilities of utterance. As he says in Aspects, the problem is that both of these sentences have probabilities of utterance "empirically indistiguishable from zero"; i.e. even if one is less likely than the other, they will both be marked "very, very ungrammatical" by any probability-based model of grammaticality. But "colourless green ideas..." is perfectly, 100% grammatical. It is just as grammatical as highly probable utterances such as "Good morning".

Finally (a related point), does the statistical model "understand" sentences sufficiently to make the right predictions about semantically anomalous sentences? For example, would it assign very different probabilities to "The Earth is an example of a thing which is round" and "The Earth is an example of a thing which is flat"? (Note that "Earth" and "round/flat" are a long way away from each other, so the higher probability of "round" and "Earth" co-occuring will not be of any help.) Presumably, the first sentence is overwhelmingly more likely than the second; but if the model can't capture this, it can't claim to be a model of utterance probability. And if it isn't a model of utterance probability, it's not going to do any useful work in refuting Chomsky.

Posted by: Alex at May 18, 2006 2:07
Post a comment

Remember personal info?