April 8, 2004

Formal grammar and information theory: together again?

I'm not sure how many of my readers are into hybrid mathematical linguistics. My guess is circa three. But, if you are, I'm currently reading Formal grammar and information theory: together again? by Fernando Pereira at UPenn. He lays out, explicitly, the false assumption behind the rejection of empirical methods in linguistics and the presumption of poverty of stimulus in language learning:

In the last forty years, research on models of spoken and written language has been split between two seemingly irreconcilable points of view: formal linguistics in the Chomsky tradition, and information theory in the Shannon tradition. Chomsky (1957)'s famous quote signals the beginning of the split:
(1) Colorless green ideas sleep furiously.
(2) Furiously sleep ideas green colorless.

[...] It is fair to assume that neither sentence (1) nor (2) (nor indeed any part of these sentences) has ever occurred in an English discourse. Hence, in any statistical model for grammaticalness, these sentences will be ruled out on identical grounds as equally `remote' from English. Yet (1), though nonsensical, is grammatical, while (2) is not.

[...] Chomsky concluded that sentences (1) and (2) are equally unlikely from the observation that neither sentence or `part' thereof would have occurred previously (Abney, 1996). From this observation, he argued that any statistical model based on the frequencies of word sequences would have to assign equal, zero, probabilities to both sentences. But this relies on the unstated assumption that any probabilistic model necessarily assigns zero probability to unseen events. Indeed, this would be the case if the model probability estimates were just the relative frequencies of observed events (the maximum-likelihood estimator).

Since I am primarily interested in estimating the probability of unique, novel linguistic productions (e.g. the probability that sentence X is a translation of sentence Y when I have neither seen X nor Y before), it is very fortunate for me that Chomsky's fair assumption about statistical models of grammaticality is completely, utterly, irretrievably wrong.

Update: Heh, heh, heh. I just got to a good bit on page 7:

Using this estimate [bigram probabilities] for the probability of a string and an aggregate model with C = 16 trained on newspaper text using the expectation-maximization (EM) method (Dempster, Laird, & Rubin, 1977), we find that

p(Colorless green ideas sleep furiously) / p(Furiously sleep ideas green colorless)  ~=  2 x 105.

Thus, a suitably constrained statistical model, even a very simple one, can meet Chomsky's particular challenge.

So, in what must be a first in the history of linguistics, Pereira actually goes out and calculates just how much more probable Chomsky's impossible but grammatically correct sentence is when compared to a grammatically incorrect one. Tell your students, your friends, your co-workers: Colorless green ideas sleep furiously is at least 200,000 times as probable a sentence as Furiously sleep ideas green colorless.


Posted by Scott Martens at 4:16 PM | Comments (16) | TrackBack

April 13, 2004

Language education for the 20th 21st century

Since I seem to have more readers interested in language than I thought, I think I might just post more material on it. I should note, for those who don't read the comments, that my discovery of Fernando Pereira was also blogged by Mark Lieberman at Language Log back in October. We're not exactly at the cutting edge of the literature here at Pedantry.

But, that brings me to today's topic, inspired by this post over at Language Log:


Posted by Scott Martens at 2:27 PM | Comments (2) | TrackBack

In defense of linguistic prescriptivism

I'm having a low productivity day. I have a coding problem that is just hounding me, something that simply will not run fast enough and requires a fairly big rethink. I'm avoiding getting into it in hopes that a flash of inspiration will come to me on the toilet or something.

Remarkably, this strategy actually works well. I actually had a dream last night where the solution to the whole problem came to me. It was beautiful and simple - something so transparent that I knew instantly it would work, something that would have been obvious to Kolmogorov or Chaitlin if they had only been terminologists. Unfortunately, I woke up and couldn't remember any of it.

So, until this flash of genius comes back, assuming it wasn't just the Tandoori chicken I made for dinner talking, I have some time to blog. There's so much work here that I've been putting off. The link list is a mess - out of date, doesn't reflect the people who link to me or are even still blogging, is missing a bunch of folks I do read; Grandpa is still on his way to Africa; I have a bunch of books I'd like to review; I have a couple of posts up on AFOE; there must be something new to say about this mess in Fallujah...

But, nah, I thought, why not dredge open another controversial can of worms. I figured I should either explain why I think Dave Sim is a genius because the last issue of Cerebus came out last month, or I should defend prescriptivism. So, I flipped a coin and decided to defend linguistic prescriptivism.

I'm lying. No, I didn't flip a coin. I'm a couple years behind on Cerebus anyway. I hear he's gone from misogynistic to homophobic over the last few years. I still think he's a genius, but I've never claimed that genius isn't fully compatible with being a flaming loon.

I just saw Mark Lieberman's post over at Language Log and thought it merited a comment or two:

A Field Guide to Prescriptivists

[...] Like bacteria transferring genes, prescriptivists -- whether sensible or idiotic -- mix and match ideas about usage. The resulting distribution is far from random: different prescriptive memes are more or less compatible with one another, and with other aspects of critical morphology, ideological metabolism and intellectual history. However, the result is not a nice Linnaean taxonomic tree either.

I don't think anyone can yet plausibly claim to have found memetic DNA, if such a thing is even possible. However, we can identify some key elements of prescriptivist metabolism, in terms of five different motivations that may be given for strictures about usage:

  1. Tradition -- how our forebears talked. Innovation is degeneration.
  2. Fashion-- how an admired group talks. Deviation is alienation.
  3. Universal grammar -- how one ought ideally to talk. Inconsistency is illogical.
  4. Standards -- how we should agree to talk. Variation confuses communication.
  5. Revelation -- how God taught us to talk. Alteration is transgression.

Particular cases are usually a mixture of these. Such metabolic processes may cooperate or conflict depending on details -- thus an appeal to fashion may point in the same direction as an appeal to tradition, or in the opposite direction, depending on whether the prescriptivist admires the old ways or prefers the latest thing.

These are all very substantial, unscientific and unfounded fallacies which serve as a poor basis for prescriptivism, except for numbers 2 and 4.


Posted by Scott Martens at 5:06 PM | Comments (9) | TrackBack

May 4, 2004

I need some linguistics help

I'm trying to find somebody - Language Hat maybe? - who can offer a professional opinion on some work in Indo-European studies and linguistic reconstruction. To wit, is this as nuts as it sounds to me, or is there just stuff going on that I don't know about?

From: Theo Vennemann (2003), Europa Vasconica - Europa Semitica (Trends in Linguistics: Studies and Monographs, 138), Patrizia Noel Aziz Hanna, ed., Berlin: Mouton de Gruyter.

Bemerkung zum fr?hgermanischen Wortschatz Reference

Several accounts of the history of the German language contend that about one third of the Proto-Germanic vocabulary has no Indo-European etymology. The categories cited as those in which these words cumulate are:

  1. warfare and weapons (e.g. Waffe 'weapon', Schwert 'sword')
  2. sea and navigation (e.g. See 'sea', Ufer 'bank, shore', Sturm 'tempest, storm')
  3. law (e.g. S?hne 'atonement', stehlen 'to steal', Dieb 'thief')
  4. state and communal life (e.g. Knecht 'servant', Volk 'division, people', Adel 'nobility')
  5. husbandry, house building, settlement (e.g. Rost 'Grill', Fleisch 'meat', Haus 'house')
  6. other expressions of advanced civilization (e.g. Zeit 'time')
  7. names of animals and plants (e.g. Aal 'eel', M?we 'gull', Bohne 'bean')
  8. expressions from numerous spheres of daily life (e.g. trinken 'to drink', Leder 'leather')

The accounts suggest that these unexplained words may be owed to prehistoric substrates. By contrast, it is shown in this paper that three of the eight categories of words thus claimed to be prehistoric substratal borrowings, categories 1, 3, and 4, are owed to superstrates rather than to substrates in historical cases of language contact. Indeed it is precisely these three categories where superstratal loan-words are shown to abound in the following cases:

  1. the superstratal Norman-French influence on Middle English,
  2. the superstratal Franconian influence on the Gallo-Roman Latin of Northern France,
  3. the superstratal Arabic influence on Spanish,
  4. the superstratal Lombard and Ostrogoth influence on Northern Italian,
  5. the superstratal Turkish influence on the languages of the Balkans,
  6. the superstratal influence of Low German on Danish and Swedish as a consequence of the commercial dominance of the Hansa.

The conclusion drawn in this paper is that if the Germanic vocabulary lacking Indo-European etymologies consists of loan-words, then at least the loan-words in categories 1, 3, and 4 were borrowed from superstrates rather than from substrates. The paper concludes with speculations about the prehistoric settings in which such superstratal influence on Pre-Germanic would have been possible. The megalithic monuments of Western Europe are suggested to be the archaeological vestiges of the culture to which those superstratal languages belonged. No concrete proposal is made concerning the languages or language families from which the problematic vocabulary was borrowed, but Basque and Pictish are mentioned as testimony of a once non-Indo-European Western and Northern Europe.

I am troubled because Theo Venneman does have papers, particualrly on Germanic phonology, that seem to have been cited by respectable people, and he holds down a tenured position in Germanic studies at a very mainstream Bavarian university. The trouble is, this is quite remote from the comparative method in linguistic reconstruction, and I can't understand what he could possibly mean by "borrowing from a superstrate". I was especially suspicious after finding this, which leads me to think that Venneman hasn't even looked up sword in the OED.

Having tenure and having written a few good papers is not a guarantee of mental health. However, I should think that if there was a significant body of highly regarded research on pre-IE European languages coming to the kinds of conclusions Venneman is coming to, I might have heard about it. So, I'm hoping somebody with more background in Indo-European studies can tell me if this Venneman guy is taken seriously, or if there is a consensus that he's a crackpot.

Update: Does Venneman mean sociologically dominant when he says superstrate? That's not what I've always understood it to mean - I've always used it as a specialised term in creole studies, to mean the "target" language that substrate speakers are trying to communicate in - but it's at least not incoherent with that meaning.

Further update: Mucked up a link. The etymology I found on the web comes from here:

sword: ON swerdh, OE sweord- from general Gmc root, etymology dubious, perhaps OHG sweran ‘cause or suffer pain’, swero, swer(a)do ‘pain’ , Ir. serb ‘bitter’, Av. xara-’wound’ with orig. sense of root ’sting, cut’ (Walde-Pokorny, Krogmann, Kluge and Buck).

I can't figure out where that etymology came from though.


Posted by Scott Martens at 2:25 PM | Comments (11) | TrackBack

May 13, 2004

Exams, exams, exams

I think I blew my Chinese luisterexam last night by writing "?" when I meant "?" consistently. It's not like there's any actual phonetic difference between the two. They're homophones not only in Mandarin but I think in every other form of Chinese.

Most Chinese characters seem to be used in primarily phonetic ways - they don't seem to reduce ambiguity in any meaningful way. There are circa 1200 unique syllables in Chinese, probably less than a thousand for the 70 percent of Mandarin speakers for whom "?" sounds the same as "?". Really, would it be so frigging impossible to just pick one character for each distinctive syllable? It would make my life only a thousand times easier.

And Russian... don't get me started on Russian. What is with this "???? ???", "??? ????", "???? ?????" stuff? I get the concept of case as an indicator of verbal argument structure. I get case as a these marker. But how in hell does case get to be an indicator of number? "Two and three take the genetive singular." WTF?

I'm just having that "my brain is full" feeling. Language classes - like drugs, only less fun.

[Warning: You need to set your Text Encoding to Unicode to read this entry.]


Posted by Scott Martens at 9:52 PM | Comments (17) | TrackBack

May 14, 2004

Aced the Russian luistervaardigheidsexam

The prof told us the answers afterwards. Unless I made a mistake filling the test out, I got all of the questions right. Of course, it was multiple choice and only comes to 5% of the final grade. The real test is Monday and Wednesday - the written and the oral.

In the mean time, my life consists of Chinese interleaved with Russian.

Since I've already pronounced my efforts to study two of the three hardest languages offered at Leuven at the same time, while holding a full time job, to be completely nuts, I find myself somewhat surprised to be considering taking an intensive, five week Dutch course in lieu of a third year of weekly classes. Three nights a week, four hours a night.

Upside: If I did it, I could take fourth year Dutch next year or try to do both Russian and Chinese again, and since I feel guilty - vestiges of bad immigrant guilt - for not studying Dutch this last year, I could end up as far along as if I had studied it all year.

Downside: If I did it, I could take fourth year Dutch next year or try to do both Russian and Chinese again, and my wife will kill me.

Classes would start the day before my last exam and run til the end of June. Should I do it? Duch is easy enough, I just hesitate to give another language three nights a week. On the other hand, it's only a month and only one more night than I've been doing for the last year.


Posted by Scott Martens at 10:35 PM | Comments (4) | TrackBack

May 25, 2004

Café sans frontières

Mark Liberman over at Language Log confronts the rather distinctive bipartite division of coffees in Montreal: velouté or corsé? I was a student when I lived in Montreal. The answer was always corsé. Montreal coffee culture is a bit different. Instead of Starbucks, Van Houtte is the place for du vrai corsé.

But what makes this really funny to me is that I had the same experience in reverse.


Posted by Scott Martens at 11:06 PM | Comments (6) | TrackBack

May 31, 2004

Cussing in Hebrew

Geoffrey Pullum is struck by a passage in the latest New Yorker in which young Jewish settlers in the West Bank use particularly foul language to express their lack of common feeling towards their Arab neighbours. What he finds most striking is the use of the Arabic word for "cunt."

Pullum's sentiment - particularly the futility of looking for a linguistic solution to ethnic hatred - is right on. However, the use of Arabic obscenities by these young Hebrew-speaking men is not terribly surprising. Modern Hebrew is a language that came into being quite recently and which was originally defined in a very artificial way. The language has been transformed by being adopted by many Yiddish, Ladino and Arabic speakers over the last century and much of its real structure and vocabulary comes from those roots.

The consequences of this sort of creolisation were related to me by my Jewish vice principal in high school who tried to order a glass of water in a restaurant in Tel Aviv using his rather rusty Hebrew. It seems that Americans who mispronounce the Hebrew word for water often end up producing something that sounds like the Arabic word that translates as "cunt." Everyone in Israel knows that word because nearly all Hebrew obscenities have been borrowed from Levantine Arabic.


Posted by Scott Martens at 8:03 AM | Comments (1) | TrackBack

June 10, 2004

Reconstructing the original migration out of Africa

John McWhorter is restrainedly enthusiastic about a recently published paper linking an isolated language of Nepal to the languages of the Andaman islands. Now, as I have recently pointed out, I am not terribly specialised in historical linguistics. The last thing I read on the historical linguistics of Papuan languages was Wurm's book from the mid-70's. Even when I read it over a decade ago, the very idea that the Andaman languages had any particular connection to the languages of Papua was considered controvertial.

Thus, I am ill-equipped to judge the connecting hypothesis - that if this Nepalese language is related to Andamanese languages and Andamanese languages are related to Papuan languages and Papua has been settled for some 75,000 years, then this link is reconstructing an 80,000 year old linguistic connection. But, it seems to me that I'd need to be convinced of the intermediate steps before considering the basic claim.


Posted by Scott Martens at 4:48 PM | Comments (4) | TrackBack

August 13, 2004

Seeing language through a lexical lens

My recent post at AFOE on the German language reform, along with a discussion on Language Hat and It's Ablaut Time, has inspired me to do this post. My professional roots are in several highly heterodox schools of linguistics, and some of the solutions found in them seem relevant to the topic at hand.

David over at It's Abluat Time starts his post with this:

I used to think we could talk intelligently about the grammars of languages by starting with the assumption that grammars are designed for communication. The more I look at actual languages, the less I believe that this is the case. While languages obviously serve as media of communcation, they are in many ways ill-suited to this task. Grammars are too complex, too byzantine, too intricate, and indeed too beautiful, to be optimal codes for communcation.

I'm inclined instead to have a problem with the notion that grammars are designed at all. Like Language Hat, I've never been able to develop much enthusiasm for Esperanto, in part because of this very problem.

Since the rise of the structuralists, mainstream linguistics has not thought highly of artificial languages. The main reason for this is the belief that the true object of study for linguistics is the vernacular, and that any sort of role in managing language problems is a sort of unscientific prescriptivism. While I think this is no longer an appropriate attitude for linguistics, I too don't think very highly of Esperanto.

Esperantists talk at great length about how easy their language is to learn, but the simple truth is that it is very hard to learn and effectively impossible to use "correctly." Esperantists also have a long history of complaining about how poorly Esperantists speak Esperanto. But the reason Esperanto is doomed in its current form is because Esperantists believe that language is, in Steven Pinker's words, words and rules which are internalised by the speaker. I don't think that it is any such thing. A language is the set of things a community of speakers does to communicate linguistically.

It is that belief, and the claim that the above is not a circular definition, that places me far outside of the linguistics mainstream.


Posted by Scott Martens at 10:25 AM | Comments (16) | TrackBack

August 17, 2004

Only 18 out of 45 phonic generalizations met the criteria of usefulness

According to the New York Times, Theodore Clymer died last month. He was an important figure in American elementary education, one who reached the end of his life every bit as controvertial a figure as when he did his most famous research.

The title of this post is a direct quote. It is his most famous conclusion.

Before the 1960s, most early childhood literacy education focused on certain folk beliefs about the phonetics of English spelling. For example, how many of us have heard that when two vowels are next to each other, you only hear the first one? Not that many I suspect. It used to be a common rule taught to every first grader, until Dr Clymer showed that the exceptions to this rule are so numerous that the rule is of diminishing value.

This was the begining of the end for phonic reading in the US, which is kind of ironic since Clymer himself advocated a more modern form of phonics for most of his career.


Posted by Scott Martens at 4:41 PM | Comments (9) | TrackBack

When linguists get snarky

Long term readers will, of course, know that I am not a great fan of Chomsky's, and not more enthusiastic about Steven Pinker. Consequently, I can not help but giggle at Semantic Composition's review of the latest from that comedic duo Pinker and Jackendoff, and their latest reconstruction of Chomsky's latest incarnation.


Posted by Scott Martens at 10:52 PM | TrackBack

August 23, 2004

Pirahá and the art of mathematics

Via Language Log and Crooked Timber, I take note of a Reuters dispatch over here and a brief in the Science section of the WaPo here.


Posted by Scott Martens at 10:59 AM | Comments (1) | TrackBack

October 20, 2004

The Ur-Durkheimat

Cosma Shalizi has a piece up on Durkheim's theories about categorisation and consciousness, which, from my logs, seems to be bringing in a few hits. He is discussing a paper by one Albert J. Bergesen analysing and, apparently rejecting, Durkheim on the basis of more recent research in infant cognition.

I mostly don't disagree with Cosma on this one - especially since he points out that one shouldn't "[conflate] pre-social conceptual structure with innate conceptual structure."

All the evidence Bergesen reviews is compatible, I think, with infants simply being born with learning mechanisms which always settle on the same structure when confronted with the kind of environments human beings have always inhabited --- one where space is, to the limits of perception, Eucliean and three dimensional, where gravity has a constant direction and magnitude, etc. Conceivably, in a different environment --- say, free falll --- they would acquire different concepts, or perhaps no coherent concepts at all. (At last, a scientific point to the space station!) Anyway, I think that if you pay enough attention to either the nature of statistical learning procedures or the mechanisms of biological development you'll find the usual argument over innate cognitive structures dissolving in your mind.

Indeed, since this is more or less my position, I don't have anything to object to on that count. The degree to which novelty draws infant (and adult) attention and stimulates cognition has been pretty close to the core of non-nativist accounts of development at least since the days of Piaget. In fact, I would tend to push the argument beyond the mind and even further back than birth. Even our physical features, the ones where clear accounts of their heritability are unquestioned, are dependent on pre- and post-natal development environments. If it were possible to let embryos develop radically outside of their expected physical environment, we might not even consider those things that are clearly hereditary to be truly inate. At this point one (or at least I) would begin questioning whether inate actually means that much at all.

Which brings me to his final point, in the footnotes, and the link you clicked on if you got here from Cosma's page:

Of course, the mechanisms implementing those ordinary processes are highly non-trivial, as your friendly neighborhood linguists will tell you. In fact, they're so intricate, and so useful, that it's absurd to believe they're not biological adaptations, no matter what Uncle Noam says. As for coordinated attention, my suspicion, sparked by reading Barbara Ehrenreich, is that it owes a lot to humans being such unusually weak, flabby, small-toothed apes. One of us is "chewy, and good with ketchup", but a lot of us throwing stones at the same thing are trouble. But, again, that doesn't explain how we pull it off (as your friendly neighborhood Vygotskian linguist will tell you).

"Biological adaptation" is, of course, a loaded term in this little corner of social science discourse, but once you find yourself questioning whether something can even be inate, it's not too hard to swallow. I'd prefer one thought of language as biologically adapted more in the sense of the Baldwin effect than, say, the way we understand the opposable thumb to be a biological adaptation. But I do need to slap Cosma's hand gently for thinking that because something is intricate and useful it is a biological adaptation. The bond market is intricate too and quite useful in its own way, but could only be viewed as a biological adaption by extending the term to the point of meaninglessness.

As for Bergesen, he starts in the wrong place by quoting Meltzoff on the fall of Piagetism. Classical Piagetian theory, like classical everything, has serious shortcomings and drawbacks. However, the general assumption that infant cognitive abilities entail biologically fixed, hereditary categories was far more instrumental in the death of Piagetian thought than any actual experimental results. That assumption is vastly more difficult to sustain in the 21st century.

In the same sense, an unmodified Durkheimism is probably not an advisable set of ideas to have. Many areas of his thought have come in for serious reexamination, and obviously Bergesen would like the notion that cognition is acquired through socialisation to also be reexamined.

Since my inclination is to think of cognition as goal-oriented computation, clearly infants undertake cognition from birth or nearly from birth. I would still argue that in the studies Bergesen cites I still do not see a convincing case for pre-natal knowledge; and as Cosma points out, the line between inate and acquired is tough to find given modern knowledge of development. Bergesen's survey of the literature only cites one experiment involving just-born infants, and its one I'm unfamiliar with. Yes, even a one week old infant has seen enough of his or her mother's face to recognise it, but that does not mean they were born with that ability inately.

That leads to the other issue I see left unaddressed: What is socialisation? If it means nothing more than interacting with other people, babies start doing that as soon as some doctor pulls it out by its head and whacks it on the ass. If "pre-socialisation" is meant to mean "preceding any human contact", I don't see much of a challenge here.

Thus, I find myself more critical of Bergesen than Cosma is. Essentially, Bergesen is saying: "Here is a bibliography of people saying babies come into the world with lots of pre-existing knowledge and categorisations that they could never have learned, and here are sociologists, pretending that everything about our minds comes from society. Bad sociologists, no treat for you!" But there is a whole literature questioning those radical conclusions about infant knowledge and hereditary, biologically adapted categories. They are far from universally accepted in psychology, developmental theory, or linguistics. Bergesen is stretching some already quite stretched conclusions still further.


Posted by Scott Martens at 3:08 PM | Comments (7) | TrackBack

February 4, 2005

A difference of semantics

Brief entry inspired by Atrios misuse of the idiom a difference of semantics. This is my personal bugaboo, so bear with me.

Igor Mel'čuk was one of my profs in Montreal, and he used to tell this joke, which, I guess, makes a bit more sense in Russian:

A man walks into a doctor's office and demands to be castrated. The doctor says, reasonably enough, that he doesn't do that sort of thing, and besides, why would the guy want to be castrated. To this the man whips out a gun and demands that the doctor castrate him or else. So, the doctor, forced at gunpoint, agrees to do it. He puts the man under and castrates him.

When the man wakes up, the doctor says, "Well, I did what you asked. But why on earth do you want to be castrated."

The man replies, "Well, you see, I have this Jewish girlfriend, and she won't do it with me unless I've been castrated. "

"Don't you mean circumcised?"

The man thinks about it for a moment and says, "Well, doctor, don't you think that's just a difference of semantics?"

The point is that semantic distinctions are terribly important. Most people use the phrase "a difference of semantics" to mean the exact opposite, when two things make no semantic difference.


Posted by Scott Martens at 1:42 PM | TrackBack

February 14, 2005

The Hegemonic Lexicon: a first draft

Below the fold is my first draft of my soon-to-be doctoral proposal. I invite - indeed, beg for - comments. I remind everyone that this is a hastily written first draft. It is far from a finished proposal. For example, the title is not transparently linked to the contents.

This, a CV, and two letters of recommendation are now all that stands between me and becoming a doctoral candidate. There's only one more impossible thing to do: I have been asked for two letters of recommendation, and it has been suggested that I really ought to get one from a prof from before my life in Belgium. This poses some real problems. My last bout of education before arriving in Europe was at Stanford in late 2000. Before that, I was at U de Montréal until 1994. My last completed degree before Belgium was an undergrad degree in Physics that I finished (with a pitiful GPA) in 1991. So, either I have to hit up profs who haven't seen me in five years, or ones that haven't seen me in 11 years, or profs I haven't seen in as much as 15 years in a totally different field where I was crap.

The candidate pool:

  • John Koza at Stanford (Good news: He gave me a good grade for a paper that's actually been cited somewhere. Bad news: I took the class by video and saw Koza in person, like, twice, five years ago.)
  • Igor Mel'čuk at U de Montréal (Good news: we got on well. Bad news: my last class with him was incomplete and eleven years ago.)
  • Alan Ford at U de Montréal (Good news: He gave me a good grade. Bad news: The one big paper I wrote for him was total shit, and I'm pretty sure he knew that, and it was some twelve or thirteen years ago.)
  • Carl Helrich at Goshen College (Good news: he was my undergrad advisor. Bad news: He was my undergrad advisor in Physics, at which I was utter crap and got shitty grades. If I'd been less of a stubborn ass, I'd have quit Physics early and majored in something closer to my interests.)

Anyway, I am also soliticing advice on how to approach a prof one hasn't seen in a long, long time - and possibly didn't impress that much - to obtain a letter of recommendation. Any help at all would be appreciated.

So, onward with the one-page proposal:


Posted by Scott Martens at 12:20 PM | Comments (3) | TrackBack

February 17, 2005

MDL in Linguistic Modeling: Version 2.0 of the doctoral proposal

Less math, more linguistics, slightly shorter, fewer different things at once. And many thanks to Cosma Shalizi for comments in e-mail.

To read, click the "more" button. And once again, please comment.


Posted by Scott Martens at 1:58 PM | TrackBack

May 10, 2005

The PhD proposal - part 1

Real life strikes again. I'm moving forward on the PhD, doing some fixes on the proposal. This, once again, interrupting my ability to do other things, but that's for the best. I'm not going to flunk Dutch, so the studying can wait. I'm not going to have trouble with Russian - although why it takes five weeks to get an invitation letter through the Foreign Ministry is beyond me. I ought to be getting it in a week.

So, below the fold, part one of the PhD proposal. Critiques solicited. I will probably delete the content below the line after a week or so because I don't really want the search engines finding it. The idea is simple enough, and I really don't want to see someone else doing the same thing in the middle of my doctorate. Call me egotistical, or paranoid. Whatever.


Posted by Scott Martens at 11:23 PM | Comments (4) | TrackBack

May 11, 2005

The PhD proposal - part 2

This is part 2 of the PhD proposal. It, too, will not last on the web. This part covers some of the more innovative elements of my proposal. Once again, I am soliciting comments. However, I did use AppleWorks built-in HTML converter, and that may cause some display problems.


Posted by Scott Martens at 11:08 PM | TrackBack

February 10, 2006

Wrathful Dispersionism

Fortunately, this is a joke. I was shaking my head at the prospect it might be real until I got to the line about "linguistics is widely and justifiably seen as the centrepiece of the high-school science curriculum".

I asked my grandfather the pastor once why evolution is so widely hated while more modern theories about language aren't. He said there was nothing particularly incompatible between the Tower of Babel and continuous language change. So why, I replied, is it different for evolution? The Catholics seem to have found a way to integrate unique creation with evolution, why not others.

I don't remember any satisfactory answer.

(Link found via Brad Delong.)


Posted by Scott Martens at 7:24 AM | Comments (1) | TrackBack

June 11, 2006

Near mergers and the end of the minimal pair

I'm reading Labov's Principles of Language Change (Volume 1:Internal Factors and Volume 2:Social Factors) and I've come across something absolutely fascinating and totally contradictory to what they teach in Linguistics 101.

First, a little background for non-linguists. Linguistics started out in the reconstruction of language change - what's usually called historical linguistics today. One common phenomenon in language change is called the phonetic merger, where two sounds that used to be different become indistinguishable. Five hundred years ago, the words meet and meat were pronounced very differently, which is why they're still spelled differently. Then, the vowels in the middle merged in sound at some point. This can be tested with a minimal pair test: If you say "meat" or "meet" in isolation, or in a sentence where either word could be used, people can't tell the difference.

Now, when two sounds merge, they sound the same. The general understanding is that this is a one way process: two words that sound the same never, ever start to sound different. Or at least, that's what I was taught. Turns out this isn't exactly true.

Labov talks about something called the near-merger, where two sounds become so alike that listeners can't tell them apart, but using recordings and frequency measurements, a computer can still tell them apart. As an example, he shows that New Yorkers can't hear any difference between source from sauce in speech, but do clearly pronounce them differently. Labov implies that this might explain how line and loin, which sounded the same in the 18th century, have since become very different in most dialects of English - they never merged completely in the first place.

Now, I can think of a sociolinguistic explanation of how this kind of situation could exist and remain stable. Exposure to speakers of different dialects - one's that preserve larger distinctions - through media like TV could influence people's speech enough to retain a difference. But that difference might falls below the threshold of conscious comprehension. Furthermore, the inability of New Yorkers to consciously detect the difference between source and sauce doesn't mean that it doesn't contribute unconsciously to their ability to understand words in context.

However, I suspect that TV can't explain all cases of near-mergers.

The existence of near mergers undermines the idea that phonetic spelling systems can be easily constructed, as minimal pair tests may not reveal all real distinctions. Furthermore, it implicitly undermines the idea that language is a property of individuals, since this distinction relies on its social effect to persist. And, it really strikes hard at the notion that language can ever be modeled synchronically. Near-mergers only exist because of past distinctions, they can only be modeled in the light of the overt past distinction that they retain.

This is serious stuff.

I'm thinking, though, about whether the idea of a near-merger might apply to morphology and syntax. In morphology, I can think of one: gender in Dutch. Dutch no longer makes an overt distinction between masculine and feminine except in the choice of pronoun and the archaic ''te + dative'' construct. Yet, speakers are routinely capable of making masculine/feminine distinctions correctly. I thought the main reason was that so many people in Belgium speak dialects where the masculine/feminine distinction is still morphologically significant, but now I wonder. What if Dutch speakers who have never used anything other than the standard dialect were able to make those distinctions? Would this be the morphological analogy to a near-merger?

Labov has some other stuff that I think leads to interesting conclusions in creolistics, but that's a different post.


Posted by Scott Martens at 6:31 PM | Comments (4)