Reliability of probability estimates for language research

General idea - Compare different probability estimates based on different data sources and different estimation methods in terms of how good a fit the estimates provide against behavioral outcomes.

Possible behavioral outcome measure - naming latencies (attachment:balota_2004_visual.pdf; attachment:baayen_2006_morphological.pdf; attachment:balota_2007_english.pdf), word duration (Bell et al. 03; P-SWBD; also in Balota et al.?), word-by-word reading times (Dundee corpus for English; there are also corpora for French and German; also possible: modeling of self-paced reading data).

Study ideas

Unigram reliability: Compare unigram estimates of CELEX and Google against naming database of Balota et al. We could get the models (including previous predictors) from Baayen et al (2006). Maybe add a measure of "context distinguishability"?
Ngram vs. POS-ngram vs. PCFG: Compare bigram/trigram estimates against POS-ngrams or PCFG estimates based on hand-parsed written (Brown, WSJ) vs. automatically-parsed written text (BNC, Gigaword) again e.g. Dundee eye-tracking corpus.
Ngrams in different corpora: Compare unigram vs. bigram vs. trigram estimates based on balanced corpus (BNC, ANC), Gigaword, and Google counts. Also compare Google four- and five-grams against Google uni-, bi-, and trigrams. Probably inter-corpus ngram correlations will differ quite a bit depending on the target context, but maybe we can isolate some type of contexts for which correlations are high, and some contexts where e.g. Google ngrams should be used with caution.
Different languages: web-based ngrams as an easy to get ngram source for a variety of languages! Demonstrate usability given the other studies and then apply it to one other language?

The Wikipedia Discussion

According to Ben, researchers in AI (especially automatic knowledge extraction) are turning to Wikipedia as a primary training corpus. Wikipedia offers an interesting alternative to the Google n-gram corpus in several ways.

First, n-gram counts from Wikipedia will be tied to a specific release of Wikipedia, allowing for reproducible results and exact knowledge of the corpus. It's currently impossible to know which pages contributed to the Google n-grams, from which of Googles indeces, and at what time each text was parsed.
Second, the Wikipedia corpus could be parsed, allowing another opportunity for within-corpus comparisons of PCFGs and n-gram models.
Third, the Wikipedia corpus is a collection of documents, each about a single topic. This would allow for several interesting extensions to our models.