Reliability of probability estimates for language research

General idea - Compare different probability estimates based on different data sources and different estimation methods in terms of how good a fit the estimates provide against behavioral outcomes.

Possible behavioral outcome measure - naming latencies (attachment:balota_2004_visual.pdf; attachment:baayen_2006_morphological.pdf; attachment:balota_2007_english.pdf), word duration (Bell et al. 03; P-SWBD; also in Balota et al.?), word-by-word reading times (Dundee corpus for English; there are also corpora for French and German; also possible: modeling of self-paced reading data).

Study ideas

The Wikipedia Discussion

According to Ben, researchers in AI (especially automatic knowledge extraction) are turning to Wikipedia as a primary training corpus. Wikipedia offers an interesting alternative to the Google n-gram corpus in several ways.

MoinMoin Appliance - Powered by TurnKey Linux