Reliability of probability estimates for language research

General idea - Compare different probability estimates based on different data sources and different estimation methods in terms of how good a fit the estimates provide against behavioral outcomes.

Possible behavioral outcome measure - naming latencies (attachment:balota_2004_visual.pdf ; attachment:baayen_2006_morphological.pdf ; attachment:balota_2007_english.pdf), word duration (Bell et al. 03; P-SWBD; also in Balota et al.?), word-by-word reading times (Dundee corpus for English; there are also corpora for French and German; also possible: modeling of self-paced reading data).

Study ideas

The Wikipedia Discussion

Knowledge Extraction, and Knowledge Representation researchers in AI are becoming very excited about Wikipedia. The idea of having a large knowledge base that is linked directly into wikipedia source articles (both sentence text and structured data in, e.g., tables), is very appealing (if you can figure a way to automatically extract knowledge from wiki pages, then you have thousands of knowledge engineers for free cranking out your knowledge base, and since wiki contributors tend to be (reasonably) tech savvy, AI folks are hopeful that these random people will be open to using various tools, tagsets, ..., when they do their edits, to maybe have the NLP people and the contributors "meet partway" between free-form input and manually encoding things in logical form (ala Cyc)). Information Extraction and Corpus Statistics are terms that should be more commonly associated than they are now; adding wikipedia to the mix of datasets might (?) allow for widening the pool of those interested in our work. Misc stat: the premier Knowledge Extraction group, out of UW, shared their latest web-text dataset with Ben. Over 10% (by document count, not sentence or word) of that corpus was wikipedia, wikipedia was the largest individual contributor (top-level domain).

Some specific benefits of wiki data:

Several interesting ideas came up in this conversation.

MoinMoin Appliance - Powered by TurnKey Linux