Differences between revisions 4 and 12 (spanning 8 versions)
Revision 4 as of 2008-02-20 23:13:57
Size: 2833
Editor: colossus
Revision 12 as of 2009-05-14 20:09:44
Size: 601
Editor: platypus
Comment: fix broken wiki link
Deletions are marked like this. Additions are marked like this.
Line 1: Line 1:
#acl HlpLabGroup:read,write,delete,revert,admin #acl HlpLabGroup:read,write,delete,revert,admin All:read
Line 5: Line 5:
= Reliability of probability estimates for language research = = Usability of probability estimates for language research =
Line 7: Line 7:
'''General idea -''' Compare different probability estimates based on different data sources and different estimation methods in terms of how good a fit the estimates provide against behavioral outcomes. '''General idea -''' Compare different probability estimates based on different data sources and different estimation methods in terms of how good a fit the estimates provide against behavioral outcomes. 
Line 9: Line 9:
'''Possible behavioral outcome measure -''' naming latencies (attachment:balota_2004_visual.pdf; attachment:baayen_2006_morphological.pdf; attachment:balota_2007_english.pdf), word duration (Bell et al. 03; P-SWBD; also in Balota et al.?), word-by-word reading times (Dundee corpus for English; there are also corpora for French and German; also possible: modeling of self-paced reading data).

== Study ideas ==

 * '''Unigram reliability:''' Compare unigram estimates of CELEX and Google against naming database of Balota et al. We could get the models (including previous predictors) from Baayen et al (2006). Maybe add a measure of "context distinguishability"?

 * '''Ngram vs. POS-ngram vs. PCFG:''' Compare bigram/trigram estimates against POS-ngrams or PCFG estimates based on hand-parsed written (Brown, WSJ) vs. automatically-parsed written text (BNC, Gigaword) again e.g. Dundee eye-tracking corpus.

 * '''Ngrams in different corpora:''' Compare unigram vs. bigram vs. trigram estimates based on balanced corpus (BNC, ANC), Gigaword, and Google counts. Also compare Google four- and five-grams against Google uni-, bi-, and trigrams. Probably inter-corpus ngram correlations will differ quite a bit depending on the target context, but maybe we can isolate some type of contexts for which correlations are high, and some contexts where e.g. Google ngrams should be used with caution.

 * '''Different languages''': web-based ngrams as an easy to get ngram source for a variety of languages! Demonstrate usability given the other studies and then apply it to one other language?

=== The Wikipedia Discussion ===
According to Ben, researchers in AI (especially automatic knowledge extraction) are turning to Wikipedia as a primary training corpus. Wikipedia offers an interesting alternative to the Google n-gram corpus in several ways.

 * First, n-gram counts from Wikipedia will be tied to a specific release of Wikipedia, allowing for reproducible results and exact knowledge of the corpus. It's currently impossible to know which pages contributed to the Google n-grams, from which of Googles indeces, and at what time each text was parsed.

 * Second, the Wikipedia corpus could be parsed, allowing another opportunity for within-corpus comparisons of PCFGs and n-gram models.

 * Third, the Wikipedia corpus is a collection of documents, each about a single topic. This would allow for several interesting extensions to our models.
This project focuses on resource-provision and methodology-check, but the results will also be relevant to theories of comprehension and production. For more detail continue to the [wiki:Self:HlpLab/Projects/NgramUsability/LabLog lab log].

Usability of probability estimates for language research

General idea - Compare different probability estimates based on different data sources and different estimation methods in terms of how good a fit the estimates provide against behavioral outcomes.

This project focuses on resource-provision and methodology-check, but the results will also be relevant to theories of comprehension and production. For more detail continue to the [wiki:HlpLab/Projects/NgramUsability/LabLog lab log].

ProjectsNgramUsability (last edited 2011-08-10 15:56:49 by echidna)

MoinMoin Appliance - Powered by TurnKey Linux