Differences between revisions 1 and 2

Reliability of probability estimates for language research

General idea - Compare different probability estimates based on different data sources and different estimation methods in terms of how good a fit the estimates provide against behavioral outcomes.

Possible behavioral outcome measure - naming latencies (attachment:balota_2004_visual.pdf; attachment:baayen_2006_morphological.pdf), word duration (Bell et al. 03; P-SWBD; also in Balota et al.?), word-by-word reading times (Dundee corpus for English; there are also corpora for French and German; also possible: modeling of self-paced reading data).

Study ideas

Unigram reliability: Compare unigram estimates of CELEX and Google against naming database of Balota et al. We could get the models (including previous predictors) from Baayen et al (2006). Maybe add a measure of "context distinguishability"?
Ngram vs. POS-ngram vs. PCFG: Compare bigram/trigram estimates against POS-ngrams or PCFG estimates based on hand-parsed written (Brown, WSJ) vs. automatically-parsed written text (BNC, Gigaword) again e.g. Dundee eye-tracking corpus.
Ngrams in different corpora: Compare unigram vs. bigram vs. trigram estimates based on balanced corpus (BNC, ANC), Gigaword, and Google counts. Also compare Google four- and five-grams against Google uni-, bi-, and trigrams. Probably inter-corpus ngram correlations will differ quite a bit depending on the target context, but maybe we can isolate some type of contexts for which correlations are high, and some contexts where e.g. Google ngrams should be used with caution.
Different languages: web-based ngrams as an easy to get ngram source for a variety of languages! Demonstrate usability given the other studies and then apply it to one other language?

-  ⇤ ← Revision 1 as of 2008-02-15 18:03:35 → 
  Size: 1876
  Editor: colossus
  Comment:
+   ← Revision 2 as of 2008-02-20 21:11:11 → ⇥
  Size: 1919
  Editor: colossus
  Comment:
-Deletions are marked like this.
+Additions are marked like this.
 Line 9:
-'''Possible behavioral outcome measure -''' naming latencies (Balota et al 04; Baayen et al 06), word duration (Bell et al. 03; P-SWBD; also in Balota et al.?), word-by-word reading times (Dundee corpus for English; there are also corpora for French and German; also possible: modeling of self-paced reading data).
+'''Possible behavioral outcome measure -''' naming latencies (attachment:balota_2004_visual.pdf; attachment:baayen_2006_morphological.pdf), word duration (Bell et al. 03; P-SWBD; also in Balota et al.?), word-by-word reading times (Dundee corpus for English; there are also corpora for French and German; also possible: modeling of self-paced reading data).