Differences between revisions 4 and 5

Reliability of probability estimates for language research

General idea - Compare different probability estimates based on different data sources and different estimation methods in terms of how good a fit the estimates provide against behavioral outcomes.

Possible behavioral outcome measure - naming latencies (attachment:balota_2004_visual.pdf; attachment:baayen_2006_morphological.pdf; attachment:balota_2007_english.pdf), word duration (Bell et al. 03; P-SWBD; also in Balota et al.?), word-by-word reading times (Dundee corpus for English; there are also corpora for French and German; also possible: modeling of self-paced reading data).

Study ideas

Unigram reliability: Compare unigram estimates of CELEX and Google against naming database of Balota et al. We could get the models (including previous predictors) from Baayen et al (2006). Maybe add a measure of "context distinguishability"?
Ngram vs. POS-ngram vs. PCFG: Compare bigram/trigram estimates against POS-ngrams or PCFG estimates based on hand-parsed written (Brown, WSJ) vs. automatically-parsed written text (BNC, Gigaword) again e.g. Dundee eye-tracking corpus.
Ngrams in different corpora: Compare unigram vs. bigram vs. trigram estimates based on balanced corpus (BNC, ANC), Gigaword, and Google counts. Also compare Google four- and five-grams against Google uni-, bi-, and trigrams. Probably inter-corpus ngram correlations will differ quite a bit depending on the target context, but maybe we can isolate some type of contexts for which correlations are high, and some contexts where e.g. Google ngrams should be used with caution.
Different languages: web-based ngrams as an easy to get ngram source for a variety of languages! Demonstrate usability given the other studies and then apply it to one other language?

The Wikipedia Discussion

According to Ben, researchers in AI (especially automatic knowledge extraction) are turning to Wikipedia as a primary training corpus. Wikipedia offers an interesting alternative to the Google n-gram corpus in several ways.

First, n-gram counts from Wikipedia will be tied to a specific release of Wikipedia, allowing for reproducible results and exact knowledge of the corpus. It's currently impossible to know which pages contributed to the Google n-grams, from which of Google's indeces, and at what time each text was collected.
Second, the Wikipedia corpus could be parsed, allowing another opportunity for within-corpus comparisons of PCFGs and n-gram models.
Third, the Wikipedia corpus is a collection of documents, each about a single topic. This would allow for several interesting extensions to our models.

Several interesting ideas came up in this conversation.

We could correlate probabilities based on Google n-grams with probabilities based on Wikipedia n-grams, and with probabilities from a Wikipedia-trained PCFG. We might, for example, be able to find a subset of Google n-grams that correlate well with relatively-professional web text. We might also be able to do some kind of clustering on documents, looking for patterns among documents in the correlations between Google probs and Wiki probs.
We could train a parser on the parsed Treebank, and then run it on the Wikipedia corpus. We could then use some subset of the corpus (e.g., the top 10% of parsed documents as measured by perplexity) as a new corpus where we had relatively high confidence in the automatic parses. Ben related this specifically to some of the points in Roland et al 2007.
Ben had another good idea that he will have to add here.

-  ⇤ ← Revision 4 as of 2008-02-20 23:13:57 → 
  Size: 2833
  Editor: colossus
  Comment:
+   ← Revision 5 as of 2008-02-20 23:23:19 → ⇥
  Size: 3776
  Editor: colossus
  Comment:
-Deletions are marked like this.
+Additions are marked like this.
 Line 25:
- * First, n-gram counts from Wikipedia will be tied to a specific release of Wikipedia, allowing for reproducible results and exact knowledge of the corpus.  It's currently impossible to know which pages contributed to the Google n-grams, from which of Googles indeces, and at what time each text was parsed.
+ * First, n-gram counts from Wikipedia will be tied to a specific release of Wikipedia, allowing for reproducible results and exact knowledge of the corpus.  It's currently impossible to know which pages contributed to the Google n-grams, from which of Google's indeces, and at what time each text was collected.
 Line 30:
+Several interesting ideas came up in this conversation.  

 * We could correlate probabilities based on Google n-grams with probabilities based on Wikipedia n-grams, and with probabilities from a Wikipedia-trained PCFG.  We might, for example, be able to find a subset of Google n-grams that correlate well with relatively-professional web text.  We might also be able to do some kind of clustering on documents, looking for patterns among documents in the correlations between Google probs and Wiki probs.

 * We could train a parser on the parsed Treebank, and then run it on the Wikipedia corpus.  We could then use some subset of the corpus (e.g., the top 10% of parsed documents as measured by perplexity) as a new corpus where we had relatively high confidence in the automatic parses.  Ben related this specifically to some of the points in Roland et al 2007.

 * Ben had another good idea that he will have to add here.