Differences between revisions 7 and 8

Reliability of probability estimates for language research

General idea - Compare different probability estimates based on different data sources and different estimation methods in terms of how good a fit the estimates provide against behavioral outcomes.

This project focuses on resource-provision and methodology-check, but the results will also be relevant to theories of comprehension and production. For more detail continue to [wiki:/HlpLab/Projects/NgramUsability/LabLog lab log].

Possible behavioral outcome measure - naming latencies (attachment:balota_2004_visual.pdf ; attachment:baayen_2006_morphological.pdf ; attachment:balota_2007_english.pdf), word duration (Bell et al. 03; P-SWBD; also in Balota et al.?), word-by-word reading times (Dundee corpus for English; there are also corpora for French and German; also possible: modeling of self-paced reading data).

Study ideas

Unigram reliability: Compare unigram estimates of CELEX and Google against naming database of Balota et al. We could get the models (including previous predictors) from Baayen et al (2006). Maybe add a measure of "context distinguishability"?
Ngram vs. POS-ngram vs. PCFG: Compare bigram/trigram estimates against POS-ngrams or PCFG estimates based on hand-parsed written (Brown, WSJ) vs. automatically-parsed written text (BNC, Gigaword) again e.g. Dundee eye-tracking corpus.
Ngrams in different corpora: Compare unigram vs. bigram vs. trigram estimates based on balanced corpus (BNC, ANC), Gigaword, and Google counts. Also compare Google four- and five-grams against Google uni-, bi-, and trigrams. Probably inter-corpus ngram correlations will differ quite a bit depending on the target context, but maybe we can isolate some type of contexts for which correlations are high, and some contexts where e.g. Google ngrams should be used with caution.
Different languages: web-based ngrams as an easy to get ngram source for a variety of languages! Demonstrate usability given the other studies and then apply it to one other language?

The Wikipedia Discussion

Knowledge Extraction, and Knowledge Representation researchers in AI are becoming very excited about Wikipedia. The idea of having a large knowledge base that is linked directly into wikipedia source articles (both sentence text and structured data in, e.g., tables), is very appealing (if you can figure a way to automatically extract knowledge from wiki pages, then you have thousands of knowledge engineers for free cranking out your knowledge base, and since wiki contributors tend to be (reasonably) tech savvy, AI folks are hopeful that these random people will be open to using various tools, tagsets, ..., when they do their edits, to maybe have the NLP people and the contributors "meet partway" between free-form input and manually encoding things in logical form (ala Cyc)). Information Extraction and Corpus Statistics are terms that should be more commonly associated than they are now; adding wikipedia to the mix of datasets might (?) allow for widening the pool of those interested in our work. Misc stat: the premier Knowledge Extraction group, out of UW, shared their latest web-text dataset with Ben. Over 10% (by document count, not sentence or word) of that corpus was wikipedia, wikipedia was the largest individual contributor (top-level domain).

Some specific benefits of wiki data:

n-gram counts from Wikipedia will be tied to a specific release of Wikipedia, allowing for reproducible results and exact knowledge of the corpus. It's currently impossible to know which pages contributed to the Google n-grams, from which of Google's indices, and at what time each text was collected. The Google n-grams are still important, as they are LDC distributed and cover a lot, but all we know of their origin is that Google had the underlying documents sitting around someplace 2 years ago; not much is known (???) about the distribution of topic/genre of that collection, or how it has changed since then (the web is constantly evolving). Ben related this to points in Roland 2007, how they were able to make nice, intuitive statements about certain observed properties based on knowledge of the underlying data (for instance, WSJ is quite formal, thus the heavy increase in passives). Such intuitive statements are not possible with the Google ngram.
Wikipedia could be parsed, allowing another opportunity for within-corpus comparisons of PCFGs and n-gram models.
Wikipedia is a collection of documents, each about a single topic. This would allow for several interesting extensions to our models.

Several interesting ideas came up in this conversation.

We could correlate probabilities based on Google n-grams with probabilities based on Wikipedia n-grams, and with probabilities from a Wikipedia-trained PCFG. We might, for example, be able to find a subset of Google n-grams that correlate well with relatively-professional web text. We might also be able to do some kind of clustering on documents, looking for patterns among documents in the correlations between Google probs and Wiki probs.
Independent of wikipedia itself, we might try building "expanded" versions of current corpora given access to a large collection of web documents (Ben has web data he may or may not be able to share, but more could be gathered if we wanted to do this). The idea would be to use, e.g., a language model based on Brown, then take the top x% of the web pages as according to KL vs Brown. This would allow for filling in missing data with web pages that are most similar to the pre-existing corpus. A possible experiment might be to pick a few words that appear in Brown above some freq threshold. Remove all pages from Brown that have those words. Build an n-gram model from the remaining pages, use it to find pages that are most similar to Brown from a larger web-text corpus, but have the missing words. Build an ngram model on the expanded corpus. Compare that model to the Brown ngrams for the missing words, as well as the ngrams you'd get by using the entire webtext corpus.
Kind of related to the above, Ben (me) asked whether psycho-linguists ever tested their subjects before reading studies. Something like: what were your SAT verbal scores? Could you please take the following short vocab test? How many books do you read a month? What genre of text do you prefer (money/sports/fantasy/...)? Based on these results, could we cut down on inter-subject variance (when correlating performance to, e.g., ngram stats) by using mixed language models that were crudely adapted to each subject based on their survey responses?

-  ⇤ ← Revision 7 as of 2008-02-21 00:57:30 → 
  Size: 6608
  Editor: 65-37-63-15
  Comment:
+   ← Revision 8 as of 2008-02-21 17:20:01 → ⇥
  Size: 6844
  Editor: colossus
  Comment:
-Deletions are marked like this.
+Additions are marked like this.
 Line 7:
-'''General idea -''' Compare different probability estimates based on different data sources and different estimation methods in terms of how good a fit the estimates provide against behavioral outcomes.
+'''General idea -''' Compare different probability estimates based on different data sources and different estimation methods in terms of how good a fit the estimates provide against behavioral outcomes.   This project focuses on resource-provision and methodology-check, but the results will also be relevant to theories of comprehension and production. For more detail continue to [wiki:/HlpLab/Projects/NgramUsability/LabLog lab log].