Differences between revisions 1 and 2
Revision 1 as of 2008-04-07 14:44:15
Size: 2164
Editor: vandurme
Comment:
Revision 2 as of 2008-04-07 15:32:45
Size: 2161
Editor: colossus
Comment:
Deletions are marked like this. Additions are marked like this.
Line 5: Line 5:
#acl MoinPagesEditorGroup:read,write,delete,revert All:read #acl HlpLabGroup:read,write,delete,revert,admin All:read

web-based ngrams

Papers

Googleology is Bad ScienceBR Adam KilgarriffBR Computational Linguistics March 2007, Vol. 33, No. 1: 147–151.BR BR The World Wide Web is enormous, free, immediately available, and largely linguistic. As we discover, on ever more fronts, that language analysis and generation benefit from big data, so it becomes appealing to use the Web as a data source. The question, then, is how.BR BR BR BR Lapata, Mirella and Frank Keller. 2005. Web-based Models for Natural Language Processing. ACM Transactions on Speech and Language Processing 2:1, 1-31.BR BR Previous work demonstrated that web counts can be used to approximate bigram counts, thus suggesting that web-based frequencies should be useful for a wide variety of NLP tasks. However, only a limited number of tasks have so far been tested using web-scale data sets. The present paper overcomes this limitation by systematically investigating the performance of web-based models for several NLP tasks, covering both syntax and semantics, both generation and analysis, and a wider range of n-grams and parts of speech than have been previously explored. For the majority of our tasks, we find that simple, unsupervised models perform better when n-gram counts are obtained from the web rather than from a large corpus. In some cases, performance can be improved further by using backoff or interpolation techniques that combine web counts and corpus counts. However, unsupervised web-based models generally fail to outperform supervised state-of-the-art models trained on smaller corpora. We argue that web-based models should therefore be used as a baseline for, rather than an alternative to, standard supervised models.BR BR

Attachments

AttachList

LabMeetingSp08w11 (last edited 2008-09-16 18:46:23 by platypus)

MoinMoin Appliance - Powered by TurnKey Linux