Papers

Googleology is Bad ScienceBR Adam KilgarriffBR Computational Linguistics March 2007, Vol. 33, No. 1: 147–151.BR BR The World Wide Web is enormous, free, immediately available, and largely linguistic. As we discover, on ever more fronts, that language analysis and generation benefit from big data, so it becomes appealing to use the Web as a data source. The question, then, is how.BR BR BR BR Lapata, Mirella and Frank Keller. 2005. Web-based Models for Natural Language Processing. ACM Transactions on Speech and Language Processing 2:1, 1-31.BR BR Previous work demonstrated that web counts can be used to approximate bigram counts, thus suggesting that web-based frequencies should be useful for a wide variety of NLP tasks. However, only a limited number of tasks have so far been tested using web-scale data sets. The present paper overcomes this limitation by systematically investigating the performance of web-based models for several NLP tasks, covering both syntax and semantics, both generation and analysis, and a wider range of n-grams and parts of speech than have been previously explored. For the majority of our tasks, we find that simple, unsupervised models perform better when n-gram counts are obtained from the web rather than from a large corpus. In some cases, performance can be improved further by using backoff or interpolation techniques that combine web counts and corpus counts. However, unsupervised web-based models generally fail to outperform supervised state-of-the-art models trained on smaller corpora. We argue that web-based models should therefore be used as a baseline for, rather than an alternative to, standard supervised models.BR BR BR Baayen, Feldman, Schreuder. JML 2006. Morphological influences on the recognition of monosyllabic monomorphemic words.BR BR The study of Balota et al. represents a rigorous analysis that is consistent with twenty years of research on word recognition. Nevertheless, Balota et al. also raised a number of methodological issues inviting further research and clarification. A first such issue is potential non-linearities in the relation between predictors and response latencies. Balota and colleagues observed a non-linear relation between frequency and lexical decision latencies in a univariate regression, but did not explore this non-linearity further in multiple regression. The exclusion of relevant non-linear terms from multiple regression can induce otherwise unnecessary interactions, however. BR A second issue concerns the order in which variables are entered into the hierarchical regression model. ...BR

Attachments

AttachList

LabMeetingSp08w11 (last edited 2008-09-16 18:46:23 by platypus)

MoinMoin Appliance - Powered by TurnKey Linux