We'll discuss opportunities offered and problems raised by using data from the Web for HLP and NLP work. The meeting will cover the following papers:

Fletcher, W. H. (2007). Implementing a BNC-Compare-able Web Corpus. In . Louvain-la-Neuve, Belgium. Retrieved from http://webascorpus.org/wac3/wac3-WHFletcher-revised.pdf. Keller, F., Lapata, M., & Ourioupina, O. (2002). Using the web to overcome data sparseness. In Proceedings of the ACL-02 conference on Empirical methods in natural language processing-Volume 10 (pp. 230-237). Association for Computational Linguistics Morristown, NJ, USA. Kilgarriff, A. (2007). Googleology is Bad Science. Computational Linguistics, 33(1), 147-151. Kilgarriff, A., & Grefenstette, G. (2003). Introduction to the Special Issue on the Web as Corpus. Computational Linguistics, 29(3), 333-347.

BibTex for papers we're reading: @inproceedings{keller_usingweb_2002,

},

@article{kilgarriff_introduction_2003,

},

@inproceedings{fletcher_implementingbnc-compare-able_2007,

},

@article{kilgarriff_googleology_2007,

}

Related web-as-corpus papers: Brants, T., & Franz, A. Web 1t 5-gram version 1, 2006. Linguistic Data Consortium, Philadelphia. Calvo, H., & Gelbukh, A. (2003). Improving Disambiguation of Prepositional Phrase Attachments Using the Web as Corpus. Procs. of CIARP, 592–598. Fletcher, W. H. (2007). Implementing a BNC-Compare-able Web Corpus. In . Louvain-la-Neuve, Belgium. Retrieved from http://webascorpus.org/wac3/wac3-WHFletcher-revised.pdf. Keller, F., & Lapata, M. (2003). Using the Web to Obtain Frequencies for Unseen Bigrams. Computational Linguistics, 29(3), 459-484. Keller, F., Lapata, M., & Ourioupina, O. (2002). Using the web to overcome data sparseness. In Proceedings of the ACL-02 conference on Empirical methods in natural language processing-Volume 10 (pp. 230-237). Association for Computational Linguistics Morristown, NJ, USA. Kilgarriff, A., & Grefenstette, G. (2003). Introduction to the Special Issue on the Web as Corpus. Computational Linguistics, 29(3), 333-347. Lapata, M., & Keller, F. (2004). The Web as a baseline: Evaluating the performance of unsupervised Web-based models for a range of NLP tasks. Proceedings of the Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics, 121-128. Lapata, M., & Keller, F. (2005). Web-based models for natural language processing. ACM Transactions on Speech and Language Processing (TSLP), 2(1). Lapata, M., Keller, F., & McDonald, S. (2001). Evaluating smoothing algorithms against plausibility judgements. In Proceedings of the 39th Annual Meeting on Association for Computational Linguistics (pp. 354-361). Association for Computational Linguistics Morristown, NJ, USA. Resnik, P., & Smith, N. A. (2003). The Web as a parallel corpus. Computational Linguistics, 29(3), 349-380. de Schryver, G. M. (2002). WEB for/as corpus: a perspective for the African languages. Nordic Journal of African Studies, 11(2), 266-282. Volk, M. (2001). Exploiting the WWW as a corpus to resolve PP attachment ambiguities. In Proceedings of Corpus Linguistics (pp. 601–606). Volk, M. (2002). Using the Web as Corpus for Linguistic Research. Tahendusepuuja. Catcher of the Meaning. A Festschrift for Professor Haldur Oim.

@article{brants_web_????,

},

@article{calvo_improving_2003,

},

@article{de_schryver_web_2002,

},

@inproceedings{fletcher_implementingbnc-compare-able_2007,

English corpus from the web to complement his BNC-based online database {\textquotedblleft}Phrases in English{\textquotedblright}. This new corpus represents the principal English-speaking countries in proportion to their population and will be linguistically annotated with the CLAWS4 tagger using a PoS-tagset comparable to those of the BNC and ANC. Parallel processing on multiple PCs will facilitate reaching the targeted size. This corpus will continue to grow dynamically in response to actual user queries to the author{\textquoteright}s various web as corpus interfaces, but {\textquotedblleft}snapshots{\textquotedblright} of each generation of the corpus will be preserved to ensure replicability of results. This report on work in progress will inspire discussion of the underlying concepts and suggestions for improvement.},

},

@article{keller_usingweb_2003,

},

@inproceedings{keller_usingweb_2002,

},

@article{kilgarriff_introduction_2003,

},

@article{lapata_web_2004,

},

@article{lapata_web-based_2005,

},

@inproceedings{lapata_evaluating_2001,

},

@article{resnik_web_2003,

},

@inproceedings{volk_exploitingwww_2001,

},

@article{volk_usingweb_2002,

}

MoinMoin Appliance - Powered by TurnKey Linux