We'll discuss opportunities offered and problems raised by using data from the Web for HLP and NLP work.

Papers

Fletcher, W. H. (2007). Implementing a BNC-Compare-able Web Corpus. In . Louvain-la-Neuve, Belgium. Retrieved from http://webascorpus.org/wac3/wac3-WHFletcher-revised.pdf.

Kilgarriff, A. (2007). Googleology is Bad Science. Computational Linguistics, 33(1), 147-151.

Kilgarriff, A., & Grefenstette, G. (2003). Introduction to the Special Issue on the Web as Corpus. Computational Linguistics, 29(3), 333-347.

Lapata, M., & Keller, F. (2005). Web-based models for natural language processing. ACM Transactions on Speech and Language Processing (TSLP), 2(1).

Files

The link to the CLEANEVAL project: http://cleaneval.sigwac.org.uk/

Instructions

Read Kilgariff's Googleology paper if you haven't already. If you have already read that one, read Kilgariff & Grefenstette.

Read at least one of Fletcher or Lapata & Keller.

BibTex

@inproceedings{fletcher_implementingbnc-compare-able_2007,
        address = {Louvain-la-Neuve, Belgium},
        title = {Implementing a BNC-Compare-able Web Corpus},
        url = {http://webascorpus.org/wac3/wac3-WHFletcher-revised.pdf},
        author = {W. H. Fletcher},
        month = sep,
        year = {2007},
        keywords = {web-as-corpus}
},

@article{kilgarriff_googleology_2007,
        title = {Googleology is Bad Science},
        volume = {33},
        number = {1},
        journal = {Computational Linguistics},
        author = {A. Kilgarriff},
        year = {2007},
        keywords = {web-as-corpus},
        pages = {147--151}
},

@article{kilgarriff_introduction_2003,
        title = {Introduction to the Special Issue on the Web as Corpus},
        volume = {29},
        number = {3},
        journal = {Computational Linguistics},
        author = {A. Kilgarriff and G. Grefenstette},
        year = {2003},
        keywords = {web-as-corpus},
        pages = {333--347}
},

@article{lapata_web-based_2005,
        title = {Web-based models for natural language processing},
        volume = {2},
        number = {1},
        journal = {ACM Transactions on Speech and Language Processing (TSLP)},
        author = {M. Lapata and F. Keller},
        year = {2005},
        keywords = {web-as-corpus}
}

Brants, T., & Franz, A. (2006). Web 1t 5-gram version 1, 2006. Linguistic Data Consortium, Philadelphia.

Calvo, H., & Gelbukh, A. (2003). Improving Disambiguation of Prepositional Phrase Attachments Using the Web as Corpus. Procs. of CIARP, 592–598.

Fletcher, W. H. (2007). Implementing a BNC-Compare-able Web Corpus. In . Louvain-la-Neuve, Belgium. Retrieved from http://webascorpus.org/wac3/wac3-WHFletcher-revised.pdf.

Keller, F., & Lapata, M. (2003). Using the Web to Obtain Frequencies for Unseen Bigrams. Computational Linguistics, 29(3), 459-484.

Keller, F., Lapata, M., & Ourioupina, O. (2002). Using the web to overcome data sparseness. In Proceedings of the ACL-02 conference on Empirical methods in natural language processing-Volume 10 (pp. 230-237). Association for Computational Linguistics Morristown, NJ, USA.

Kilgarriff, A. (2007). Googleology is Bad Science. Computational Linguistics, 33(1), 147-151.

Kilgarriff, A., & Grefenstette, G. (2003). Introduction to the Special Issue on the Web as Corpus. Computational Linguistics, 29(3), 333-347.

Lapata, M., & Keller, F. (2004). The Web as a baseline: Evaluating the performance of unsupervised Web-based models for a range of NLP tasks. Proceedings of the Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics, 121-128.

Lapata, M., & Keller, F. (2005). Web-based models for natural language processing. ACM Transactions on Speech and Language Processing (TSLP), 2(1).

Lapata, M., Keller, F., & McDonald, S. (2001). Evaluating smoothing algorithms against plausibility judgements. In Proceedings of the 39th Annual Meeting on Association for Computational Linguistics (pp. 354-361). Association for Computational Linguistics Morristown, NJ, USA.

Resnik, P., & Smith, N. A. (2003). The Web as a parallel corpus. Computational Linguistics, 29(3), 349-380.
de Schryver, G. M. (2002). WEB for/as corpus: a perspective for the African languages. Nordic Journal of African Studies, 11(2), 266-282.

Volk, M. (2001). Exploiting the WWW as a corpus to resolve PP attachment ambiguities. In Proceedings of Corpus Linguistics (pp. 601–606).

Volk, M. (2002). Using the Web as Corpus for Linguistic Research. Tahendusepuuja. Catcher of the Meaning. A Festschrift for Professor Haldur Oim.

@article{brants_web_2006,
        title = {Web 1t 5-gram version 1, 2006},
        journal = {Linguistic Data Consortium, Philadelphia},
        author = {T. Brants and A. Franz},
        year = {2006},
        keywords = {web-as-corpus}
},

@article{calvo_improving_2003,
        title = {Improving Disambiguation of Prepositional Phrase Attachments Using the Web as Corpus},
        journal = {Procs. of CIARP},
        author = {H. Calvo and A. Gelbukh},
        year = {2003},
        keywords = {web-as-corpus},
        pages = {592{\textendash}598}
},

@article{de_schryver_web_2002,
        title = {WEB for/as corpus: a perspective for the African languages},
        volume = {11},
        abstract = {In this article the potential of the multilingual Web to function as a corpus, in addition to a source for corpus creation, is examined. Despite the fact that English dominates the Web, and despite the fact that most work in corpus linguistics revolves around English, it will be argued that African languages do have a place in the bigger picture. Substantial African-language Web corpora can indeed already be compiled (Web for Corpus) and accessed (Web as Corpus), and the list of potential applications grows by the day.},
        number = {2},
        journal = {Nordic Journal of African Studies},
        author = {G. M. de Schryver},
        year = {2002},
        keywords = {web-as-corpus},
        pages = {266--282}
},

@inproceedings{fletcher_implementingbnc-compare-able_2007,
        address = {Louvain-la-Neuve, Belgium},
        title = {Implementing a BNC-Compare-able Web Corpus},
        url = {http://webascorpus.org/wac3/wac3-WHFletcher-revised.pdf},
        abstract = {This paper details the author{\textquoteright}s plans for and progress with compiling and analyzing a new gigaword 
English corpus from the web to complement his BNC-based online database {\textquotedblleft}Phrases in English{\textquotedblright}.  
This new corpus represents the principal English-speaking countries in proportion to their population 
and will be linguistically annotated with the CLAWS4 tagger using a PoS-tagset comparable to those 
of the BNC and ANC.  Parallel processing on multiple PCs will facilitate reaching the targeted size.  
This corpus will continue to grow dynamically in response to actual user queries to  the author{\textquoteright}s 
various web as corpus interfaces, but {\textquotedblleft}snapshots{\textquotedblright} of each generation of the corpus will be preserved to 
ensure replicability of results.  This report on work in progress will inspire discussion of the 
underlying concepts and suggestions for improvement.},
        author = {W. H. Fletcher},
        month = sep,
        year = {2007},
        keywords = {web-as-corpus}
},

@article{keller_usingweb_2003,
        title = {Using the Web to Obtain Frequencies for Unseen Bigrams},
        volume = {29},
        number = {3},
        journal = {Computational Linguistics},
        author = {F. Keller and M. Lapata},
        year = {2003},
        keywords = {web-as-corpus},
        pages = {459--484}
},

@inproceedings{keller_usingweb_2002,
        title = {Using the web to overcome data sparseness},
        booktitle = {Proceedings of the ACL-02 conference on Empirical methods in natural language processing-Volume 10},
        publisher = {Association for Computational Linguistics Morristown, NJ, USA},
        author = {F. Keller and M. Lapata and O. Ourioupina},
        year = {2002},
        keywords = {web-as-corpus},
        pages = {230--237}
},

@article{kilgarriff_googleology_2007,
        title = {Googleology is Bad Science},
        volume = {33},
        number = {1},
        journal = {Computational Linguistics},
        author = {A. Kilgarriff},
        year = {2007},
        keywords = {web-as-corpus},
        pages = {147--151}
},

@article{kilgarriff_introduction_2003,
        title = {Introduction to the Special Issue on the Web as Corpus},
        volume = {29},
        number = {3},
        journal = {Computational Linguistics},
        author = {A. Kilgarriff and G. Grefenstette},
        year = {2003},
        keywords = {web-as-corpus},
        pages = {333--347}
},

@article{lapata_web_2004,
        title = {The Web as a baseline: Evaluating the performance of unsupervised Web-based models for a range of NLP tasks},
        journal = {Proceedings of the Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics},
        author = {M. Lapata and F. Keller},
        year = {2004},
        keywords = {web-as-corpus},
        pages = {121--128}
},

@article{lapata_web-based_2005,
        title = {Web-based models for natural language processing},
        volume = {2},
        number = {1},
        journal = {ACM Transactions on Speech and Language Processing (TSLP)},
        author = {M. Lapata and F. Keller},
        year = {2005},
        keywords = {web-as-corpus}
},

@inproceedings{lapata_evaluating_2001,
        title = {Evaluating smoothing algorithms against plausibility judgements},
        booktitle = {Proceedings of the 39th Annual Meeting on Association for Computational Linguistics},
        publisher = {Association for Computational Linguistics Morristown, NJ, USA},
        author = {M. Lapata and F. Keller and S. McDonald},
        year = {2001},
        keywords = {web-as-corpus},
        pages = {354--361}
},

@article{resnik_web_2003,
        title = {The Web as a parallel corpus},
        volume = {29},
        number = {3},
        journal = {Computational Linguistics},
        author = {P. Resnik and N. A. Smith},
        year = {2003},
        keywords = {web-as-corpus},
        pages = {349--380}
},

@inproceedings{volk_exploitingwww_2001,
        title = {Exploiting the WWW as a corpus to resolve PP attachment ambiguities},
        booktitle = {Proceedings of Corpus Linguistics},
        author = {M. Volk},
        year = {2001},
        keywords = {web-as-corpus},
        pages = {601{\textendash}606}
},

@article{volk_usingweb_2002,
        title = {Using the Web as Corpus for Linguistic Research},
        journal = {Tahendusepuuja. Catcher of the Meaning. A Festschrift for Professor Haldur Oim},
        author = {M. Volk},
        year = {2002},
        keywords = {web-as-corpus}
}

LabmeetingAu08w4 (last edited 2008-09-30 18:38:57 by colossus)

MoinMoin Appliance - Powered by TurnKey Linux