Differences between revisions 39 and 40
Revision 39 as of 2010-07-07 14:35:42
Size: 8794
Editor: echidna
Comment:
Revision 40 as of 2011-09-29 19:36:21
Size: 8817
Editor: echidna
Comment: updated to new wiki syntax
Deletions are marked like this. Additions are marked like this.
Line 8: Line 8:
 * [http://devoted.to/corpora David Lee's amazing summary of corpora and corpus tools]
 * [http://www.ldc.upenn.edu/Catalog/ Searchable catalogue of the Linguistic Data Consortium] (we are a member and either have or can get most of the corpora for free).
 * If you don't '''find a corpus''' for a particular task on any of the above pages, send and email to the international corpus email list (you need to [http://gandalf.aksis.uib.no/corpora/sub.html subscribe]). This is not the same as our local list that informs you of changes to our corpus environment (See next section).
 * [[http://devoted.to/corpora|David Lee's amazing summary of corpora and corpus tools]]
 * [[http://www.ldc.upenn.edu/Catalog/|Searchable catalogue of the Linguistic Data Consortium]] (we are a member and either have or can get most of the corpora for free).
 * If you don't '''find a corpus''' for a particular task on any of the above pages, send and email to the international corpus email list (you need to [[http://gandalf.aksis.uib.no/corpora/sub.html|subscribe]]). This is not the same as our local list that informs you of changes to our corpus environment (See next section).
Line 13: Line 13:
  * [http://www.webcorp.org.uk/ The web as corpus]
  * [http://www.linguistics.ucla.edu/people/hayes/QueryGoogle/qgapplet.html Automated Google queries]
  * [http://corpus.byu.edu/ BYU corpora] of American English (100 - 360 million words); British English (100 million words); historical corpora of English and Spanish. Corpora are POS tagged and lemma searchable.
  * [http://corp.hum.sdu.dk/ Treebanks] of Danish, Swedish, Norwegian, Icelandic, German, British English, French, Italian, Spanish, Portuguese, Romanian, Esparanto, Faroese, Estonian
  * [[http://www.webcorp.org.uk/|The web as corpus]]
  * [[http://www.linguistics.ucla.edu/people/hayes/QueryGoogle/qgapplet.html|Automated Google queries]]
  * [[http://corpus.byu.edu/|BYU corpora]] of American English (100 - 360 million words); British English (100 million words); historical corpora of English and Spanish. Corpora are POS tagged and lemma searchable.
  * [[http://corp.hum.sdu.dk/|Treebanks]] of Danish, Swedish, Norwegian, Icelandic, German, British English, French, Italian, Spanish, Portuguese, Romanian, Esparanto, Faroese, Estonian
Line 19: Line 19:
  * [http://faculty.washington.edu/ebender/corpora_sociolx.html Corpora for sociolinguists] by Emily Bender   * [[http://faculty.washington.edu/ebender/corpora_sociolx.html|Corpora for sociolinguists]] by Emily Bender
Line 22: Line 22:
  * [wiki:Self/HlpLab/UnixEnvironment Setting up your Unix environment in the lab for corpus work]   * [[UnixEnvironment|Setting up your Unix environment in the lab for corpus work]]
Line 25: Line 25:
You may want to subscribe to our corpora mailing list (ask Benjamin Van Durme to add you to the list). This list distributess information about our corpus environment. You can also use it ask/answer questions about the local corpus environment. There are several groups that use corpora. At the very least you will find useful tools and corpora in /p/hlp/ and /p/nl/ (the natural language processing group in CS). There is also a [http://www.cs.rochester.edu/research/speech/ldc.html corpus inventory] and [http://www.cs.rochester.edu/research/cisd/resources/acquisitions.rss RSS feed]. You can use a web browser or email client with built-in RSS capabilities (e.g. Mozilla Firefox, Mozilla Thunderbird), a web-based aggregator (e.g. on My Yahoo!, Google), or a standalone aggregator. We have added further corpora on /p/hlp/corpora/ and more is to come. An updated page reflecting our corpus and software inventory is in the works. You may want to subscribe to our corpora mailing list (ask Benjamin Van Durme to add you to the list). This list distributess information about our corpus environment. You can also use it ask/answer questions about the local corpus environment. There are several groups that use corpora. At the very least you will find useful tools and corpora in /p/hlp/ and /p/nl/ (the natural language processing group in CS). There is also a [[http://www.cs.rochester.edu/research/speech/ldc.html|corpus inventory]] and [[http://www.cs.rochester.edu/research/cisd/resources/acquisitions.rss|RSS feed]]. You can use a web browser or email client with built-in RSS capabilities (e.g. Mozilla Firefox, Mozilla Thunderbird), a web-based aggregator (e.g. on My Yahoo!, Google), or a standalone aggregator. We have added further corpora on /p/hlp/corpora/ and more is to come. An updated page reflecting our corpus and software inventory is in the works.
Line 44: Line 44:
See [http://trec.nist.gov/data/reuters/reuters.html] for more detail. See [[http://trec.nist.gov/data/reuters/reuters.html]] for more detail.
Line 50: Line 50:
[[Anchor(swbdsound)]] <<Anchor(swbdsound)>>
Line 57: Line 57:
[http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009S02 Speech] and [http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009T20 MDE Transcripts] [[http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009S02|Speech]] and [[http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009T20|MDE Transcripts]]
Line 60: Line 60:
[http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC97T14 Transcripts] [[http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC97T14|Transcripts]]
Line 83: Line 83:
[http://www.cis.upenn.edu/~chinese/segguide.3rd.ch.pdf Segmentation Guide] [[BR]]
[http://www.cis.upenn.edu/~chinese/posguide.3rd.ch.pdf POS Tagging Guide] [[BR]]
[http://www.cis.upenn.edu/~chinese/parseguide.3rd.ch.pdf Bracketing Guide] [[BR]]
[[http://www.cis.upenn.edu/~chinese/segguide.3rd.ch.pdf|Segmentation Guide]] <<BR>>
[[http://www.cis.upenn.edu/~chinese/posguide.3rd.ch.pdf|POS Tagging Guide]] <<BR>>
[[http://www.cis.upenn.edu/~chinese/parseguide.3rd.ch.pdf|Bracketing Guide]] <<BR>>
Line 118: Line 118:
[[Anchor(pswbd)]] <<Anchor(pswbd)>>
Line 121: Line 121:
[http://groups.inf.ed.ac.uk/switchboard/index.html the Edinburgh webpages on Switchboard in NXT]. [[http://groups.inf.ed.ac.uk/switchboard/index.html|the Edinburgh webpages on Switchboard in NXT]].
Line 124: Line 124:
The time-alignment of the orthographic words that are the terminals of the Treebank Switchboard corpus make it possible to go from the syntactic searches to the [#swbdsound corresponding sound files] of the Switchboard conversations to extract acoustic information from them (or for listening pleasure ;-)). The time-alignment of the orthographic words that are the terminals of the Treebank Switchboard corpus make it possible to go from the syntactic searches to the [[#swbdsound|corresponding sound files]] of the Switchboard conversations to extract acoustic information from them (or for listening pleasure ;-)).

Some really useful links to corpora on the web

Stay up-to-date with the corpora@cs email list

You may want to subscribe to our corpora mailing list (ask Benjamin Van Durme to add you to the list). This list distributess information about our corpus environment. You can also use it ask/answer questions about the local corpus environment. There are several groups that use corpora. At the very least you will find useful tools and corpora in /p/hlp/ and /p/nl/ (the natural language processing group in CS). There is also a corpus inventory and RSS feed. You can use a web browser or email client with built-in RSS capabilities (e.g. Mozilla Firefox, Mozilla Thunderbird), a web-based aggregator (e.g. on My Yahoo!, Google), or a standalone aggregator. We have added further corpora on /p/hlp/corpora/ and more is to come. An updated page reflecting our corpus and software inventory is in the works.

Corpora

1. Plain text corpora

1.1. Gigaword corpora

  • Chinese
  • Spanish

1.2. Reuters - NIST Corpus

Reuters newswire from 1996-08-20 to 1997-08-19.

  • RCV1 - 810,000 Reuters, English Language News stories.
  • RCV2 - 487,000 Reuters News stories in thirteen languages (Dutch, French, German, Chinese, Japanese, Russian, Portuguese, Spanish, Latin American Spanish, Italian, Danish, Norwegian, and Swedish)

To use this corpus you as an individual must sign a license agreement, which is then kept on file in the lab. After signing you will be added to the reuters Unix group on the HLP system. This corpus is not available on the CS system.

See http://trec.nist.gov/data/reuters/reuters.html for more detail.

2. Audio and video corpora

Audio corpora (the sound files of corpora) are stored under /p/hlp/corpora/Audio/.

2.1. Switchboard sound files

We've installed the sound files of those Switchboard dialogues (Switchboard 1 release 2) that are part of the Penn Treebank (release 3, Marcus et al. 1999) and the [#pswbd Edinburgh-Stanford Paraphrase Switchboard]. Jason Brenier has developed Python scripts that map the Switchboard annotation layers to the sound files, making it possible (via intermediate steps) to e.g. conduct syntactic searches and then extract acoustic information from the sound files over the syntactic match.

2.2. Buckeye Corpus

2.3. Czech Broadcast Conversation

Speech and MDE Transcripts

2.4. CALLHOME American English

Transcripts

3. Syntactically annotated corpora

3.1. Treebanks

Title

File

LDC Catalog number/Original name

Language

#word

#sentence

#story

Original format

Arabic Treebank Part 1 V3

ATB1_V3/

LDC2005T02

Arabic

145,386

734

Arabic Treebank Part 2 V2

ATB2_V2/

LDC2004T02

Arabic

144,199

501

Arabic Treebank Part 3 V1

ATB3_V1/

LDC2004T11

Arabic

340,281

600

Chinese Treebank V5.1

ChineseTreebank5.1/

LDC2005T01U01

Chinese

507,222

18,782

Chinese Treebank V6.0

ChineseTreebank6.0/

LDC2007T36

Chinese

781,351

28,295

Chinese Treebank v7.0

ChineseTreebank7.0/

LDC2010T07

Chinese

840,000

Penn Discourse Treebank Version 2.0

pdtb_v2/

LDC2008T05

English

Prague Dependency Treebank 2.0

pdt_2/

LDC2006T01

Czech

2,000,000

Danish Dependency Treebank V1.0

ddt1.0/

ddt-1.0.tar

Danish

5540

NEGRA corpus V2.0

Negra2.0/

negra-corpus.tar.gz

German

20,602

export/Penn Treebank

Merge files

mrg/

Tübingen Treebank of Spoken English

TuebaES/

English

310,000

30,000

export/Penn Treebank/XML

Tübingen Treebank of Spoken German

TuebaDS/

German

360,000

38,000

export/Penn Treebank/XML

Tübingen Treebank of Spoken Japanese

TuebaJS/

Japanese

160,000

18,000

export/XML/CoNLL-X Shared Task dependency

3.1.1. Penn Chinese Treebank

Segmentation Guide
POS Tagging Guide
Bracketing Guide

3.2. TGrep2able

All of our syntactically annotated corpora have been processed to make them usable with the TGrep2 tool. See [wiki:Self/HlpLab/CorpusTools Corpus Tools] for more info on TGrep2.

Filename

Description

BNC.parsed.t2c.gz

The full British National Corpus

BNC_spoken.parsed.t2c.gz

Just the spoken part of the BNC

BNC_written.parsed.t2c.gz

Just the written part of the BNC

arabic-collapsed.t2c.gz

arabic-treebank-with-vowels.t2c.gz

arabic-treebank-without-vowels.t2c.gz

brown.t2c.gz

chtb2.t2c.gz

chtb4.t2c.gz

chtb5.1.t2c

commented-brown.t2c.gz

icegb.t2c.gz

negra.t2c.gz

sw.backtrans.convid_030507.t2c.gz

sw.backtrans.t2c.gz

swbd.t2c.gz

tiger.t2c.gz

tuebaes.t2c.gz

Tübingen Treebank of Spoken English

tuebads.t2c.gz

Tübingen Treebank of Spoken German

wsj-commented.t2c.gz

wsj_mrg.t2c.gz

ycoe.t2c.gz

3.3. TIGER Corpora

3.3.1. Tiger2 Corpus

4. Edinburgh-Stanford Paraphrase Switchboard

This corpus combines numerous annotations for the Penn Treebank (release 3, Marcus et al. 1999) portion of the Switchboard corpus (Godfrey et al. 1992). In addition to the part-of-speech, grammatical function, and syntactic annotation of the Treebank, the corpus includes annotation for turn-taking, disfluencies (Taylor et al. 1995), dialogue acts (Shirberg et al. 1998), animacy (Zaenen et al. 2003), coreference and information status (Nissim et al. 2001), kontrast and kontrast-triggers (Calhoun 2006), as well as time-aligned orthographic transcripts (automatically aligned down to the syllable and segment level), some prosodic phrasing and accent annotation, and some more. For a detailed summary and manuals, see the Edinburgh webpages on Switchboard in NXT. We have both the XML version and a version that has been back-translated to Penn Treebank format, so that it is TGrep2able (thanks to Jean Carlette, Jason Brenier, Sasha Calhoun, and Neal Snider).

The time-alignment of the orthographic words that are the terminals of the Treebank Switchboard corpus make it possible to go from the syntactic searches to the corresponding sound files of the Switchboard conversations to extract acoustic information from them (or for listening pleasure ;-)).

NB: You must be a member of the pswbd group to have access to this corpus.

Corpora (last edited 2018-06-07 17:57:11 by dhcp-10-5-21-163)

MoinMoin Appliance - Powered by TurnKey Linux