Differences between revisions 6 and 41 (spanning 35 versions)

Some really useful links to corpora on the web

David Lee's amazing summary of corpora and corpus tools
Searchable catalogue of the Linguistic Data Consortium (we are a member and either have or can get most of the corpora for free).
If you don't find a corpus for a particular task on any of the above pages, send and email to the international corpus email list (you need to subscribe). This is not the same as our local list that informs you of changes to our corpus environment (See next section).
Online search interfaces for:
- The web as corpus
- Automated Google queries
- BYU corpora of American English (100 - 360 million words); British English (100 million words); historical corpora of English and Spanish. Corpora are POS tagged and lemma searchable.
- Treebanks of Danish, Swedish, Norwegian, Icelandic, German, British English, French, Italian, Spanish, Portuguese, Romanian, Esparanto, Faroese, Estonian
Classes on:
- Corpora for sociolinguists by Emily Bender
Local help
- Setting up your Unix environment in the lab for corpus work

Stay up-to-date with the corpora@cs email list

You may want to subscribe to our corpora mailing list (ask Benjamin Van Durme to add you to the list). This list distributess information about our corpus environment. You can also use it ask/answer questions about the local corpus environment. There are several groups that use corpora. At the very least you will find useful tools and corpora in /p/hlp/ and /p/nl/ (the natural language processing group in CS). There is also a corpus inventory and RSS feed. You can use a web browser or email client with built-in RSS capabilities (e.g. Mozilla Firefox, Mozilla Thunderbird), a web-based aggregator (e.g. on My Yahoo!, Google), or a standalone aggregator. We have added further corpora on /p/hlp/corpora/ and more is to come. An updated page reflecting our corpus and software inventory is in the works.

Corpora

1. Plain text corpora

1.1. Gigaword corpora

Chinese
Spanish

1.2. Reuters - NIST Corpus

Reuters newswire from 1996-08-20 to 1997-08-19.

RCV1 - 810,000 Reuters, English Language News stories.
RCV2 - 487,000 Reuters News stories in thirteen languages (Dutch, French, German, Chinese, Japanese, Russian, Portuguese, Spanish, Latin American Spanish, Italian, Danish, Norwegian, and Swedish)

To use this corpus you as an individual must sign a license agreement, which is then kept on file in the lab. After signing you will be added to the reuters Unix group on the HLP system. This corpus is not available on the CS system.

See http://trec.nist.gov/data/reuters/reuters.html for more detail.

2. Audio and video corpora

Audio corpora (the sound files of corpora) are stored under /p/hlp/corpora/Audio/.

2.1. Switchboard sound files

We've installed the sound files of those Switchboard dialogues (Switchboard 1 release 2) that are part of the Penn Treebank (release 3, Marcus et al. 1999) and the Edinburgh-Stanford Paraphrase Switchboard. Jason Brenier has developed Python scripts that map the Switchboard annotation layers to the sound files, making it possible (via intermediate steps) to e.g. conduct syntactic searches and then extract acoustic information from the sound files over the syntactic match.

2.2. Buckeye Corpus

2.3. Czech Broadcast Conversation

Speech and MDE Transcripts

2.4. CALLHOME American English

Transcripts

3. Syntactically annotated corpora

3.1. Treebanks

Title	File	LDC Catalog number/Original name	Language	#word	#sentence	#story	Original format
Arabic Treebank Part 1 V3	`ATB1_V3/`	`LDC2005T02`	Arabic	145,386		734
Arabic Treebank Part 2 V2	`ATB2_V2/`	`LDC2004T02`	Arabic	144,199		501
Arabic Treebank Part 3 V1	`ATB3_V1/`	`LDC2004T11`	Arabic	340,281		600
Chinese Treebank V5.1	`ChineseTreebank5.1/`	`LDC2005T01U01`	Chinese	507,222		18,782
Chinese Treebank V6.0	`ChineseTreebank6.0/`	`LDC2007T36`	Chinese	781,351	28,295
Chinese Treebank v7.0	`ChineseTreebank7.0/`	`LDC2010T07`	Chinese	840,000
Penn Discourse Treebank Version 2.0	`pdtb_v2/`	`LDC2008T05`	English
Prague Dependency Treebank 2.0	`pdt_2/`	`LDC2006T01`	Czech	2,000,000
Danish Dependency Treebank V1.0	`ddt1.0/`	`ddt-1.0.tar`	Danish			5540
NEGRA corpus V2.0	`Negra2.0/`	`negra-corpus.tar.gz`	German			20,602	export/Penn Treebank
Merge files	`mrg/`
Tübingen Treebank of Spoken English	`TuebaES/`		English	310,000	30,000		export/Penn Treebank/XML
Tübingen Treebank of Spoken German	`TuebaDS/`		German	360,000	38,000		export/Penn Treebank/XML
Tübingen Treebank of Spoken Japanese	`TuebaJS/`		Japanese	160,000	18,000		export/XML/CoNLL-X Shared Task dependency

3.1.1. Penn Chinese Treebank

Segmentation Guide
POS Tagging Guide
Bracketing Guide

3.2. TGrep2able

All of our syntactically annotated corpora have been processed to make them usable with the TGrep2 tool. See [wiki:Self/HlpLab/CorpusTools Corpus Tools] for more info on TGrep2.

Filename	Description
BNC.parsed.t2c.gz	The full British National Corpus
BNC_spoken.parsed.t2c.gz	Just the spoken part of the BNC
BNC_written.parsed.t2c.gz	Just the written part of the BNC
arabic-collapsed.t2c.gz
arabic-treebank-with-vowels.t2c.gz
arabic-treebank-without-vowels.t2c.gz
brown.t2c.gz
chtb2.t2c.gz
chtb4.t2c.gz
chtb5.1.t2c
commented-brown.t2c.gz
icegb.t2c.gz
negra.t2c.gz
sw.backtrans.convid_030507.t2c.gz
sw.backtrans.t2c.gz
swbd.t2c.gz
tiger.t2c.gz
tuebaes.t2c.gz	Tübingen Treebank of Spoken English
tuebads.t2c.gz	Tübingen Treebank of Spoken German
wsj-commented.t2c.gz
wsj_mrg.t2c.gz
ycoe.t2c.gz

3.3. TIGER Corpora

3.3.1. Tiger2 Corpus

4. Edinburgh-Stanford Paraphrase Switchboard

This corpus combines numerous annotations for the Penn Treebank (release 3, Marcus et al. 1999) portion of the Switchboard corpus (Godfrey et al. 1992). In addition to the part-of-speech, grammatical function, and syntactic annotation of the Treebank, the corpus includes annotation for turn-taking, disfluencies (Taylor et al. 1995), dialogue acts (Shirberg et al. 1998), animacy (Zaenen et al. 2003), coreference and information status (Nissim et al. 2001), kontrast and kontrast-triggers (Calhoun 2006), as well as time-aligned orthographic transcripts (automatically aligned down to the syllable and segment level), some prosodic phrasing and accent annotation, and some more. For a detailed summary and manuals, see the Edinburgh webpages on Switchboard in NXT. We have both the XML version and a version that has been back-translated to Penn Treebank format, so that it is TGrep2able (thanks to Jean Carlette, Jason Brenier, Sasha Calhoun, and Neal Snider).

The time-alignment of the orthographic words that are the terminals of the Treebank Switchboard corpus make it possible to go from the syntactic searches to the corresponding sound files of the Switchboard conversations to extract acoustic information from them (or for listening pleasure ;-)).

NB: You must be a member of the pswbd group to have access to this corpus.

-  ⇤ ← Revision 6 as of 2007-09-27 19:58:19 → 
  Size: 2301
  Editor: lab1
  Comment:
+   ← Revision 41 as of 2011-10-07 19:41:30 → ⇥
  Size: 8819
  Editor: echidna
  Comment:
-Deletions are marked like this.
+Additions are marked like this.
 Line 6:
+= Some really useful links to corpora on the web =

 * [[http://devoted.to/corpora|David Lee's amazing summary of corpora and corpus tools]]
 * [[http://www.ldc.upenn.edu/Catalog/|Searchable catalogue of the Linguistic Data Consortium]] (we are a member and either have or can get most of the corpora for free).
 * If you don't '''find a corpus''' for a particular task on any of the above  pages, send and email to the international corpus email list (you need to [[http://gandalf.aksis.uib.no/corpora/sub.html|subscribe]]). This is not the same as our local list that informs you of changes to our corpus environment (See next section).

 * Online search interfaces for:
  * [[http://www.webcorp.org.uk/|The web as corpus]]
  * [[http://www.linguistics.ucla.edu/people/hayes/QueryGoogle/qgapplet.html|Automated Google queries]]
  * [[http://corpus.byu.edu/|BYU corpora]] of American English (100 - 360 million words); British English (100 million words); historical corpora of English and Spanish. Corpora are POS tagged and lemma searchable.
  * [[http://corp.hum.sdu.dk/|Treebanks]] of Danish, Swedish, Norwegian, Icelandic, German, British English, French, Italian, Spanish, Portuguese, Romanian, Esparanto, Faroese, Estonian

 * Classes on:
  * [[http://faculty.washington.edu/ebender/corpora_sociolx.html|Corpora for sociolinguists]] by Emily Bender

 * Local help
  * [[UnixEnvironment|Setting up your Unix environment in the lab for corpus work]]
-Line 7:
+Line 25:
-You may want to subscribe to our corpora mailing list (just ask Benjamin Van Durme to add you to the list). This list distributess information about our corpus environment. You can also use it ask/answer questions about the local corpus environment. There are several groups that use corpora. At the very least you will find useful tools and corpora in /p/hlp/ and /p/nl/ (the natural language processing group in CS). There is also a [http://www.cs.rochester.edu/research/speech/ldc.html corpus inventory] and [http://www.cs.rochester.edu/research/cisd/resources/acquisitions.rss RSS feed]. You can use a web browser or email client with built-in RSS capabilities (e.g. Mozilla Firefox, Mozilla Thunderbird), a web-based aggregator (e.g. on My Yahoo!, Google), or a standalone aggregator. We have added further corpora on /p/hlp/corpora/ and more is to come. An updated page reflecting our corpus and software inventory is in the works.
+You may want to subscribe to our corpora mailing list (ask Benjamin Van Durme to add you to the list). This list distributess information about our corpus environment. You can also use it ask/answer questions about the local corpus environment. There are several groups that use corpora. At the very least you will find useful tools and corpora in /p/hlp/ and /p/nl/ (the natural language processing group in CS). There is also a [[http://www.cs.rochester.edu/research/speech/ldc.html|corpus inventory]] and [[http://www.cs.rochester.edu/research/cisd/resources/acquisitions.rss|RSS feed]]. You can use a web browser or email client with built-in RSS capabilities (e.g. Mozilla Firefox, Mozilla Thunderbird), a web-based aggregator (e.g. on My Yahoo!, Google), or a standalone aggregator. We have added further corpora on /p/hlp/corpora/ and more is to come. An updated page reflecting our corpus and software inventory is in the works.
-Line 11:
+Line 29:
-== Gigaword ==
+== Plain text corpora ==

=== Gigaword corpora ===
-Line 14:
+Line 34:
+ * Spanish
-Line 15:
+Line 36:
-== Parsed Switchboard ==
You must be a member of the `pswbd` Unix group to access this corpus.
+=== Reuters - NIST Corpus ===
Reuters newswire from 1996-08-20 to 1997-08-19.
-Line 18:
+Line 39:
-== TGrep2able ==
Corpora that have been processed to make them usable with the TGrep2 tool. See [wiki:Self/HlpLab/CorpusTools Corpus Tools] for more info on TGrep2.
+ * RCV1 - 810,000 Reuters, English Language News stories.
 * RCV2 - 487,000 Reuters News stories in '''thirteen languages''' (Dutch, French, German, Chinese, Japanese, Russian, Portuguese, Spanish, Latin American Spanish, Italian, Danish, Norwegian, and Swedish)
-Line 21:
+Line 42:
-== TIGER Corpora ==
+To use this corpus you as an individual must sign a license agreement, which is then kept on file in the lab. After signing you will be added to the `reuters` Unix group on the HLP system. This corpus is not available on the CS system.
-Line 23:
+Line 44:
-== Tiger2 Corpus ==
+See [[http://trec.nist.gov/data/reuters/reuters.html]] for more detail.
-Line 25:
+Line 46:
-== Treebanks ==
+== Audio and video corpora ==
Audio corpora (the sound files of corpora) are stored under `/p/hlp/corpora/Audio/`.

<<Anchor(swbdsound)>>
=== Switchboard sound files ===
We've installed the sound files of those Switchboard dialogues (Switchboard 1 release 2) that are part of the Penn Treebank (release 3, Marcus et al. 1999) and the [[#pswbd|Edinburgh-Stanford Paraphrase Switchboard]]. Jason Brenier has developed Python scripts that map the Switchboard annotation layers to the sound files, making it possible (via intermediate steps) to e.g. conduct syntactic searches and then extract acoustic information from the sound files over the syntactic match.

=== Buckeye Corpus ===

=== Czech Broadcast Conversation ===
[[http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009S02|Speech]] and [[http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009T20|MDE Transcripts]]

=== CALLHOME American English ===
[[http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC97T14|Transcripts]]

== Syntactically annotated corpora ==

=== Treebanks ===
-Line 28:
+Line 67:
-||Arabic Treebank Part 1 V3||`ATB1_V3/`||`LDC2005T02`||Arabic||145386|| ||734|| ||
||Arabic Treebank Part 2 V2||`ATB2_V2/`||`LDC2004T02`||Arabic||144199|| ||501|| ||
||Arabic Treebank Part 3 V1||`ATB3_V1/`||`LDC2004T11`||Arabic||340281|| ||600|| ||
||Chinese Treebank V5.1||`Chinese``Treebank5.1/`||`LDC2005T01U01`||Chinese||507222|| ||18782|| ||
||Prague Dependency Treebank 2.0||`pdt_2/`||`LDC2006T01`||Czech||2000000|| || || ||
+||Arabic Treebank Part 1 V3||`ATB1_V3/`||`LDC2005T02`||Arabic||145,386|| ||734|| ||
||Arabic Treebank Part 2 V2||`ATB2_V2/`||`LDC2004T02`||Arabic||144,199|| ||501|| ||
||Arabic Treebank Part 3 V1||`ATB3_V1/`||`LDC2004T11`||Arabic||340,281|| ||600|| ||
||Chinese Treebank V5.1    ||`Chinese``Treebank5.1/`||`LDC2005T01U01`||Chinese||507,222|| ||18,782|| ||
||Chinese Treebank V6.0    ||`Chinese``Treebank6.0/`||`LDC2007T36`||Chinese||781,351||28,295|| || ||
||Chinese Treebank v7.0    ||`Chinese``Treebank7.0/`||`LDC2010T07`||Chinese||840,000||      || || ||
||Penn Discourse Treebank Version 2.0||`pdtb_v2/`||`LDC2008T05`||English|| || || || ||
||Prague Dependency Treebank 2.0||`pdt_2/`||`LDC2006T01`||Czech||2,000,000|| || || ||
-Line 34:
+Line 76:
-||NEGRA corpus V2.0||`Negra2.0/`||`negra-corpus.tar.gz`||German|| || ||20602||export/Penn Treebank||
+||NEGRA corpus V2.0||`Negra2.0/`||`negra-corpus.tar.gz`||German|| || ||20,602||export/Penn Treebank||
-Line 36:
+Line 78:
+||Tübingen Treebank of Spoken English||`TuebaES/` || ||English ||310,000 ||30,000 || ||export/Penn Treebank/XML ||
||Tübingen Treebank of Spoken German||`TuebaDS/` || ||German ||360,000 ||38,000 || ||export/Penn Treebank/XML ||
||Tübingen Treebank of Spoken Japanese||`TuebaJS/` || ||Japanese ||160,000 ||18,000 || ||export/XML/CoNLL-X Shared Task dependency ||

==== Penn Chinese Treebank ====
[[http://www.cis.upenn.edu/~chinese/segguide.3rd.ch.pdf|Segmentation Guide]] <<BR>>
[[http://www.cis.upenn.edu/~chinese/posguide.3rd.ch.pdf|POS Tagging Guide]] <<BR>>
[[http://www.cis.upenn.edu/~chinese/parseguide.3rd.ch.pdf|Bracketing Guide]] <<BR>>

=== TGrep2able ===
All of our syntactically annotated corpora have been processed to make them usable with the TGrep2 tool. See [wiki:Self/HlpLab/CorpusTools Corpus Tools] for more info on TGrep2.

|| '''Filename''' || '''Description''' ||
|| BNC.parsed.t2c.gz || The full British National Corpus ||
|| BNC_spoken.parsed.t2c.gz || Just the spoken part of the BNC ||
|| BNC_written.parsed.t2c.gz || Just the written part of the BNC ||
|| arabic-collapsed.t2c.gz || ||
|| arabic-treebank-with-vowels.t2c.gz || ||
|| arabic-treebank-without-vowels.t2c.gz || ||
|| brown.t2c.gz || ||
|| chtb2.t2c.gz || ||
|| chtb4.t2c.gz || ||
|| chtb5.1.t2c || ||
|| commented-brown.t2c.gz || ||
|| icegb.t2c.gz || ||
|| negra.t2c.gz || ||
|| sw.backtrans.convid_030507.t2c.gz || ||
|| sw.backtrans.t2c.gz || ||
|| swbd.t2c.gz || ||
|| tiger.t2c.gz || ||
|| tuebaes.t2c.gz || Tübingen Treebank of Spoken English ||
|| tuebads.t2c.gz || Tübingen Treebank of Spoken German ||
|| wsj-commented.t2c.gz || ||
|| wsj_mrg.t2c.gz || ||
|| ycoe.t2c.gz || ||

=== TIGER Corpora ===

==== Tiger2 Corpus ====

<<Anchor(pswbd)>>
== Edinburgh-Stanford Paraphrase Switchboard ==
This corpus combines numerous annotations for the Penn Treebank (release 3, Marcus et al. 1999) portion of the Switchboard corpus (Godfrey et al. 1992). In addition to the part-of-speech, grammatical function, and  syntactic annotation of the Treebank, the corpus includes annotation for turn-taking, disfluencies (Taylor et al. 1995),  dialogue acts (Shirberg et al. 1998), animacy (Zaenen et al. 2003), coreference and information status (Nissim et al. 2001), kontrast and kontrast-triggers (Calhoun 2006), as well as time-aligned orthographic transcripts (automatically aligned down to the syllable and segment level), some prosodic phrasing and accent annotation, and some more. For a detailed summary and manuals, see 
[[http://groups.inf.ed.ac.uk/switchboard/index.html|the Edinburgh webpages on Switchboard in NXT]]. 
We have both the XML version and a version that has been back-translated to Penn Treebank format, so that it is TGrep2able (thanks to Jean Carlette, Jason Brenier, Sasha Calhoun, and Neal Snider).

The time-alignment of the orthographic words that are the terminals of the Treebank Switchboard corpus make it possible to go from the syntactic searches to the [[#swbdsound|corresponding sound files]] of the Switchboard conversations to extract acoustic information from them (or for listening pleasure ;-)).

'''NB:''' You must be a member of the `pswbd` group to have access to this corpus.