Differences between revisions 8 and 9

Stay up-to-date with the corpora@cs email list

You may want to subscribe to our corpora mailing list (just ask Benjamin Van Durme to add you to the list). This list distributess information about our corpus environment. You can also use it ask/answer questions about the local corpus environment. There are several groups that use corpora. At the very least you will find useful tools and corpora in /p/hlp/ and /p/nl/ (the natural language processing group in CS). There is also a [http://www.cs.rochester.edu/research/speech/ldc.html corpus inventory] and [http://www.cs.rochester.edu/research/cisd/resources/acquisitions.rss RSS feed]. You can use a web browser or email client with built-in RSS capabilities (e.g. Mozilla Firefox, Mozilla Thunderbird), a web-based aggregator (e.g. on My Yahoo!, Google), or a standalone aggregator. We have added further corpora on /p/hlp/corpora/ and more is to come. An updated page reflecting our corpus and software inventory is in the works.

Corpora (What is installed in `/p/hlp/corpora/`?)

1. Plain text corpora

1.1. Gigaword corpora

Chinese

1.2. Reuters - NIST Corpus

Reuters newswire from 1996-08-20 to 1997-08-19.

RCV1 - 810,000 Reuters, English Language News stories.
RCV2 - 487,000 Reuters News stories in thirteen languages (Dutch, French, German, Chinese, Japanese, Russian, Portuguese, Spanish, Latin American Spanish, Italian, Danish, Norwegian, and Swedish)

To use this corpus you as an individual must sign a license agreement, which is then kept on file in the lab. After signing you will be added to the reuters Unix group on the HLP system. This corpus is not available on the CS system.

See [http://trec.nist.gov/data/reuters/reuters.html] for more detail.

2. Audio and video corpora

Audio corpora (the sound files of corpora) are stored under /p/hlp/corpora/Audio/.

2.1. Switchboard sound files

We've installed the sound files of those Switchboard dialogues (XXXX andrew version? XXXX) that are part of the Penn Treebank (release 3, Marcus et al. 1999) and the [Edinburgh-Stanford Paraphrase Switchboard]. Jason Brenier has developed python scripts that map the Switchboard annotation layers to the sound files, making it possible (via intermediate steps) to e.g. conduct syntactic searches and then extract acoustic information from the sound files over the syntactic match.

3. Syntactically annotated corpora

3.1. Treebanks

Title	File	LDC Catalog number/Original name	Language	#word	#sentence	#story	Original format
Arabic Treebank Part 1 V3	`ATB1_V3/`	`LDC2005T02`	Arabic	145386		734
Arabic Treebank Part 2 V2	`ATB2_V2/`	`LDC2004T02`	Arabic	144199		501
Arabic Treebank Part 3 V1	`ATB3_V1/`	`LDC2004T11`	Arabic	340281		600
Chinese Treebank V5.1	`ChineseTreebank5.1/`	`LDC2005T01U01`	Chinese	507222		18782
Prague Dependency Treebank 2.0	`pdt_2/`	`LDC2006T01`	Czech	2000000
Danish Dependency Treebank V1.0	`ddt1.0/`	`ddt-1.0.tar`	Danish			5540
NEGRA corpus V2.0	`Negra2.0/`	`negra-corpus.tar.gz`	German			20602	export/Penn Treebank
Merge files	`mrg/`

3.2. TGrep2able

All of our syntactically annotated corpora have been processed to make them usable with the TGrep2 tool. See [wiki:Self/HlpLab/CorpusTools Corpus Tools] for more info on TGrep2.

3.3. TIGER Corpora

3.3.1. Tiger2 Corpus

4. Edinburgh-Stanford Paraphrase Switchboard

This corpus combines numerous annotations for the Penn Treebank (release 3, Marcus et al. 1999) portion of the Switchboard corpus (Godfrey et al. 1992). In addition to the part-of-speech, grammatical function, and syntactic annotation of the Treebank, the corpus includes annotation for turn-taking, disfluencies (Taylor et al. 1995), dialogue acts (Shirberg et al. 1998), animacy (Zaenen et al. 2003), coreference and information status (Nissim et al. 2001), kontrast and kontrast-triggers (Calhoun 2006), as well as time-aligned orthographic transcripts (automatically aligned down to the syllable and segment level), some prosodic phrasing and accent annotation, and some more. For a detailed summary and manuals, see [http://groups.inf.ed.ac.uk/switchboard/index.html the Edinburgh webpages on Switchboard in NXT]. We have both the XML version and a version that has been back-translated to Penn Treebank format, so that it is TGrep2able (thanks to Jean Carlette, Jason Brenier, Sasha Calhoun, and Neal Snider).

NB: You must be a member of the pswbd Unix group to access this corpus.

-  ⇤ ← Revision 8 as of 2008-02-04 00:34:54 → 
  Size: 4173
  Editor: cpe-74-67-187-252
  Comment:
+   ← Revision 9 as of 2008-02-04 00:41:18 → ⇥
  Size: 4823
  Editor: cpe-74-67-187-252
  Comment:
-Deletions are marked like this.
+Additions are marked like this.
 Line 17:
-== Reuters - NIST Corpus ==
+=== Reuters - NIST Corpus ===
 Line 21:
- * RCV2 - 487,000 Reuters News stories in thirteen languages (Dutch, French, German, Chinese, Japanese, Russian, Portuguese, Spanish, Latin American Spanish, Italian, Danish, Norwegian, and Swedish)
+ * RCV2 - 487,000 Reuters News stories in '''thirteen languages''' (Dutch, French, German, Chinese, Japanese, Russian, Portuguese, Spanish, Latin American Spanish, Italian, Danish, Norwegian, and Swedish)
 Line 26:
+== Audio and video corpora ==
Audio corpora (the sound files of corpora) are stored under `/p/hlp/corpora/Audio/`.

=== Switchboard sound files ===
We've installed the sound files of those Switchboard dialogues (XXXX andrew version? XXXX) that are part of the Penn Treebank (release 3, Marcus et al. 1999) and the [''Edinburgh-Stanford Paraphrase Switchboard'']. Jason Brenier has developed python scripts that map the Switchboard annotation layers to the sound files, making it possible (via intermediate steps) to e.g. conduct syntactic searches and then extract acoustic information from the sound files over the syntactic match.