Differences between revisions 5 and 6
Revision 5 as of 2007-09-11 20:11:26
Size: 2298
Editor: 63
Comment:
Revision 6 as of 2007-09-27 19:58:19
Size: 2301
Editor: lab1
Comment:
Deletions are marked like this. Additions are marked like this.
Line 19: Line 19:
Corpora that have been processed to make them usable with the TGrep2 tool. See [wiki:/HlpLab/CorpusTools/ Corpus Tools] for more info on TGrep2. Corpora that have been processed to make them usable with the TGrep2 tool. See [wiki:Self/HlpLab/CorpusTools Corpus Tools] for more info on TGrep2.

Stay up-to-date with the corpora@cs email list

You may want to subscribe to our corpora mailing list (just ask Benjamin Van Durme to add you to the list). This list distributess information about our corpus environment. You can also use it ask/answer questions about the local corpus environment. There are several groups that use corpora. At the very least you will find useful tools and corpora in /p/hlp/ and /p/nl/ (the natural language processing group in CS). There is also a [http://www.cs.rochester.edu/research/speech/ldc.html corpus inventory] and [http://www.cs.rochester.edu/research/cisd/resources/acquisitions.rss RSS feed]. You can use a web browser or email client with built-in RSS capabilities (e.g. Mozilla Firefox, Mozilla Thunderbird), a web-based aggregator (e.g. on My Yahoo!, Google), or a standalone aggregator. We have added further corpora on /p/hlp/corpora/ and more is to come. An updated page reflecting our corpus and software inventory is in the works.

Corpora

1. Gigaword

  • Chinese

2. Parsed Switchboard

You must be a member of the pswbd Unix group to access this corpus.

3. TGrep2able

Corpora that have been processed to make them usable with the TGrep2 tool. See [wiki:Self/HlpLab/CorpusTools Corpus Tools] for more info on TGrep2.

4. TIGER Corpora

5. Tiger2 Corpus

6. Treebanks

Title

File

LDC Catalog number/Original name

Language

#word

#sentence

#story

Original format

Arabic Treebank Part 1 V3

ATB1_V3/

LDC2005T02

Arabic

145386

734

Arabic Treebank Part 2 V2

ATB2_V2/

LDC2004T02

Arabic

144199

501

Arabic Treebank Part 3 V1

ATB3_V1/

LDC2004T11

Arabic

340281

600

Chinese Treebank V5.1

ChineseTreebank5.1/

LDC2005T01U01

Chinese

507222

18782

Prague Dependency Treebank 2.0

pdt_2/

LDC2006T01

Czech

2000000

Danish Dependency Treebank V1.0

ddt1.0/

ddt-1.0.tar

Danish

5540

NEGRA corpus V2.0

Negra2.0/

negra-corpus.tar.gz

German

20602

export/Penn Treebank

Merge files

mrg/

Corpora (last edited 2018-06-07 17:57:11 by dhcp-10-5-21-163)

MoinMoin Appliance - Powered by TurnKey Linux