Size: 1305
Comment:
|
Size: 2300
Comment:
|
Deletions are marked like this. | Additions are marked like this. |
Line 7: | Line 7: |
== Stay up-to-date with the corpora@cs email list == You may want to subscribe to our corpora mailing list (just ask Benjamin Van Durme to add you to the list). This list distributess information about our corpus environment. You can also use it ask/answer questions about the local corpus environment. There are several groups that use corpora. At the very least you will find useful tools and corpora in /p/hlp/ and /p/nl/ (the natural language processing group in CS). There is also a [http://www.cs.rochester.edu/research/speech/ldc.html corpus inventory] and [http://www.cs.rochester.edu/research/cisd/resources/acquisitions.rss RSS feed]. You can use a web browser or email client with built-in RSS capabilities (e.g. Mozilla Firefox, Mozilla Thunderbird), a web-based aggregator (e.g. on My Yahoo!, Google), or a standalone aggregator. We have added further corpora on /p/hlp/corpora/ and more is to come. An updated page reflecting our corpus and software inventory is in the works. |
Corpora
1. Stay up-to-date with the corpora@cs email list
You may want to subscribe to our corpora mailing list (just ask Benjamin Van Durme to add you to the list). This list distributess information about our corpus environment. You can also use it ask/answer questions about the local corpus environment. There are several groups that use corpora. At the very least you will find useful tools and corpora in /p/hlp/ and /p/nl/ (the natural language processing group in CS). There is also a [http://www.cs.rochester.edu/research/speech/ldc.html corpus inventory] and [http://www.cs.rochester.edu/research/cisd/resources/acquisitions.rss RSS feed]. You can use a web browser or email client with built-in RSS capabilities (e.g. Mozilla Firefox, Mozilla Thunderbird), a web-based aggregator (e.g. on My Yahoo!, Google), or a standalone aggregator. We have added further corpora on /p/hlp/corpora/ and more is to come. An updated page reflecting our corpus and software inventory is in the works.
2. Gigaword
- Chinese
3. Parsed Switchboard
You must be a member of the pswbd Unix group to access this corpus.
4. TGrep2able
Corpora that have been processed to make them usable with the TGrep2 tool. See [wiki:/HlpLab/CorpusTools/ Corpus Tools] for more info on TGrep2.
5. TIGER Corpora
6. Tiger2 Corpus
7. Treebanks
Title |
File |
LDC Catalog number/Original name |
Language |
#word |
#sentence |
#story |
Original format |
Arabic Treebank Part 1 V3 |
ATB1_V3/ |
LDC2005T02 |
Arabic |
145386 |
|
734 |
|
Arabic Treebank Part 2 V2 |
ATB2_V2/ |
LDC2004T02 |
Arabic |
144199 |
|
501 |
|
Arabic Treebank Part 3 V1 |
ATB3_V1/ |
LDC2004T11 |
Arabic |
340281 |
|
600 |
|
Chinese Treebank V5.1 |
ChineseTreebank5.1/ |
LDC2005T01U01 |
Chinese |
507222 |
|
18782 |
|
Prague Dependency Treebank 2.0 |
pdt_2/ |
LDC2006T01 |
Czech |
2000000 |
|
|
|
Danish Dependency Treebank V1.0 |
ddt1.0/ |
ddt-1.0.tar |
Danish |
|
|
5540 |
|
NEGRA corpus V2.0 |
Negra2.0/ |
negra-corpus.tar.gz |
German |
|
|
20602 |
export/Penn Treebank |
Merge files |
mrg/ |
|
|
|
|
|
|