Corpora

1. Gigaword

2. Parsed Switchboard

You must be a member of the pswbd Unix group to access this corpus.

3. TGrep2able

Corpora that have been processed to make them usable with the TGrep2 tool. See [wiki:/HlpLab/CorpusTools/ Corpus Tools] for more info on TGrep2.

4. TIGER Corpora

5. Tiger2 Corpus

6. Treebanks

Title

File

LDC Catalog number/Original name

Language

#word

#sentence

#story

Original format

Arabic Treebank Part 1 V3

ATB1_V3/

LDC2005T02

Arabic

145386

734

Arabic Treebank Part 2 V2

ATB2_V2/

LDC2004T02

Arabic

144199

501

Arabic Treebank Part 3 V1

ATB3_V1/

LDC2004T11

Arabic

340281

600

Chinese Treebank V5.1

ChineseTreebank5.1/

LDC2005T01U01

Chinese

507222

18782

Prague Dependency Treebank 2.0

pdt_2/

LDC2006T01

Czech

2000000

Danish Dependency Treebank V1.0

ddt1.0/

ddt-1.0.tar

Danish

5540

NEGRA corpus V2.0

Negra2.0/

negra-corpus.tar.gz

German

20602

export/Penn Treebank

Merge files

mrg/

MoinMoin Appliance - Powered by TurnKey Linux