Corpus Tools


TGrep2 Manual (PDF)

1. Preparing a Penn treebank like corpus for TGrep2

  1. Concatenate files and filter out comments and xml. e.g.
    cat /p/hlp/corpora/Treebanks/ChineseTreebank6.0/data/gbk/bracketed/*.fid | grep -v '^\*' | grep -v '^[^a-zA-Z(]*<' | sed 's/^(/(TOP /' > chtb6.txt

  2. Run the conversion BR tgrep2 -p chtb6.txt chtb6.t2c.gz

  3. If there are errors (e.g. missing "("). go into the txt file and correct them. Then repeat the conversion.
  4. It is useful to find n-th sentence in txt file:
    grep -n -m 27836 "^(" chtb6.txt |tail -n 1

2. Tutorial

We will upload a TGrep2 tutorial here soon. For now, see and

Setting up your local Unix environment including TGrep2 specific information.

Also here are some examples to get started. Work your way through the following examples (I recommend to start by having the Penn Treebank manuals open because they contain a summary and explanation of all available part-of-speech, syntactic, and functional annotation of the Treebank corpora). Look through the manuals and look at one or two sentences in each corpus (tgrep2 -l "* !> *" | more) to see how they are annotated (for now, just focus on the high-level structure - How are sentences annotated? Are subjects sisters of verb phrases? What are verb phrases governed by?).

  1. Let's start simple. The first problem is on questions. The task is to compare question strategies across spoken and written texts. Use the Switchboard portion of the Penn Treebank (release 3, Marcus et al. 1999) as your corpus of spoken American English, and the Brown portion as the written sample of American English.
    1. We start with the frequency of wh-questions. Extract all wh-questions (such as "Who did this?", "Why didn't he come to the party?", etc.) from SWBD and BROWN. Make sure, you get all and only wh-questions (i.e. no relative clauses like "The first person who solves this get's free ice cream."; no free relatives like "He likes what she does."; no yes/no-questions like "Does he ever sleep?"; etc.). Is there a tag that marks all wh-questions? Focus on non-embedded questions for now.

    2. Given the different size of the two corpora, is the relative frequency of wh-questions higher in speech or in writing? ou can normalize your count from (1.a) either by the number of sentences or the number of words in the corpus. But be careful: make sure you are only counting real words and real sentences. The corpora may contain empty sentences (as a side effect of annotation conventions or bugs). Use some pattern that matches only lexical words (TIP: most non-word terminal nodes don't start with a letter ... to count only the words of a corpus, use regular expressions to match terminal nodes that start with a letter or an apostrophe [if you want to count contractions as words]).

    3. Are there more non-embedded or embedded wh-questions (e.g. "Who did you see?" vs. "She was wondering what was going on.") in speech compared to in writing? If you find a difference between speech and writing - what is due to? Is embedding generally (even outside of questions) more frequent in writing or speech?

    4. Now, it'll get more tricky. The Penn Treebank actually marks the trace of the wh-phrase (i.e. the place it was extracted from). Compare the frequency (in speech) of subject-extracted vs. non-subject-extracted wh-questions. Note that the trace does not have to be in the matrix clause (the clause that the wh-phrase is in) - it may be in an embedded clause leading to a long-distance dependency. How frequent are such long-distance dependencies (i.e. how many wh-phrases were extracted across a clause boundary)? Does the distribution of the grammatical function or the location (embedding) of the extracted wh-phrase differ between speech and writing?

TDT Tools

A collection of Perl scripts written by Florian Jaeger. See TDT for more information.

HlpLab: CorpusTools (last edited 2011-10-07 20:00:20 by echidna)

MoinMoin Appliance - Powered by TurnKey Linux