Differences between revisions 5 and 6
Revision 5 as of 2008-02-04 01:26:56
Size: 2671
Editor: cpe-74-67-187-252
Comment:
Revision 6 as of 2008-02-04 01:27:24
Size: 2677
Editor: cpe-74-67-187-252
Comment:
Deletions are marked like this. Additions are marked like this.
Line 18: Line 18:
 a. We start with the frequency of ''wh''-questions. Extract all ''wh''-questions (such as "Who did this?", "Why didn't he come to the party?", etc.) from SWBD and BROWN. Make sure, you get all and only ''wh''-questions (i.e. no relative clauses like "The first person who solves this get's free ice cream."; no free relatives like "He likes what she does."; no yes/no-questions like "Does he ever sleep?"; etc.). Is there a tag that marks all ''wh''-questions?  1.a. We start with the frequency of ''wh''-questions. Extract all ''wh''-questions (such as "Who did this?", "Why didn't he come to the party?", etc.) from SWBD and BROWN. Make sure, you get all and only ''wh''-questions (i.e. no relative clauses like "The first person who solves this get's free ice cream."; no free relatives like "He likes what she does."; no yes/no-questions like "Does he ever sleep?"; etc.). Is there a tag that marks all ''wh''-questions?
Line 20: Line 20:
 b. Given the different size of the two corpora, is the relative frequency of ''wh''-questions higher in speech or in writing? ou can normalize your count from (1.a) either by the number of sentences or the number of words in the corpus. But be careful: make sure you are only counting real words and real sentences. The corpora may contain empty sentences (as a side effect of annotation conventions or bugs). Use some pattern that matches only lexical words (TIP: most non-word terminal nodes don't start with a letter ... to count only the words of a corpus, use regular expressions to match terminal nodes that start with a letter or an apostrophe [if you want to count contractions as words]).  1.b. Given the different size of the two corpora, is the relative frequency of ''wh''-questions higher in speech or in writing? ou can normalize your count from (1.a) either by the number of sentences or the number of words in the corpus. But be careful: make sure you are only counting real words and real sentences. The corpora may contain empty sentences (as a side effect of annotation conventions or bugs). Use some pattern that matches only lexical words (TIP: most non-word terminal nodes don't start with a letter ... to count only the words of a corpus, use regular expressions to match terminal nodes that start with a letter or an apostrophe [if you want to count contractions as words]).
Line 22: Line 22:
 c.  1.c.

Corpus Tools

1. TGrep2

[http://tedlab.mit.edu/~dr/Tgrep2/tgrep2.pdf TGrep2 Manual (PDF)]

1.1. Tutorial

We will upload a TGrep2 tutorial here soon. For now, see [http://www.bcs.rochester.edu/people/fjaeger/teaching/tutorials/TGrep2/LabSyntax-Tutorial.html] and [http://www.stanford.edu/dept/linguistics/corpora/cas-tut-tgrep.html].

Also here are some examples to get started. Work your way through the following examples (I recommend to start by having the Penn Treebank manuals open because they contain a summary and explanation of all available part-of-speech, syntactic, and functional annotation of the Treebank corpora). Look through the manuals and look at one or two sentences in each corpus (tgrep2 -l "* !> *" | more) to see how they are annotated (for now, just focus on the high-level structure - How are sentences annotated? Are subjects sisters of verb phrases? What are verb phrases governed by?).

  1. Let's start simple. The first problem is on questions. The task is to compare question strategies across spoken and written texts. Use the Switchboard portion of the Penn Treebank (release 3, Marcus et al. 1999) as your corpus of spoken American English, and the Brown portion as the written sample of American English.

    1.a. We start with the frequency of wh-questions. Extract all wh-questions (such as "Who did this?", "Why didn't he come to the party?", etc.) from SWBD and BROWN. Make sure, you get all and only wh-questions (i.e. no relative clauses like "The first person who solves this get's free ice cream."; no free relatives like "He likes what she does."; no yes/no-questions like "Does he ever sleep?"; etc.). Is there a tag that marks all wh-questions?

    1.b. Given the different size of the two corpora, is the relative frequency of wh-questions higher in speech or in writing? ou can normalize your count from (1.a) either by the number of sentences or the number of words in the corpus. But be careful: make sure you are only counting real words and real sentences. The corpora may contain empty sentences (as a side effect of annotation conventions or bugs). Use some pattern that matches only lexical words (TIP: most non-word terminal nodes don't start with a letter ... to count only the words of a corpus, use regular expressions to match terminal nodes that start with a letter or an apostrophe [if you want to count contractions as words]). 1.c.

2. TDT Tools

A collection of Perl scripts written by Florian Jaeger. See [wiki:/TDT TDT] for more information.

CorpusTools (last edited 2011-10-07 20:00:20 by echidna)

MoinMoin Appliance - Powered by TurnKey Linux