Differences between revisions 9 and 10

[wiki:HlpLab/LSA09/Syllabus Syllabus] | [wiki:HlpLab/LSA09/Assignments Assignments] | [wiki:HlpLab/LSA09/People People] | [wiki:HlpLab/LSA09/CorporaTutorials Corpora & Tutorials] | [wiki:HlpLab/LSA09/References Readings] | [http://lsa2009.berkeley.edu/courses/lsa125.html Offical LSA course page]

Reading and References

We've put together a couple of general readings suggestions for corpus-based research on psycholinguistics in addition to the specific readings mentioned on the syllabus. They are listed below the references.

1. References

AttachList

2. Reading Themes

Each section below summarizes a couple of papers on a particular issue that will be covered in class. We don't expect you to read all these papers; it's more to give you pointers for further readings. At the end of each section you find what we identify to be a good entry reading on that topic.

2.1. Accessibility: Availability and Alignment in Sentence Production

Syntactic variation has been attributed to accessibility. For the purpose of this class, accessibility refers to ease of retrieval. Accessibility-based accounts for e.g. word order alternations say that the relative accessibility of the referents described by the different constituents affects speakers' word order preferences.

Two specific proposals have been discussed and tested in detail in the literature. Psycholinguistic alignment accounts (e.g. Bock and Warren, 1985) state that speakers prefer to align conceptually accessible referents with higher grammatical functions (this accounts resemble linguistic accounts of alignments, as e.g. in Aissen, 2003; Bresnan et al., 2001). Availability accounts, on the other hand, state that speakers prefer to mention accessible referents early in the sentence (Levelt & Maassen, 1981; Ferreira, 1996; Ferreira and Dell, 2000). For English these two accounts make very similar predictions, but for other languages they don't necessarily. We recommend Branigan et al. (2007) for a direct comparison and summary of previous work. See also Jaeger and Norcliffe (in press) for a summary of the relevant cross-linguistics work.

2.2. Length and Word Order in Sentence Production

A well-documented phenomenon in sentence production is domain minimization (this is Hawkins' (2004) terminology, but the basic observation goes at least as far back as Behaghel (1909))--speakers, given a choice between multiple word orders, will tend to choose the order that minimizes the distance between dependent elements in the sentence. Take for example sentences (1) and (2).

(1) John walked [with his older, popular sister] [to school.] (2) John walked [to school] [with his older, popular sister.]

While (1) and (2) encode (let's assume) the same meaning, (2) is predicted to be more likely since this order minimizes the distance between the verb and the heads of its dependent prepositional phrases. In addition to influencing production choices, domain minimization has also been argued to play some role in constraining the space of possible grammars, and is therefore also of interest to typologists. For example, languages with nearly-consistent headedness tend to more easily allow domain minimization, which might explain why the vast majority of the world's languages are consistently head-final or head-initial. Perhaps the most influential theory along these lines is that of John Hawkins, whose 1994 and 2004 books are both classics. We may scan some portions of these, but the basic claims are spelled out in Hawkins (2007), which will be required reading for this section. For experimental support of Hawkins' proposals in Japanese, see Yamashita and Chang (2001). Length, like everything else, interacts with a number of other factors in word order. For example, there is a well-known influence of discourse status--recently mentioned referents tend to be produced earlier than novel referents (this is also referred to as the "given-before-new" preference). Length and discourse status can either simultaneously support the same word order or give rise to competing pressures. The question then becomes, how do speakers weight each of these considerations in arriving at a particular word order? Two papers that deal with the relationship between length- and discourse-driven factors in production are Arnold et al. (2000) and Choi (1997). We recommend reading one of these. Finally, Gildea and Temperley (2008) take the typological end of Hawkins-style accounts a step further and draw on methods from computational linguistics to try to determine whether the grammars of natural languages are optimal systems for simultaneously minimizing numerous dependency lengths. The paper is very technical (from a computational linguistics journal), but you can get the gist from the introduction if you don't want to wade through all the details.

2.3. Ambiguity Avoidance in Sentence Production

To what extent do people structure their utterances in a way that avoids ambiguities that may arise from the perspective of their interlocutors? Put another way, do speakers structure their utterances in a way that eases production only, or do speakers also attempt to ease comprehension? Evidence on this matter is equivocal. Arnold et al. (2004, for instance, find no evidence of ambiguity avoidance in a production experiment. Haywood et al. (2005), on the other hand, find evidence which supports strategic ambiguity avoidance, possibly due to a more ecological experimental procedure. Please read both articles. Kraljic and Brennan (2005) provide evidence that prosodic cues to disambiguation are used by speakers, but that use of these cues is insensitive to the needs of the comprehender (they use prosody to disambiguate even when the context is completely disambiguatin). Please read this article if you have time.

2.4. Uniform Information Density

Uniform Information Density is a recently emerging account of language production (Jaeger, 2006; Levy & Jaeger, 2007; Jaeger, submitted, in prep), according to which speakers' choices in production are driven by a preference to distribute information uniformly across the linguistic signal. Information is defined information theoretically (Shannon, 1948) with reference to probability distribution (the probable an event is the more information its occurrence carries).

Uniform Information Density has been tested against corpus data from phonetic reduction (Jaeger & Kidd, 2008; building on Bell et al., 2003, 2009), morpho-syntactic reduction (Frank & Jaeger, 2008), syntactic reduction (Jaeger, 2006, submitted, in prep; Levy & Jaeger, 2007), and against inter-clausal planning (Gomez Gallo et al., 2008). Data from the distribution of disfluencies and gestures has also been argued to be supporting the principle of Uniform Information Density (Cook et al., 2009).

Short introductions can be found in Levy and Jaeger (2007, rather technical) and Frank and Jaeger (2008). A more in depth discussion in journal format is found in Jaeger (submitted).

2.5. Psycholinguistic Corpus-based work on Syntactic Variation

For some examples, of corpus-based psycholinguistic research on syntactic production, see:

Bresnan et al. (2007) and Arnold et al. (2000) on the ditransitive alternation;
Wasow (1997) and Arnold et al. (2000) on heavy NP shift;
Lohse et al. (2004) on particle shift;
Roland et al. (2005) and Jaeger (submitted) on complementizer-mentioning;
Wasow et al. (in press) on relativizer-mentioning;
Jaeger (in prep) on passive subject-extracted relative clause reduction;
Frank and Jaeger (2008) on auxiliary contraction;

Example of a corpus-based approach using mixed logit models are given in Bresnan et al. (2007) and Jaeger (submitted).

3. Sociolinguistic Corpus-based work on Syntactic Variation

While the focus of this course is on the use of corpora in addressing psycholinguistic questions, syntactic corpora have also been used to pursue questions of sociolinguistic interest. Tagliamonte et al. (2005), for example, have looked at relativizer (e.g. that) mentioning cross-dialectally. This is instructive for psycholinguistic accounts of similar phenomena because it shows that some of the observed variation may have a sociolinguistic rather than psycholinguistic explanation. Or perhaps more interestingly, that psycholinguistic principles may have a social dimension. Tagliamonte and Smith (2005) looks at complementizer mentioning. Both are very nice papers that are very easy to understand. Please try to read at least one of them.

3.1. Grammaticization and Gradient Grammaticality in Syntactic Variation

Please read Bresnan and Hay (2007) and Torres Calcoullos and Walker (2009)

Optional: Bresnan et al. (2007)

3.2. Statistics for Corpus-based Research

Modern corpus-based research mostly employs multiple regression methods. Since corpus-based work usually involves clustered data (data from different speakers, different groups, etc.) the employed statistical methods need to somehow correct for the resulting violation of the assumption of independence. This can be done, for example, via bootstrapping or by means of multilevel (mixed) models.

Most papers on these models are still hard to read, but there are some pretty readable introductions to ordinary and multilevel regression methods for language researchers.

Baayen (2008) provides a selection of examples, case studies, and some conceptual background, along with tons of R code to run regression and mixed models on language data. You can download the book for free from his website.

Most research on syntactic production requires binomial or multinomial models (because the outcome we're analyzing are categorical). Jaeger (2008) provides an introduction to mixed logit models. Readable applications of such models to corpus data are found in Bresnan et al. (2007) and Jaeger (submitted; see also Jaeger, 2006).

A wonderful introduction to linear mixed models and model comparison over such models is found in Baayen et al. (2008).

For a discussion of statistics with respect specifically to sociolinguistic corpus research, have a look at Johnson (2009).

For further references and advanced tutorials, see [wiki:HlpLab/StatsCourses our HLP lab stats course page] and search the entries of the [http://www.hlplab.wordpress.com/ HLP lab blog]. Also consider subscribing to the R-lang email list, a list specifically designed to help language researchers using R.

-  ⇤ ← Revision 9 as of 2009-06-26 16:10:58 → 
  Size: 11231
  Editor: cpe-69-207-173-234
  Comment:
+   ← Revision 10 as of 2009-06-26 17:06:37 → ⇥
  Size: 11215
  Editor: platypus
  Comment: space clean up
-Deletions are marked like this.
+Additions are marked like this.
 Line 14:
-Line 20:
+Line 17:
-Line 44:
+Line 40:
-Line 51:
+Line 46:
-Line 70:
+Line 64: