Diff for "ProjectsConstantEntropyChinese"

Differences between revisions 12 and 13

Project maintained by: ["TingQian"]

Project-related newsBR

Mar 20 -- I am going to present the current progress of this project at the [http://www.rochester.edu/College/ugresearch/expo.html Undergraduate Research Exposition 2008]. Specific data and time will be announced here.

Motivation

What underlies the difference in linguistic performance between native speakers and non-native speakers? We may all have the experience of finding even perfectly grammatical speech or writings produced by non-native speakers hard to understand. On the behaviorist level, this is likely to occur if a non-native speaker chooses unconventional terms to express ideas, or, he or she ignores contextual information that is specific to the new language and thus adds redundancy to expression. However, the sources of this added difficulty, that is, what kind of changes in the information that language carries has been caused by non-native speakers' language behaviors, are not clearly identified and explained. In this project, I am trying to formulate an information-theoretic account for this phenomenon, by analyzing the quality of both native language production and non-native language production with respect to native Chinese speakers.

Background

Recent studies have shown humans tend to optimize information density of a sentence during speech production in order to make it easy to understand (e.g. Jaeger, 2006). In studies on written English text, Genzel and Charniak (2002) proposed a constancy rate principle, which states if humans try to communicate in an efficient way, the information content of each individual sentence will increase with respect to its order in a paragraph. Intuitively, this means that without the knowledge of prior context, it will be difficult to understand a sentence randomly picked out from a paragraph.

The current study uses computational methods to examine whether languages spoken by people of a certain language group will vary in information density when one is their first language (Chinese is used in this study) and the other the second language (English). There are three main tasks:

exploring the distribution of information content of written Chinese, for which the constant rate principle has not yet been confirmed;
comparing the obtained result with information content of written English by Chinese speakers;
based on 1 and 2, looking for the reason why different levels of linguistic performance are often observed between first- and second-language uses.

The entire study will be based on three corpora: Chinese, English, and English used as a second language, respectively. Models of n-gram and lexicalized probabilistic context-free grammar will be used to compute information content for each type of language. The prospective results can shed light on what might be the cause for utterances of non-native speakers to appear unnatural. Is it imperfection with grammatical rules, or some other constraints in the brain that interferes with second language processing? The long-term goal of this project tries to find an answer to this question.

Method

Current Progress

ProjectsConstantEntropyChinese (last edited 2008-04-07 03:15:41 by cpe-66-67-32-171)

-  ⇤ ← Revision 12 as of 2008-03-21 03:02:45 → 
  Size: 2903
  Editor: cpe-66-67-32-171
  Comment:
+   ← Revision 13 as of 2008-03-21 03:12:03 → ⇥
  Size: 3354
  Editor: cpe-66-67-32-171
  Comment:
-Deletions are marked like this.
+Additions are marked like this.
 Line 13:
-What underlies the difference in linguistic performance between native speakers and non-native speakers? We may all have the experience of finding even perfectly grammatical speech or writings produced by non-native speakers hard to understand. On the behaviorist level, this is likely to occur if a non-native speaker chooses unconventional terms to express ideas, or he or she ignores contextual information that is specific to the new language and thus adds redundancy to expression.
+What underlies the difference in linguistic performance between native speakers and non-native speakers? We may all have the experience of finding even perfectly grammatical speech or writings produced by non-native speakers hard to understand. On the behaviorist level, this is likely to occur if a non-native speaker chooses unconventional terms to express ideas, or, he or she ignores contextual information that is specific to the new language and thus adds redundancy to expression. However, the sources of this added difficulty, that is, what kind of changes in the information that language carries has been caused by non-native speakers' language behaviors, are not clearly identified and explained. In this project, I am trying to formulate an information-theoretic account for this phenomenon, by analyzing the quality of both native language production and non-native language production with respect to native Chinese speakers.