ProjectsConstantEntropyChinese

Project maintained by: ["TingQian"]

Project-related newsBR

Mar 20 -- Ting Qian is going to present the current progress of this project at the [http://www.rochester.edu/College/ugresearch/expo.html Undergraduate Research Exposition 2008]. Specific data and time will be announced.

Background & Motivation

The current study uses computational methods to examine whether written languages by people of a certain language group will vary in information density when one is their first language (Chinese is used in this study) and the other the second language (English). Recent studies have shown humans tend to optimize information density of a sentence during speech production in order to make it easy to understand (e.g. Jaeger, 2006). In studies on written English text, Genzel and Charniak (2002) proposed a constancy rate principle, which states if humans try to communicate in an efficient way, the information content of each individual sentence will increase with respect to its order in a paragraph. Intuitively, this means that without the knowledge of prior context, it will be difficult to understand a sentence randomly picked out from a paragraph.

In this project, there are three main tasks: 1) exploring the distribution of information content of written Chinese, for which the constant rate principle has not yet been confirmed, 2) comparing the obtained result with information content of written English by Chinese speakers, and 3) based on 1 and 2, looking for the reason why different levels of linguistic performance are often observed between first- and second-language uses.

The entire study will be based on three corpora: Chinese, English, and English used as a second language, respectively. Models of n-gram and lexicalized probabilistic context-free grammar will be used to compute information content for each type of language. The prospective results can shed light on what might be the cause for utterances of non-native speakers to appear unnatural. Is it imperfection with grammatical rules, or some other constraints in the brain that interferes with second language processing? The long-term goal of this project tries to find an answer to this question.

Background & Motivation

Method

Current Progress