## page was renamed from HlpLab/Projects/Run
#acl HlpLabGroup:read,write,delete,revert,admin All:read
#format wiki
#pragma section-numbers 3
#language en

= Run the Whole Thing =

Ideas keep coming up where the right approach would be to collect some measurement over the entire corpus.  In some cases this would mean adding a new level of annotation to an existing corpus; in other cases, this would mean running an analysis on existing annotations, but across an entire corpus rather than on specifically extracted data sets.  

It is unclear whether any of this is a good idea.

== Measurements and Analyses ==

This is a collection of measurements that could be taken or projects that could come using the general strategy of "running the whole thing":

 * Add frequency, relative frequency, predictability from the right and left n words as new levels of annotation in the NiteXML representation.  (These are what we use on almost every project)

 * Add entropy, importance, and predictability from both directions as new annotations.  (These are things we want to use in the future or already use in particular contexts)

 * Add a separate parse tree generated by an automatic dependency parser.

 * Learn a PCFG from the parsed corpus SWBD.  If stable, add the probability of each word from the PCFG as a level of annotation.

 * Extract our best estimate of speech rate at each word in the corpus.  Use this information to investigate the issue of the size of the planning window for production.

 * Compare citation form of each word to the hand-transcription and/or the automatic segmentation of each word.  Do segment alignment, score number of additions/deletions/insertions/exchanges.  Possibly same thing at syllable level.  Characterize reduction and "massive" reduction in corpora, correlate with information measures (for every word in the corpus).

 * Use a computational model of articulation to calculate the articulatory effort for each word in the corpus (ideally do this sublexically, for each n-phone).  Calculate effort score for citation form of each word and transcribed form of each word.  Identify effort/information trade-offs, correlate effort measures with other measures of reduction.  Combine with speech rate information to try to compute cost function relating effort and time.