General Notes

Brief overview of dataflow

Text Processing

  1. Call swbdext_markup.R to process swbdext.tab and yield swbdext_clean.tab. Other files that need to be in the working directory: context_markup.R, 14-b.t2o, and 4-a.t2o. The result is a version of swbdext.tab with a) utterances that are almost entirely just text and b) some amount of sampling class annotation.
  2. Call more_regex_fixes.R and makeS1TargetSpeaker.py (probably in that order) to fix text artifacts that *_markup.R scripts currently do not handle.

Human Annotation - Alternations & Questions

  1. Call altgen.R to process swbdext_clean.tab and yield alternations.tab. This new file produces automatically alternated versions of sentences in swbdext_clean.tab that are (at this point) judged worthy of being sampled.
  2. Someone (thus far, just me - Eric Meinhardt) then runs judge.py to judge the acceptability of the automatic alternations, producing rated_annotations.tab.
  3. The script import_ratings.R takes the information in rated_annotations.tab and adds new details to the sampling class information present in swbdext_clean.tab to produce swbdext_final.tab.
  4. Someone (thus far, Eric Meinhardt) runs reviewItems.py on swbdext_final.tab to add new questions to useable items as well as to edit, rate, and review previously added questions. Any items that disambiguate the gender of either speaker can also be marked and annotated as such.

Producing samples

  1. FINALLY, sample.py can be called on swbdext_final.tab to produce a sample file with a name that indicates the random seed that produced the sample in the file, target sentence manipulation condition, and order (e.g. 1348349902_orig_rev.sample). This script currently depends on two other files - optimization.py and RRHC.py. Currently, sample.py also writes a record of the optimization process used to produce the ultimate sample (ending in .RRHCrun); these records can be used later for analysis of the optimization algorithms. Command-line options (and default values) for sample.py are detailed in the module docstring and further below.
  2. driver.py is a shell script for conveniently producing a variety of samples with a systematic range of command-line parameters. Use is pretty straight-forward if you take a look and already know what sample.py's command-line options are.

Utility scripts

Detailed description of scripts and dataflow

Text Processing

Alternation Annotation

Question Annotation

Sampling

Adaptation in Production Script Guide (last edited 2012-10-01 06:35:55 by user-0c9a9bf)

MoinMoin Appliance - Powered by TurnKey Linux