General Notes

Brief overview of dataflow

Text Processing

  1. Call swbdext_markup.R to process swbdext.tab and yield swbdext_clean.tab. Other files that need to be in the working directory: context_markup.R, 14-b.t2o, and 4-a.t2o. The result is a version of swbdext.tab with a) utterances that are almost entirely just text and b) some amount of sampling class annotation.
  2. Call more_regex_fixes.R and makeS1TargetSpeaker.py (probably in that order) to fix text artifacts that *_markup.R scripts currently do not handle.

Human Annotation - Alternations & Questions

  1. Call altgen.R to process swbdext_clean.tab and yield alternations.tab. This new file produces automatically alternated versions of sentences in swbdext_clean.tab that are (at this point) judged worthy of being sampled.
  2. Someone (thus far, just me - Eric Meinhardt) then runs judge.py to judge the acceptability of the automatic alternations, producing rated_annotations.tab.
  3. The script import_ratings.R takes the information in rated_annotations.tab and adds new details to the sampling class information present in swbdext_clean.tab to produce swbdext_final.tab.
  4. Someone (thus far, Eric Meinhardt) runs reviewQs.py on swbdext_final.tab to add new questions to useable items as well as to edit, rate, and review previously added questions.

Producing samples

  1. FINALLY, sample.py can be called on swbdext_final.tab to produce a sample file named samplen.tab, where n is one greater than the number of other .tab files in the current directory that have the substring "sample" in their filename. This script currently depends on two other files - optimization.py and RRHC.py. Currently, sample.py also writes a record of the optimization process used to produce the ultimate sample (ending in .RRHCrun); these records can be used later for analysis of the optimization algorithms. Command-line options (and default values) for sample.py are detailed in the module docstring and further below.

  2. driver.py is a shell script for conveniently producing a variety of samples with a systematic range of command-line parameters. Use is pretty straight-forward if you take a look and already know what sample.py's command-line options are.

Utility scripts

Detailed description of scripts and dataflow

Text Processing

Alternation Annotation

Question Annotation

Sampling

MoinMoin Appliance - Powered by TurnKey Linux