#acl HlpLabGroup:read,write,delete,revert,admin All:read #format wiki #language en #pragma section-numbers 3 == General Notes == * All python scripts are Python 3; some (as a result of using the argparse module) require Python 3.2. * All scripts should explain usage, goals, and (often) summarize mechanisms in the comments at the top of the script. Use this page as a supplementary overview or imperfect all-in-one reference. * The embarrasingly bad python scripts have been replaced with much more readable and idiomatic code; the R scripts are still very rough around the edges, but I'm not sure that there is such a thing as an elegant, well organized, and easily readable R script longer than a few dozen lines. * See the [[ProjectsSyntacticAdaptationProduction#To Do|todo list on the main project page]] for further ways to improve the clarity and quality of the code for this project. == Brief overview of dataflow == === Text Processing === 1. Call swbdext_markup.R to process swbdext.tab and yield swbdext_clean.tab. Other files that need to be in the working directory: context_markup.R, 14-b.t2o, and 4-a.t2o. The result is a version of swbdext.tab with a) utterances that are almost entirely just text and b) some amount of sampling class annotation. 2. Call more_regex_fixes.R and makeS1TargetSpeaker.py (probably in that order) to fix text artifacts that *_markup.R scripts currently do not handle. === Human Annotation - Alternations & Questions === 2. Call altgen.R to process swbdext_clean.tab and yield alternations.tab. This new file produces automatically alternated versions of sentences in swbdext_clean.tab that are (at this point) judged worthy of being sampled. 3. Someone (thus far, just me - Eric Meinhardt) then runs judge.py to judge the acceptability of the automatic alternations, producing rated_annotations.tab. 4. The script import_ratings.R takes the information in rated_annotations.tab and adds new details to the sampling class information present in swbdext_clean.tab to produce swbdext_final.tab. 5. Someone (thus far, Eric Meinhardt) runs reviewItems.py on swbdext_final.tab to add new questions to useable items as well as to edit, rate, and review previously added questions. Any items that disambiguate the gender of either speaker can also be marked and annotated as such. === Producing samples === 6. FINALLY, sample.py can be called on swbdext_final.tab to produce a sample file with a name that indicates the random seed that produced the sample in the file, target sentence manipulation condition, and order (e.g. 1348349902_orig_rev.sample). This script currently depends on two other files - optimization.py and RRHC.py. Currently, sample.py also writes a record of the optimization process used to produce the ultimate sample (ending in .RRHCrun); these records can be used later for analysis of the optimization algorithms. Command-line options (and default values) for sample.py are detailed in the module docstring and further below. 7. driver.py is a shell script for conveniently producing a variety of samples with a systematic range of command-line parameters. Use is pretty straight-forward if you take a look and already know what sample.py's command-line options are. === Utility scripts === * db_stats.py finds the counts and proportions of a bunch of categories of items in the input database, printing results to stdout. Adapted from code in reviewItems.py, but would probably be shorter and easier to maintain in R rather than python. You must specify an input file and a filter option; the filter options correspond to the only three uses I've ever had for such a file - calculating counts/percentages with respect to the entire database (a - all items), calculating counts/percentages with respect to the useable items (u), and calculating counts/percentages with respect to the useable items that have questions (q). * inspect&modify.py is a script template for one or both of a) quickly seeing how many items match certain sets of properties (typically defined by simple boolean functions and combinations thereof) and b) doing something with those items (typically modifying all matching elements in the input datase). Useful for checking out annotation issues and doing something about them with a minimum of new code. == Detailed description of scripts and dataflow == === Text Processing === * swbdext_markup.R (reads primarily swbdext.tab and produces swbdext_clean.tab, by default) produces a cleaned up and somewhat more annotated version of the original database in swbdext.tab. 1. swbdext_markup.R calls context_markup.R (reading 14-b.t2o and 4-a.t2o to produce clean_14-b.t2o and clean_4-a.t2o, by default) a. context_markup.R removes * "-NONE- " * gap codes * spaces before periods * spaces in the middle of (most contractions) and possessives * disfluency markers * "end sentence" markers * "empty" speaker codes * string-final speaker codes of the 'after' context b. context_markup.R also replaces * speaker codes with "S1:" and "S2:" * ":." with ":" * double, triple, and quadruple spaces with single spaces. c. Finally, context_markup.R also infers the first and last speaker of each item's before and after context. 2. swbdext_markup.R also moves two misplaced lines in the original database so that the two context files and the database in swbdext_clean.tab have a 1-1 correspondence. 3. It re-imports the two contexts of each item in the database from the "cleaned" context files produced by context_markup.R. 4. It infers the speaker of the target sentence of each item given the speaker data inferred about both contexts by context_markup.R. 5. It adds, if necessary, missing string-initial speaker codes on all "before" contexts. 6. It also removes gap codes from the target sentence and recalculates target sentence length. 7. It capitalizes the first letter of the target and both contexts, every letter preceeded by a period and one or two spaces, and every " i{ ,.'}" 8. It replaces all double periods with single periods. 9. It removes string-initial ".. " and ". ". 10. It expands all contractions except for 'd and 's. 11. It adds the "SamplingClass" column to the database. Codes A through E are added. 12. swbdext_markup.R finally adds columns Question1, Answer1, Question2, Answer2 columns (and leaves them empty). * makeS1TargetSpeaker.py assumes swbdext_clean.tab is in the same directory and modifies it, ensuring that the speaker of the target item always has the same speaker code (which happens to be the default speaker code "S1"), except when the order of speaker labels is not clear. * more_regex_fixes.R corrects a laundry list of minor text cleanup issues in swbdext_final.tab that should have been mended by swbdext_markup.R and/or context_markup.R; its fixes should be incorporated into those two files for future re-use of those two files. 1. 10 instances of ". ." are replaced with ".". 1. 15 instances of "empty" speaker tags (e.g. the first tag in "S1: S2:") are deleted. 1. 21 instances of " ." are replaced with ".". 1. 20 instances of " ," are replaced ",". 1. 409 instances of lower-case characters following speaker tags ("S[12]:([a-z])") are replaced with upper-case versions. === Alternation Annotation === * altgen.R takes swbdext_clean.tab as input and produces alternated versions of every item potentially worth sampling at that point; the output is alternations.tab. 1. Only items with a SamplingClass of exactly 1, 2, or 3 are copied from the database in swbdext_clean.tab to the database to be written to alternations.tab. 2. Three additional fields are present in alternations.tab: Acceptability Code, AlternativePrep, CorrectedSentence. Their uses are explained in the description of judge.py. 3. The database to be written to alternations.tab is scrambled. * At this point, judge.py takes alternations.tab as input and iterates over the file, prompting the user for acceptability judgements and recording them in rated_annotations.tab. 1. judge.py checks for an existing rated_alternations.tab file; if found, it prompts the user whether to a. accept all judgements found in this file (and then judge unrated items) up front or a. decide whether to accept each existing judgement on a case-by-case basis 2. iterates over alternations.tab, presenting an automatically alternated form of the sentence and prompts the user for one of the following acceptability codes (assuming no existing rating has been found or the user rejects the existing rating): * 'a' - acceptable as is. * 'b' - borderline-acceptable as is. * 'u' - unacceptable. * 'o' - acceptable with a preposition other than 'to'. The user is then prompted to give the alternate preposition; this is stored in "AlternativePrep". * 'm' - acceptable with some other modification. The user is then prompted to rewrite the alternated form of the sentence; this is stored in "CorrectedSentence". * 'STOP' - the iteration halts and the script exits. * import_ratings.R uses rated_alternations.tab's acceptability code to decide what and how to annotate the database in swbdext_clean.tab to produce the database in swbdext_final.tab. Items with the acceptability code * 'a' have a 'Z' added to their SamplingClass * 'b' have a 'Y' added to their SamplingClass * 'o' have an 'X' added to their SamplingClass * 'u' have a 'W' added to their SamplingClass and have their Item_IDs added to exceptions.tab === Question Annotation === * reviewItems.py can take parameters specifying an input and output file, but defaults to swbdext_final.tab in both cases; the script loops through each item in (what is, at the user's direction via command-line, a a specific subset of) the input file and presents the item id and sampling class field, a pretty-printed, easily readable version of the stimulus, data related to content questions, and then a prompt for what to do next. The user may move on to the next item, add a new question, edit an existing question, answer, or rating field, mark the item as gender disambiguating, or re-print the complete item (i.e. to allow for review committed changes to a given item). *The most useful command line option is --filter (-f); the user may currently pass one of 5 arguments, directing the script to only iterate through: those items with unrated questions, those items with rated questions, those items with a sampling class of 1, 2, or 3, or those items with probable gender disambiguating words. Because it has not been useful thus far, more than one filter argument is not supported. === Sampling === * sample.py takes swbdext_final.tab as input and produces a file (__.sample, where is one of {{{orig, nppp, npnp, flip}}} and is one of {{{fwd, rev}}}) in the manner specified via the following parameters, all of which have default values if not specified in command line arguments; A. Note that none of these command line arguments are necessary and that they can be in any order. Each command line argument is either of the form "~VAR_CHAR(VALUE)" or "~VAR_CHAR(VALUE1,VALUE2)" (without quotes and as here, with no whitespace anywhere). VAR_CHAR is one of {{{q, n, f, s, c, p, r}}}. An explanation of each argument, its default value(s), and the range of values it can take on is detailed below: * q. QuestionsOverClass. Range: one of the two characters {T,F}. Default: T. If true, the script places more importance on content questions than sampling class; that is, the script attempts to build a sample entirely composed of items with content questions, if possible; if TRUE and this is not possible, the sample might have a higher proportion of sampling class 2 and sampling class 3 items then if this parameter were FALSE. See Steps 4 and 5 below for more details. * n. initNPNPprop. Range: floating point numbers in the interval [0,1]. Default: 0.5. The proportion of the sample desired to be in NPNP order coming in originally (that is, before any sentences are alternated). See step 6 below for the context when this is used. * f. flipProp. Range: floating point numbers in the interval [0,1]. Default: 0.25. After items have been alternated, the sample is ordered into pairs consisting of an NPNP item followed by an NPPP item (until the script runs out of the limiting category, at which point it tacks on the remaining items at the end); the value associated with f describes the probability that each NPNP+NPPP pair will have its order flipped to NPPP+NPNP (from NPNP+NPPP). See step 8 below for the context when this is used. * s. sampleSize. Range: integers between 0 and the # of items in swbdext_final.tab. Default: 100. Determines the number of items the script will attempt to put in the sample file. * c. finalCondition. Range for both values: floating point numbers in the interval [0,1]. Default for both values: 1. Determines the final manipulation (if any) to apply to each item. The first float in the tuple describes the proportion of items in the sample that were ORIGINALLY NPNP to ''preserve the order of'' (that is, leave unmanipulated); the second float in the tuple, similarly, describes the proportion of NPPP items in the sample whose order should be preserved; the default values of (1,1) leave all items in original database order. * p. speakerCodes. Range for both values: strings. Default for value 1: S1. Default for value 2: S2. Substitutions for speaker codes 1 and 2. Colons are added at the end of each of these two items later, automatically. * r. randSeed. Range: anything that Python can convert to an int. By default, this is assigned the value of time.time() at execution. * o. produceFlippedSample. Range: 'T', 'True', True, 'F', 'False', False. Default: True. Determines whether or not sample.py takes the almost-final sample and produces a copy identical except for the ordering among items (which has been reversed). A. All stimuli files in the project directory with names of the form __.sample were produced by running sample.py a few times, taking the random seed of the first sample that met all criteria (cross-item speaker view coherency , no within-item gender disambiguation, no cross-item overlap), and then running driver092612.py with that random seed. driver092612.py calls sample.py with that random seed and four different target manipulation arguments. This is inefficient, but means a minimal amount of additional code is necessary to produce a given sample from the database in each of the 8 order/manipulation conditions. A. Execution steps: 1. Database items with an unacceptable sampling classes are trimmed from the database in memory. 2. Each entry of the database is annotated with lemma frequency (number of instances in the database with this lemma / total number of items in the database). * Maybe this should be step 1 rather than step 2? 3. Create a collection of smaller, immutable versions of each database entry, with just the information we need to determine how good a sample is. 4. Using problem-specific sample evaluation functions, pass the database to an optimization algorithm and take the "uncompressed version" of the solution it produces. * Algorithm - Use random restart hillclimbing to separately find a good NPNP subsample and a good NPPP subsample and combine the two. Each "step" in the act of hillclimbing only ever moves upwards in the fitness landscape; the number of neighboring states a hillclimber looks at at a given step before accepting the best found state better than the current one changes according to a cooling schedule explored in more detail in optimalStoppingScribbles.zip. * Parameters: sample.py currently takes the best result of 5 runs (for both subsamples); during each run, the hillclimber class in RRHC.py currently stops after no more than 50 steps from the initial state. Increasing the number of runs or the cutoff step value currently just results in multiple runs that yield (distinct) results with fitness values that differ somewhere past the 14th decimal place; reasons for this extreme granularity of difference are discussed below. As of 02.13.12, what granularity is still present should be "good" granularity. * Fitness function: There are four desirable features in a sample. One of them is maintained perfectly by design, and the other three all take on values from 0-1 (inclusive); this makes them easy to combine. For now, all three values are multiplied to a) collapse the three subscores into one (making comparisons much easier) and b) prevent advances in one subscore from coming at the expense of another during the hillclimbing process. 1. '''Initial NPNP/NPPP proportions''': This is customizable via command-line option, defaulting to 0.5/0.5. Currently this is enforced by splitting the sample search process into two sub-problems: creating a reasonably good NPNP subsample (of length exactly equal to 0.5 * sampleSize) and creating a reasonably good NPPP subsample. The sum of these two search space sizes is an enormously smaller number than the size of the complete search space, particularly if the NPNP/NPPP command-line parameter were to become a strong suggestion (rather than a fixed constraint, as it is currently treated). 2. '''The proportion of sample items with questions''': A perfectly fit (sub)sample (with respect to this subscore) has as many questions as are possible, given the number of items with at least one question in the sample in the subset of the database the relevant subsample is drawn from (i.e. the NPNP portion) and the sample size. 3. '''Sampling class distribution''': This fitness subscore measures the (normalized) Hellinger distances (chosen from [[http://www.math.hmc.edu/~su/papers.dir/metrics.pdf|Gibbs & Su 2002]]) because a) I could figure out how to implement it (unlike Earthmover's distance), b) it takes values between 0 and 1, and c) it's much less coarse than total variation distance, the previous metric) between the current distribution over sampling classes in the subsample and the distribution over sampling classes in an "ideal" subsample; an ideal subsample maximizes the number of sampling class 1 items, filling spots remaining after using up the SC1 pool with as many sampling class 2 items as possible, and then filling any final remaining spots with sampling class 3 items. 4. '''Lemma frequency fitness''': An ideal (sub)sample has a distribution of verb lemmas equal to that of the sampleable subportion (i.e. NPNP or NPPP) of the database; this fitness value is the normalized Hellinger distance of the sample's lemma distribution and that of the sampleable subportion (i.e. NPNP or NPPP) of the database. 5. Convert the sentence order of (1 - finalCondition[0]) * [# of NPNP items in the sample] NPNP items and (1 - finalCondition[1]) * [# of NPPP items in the sample] NPPP items. 6. Organize a proportion of the sample list equal to 1-flipProp into NPNP+NPPP pairs (and the remaining proportion (= flipProp) of the sample list into NPPP+NPNP pairs) til either category of items runs out, and then add the rest of items from the longer set to the end of the rebuilt sample. (This intermediate pairlist is composed of tuples of paired sentences and single sentences from the longer list. 7. The list of pairs and left-over items is scrambled, to intersperse the leftovers. 8. The tuples wrapping each pair are then "unpacked" to produce a list with items of just one data-type. 9. Ibex columns are added. * Stimulus - before context + target sentence (alternated form if the item was alternated, original sentence string otherwise) + after context, adjusting capitalization and adding periods of the after context to restore text to what it was in the original database * List - the filename, minus the '.sample' extension - gives the random seed, target manipulation, and list order. * Trial Order - the item's position in a zero-initial order. * ItemId - set to the same value as the item's Item_ID field * StimulusType - 'item' * Condition - -> (e.g. "NPNP -> NPNP" for an item that was originally in NPNP order and was NOT alternated) 10. Substitute speaker codes in all columns of all rows. 11. Write to file.