Differences between revisions 3 and 4

TGrep2 Database Tools (TDT)

1. General options

-c corpusname or --corpus corpusname, where corpusname is an argument describing the corpus to be used. This determines a variety of things, e.g. which ngram files will be used and how corpus-specific information will be stripped from the terminal output of TGrep2 (e.g. when extracting strings from the corpus). Currently the following argument are recognized:
- swbd - Switchboard Corpus, back-translated from NITE XML (sw.backtrans.convid_030507.t2c.gz)
- wsj - Wall Street Journal (wsj_mrg.t2c.gz)
- brown - Brown corpus (brown.t2c.gz)
- bnc - the entire parse BNC
- bncs - only the spoken parts of the BNC
- bncw - only the written parts of the BNC
-d databasename or --database databasename, where databasename is an argument describing the filename of the database that you wish to manipulate (create, add information to, etc.).
- Defaults to corpusname.tab
-f factornames or --factors factornames, where factornames is one or more names of factors (=columns) in the database that you wish to create, import, or manipulate. If there is more than one one factor, their names have to be separated by commas and no spaces should intervene, as in factorname1,factorname2,...,factornameN. Most scripts allow several factor names as input and the will loop over all factors. In case the script expects an input file (e.g. a TGrep2 output file) for each factor, these can either be provided separately (see --files option) or in conjunction with the factor specification by using factorname1=file1,factorname2=file2,...,factornameN=fileN.
-o or --overwrite specifies that cells that already have a value should be overwritten if a new value is obtained by the operations conducted by the script (e.g. by importing information from a TGrep2 output file).
- Default is no.
-r or --reset specifies that all cell values of the specified factors (=columns) should be reset to the default value (usually an empty cell).
- Default is no.
-w or --warnings specifies whether detailed warnings should be printed.
- Default is no.

old notes from Andrew?

what directory structure is expected (all the subdirectories, factor files, etc)

i am constantly modifying these things and you should probably just decide for yourself. the only things that assumes directory structure are the "run" perl script and the ExtractData shellscript. run is a shellscript that calls all the other stuff, including ExtractData, which is a list of calls to perl scripts. it is ExtractData that constructs the database. run contains the commands that call TGrep2 and create the data files the perl scripts extract information out of. a somewhat redundant step, but it means that you do not have to re-extract all data when you want to change one little thing in the final database.

those scripts assume that they are in a directory called shellscripts, which is in a directory with the following four directory

data ptn results shellscripts

data will contain the data extracted by TGrep2 in data/corpus_name, e.g.

[tiflo@bcs-cycle1 ~/DT]$ ls data/swbd/ Dfollowing.sn Form.sn LEN_position.sn NPother.sn POSpreceding.sn WORDpreceding.sn Dform.sn ID.sn NPempty.sn NPsingleNother.sn TOPstring.sn XML.sn Dpreceding.sn LEN_np.sn NPmultipleN.sn NPsingleN.sn WORDfollowing.sn

as you can see the file extension for those data files is sn. by the way, in ~tiflo/DT, you'll find the most up to date version, where the run script defines the the data endings and paths at the top of the script ...

ptn is the directory with the TGrep2 patterns:

[tiflo@bcs-cycle1 ~/DT]$ ls ptn ContFactor Dpreceding.ptn NPmultipleN.ptn NPsingleN.ptn UFactor Dfollowing.ptn ID.ptn NPother.ptn POSFactor Dform.ptn NPempty.ptn NPsingleNother.ptn StringFactor

the *.ptn files in ptn are assumed to be categorical factors; depending on the assumed factor type the run script will have a slightly different tgrep2 call. check it out. all calls have in common that they create output that starts with a tgrep2 id on each line, but they differ in what follows it (if anything). for categorical factors, nothing follows the id. each file is just a list of all tgrep2 IDs that match a certain pattern

the *.ptn files in ptn/ContFactor are assumed to be length factors: each ID is followed by terminals (could be one, could be many) that correspond to the node marked =print in the .ptn file. the addLengthFactor.pl script later sum up everything that counts as a word. an ID may occur several times (on several rows) and addLengthFactor.pl will accumulate counts over that. the *.ptn files in ptn/CountFactor are assumed to be count factors: similar to length factors, but this time we do not accumulate words, but matches (rows). That is, for each occurrence of an ID in the .sn file, the addCountFactor.pl script increments the value of the variable by one. so cont and count factors are treated the same by run (it's just for organizational purposes that there are two directories).

the *.ptn files in ptn/StringFactor ... yes again similar, but the addStringFactor.pl script extracts and concatenates the words. the *.ptn files in ptn/POSFactor are assumed to be part of speech factors: the tgrep2 call will print the ID followed by the a terminal (=print) and the value of the node dominating it (=pos).

results/ is where the final data file will be stored by ExtractData, e.g.

[tiflo@bcs-cycle1 ~/DT]$ ls results/ swbd.tab

shellscripts/ contains ExtractData, run, and the MACRO.ptn files. different macrofiles can be used for different corpora and run accepts an arbitrary number of *MACRO-*.ptn files for different patterns. e.g. in

ls ../SRC/shellscripts/ bncExtractData bncwExtractData MACROS-VP.ptn Rimport.sh

runcluster.sh~

bncMACROS-SBAROPT.ptn ExtractData nohup.out run

runclusterswbd.sh

bncMACROS-VP.ptn MACROS-SBAROPT.ptn _old runcluster.sh

you could execute: ./run -c bncw -e SBAROPT VP -join -collect

which would use the bncwMACROS-SBAROPT.ptn and bncwMACROS-VP.ptn files and run the entire set of ptn files in ../ptn/*.ptn with both macrofiles (this is useful when you want one database out of several patterns that cannot be combined into one disjoined pattern). those will be stored in ../data/bncw-SBAROPT and ../data/bncw-VP. then the -join option will cat the corresponding data files together and stored them in ../data/bncw/. -collect will look for a corpus_nameExtractData file and then execute (in this case) bncwExtractData. bncwExtractData calls all the perl scripts --> it constructs the database.

perl scripts in ~/scripts/

all the TDT scripts assume that there is a data file called corpus.tab (whatever you give them as corpus name, the -c parameter) in the script directory.

btw, you can check out the most up to date script files in ~tiflo/scripts/. it's now a git archive and git is installed in /p/hlp/bin (where all our binaries are). you can create your own git copy by "git clone ~tiflo/scripts". that'll make updating a bit easier.

ok, the actual perl scripts ... i'll talk about that another time because that's like writing a manual! but see below

what all the options to all the commands mean (-oc vs -c, etc?)

see top of format.pl, which is the central file. it should have short comments on the different options.

-r is resetting the values in the database -o is overwriting values -c is the name of the corpus. this is normed. it should accept (or no what how to deal with): swbd, pswbd, wsj, brown, bnc, wbnc, sbnc

-f (in the newest version applicable to all basic scripts) gives the factorname, and you can now write things like:

[..] #$corpus is read in; $PData is the directory where the data is stored, e.g. ../data/swbd/

echo Creating new corpus file $corpus.tab
- perl createDatabase.pl -oc $corpus --files

$Pdata/ID # creates database

# adds info about speakers and conversation perl addConversationInfo.pl -oc $corpus

# adds string factor: adds a column called Form to the database and writes content of Form.sn into it # (.sn is now the default ending) perl addStringFactor.pl -oc $corpus -f Form=$Pdata/Form

# adds length factor: adds a column called Len_position to the database and writes content of LEN_position.sn into it perl addLengthFactor.pl -oc $corpus -f Len_position=$Pdata/LEN_position

# fixes a tgrep2 bug, for cases where more than instance is matched per sentence perl accumulateFactorValues.pl -oc $corpus -f Len_position

perl addStringFactor.pl -oc $corpus -f TOPstring=$Pdata/TOPstring perl addStringFactor.pl -oc $corpus -f WORDpreceding=$Pdata/WORDpreceding perl addStringFactor.pl -oc $corpus -f WORDfollowing=$Pdata/WORDfollowing

# adds categorical factor: factor name is followed by pairs of (value, file_with_IDs_that_should_have_that_value) perl addCategoricalFactor.pl -oc $corpus -f Dpreceding 1 $Pdata/Dpreceding perl addCategoricalFactor.pl -oc $corpus -f Dform 1 $Pdata/Dform perl addCategoricalFactor.pl -oc $corpus -f Dfollowing 1 $Pdata/Dfollowing

am I supposed to run the commands from /p/hlp/TDT/tools, or from somewhere specific in my home directory? (if I do one, it can't find the speaker info database, if I do the other it can't find *my* database)

see above. currently the perl scripts assume a file like swbd.tab whereever the perl scripts are. the newer version does contain a way to execute the scripts from other locations, if an environment variable is set. see top of format.pl.

how do I aggregate data from multiple corpora into a single database with the appropriate corpusID

the easiest way: addCategoricalFactor.pl -c corpus_name corpus_name [sic] writes a column into each data file that is the corpus name. then you can cat them. ExtractData in principle loops through several corpora. so you can say ExtractData swbd wsj brown and it will create (and fill) swbd.tab, wsj.tab, brown.tab. just include addCategoricalFactor.pl -c $corpus $corpus and then cat the three files at the very end of ExtractData.

-  ⇤ ← Revision 3 as of 2009-04-21 01:25:36 → 
  Size: 10116
  Editor: cpe-67-240-134-21
  Comment:
+   ← Revision 4 as of 2009-04-21 01:30:03 → ⇥
  Size: 10697
  Editor: cpe-67-240-134-21
  Comment:
-Deletions are marked like this.
+Additions are marked like this.
 Line 18:
- * '''-f''' ''factornames'' or '''-factors''' ''factornames'', where ''factornames'' is one or more names of factors (=columns) in the database that you wish to create, import, or manipulate. If there is more than one one factor, their names have to be separated by commas and no spaces should intervene, as in ''factorname1,factorname2,...,factornameN''. Most scripts allow several factor names as input and the will loop over all factors. In case the script expects an input file for each factor, these can either be provided separately (see --files option) or in conjunction with the factor specification by using ''factorname1=file1,factorname2=file2,...,factornameN=fileN''.
+ * '''-f''' ''factornames'' or '''--factors''' ''factornames'', where ''factornames'' is one or more names of factors (=columns) in the database that you wish to create, import, or manipulate. If there is more than one one factor, their names have to be separated by commas and no spaces should intervene, as in ''factorname1,factorname2,...,factornameN''. Most scripts allow several factor names as input and the will loop over all factors. In case the script expects an input file (e.g. a TGrep2 output file) for each factor, these can either be provided separately (see --files option) or in conjunction with the factor specification by using ''factorname1=file1,factorname2=file2,...,factornameN=fileN''.
 * '''-o''' or '''--overwrite''' specifies that cells that already have a value should be overwritten if a new value is obtained by the operations conducted by the script (e.g. by importing information from a TGrep2 output file).
  * Default is ''no''.
 * '''-r''' or '''--reset''' specifies that all cell values of the specified factors (=columns) should be reset to the default value (usually an empty cell). 
  * Default is ''no''.
 * '''-w''' or '''--warnings''' specifies whether detailed warnings should be printed.   
  * Default is ''no''.