TGrep2 Database Tools (TDT)

1. General options

old notes from Florian

what directory structure is expected (all the subdirectories, factor files, etc)

i am constantly modifying these things and you should probably just decide for yourself. the only things that assumes directory structure are the "run" perl script and the ExtractData shellscript. run is a shellscript that calls all the other stuff, including ExtractData, which is a list of calls to perl scripts. it is ExtractData that constructs the database. run contains the commands that call TGrep2 and create the data files the perl scripts extract information out of. a somewhat redundant step, but it means that you do not have to re-extract all data when you want to change one little thing in the final database.

those scripts assume that they are in a directory called shellscripts, which is in a directory with the following four directory

data ptn results shellscripts

data will contain the data extracted by TGrep2 in data/corpus_name, e.g.

[tiflo@bcs-cycle1 ~/DT]$ ls data/swbd/ Dfollowing.sn Form.sn LEN_position.sn NPother.sn POSpreceding.sn WORDpreceding.sn Dform.sn ID.sn NPempty.sn NPsingleNother.sn TOPstring.sn XML.sn Dpreceding.sn LEN_np.sn NPmultipleN.sn NPsingleN.sn WORDfollowing.sn

as you can see the file extension for those data files is sn. by the way, in ~tiflo/DT, you'll find the most up to date version, where the run script defines the the data endings and paths at the top of the script ...

ptn is the directory with the TGrep2 patterns:

[tiflo@bcs-cycle1 ~/DT]$ ls ptn ContFactor Dpreceding.ptn NPmultipleN.ptn NPsingleN.ptn UFactor Dfollowing.ptn ID.ptn NPother.ptn POSFactor Dform.ptn NPempty.ptn NPsingleNother.ptn StringFactor

the *.ptn files in ptn are assumed to be categorical factors; depending on the assumed factor type the run script will have a slightly different tgrep2 call. check it out. all calls have in common that they create output that starts with a tgrep2 id on each line, but they differ in what follows it (if anything). for categorical factors, nothing follows the id. each file is just a list of all tgrep2 IDs that match a certain pattern

the *.ptn files in ptn/ContFactor are assumed to be length factors: each ID is followed by terminals (could be one, could be many) that correspond to the node marked =print in the .ptn file. the addLengthFactor.pl script later sum up everything that counts as a word. an ID may occur several times (on several rows) and addLengthFactor.pl will accumulate counts over that. the *.ptn files in ptn/CountFactor are assumed to be count factors: similar to length factors, but this time we do not accumulate words, but matches (rows). That is, for each occurrence of an ID in the .sn file, the addCountFactor.pl script increments the value of the variable by one. so cont and count factors are treated the same by run (it's just for organizational purposes that there are two directories).

the *.ptn files in ptn/StringFactor ... yes again similar, but the addStringFactor.pl script extracts and concatenates the words. the *.ptn files in ptn/POSFactor are assumed to be part of speech factors: the tgrep2 call will print the ID followed by the a terminal (=print) and the value of the node dominating it (=pos).

results/ is where the final data file will be stored by ExtractData, e.g.

[tiflo@bcs-cycle1 ~/DT]$ ls results/ swbd.tab

shellscripts/ contains ExtractData, run, and the MACRO.ptn files. different macrofiles can be used for different corpora and run accepts an arbitrary number of *MACRO-*.ptn files for different patterns. e.g. in

ls ../SRC/shellscripts/ bncExtractData bncwExtractData MACROS-VP.ptn Rimport.sh

bncMACROS-SBAROPT.ptn ExtractData nohup.out run

bncMACROS-VP.ptn MACROS-SBAROPT.ptn _old runcluster.sh

you could execute: ./run -c bncw -e SBAROPT VP -join -collect

which would use the bncwMACROS-SBAROPT.ptn and bncwMACROS-VP.ptn files and run the entire set of ptn files in ../ptn/*.ptn with both macrofiles (this is useful when you want one database out of several patterns that cannot be combined into one disjoined pattern). those will be stored in ../data/bncw-SBAROPT and ../data/bncw-VP. then the -join option will cat the corresponding data files together and stored them in ../data/bncw/. -collect will look for a corpus_nameExtractData file and then execute (in this case) bncwExtractData. bncwExtractData calls all the perl scripts --> it constructs the database.

perl scripts in ~/scripts/


all the TDT scripts assume that there is a data file called corpus.tab (whatever you give them as corpus name, the -c parameter) in the script directory.

btw, you can check out the most up to date script files in ~tiflo/scripts/. it's now a git archive and git is installed in /p/hlp/bin (where all our binaries are). you can create your own git copy by "git clone ~tiflo/scripts". that'll make updating a bit easier.

ok, the actual perl scripts ... i'll talk about that another time because that's like writing a manual! but see below

what all the options to all the commands mean (-oc vs -c, etc?)

see top of format.pl, which is the central file. it should have short comments on the different options.

-r is resetting the values in the database -o is overwriting values -c is the name of the corpus. this is normed. it should accept (or no what how to deal with): swbd, pswbd, wsj, brown, bnc, wbnc, sbnc

-f (in the newest version applicable to all basic scripts) gives the factorname, and you can now write things like:

[..] #$corpus is read in; $PData is the directory where the data is stored, e.g. ../data/swbd/

$Pdata/ID # creates database

# adds info about speakers and conversation perl addConversationInfo.pl -oc $corpus

# adds string factor: adds a column called Form to the database and writes content of Form.sn into it # (.sn is now the default ending) perl addStringFactor.pl -oc $corpus -f Form=$Pdata/Form

# adds length factor: adds a column called Len_position to the database and writes content of LEN_position.sn into it perl addLengthFactor.pl -oc $corpus -f Len_position=$Pdata/LEN_position

# fixes a tgrep2 bug, for cases where more than instance is matched per sentence perl accumulateFactorValues.pl -oc $corpus -f Len_position

perl addStringFactor.pl -oc $corpus -f TOPstring=$Pdata/TOPstring perl addStringFactor.pl -oc $corpus -f WORDpreceding=$Pdata/WORDpreceding perl addStringFactor.pl -oc $corpus -f WORDfollowing=$Pdata/WORDfollowing

# adds categorical factor: factor name is followed by pairs of (value, file_with_IDs_that_should_have_that_value) perl addCategoricalFactor.pl -oc $corpus -f Dpreceding 1 $Pdata/Dpreceding perl addCategoricalFactor.pl -oc $corpus -f Dform 1 $Pdata/Dform perl addCategoricalFactor.pl -oc $corpus -f Dfollowing 1 $Pdata/Dfollowing

addPhonology script

This script will add the number of syllables in a string. It can be run as follows:

perl addPhonology -c corpus -f field

The file containing your data should be named corpus.tab, and "field" is the name of the column of your data to which you want to count the syllables for.

am I supposed to run the commands from /p/hlp/TDT/tools, or from somewhere specific in my home directory? (if I do one, it can't find the speaker info database, if I do the other it can't find *my* database)

see above. currently the perl scripts assume a file like swbd.tab whereever the perl scripts are. the newer version does contain a way to execute the scripts from other locations, if an environment variable is set. see top of format.pl.

how do I aggregate data from multiple corpora into a single database with the appropriate corpusID

the easiest way: addCategoricalFactor.pl -c corpus_name corpus_name [sic] writes a column into each data file that is the corpus name. then you can cat them. ExtractData in principle loops through several corpora. so you can say ExtractData swbd wsj brown and it will create (and fill) swbd.tab, wsj.tab, brown.tab. just include addCategoricalFactor.pl -c $corpus $corpus and then cat the three files at the very end of ExtractData.

TDT (last edited 2009-05-08 18:15:58 by cgomez)

MoinMoin Appliance - Powered by TurnKey Linux