## page was renamed from HlpLab/StatsMiniCourse/Session2
#acl HlpLabGroup,TanenhausLabGroup:read,write,delete,revert,admin All:read
#format wiki
#language en
#pragma section-numbers 4

== Session 2: Issues in linear regression ==
'''June 5 2008'''

=== Materials ===

 * attachment:lexdecRT.R
 * attachment:BaayenETAL06.pdf

=== Reading ===
|| G&H07 || Chapter 4 (pp. 53-74) || Linear regression:  before and after fitting the model ||
|| Baa08 || Sections 6.2.2-6.2.4 (pp. 198-212) || Collinearity, Model criticism, and Validation ||
|| || Section 6.4 (pp. 234-239) || Regression with breakpoints ||

=== Notes on the readings ===

=== Additional terminology ===
Feel free to add terms you want clarified in class:
 
 *  
 *

=== Questions ===
 * Q: Determining the significance of a coefficient: one-tailed or two-tailed t test?
 * A: It's a two-tailed test because we cannot a-priori assume which direction the coefficient will go. I guess if one had a really strong theoretical reason 
to assume one direction, you could do a one-tailed test (which is less conservative).

 * Q:  What is really happening behind the scenes when we call the {{{data()}}} command?
 * A:  Even after reading {{{?data}}}, it's not totally clear what's happening.  It turns out that most packages distributed with R use a database to store their data (this consists of three files (foo.rdb, foo.rds, foo.rdx)).  The internal mechanism used to deal with a database like this is the function {{{lazyLoad()}}}.  Calling {{{lazyLoad()}}} on a database has the effect of putting all of the objects in the database into your workspace, but the objects aren't actually filled with all of their values until you use them.  This means that you can load all of the datasets from a large library like {{{languageR}}} without using up all of your memory.  The underlying action take by a call to {{{data()}}} is something like
    {{{
    lazyLoad(file.path(system.file(package="languageR"), "data", "Rdata"))
    }}}

 * Q:  Why isn't there a function to center variables?  {{{scale(x, scale=FALSE)}}} is a pain to remember.
 * A:  There should be!  Let's define our own:
    {{{
    center <- function (x) { as.numeric(scale(x, scale=FALSE)) }
    }}}
  Now, whenever we call the function center(), it will subtract the mean of a vector of numbers from each element of the vector.  If you want to have access to this function every time you use R, without defining it anew each time, you can put it in a file called ".Rprofile" in your home directory.

 * Q:  I don't totally buy the argument in G&H about scaling the predictors before doing the regression.  Why is this a good idea?  Should I really divide by 2 time the standard deviation?
 * A:  You can read more detailed arguments from Gelman in [http://www.stat.columbia.edu/~gelman/research/published/standardizing7.pdf this paper] and [http://www.stat.columbia.edu/~cook/movabletype/archives/2008/04/another_salvo_i.html this blog post] (read the comments, too!).

== Suggested topics ==
If you have any material that you would like to cover that isn't included in the list below, please make note of it here.

[[Anchor(assignments)]]
=== Assignments ===
Upload your solutions to this page by 10pm Monday.

|| G&H07 || Section 4.9 (p.76) || Exercise 4 ||
|| Baa08 || Section 6.7 (p. 260) || Exercise 1, 8 ||

 * attachment:VanDurmeSession2.R
 * attachment:APS-hw2.R


[[Anchor(Topics)]]
== Topics ==

 * More on outliers
   * detect outliers
       {{{boxplot()}}}, scatterplots {{{plot(), identify()}}}[[BR]]
   * dealing with outliers
       exclusion {{{subset()}}}[[BR]]
       robust regression (based on t-distribution): {{{tlm()}}} (in package {{{hatt}}})[[BR]]
 * Overly influential cases (can be, but don't have to be outliers)
       {{{lm.influence()}}}, also {{{library(Rcmdr)}}}[[BR]]

 * Collinearity:
   * tests: {{{vif(), kappa()}}}, summary of correlations between fixed effects in {{{lmer()}}} [[BR]]
   * countermeasures: 
     * centering and/or standardizing {{{scale()}}}} [[BR]]
     * use of residuals {{{resid(lm(x1 ~ x2, data))}}} [[BR]]
     * principal component analysis (PCA) {{{princomp()}}} [[BR]]

 * Model evaluation: Where is the model off?
   * case-by-case: {{{residuals(), predict()}}}[[BR]]
   * based on predictor: {{{residuals()}}} against predictors, {{{calibrate()}}}[[BR]]
   * overall: {{{validate()}}}[[BR]]

 * Corrections: 
   * correcting for clusters (violation of assumption of independence): {{{bootcov()}}}[[BR]]