Diff for "DenmarkMiniCourseSession1"

Differences between revisions 1 and 3 (spanning 2 versions)

Session 1: Linear and logistic regression

This session will cover the basics of linear and logistic regression. See below for a [#Topics list of topics]. Please make sure to do the readings and let me know in advance if there is terminology that you would like to be clarified. The goal of this session is to go through the basic steps of building linear or logistic regression model, understanding the output, and validating how good this model is.

I've also posted some [#assignments assignments] below. There is only one way to learn how to use the methods we will talk about and that is to apply them yourself to a data set that you understand. The tutorial is intended to get you to the level where you can do that. Make sure to read the assigned readings and, if you have time, you could familiarize yourself with the scripts attached below. They use data sets from the R package languageR.

Anchor(Materials)

Materials

attachment:lexdecRT.R (a simple linear regression example)
attachment:walpiri.R (a simple logistic regression example)
attachment:case-influence.ppt
attachment:BaayenETAL06.pdf

Reading

Baa08	Section 4.3.2 (pp. 91 - 105)	Functional relations: linear regression
	Sections 6 - 6.2.4 (pp. 181-212)	Regression Modeling (Introduction and Ordinary Least Squares Regression)
		Collinearity, Model criticism, and Validation
	Section 6.3 (pp. 214-234)	Generalized Linear Models
	Section 6.6 (pp. 258-259)	General considerations

Optional reading:

G&H07	Chapter 3 (pp. 29-49)	Linear regression: the basics
	Chapter 4 (pp. 53-74)	Linear regression: before and after fitting the model
G&H07	Chapter 5 (pp. 79-105)	Logistic regression

Q&A

Q: On page 181, Baayen refers to one of the models, done via lm(), as a covariance model. Why is this considered a model of covariance rather than a regression?
A: This refers to the fact that 'lm()' or any regression model allows the inclusion of continuous predictors (unlike ANOVA, aov() in R, but like ANCOVA - analysis of covariance). The idea is that, for a linear model, where the outcome is a continuous variable, a continuous predictor co-varies (to whatever extent) with the continuous outcome.
Q: When doing a regression with a categorical variable (as Baayen does on page 182), is there an easy way to see which level is coded as 1 and which as 0?
A: By default R codes the dependent variable so that the the outcome is the level that is 2nd in terms of alpha-numerically order.
Q: Determining the significance of a coefficient: one-tailed or two-tailed t test?
A: It's a two-tailed test because we cannot a-priori assume which direction the coefficient will go. I guess if one had a really strong theoretical reason

to assume one direction, you could do a one-tailed test (which is less conservative).

Q: What is really happening behind the scenes when we call the data() command?
A: Even after reading ?data, it's not totally clear what's happening. It turns out that most packages distributed with R use a database to store their data (this consists of three files (foo.rdb, foo.rds, foo.rdx)). The internal mechanism used to deal with a database like this is the function lazyLoad(). Calling lazyLoad() on a database has the effect of putting all of the objects in the database into your workspace, but the objects aren't actually filled with all of their values until you use them. This means that you can load all of the datasets from a large library like languageR without using up all of your memory. The underlying action take by a call to data() is something like
- ```
    lazyLoad(file.path(system.file(package="languageR"), "data", "Rdata"))
```
Q: Why isn't there a function to center variables? scale(x, scale=FALSE) is a pain to remember.
A: There should be! Let's define our own:
- ```
    center <- function (x) { as.numeric(scale(x, scale=FALSE)) }
```
- Now, whenever we call the function center(), it will subtract the mean of a vector of numbers from each element of the vector. If you want to have access to this function every time you use R, without defining it anew each time, you can put it in a file called ".Rprofile" in your home directory.
Q: I don't totally buy the argument in G&H about scaling the predictors before doing the regression. Why is this a good idea? Should I really divide by 2 time the standard deviation?
A: You can read more detailed arguments from Gelman in [http://www.stat.columbia.edu/~gelman/research/published/standardizing7.pdf this paper] and [http://www.stat.columbia.edu/~cook/movabletype/archives/2008/04/another_salvo_i.html this blog post] (read the comments, too!).

Anchor(assignments)

Assignments

Baa08	Section 4.7 (p. 126)	Exercises 3 and 7*
	Section 6.7 (p. 260)	Exercise 1, 8

Anchor(Topics)

Possible topics

The Linear Model (LM)
- Geometric interpretationBR Ordinary least squares (and how they are "optimal" for the purpose of predicting an outcome Y)BR
Building a linear model (for data analysis)
- lm(), ols()BR Structure and class of these objects
  - coef(), display(), summary()BR fitted(), resid()BR
  Standard output
Interpreting the output of a linear model
- What hypotheses are we testing?BR What are coefficients and how to read them?BR
  - anova(), drop()BR
  Coding?BR
  - contrasts()BR
  Transformations and other non-linearitiesBR
  - log(), sqrt()BR rcs(), pol()BR
Using a model to predict unseen data
- predict()BR
Understanding the influence of individual cases, identifying outliers:
- see also the [[#Materials slides on case influence] attached above.
  - boxplot()BR lm.influence()BR
- detect outliers
  - boxplot(), scatterplots plot(), identify()BR
- dealing with outliers
  - exclusion subset()BR robust regression (based on t-distribution): tlm() (in package hatt)BR
Overly influential cases (can be, but don't have to be outliers)
- lm.influence(), also library(Rcmdr)BR
Collinearity
- tests: vif(), kappa(), summary of correlations between fixed effects in lmer() BR
- countermeasures:
  - centering and/or standardizing scale()} BR
  - use of residuals resid(lm(x1 ~ x2, data)) BR
  - principal component analysis (PCA) princomp() BR
Model evaluation: Where is the model off?
- case-by-case: residuals(), predict()BR
- based on predictor: residuals() against predictors, calibrate()BR
- overall: validate()BR
Corrections:
- correcting for clusters (violation of assumption of independence): bootcov()BR

-  ⇤ ← Revision 1 as of 2008-11-13 04:18:15 → 
  Size: 4279
  Editor: cpe-67-240-134-21
  Comment:
+   ← Revision 3 as of 2008-11-13 04:37:54 → ⇥
  Size: 7471
  Editor: cpe-67-240-134-21
  Comment:
-Deletions are marked like this.
+Additions are marked like this.
 Line 11:
-I've also posted some [#assignments assignments] below. There is only one way to learn how to use the methods we will talk about and that is to apply them yourself to a data set that you understand. The tutorial is intended to get you to the level where you can do that.
+I've also posted some [#assignments assignments] below. There is only one way to learn how to use the methods we will talk about and that is to apply them yourself to a data set that you understand. The tutorial is intended to get you to the level where you can do that. Make sure to read the assigned readings and, if you have time, you could familiarize yourself with the scripts attached below. They use data sets from the R package ''languageR''.
 Line 13:
+[[Anchor(Materials)]]
-Line 15:
+Line 16:
- * attachment:attention-r-data.csv
 * attachment:attention-procedure.ppt
 * attachment:attention-r-commands.R
+ * attachment:lexdecRT.R (a simple linear regression example)
 * attachment:walpiri.R  (a simple logistic regression example)
 Line 19:
+ * attachment:BaayenETAL06.pdf
-Line 21:
+Line 22:
-|| G&H07 || Chapter 3 (pp. 29-49) || Linear regression:  the basics ||
 Line 23:
-|| || Sections 6 - 6.2.1 (pp. 181-198) || Regression Modeling (Introduction and Ordinary Least Squares Regression) ||
+|| || Sections 6 - 6.2.4 (pp. 181-212) || Regression Modeling (Introduction and Ordinary Least Squares Regression) ||
||  || || Collinearity, Model criticism, and Validation ||
|| || Section 6.3 (pp. 214-234) || Generalized Linear Models ||
-Line 26:
+Line 28:
-=== Notes on the readings ===
If you'd like to follow along, the dataset used in the G&H07 reading can be found here:  [http://www.stat.columbia.edu/~gelman/arm/examples/child.iq/].  To use the file, you will need to load the "foreign" package, then use the read.dta() function.  Eg:
    `library("foreign")` [[BR]]
    `kidiq <- read.dta(file="kidiq.dta")`
+Optional reading:
|| G&H07 || Chapter 3 (pp. 29-49) || Linear regression:  the basics ||
|| || Chapter 4 (pp. 53-74) || Linear regression:  before and after fitting the model ||
|| G&H07 || Chapter 5 (pp. 79-105) || Logistic regression ||
-Line 31:
+Line 33:
-=== Additional terminology ===
Feel free to add terms you want clarified in class:
 
 *  
 *
-Line 37:
+Line 34:
-=== Questions ===
+=== Q&A ===
-Line 44:
+Line 41:
+ * Q: Determining the significance of a coefficient: one-tailed or two-tailed t test?
 * A: It's a two-tailed test because we cannot a-priori assume which direction the coefficient will go. I guess if one had a really strong theoretical reason 
to assume one direction, you could do a one-tailed test (which is less conservative).

 * Q:  What is really happening behind the scenes when we call the {{{data()}}} command?
 * A:  Even after reading {{{?data}}}, it's not totally clear what's happening.  It turns out that most packages distributed with R use a database to store their data (this consists of three files (foo.rdb, foo.rds, foo.rdx)).  The internal mechanism used to deal with a database like this is the function {{{lazyLoad()}}}.  Calling {{{lazyLoad()}}} on a database has the effect of putting all of the objects in the database into your workspace, but the objects aren't actually filled with all of their values until you use them.  This means that you can load all of the datasets from a large library like {{{languageR}}} without using up all of your memory.  The underlying action take by a call to {{{data()}}} is something like
    {{{
    lazyLoad(file.path(system.file(package="languageR"), "data", "Rdata"))
    }}}

 * Q:  Why isn't there a function to center variables?  {{{scale(x, scale=FALSE)}}} is a pain to remember.
 * A:  There should be!  Let's define our own:
    {{{
    center <- function (x) { as.numeric(scale(x, scale=FALSE)) }
    }}}
  Now, whenever we call the function center(), it will subtract the mean of a vector of numbers from each element of the vector.  If you want to have access to this function every time you use R, without defining it anew each time, you can put it in a file called ".Rprofile" in your home directory.

 * Q:  I don't totally buy the argument in G&H about scaling the predictors before doing the regression.  Why is this a good idea?  Should I really divide by 2 time the standard deviation?
 * A:  You can read more detailed arguments from Gelman in [http://www.stat.columbia.edu/~gelman/research/published/standardizing7.pdf this paper] and [http://www.stat.columbia.edu/~cook/movabletype/archives/2008/04/another_salvo_i.html this blog post] (read the comments, too!).
-Line 46:
+Line 63:
-Please upload your solutions by Friday 3:30pm.

|| G&H07 || Section 3.9 (pp. 50-51) || Exercises 3 and 5 ||
-Line 50:
+Line 64:
-*  (for Exercise 7, Baayen treats linear regression using {{{lm}}} or {{{ols}}} as the same as analysis of covariance (see section 4.4.1 (pp. 117-119))).

 * attachment:APS-hw1.R
 * attachment:BenVanDurme-hw1.R
 * attachment:TingQian-hw1.R
+|| || Section 6.7 (p. 260) || Exercise 1, 8 ||
-Line 58:
+Line 68:
-== Topics ==
+== Possible topics ==
-Line 80:
+Line 90:
- * Understanding the influence of individual cases, identifying outliers
+ * Understanding the influence of individual cases, identifying outliers:
   * see also the [[#Materials slides on case influence] attached above.
-Line 83:
+Line 94:
+   * detect outliers
       {{{boxplot()}}}, scatterplots {{{plot(), identify()}}}[[BR]]
   * dealing with outliers
       exclusion {{{subset()}}}[[BR]]
       robust regression (based on t-distribution): {{{tlm()}}} (in package {{{hatt}}})[[BR]]
 * Overly influential cases (can be, but don't have to be outliers)
       {{{lm.influence()}}}, also {{{library(Rcmdr)}}}[[BR]]

 * Collinearity
   * tests: {{{vif(), kappa()}}}, summary of correlations between fixed effects in {{{lmer()}}} [[BR]]
   * countermeasures: 
     * centering and/or standardizing {{{scale()}}}} [[BR]]
     * use of residuals {{{resid(lm(x1 ~ x2, data))}}} [[BR]]
     * principal component analysis (PCA) {{{princomp()}}} [[BR]]

 * Model evaluation: Where is the model off?
   * case-by-case: {{{residuals(), predict()}}}[[BR]]
   * based on predictor: {{{residuals()}}} against predictors, {{{calibrate()}}}[[BR]]
   * overall: {{{validate()}}}[[BR]]

 * Corrections: 
   * correcting for clusters (violation of assumption of independence): {{{bootcov()}}}[[BR]]