Diff for "HLPMiniCourseSession1"

Differences between revisions 11 and 29 (spanning 18 versions)

Session 1: Linear regression

May 29 2008

This session will cover the basics of linear regression. See below for a [#Topics list of topics]. Please make sure to do the readings, and post any terminology you'd like to be clarified or other questions you have below. You can also suggest further topics, but keep in mind that Session2 also covers aspects of linear regression modeling, specifically typical issues that come up during the modeling. The goal of this first session is to go through the basic steps of building a linear regression model and understanding the output of it. Session 2 is on validating how good this model is.

We've also posted some [#assignments assignments] below that you should hand in by Friday, so that we can post them on this wiki page. There is only one way to learn how to use the methods we will talk about and that is to apply them yourself to a data set that you understand. The tutorial is intended to get you to the level where you can do that.

Materials

attachment:attention-r-data.csv
attachment:attention-procedure.ppt
attachment:attention-r-commands.R
attachment:case-influence.ppt
for the kidiq data from G&H
- attachment:kidiq.dta
- attachment:contrast-coding.R

Reading

G&H07	Chapter 3 (pp. 29-49)	Linear regression: the basics
Baa08	Section 4.3.2 (pp. 91 - 105)	Functional relations: linear regression
	Sections 6 - 6.2.1 (pp. 181-198)	Regression Modeling (Introduction and Ordinary Least Squares Regression)
	Section 6.6 (pp. 258-259)	General considerations

Notes on the readings

If you'd like to follow along, the dataset used in the G&H07 reading can be found here: [http://www.stat.columbia.edu/~gelman/arm/examples/child.iq/]. To use the file, you will need to load the "foreign" package, then use the read.dta() function. Eg:

library("foreign") BR kidiq <- read.dta(file="kidiq.dta")

Additional terminology

Feel free to add terms you want clarified in class:

Questions

Q: On page 181, Baayen refers to one of the models, done via lm(), as a covariance model. Why is this considered a model of covariance rather than a regression?
A: This refers to the fact that 'lm()' or any regression model allows the inclusion of continuous predictors (unlike ANOVA, aov() in R, but like ANCOVA - analysis of covariance). The idea is that, for a linear model, where the outcome is a continuous variable, a continuous predictor co-varies (to whatever extent) with the continuous outcome.
Q: When doing a regression with a categorical variable (as Baayen does on page 182), is there an easy way to see which level is coded as 1 and which as 0?
A: By default R codes the dependent variable so that the the outcome is the level that is 2nd in terms of alpha-numerically order.

Anchor(assignments)

Assignments

Please upload your solutions by Friday 3:30pm.

G&H07	Section 3.9 (pp. 50-51)	Exercises 3 and 5
Baa08	Section 4.7 (p. 126)	Exercises 3 and 7*

* (for Exercise 7, Baayen treats linear regression using lm or ols as the same as analysis of covariance (see section 4.4.1 (pp. 117-119))).

attachment:APS-hw1.R
attachment:BenVanDurme-hw1.R
attachment:TingQian-hw1.R

Topics

Interacting with R and R files

Quick recap: Formulating your research questions; Hypothesis testing; a "model"
- Dependence on assumptionsBR Dependence on sampleBR Dependence on available outcome and input measuresBR Goal: BR
  - Find generalizations that hold beyond the sampleBR Predicting and outcome based on a set of predictorsBR
Understanding your data set, predictors, and outcome, available information
- str(), summary(), names()BR
Understanding the distributions of your variables
- plot(), points(), lines(), barplot()BR xtabs(), table(), prop.table()BR hist(), histogram(), densityplot()BR
Understanding dependencies between your variables
- pairs(), cor(), abline(), loess()BR
The Linear Model (LM)
- Geometric interpretationBR Ordinary least squares (and how they are "optimal" for the purpose of predicting an outcome Y)BR
Building a linear model (for data analysis)
- lm(), ols()BR Structure and class of these objects
  - coef(), display(), summary()BR fitted(), resid()BR
  Standard output
Interpreting the output of a linear model
- What hypotheses are we testing?BR What are coefficients and how to read them?BR
  - anova(), drop()BR
  Coding?BR
  - contrasts()BR
  Transformations and other non-linearitiesBR
  - log(), sqrt()BR rcs(), pol()BR
Using a model to predict unseen data
- predict()BR
Understanding the influence of individual cases, identifying outliers
- boxplot()BR lm.influence()BR

-  ⇤ ← Revision 11 as of 2008-05-28 19:02:49 → 
  Size: 1815
  Editor: forbin
  Comment:
+   ← Revision 29 as of 2008-06-05 16:47:39 → ⇥
  Size: 5629
  Editor: static-69-95-227-215
  Comment:
-Deletions are marked like this.
+Additions are marked like this.
 Line 1:
-#acl HlpLabGroup:read,write,delete,revert All:read
+#acl HlpLabGroup,TanenhausLabGroup:read,write,delete,revert,admin All:read
 Line 9:
-This session will cover the basics of linear regression. See below for a [#Topics list of topics].
+This session will cover the basics of linear regression. See below for a [#Topics list of topics]. Please make sure to do the readings, and post any terminology you'd like to be clarified or other questions you have below. You can also suggest further topics, but keep in mind that Session2 also covers aspects of linear regression modeling, specifically typical issues that come up during the modeling. The goal of this first session is to go through the basic steps of building a linear regression model and understanding the output of it. Session 2 is on validating how good this model is. 

We've also posted some [#assignments assignments] below that you should hand in by Friday, so that we can post them on this wiki page. There is only one way to learn how to use the methods we will talk about and that is to apply them yourself to a data set that you understand. The tutorial is intended to get you to the level where you can do that.

=== Materials ===

 * attachment:attention-r-data.csv
 * attachment:attention-procedure.ppt
 * attachment:attention-r-commands.R
 * attachment:case-influence.ppt

 * for the kidiq data from G&H
  * attachment:kidiq.dta
  * attachment:contrast-coding.R
-Line 18:
+Line 31:
+If you'd like to follow along, the dataset used in the G&H07 reading can be found here:  [http://www.stat.columbia.edu/~gelman/arm/examples/child.iq/].  To use the file, you will need to load the "foreign" package, then use the read.dta() function.  Eg:
    `library("foreign")` [[BR]]
    `kidiq <- read.dta(file="kidiq.dta")`
-Line 26:
+Line 42:
- * Q:
+ * Q: On page 181, Baayen refers to one of the models, done via `lm()`, as a covariance model.  Why is this considered a model of covariance rather than a regression?
 * A: This refers to the fact that 'lm()' or any regression model allows the inclusion of continuous predictors (unlike ANOVA, aov() in R, but like ANCOVA - analysis of covariance). The idea is that, for a linear model, where the outcome is a continuous variable, a continuous predictor co-varies (to whatever extent) with the continuous outcome.
-Line 28:
+Line 45:
+ * Q: When doing a regression with a categorical variable (as Baayen does on page 182), is there an easy way to see which level is coded as 1 and which as 0? 
 * A: By default R codes the dependent variable so that the the outcome is the level that is 2nd in terms of alpha-numerically order.

[[Anchor(assignments)]]
-Line 29:
+Line 50:
-Send your solutions to Andrew Watts, who will upload them here. Please send them by Friday 3:30pm.
+Please upload your solutions by Friday 3:30pm.
-Line 34:
+Line 55:
+ * attachment:APS-hw1.R
 * attachment:BenVanDurme-hw1.R
 * attachment:TingQian-hw1.R
-Line 44:
+Line 69:
- * Using a command line
       command history, continuation lines, stopping execution [[BR]]
       defining variables [[BR]]
       calling functions
 * Installing packages
       {{{install.package(), update.package()}}}
 * Using the R workspace
       {{{ls(), rm(), setwd(), getwd(), library()}}}
 * Using an R script file
 * Saving R objects
       {{{save(), save.image()}}}
+ * Quick recap: Formulating your research questions; Hypothesis testing; a "model"
       Dependence on assumptions[[BR]]
       Dependence on sample[[BR]]
       Dependence on available outcome and input measures[[BR]]
       Goal: [[BR]]
             Find generalizations that hold beyond the sample[[BR]]
             Predicting and outcome based on a set of predictors[[BR]]
            
 * Understanding your data set, predictors, and outcome, available information
       {{{str(), summary(), names()}}}[[BR]]
 * Understanding the distributions of your variables
       {{{plot(), points(), lines(), barplot()}}}[[BR]]
       {{{xtabs(), table(), prop.table()}}}[[BR]]
       {{{hist(), histogram(), densityplot()}}}[[BR]]
 * Understanding dependencies between your variables
       {{{pairs(), cor(), abline(), loess()}}}[[BR]]

 * The Linear Model (LM)
       Geometric interpretation[[BR]]
       Ordinary least squares (and how they are "optimal" for the purpose of predicting an outcome Y)[[BR]]
 * Building a linear model (for data analysis)
       {{{lm(), ols()}}}[[BR]]
       Structure and class of these objects
             {{{coef(), display(), summary()}}}[[BR]]
             {{{fitted(), resid()}}}[[BR]]
       Standard output 
 * Interpreting the output of a linear model
       What hypotheses are we testing?[[BR]]
       What are coefficients and how to read them?[[BR]]
             {{{anova(), drop()}}}[[BR]]
       Coding?[[BR]]
             {{{contrasts()}}}[[BR]]
       Transformations and other non-linearities[[BR]]
             {{{log(), sqrt()}}}[[BR]]
             {{{rcs(), pol()}}}[[BR]]
 * Using a model to predict unseen data
       {{{predict()}}}[[BR]]
 * Understanding the influence of individual cases, identifying outliers
       {{{boxplot()}}}[[BR]]
       {{{lm.influence()}}}[[BR]]