## page was renamed from HlpLab/StatsMiniCourse/Session1 #acl HlpLabGroup,TanenhausLabGroup:read,write,delete,revert,admin All:read #format wiki #language en #pragma section-numbers 4 == Session 1: Linear and logistic regression == This session will cover the basics of linear and logistic regression. See below for a [#Topics list of topics]. Please make sure to do the readings and let me know in advance if there is terminology that you would like to be clarified. The goal of this session is to go through the basic steps of building linear or logistic regression model, understanding the output, and validating how good this model is. I've also posted some [#assignments assignments] below. There is only one way to learn how to use the methods we will talk about and that is to apply them yourself to a data set that you understand. The tutorial is intended to get you to the level where you can do that. Make sure to read the assigned readings and, if you have time, you could familiarize yourself with the scripts attached below. They use data sets from the R package ''languageR''. [[Anchor(Materials)]] === Materials === * attachment:lexdecRT.R (a simple linear regression example) * attachment:walpiri.R (a simple logistic regression example) * attachment:case-influence.ppt * attachment:BaayenETAL06.pdf (contains more information about the data set ''english'', which is used for the linear regression example; it's the Balota et al. database on lexical decision time and word naming norms) * [http://www.public.iastate.edu/~dnett/S401/wreganova.pdf nice 2-page handout on the relation between ANOVA and linear models] === Reading === || Baa08 || Section 4.3.2 (pp. 91 - 105) || Functional relations: linear regression || || || Sections 6 - 6.2.4 (pp. 181-212) || Regression Modeling (Introduction and Ordinary Least Squares Regression) || || || || Collinearity, Model criticism, and Validation || || || Section 6.3 (pp. 214-234) || Generalized Linear Models || || || Section 6.6 (pp. 258-259) || General considerations || Optional reading: || G&H07 || Chapter 3 (pp. 29-49) || Linear regression: the basics || || || Chapter 4 (pp. 53-74) || Linear regression: before and after fitting the model || || G&H07 || Chapter 5 (pp. 79-105) || Logistic regression || === Q&A === * Q: On page 181, Baayen refers to one of the models, done via `lm()`, as a covariance model. Why is this considered a model of covariance rather than a regression? * A: This refers to the fact that 'lm()' or any regression model allows the inclusion of continuous predictors (unlike ANOVA, aov() in R, but like ANCOVA - analysis of covariance). The idea is that, for a linear model, where the outcome is a continuous variable, a continuous predictor co-varies (to whatever extent) with the continuous outcome. * Q: When doing a regression with a categorical variable (as Baayen does on page 182), is there an easy way to see which level is coded as 1 and which as 0? * A: By default R codes the dependent variable so that the the outcome is the level that is 2nd in terms of alpha-numerically order. * Q: Determining the significance of a coefficient: one-tailed or two-tailed t test? * A: It's a two-tailed test because we cannot a-priori assume which direction the coefficient will go. I guess if one had a really strong theoretical reason to assume one direction, you could do a one-tailed test (which is less conservative). * Q: What is really happening behind the scenes when we call the {{{data()}}} command? * A: Even after reading {{{?data}}}, it's not totally clear what's happening. It turns out that most packages distributed with R use a database to store their data (this consists of three files (foo.rdb, foo.rds, foo.rdx)). The internal mechanism used to deal with a database like this is the function {{{lazyLoad()}}}. Calling {{{lazyLoad()}}} on a database has the effect of putting all of the objects in the database into your workspace, but the objects aren't actually filled with all of their values until you use them. This means that you can load all of the datasets from a large library like {{{languageR}}} without using up all of your memory. The underlying action take by a call to {{{data()}}} is something like {{{ lazyLoad(file.path(system.file(package="languageR"), "data", "Rdata")) }}} * Q: Why isn't there a function to center variables? {{{scale(x, scale=FALSE)}}} is a pain to remember. * A: There should be! Let's define our own: {{{ center <- function (x) { as.numeric(scale(x, scale=FALSE)) } }}} Now, whenever we call the function center(), it will subtract the mean of a vector of numbers from each element of the vector. If you want to have access to this function every time you use R, without defining it anew each time, you can put it in a file called ".Rprofile" in your home directory. * Q: I don't totally buy the argument in G&H about scaling the predictors before doing the regression. Why is this a good idea? Should I really divide by 2 time the standard deviation? * A: You can read more detailed arguments from Gelman in [http://www.stat.columbia.edu/~gelman/research/published/standardizing7.pdf this paper] and [http://www.stat.columbia.edu/~cook/movabletype/archives/2008/04/another_salvo_i.html this blog post] (read the comments, too!). [[Anchor(assignments)]] === Assignments === || Baa08 || Section 4.7 (p. 126) || Exercises 3 and 7* || || || Section 6.7 (p. 260) || Exercise 1, 8 || [[Anchor(Topics)]] == Possible topics == * The Linear Model (LM) Geometric interpretation[[BR]] Ordinary least squares (and how they are "optimal" for the purpose of predicting an outcome Y)[[BR]] * Building a linear model (for data analysis) {{{lm(), ols()}}}[[BR]] Structure and class of these objects {{{coef(), display(), summary()}}}[[BR]] {{{fitted(), resid()}}}[[BR]] Standard output * Building a logistic model (for data analysis) {{{glm(, family="binomial"), lrm()}}}[[BR]] Standard output * Interpreting the output of an ordinary regression model What hypotheses are we testing?[[BR]] What are coefficients and how to read them?[[BR]] {{{anova(), drop()}}}[[BR]] Coding?[[BR]] {{{contrasts()}}}[[BR]] Transformations and other non-linearities[[BR]] {{{log(), sqrt()}}}[[BR]] {{{rcs(), pol()}}}[[BR]] * Using a model to predict unseen data {{{predict()}}}[[BR]] * Understanding the influence of individual cases, identifying outliers: * see also the [ [#Materials slides on case influence] ] attached above. {{{boxplot()}}}[[BR]] {{{lm.influence()}}}[[BR]] * detect outliers {{{boxplot()}}}, scatterplots {{{plot(), identify()}}}[[BR]] * dealing with outliers exclusion {{{subset()}}}[[BR]] robust regression (based on t-distribution): {{{tlm()}}} (in package {{{hatt}}})[[BR]] * Overly influential cases (can be, but don't have to be outliers) {{{lm.influence()}}}, also {{{library(Rcmdr)}}}[[BR]] * Collinearity * tests: {{{vif(), kappa()}}}, summary of correlations between fixed effects in {{{lmer()}}} [[BR]] * countermeasures: * centering and/or standardizing {{{scale()}}} [[BR]] * use of residuals {{{resid(lm(x1 ~ x2, data))}}} [[BR]] * principal component analysis (PCA) {{{princomp()}}} [[BR]] * Model evaluation: Where is the model off? * case-by-case: {{{residuals(), predict()}}}[[BR]] * based on predictor: {{{residuals()}}} against predictors, {{{calibrate()}}}[[BR]] * overall: {{{validate()}}}[[BR]] * Corrections: * correcting for clusters (violation of assumption of independence): {{{bootcov()}}}[[BR]]