Differences between revisions 2 and 3

The week before the class -- please prepare

A week before the class you should check that you have the right R version and all necessary packages (see main page). AndrewWatts wrote some very useful [http://linginst07.stanford.edu/florianR/software/ instructions for installing R] last summer. You should still be able to follow those steps, but make sure you download and install version 2.7.1.

Please also make sure that you refresh your memory with regard to standard R commands (see topics below). I will assume basic knowledge of probability theory and that you generally know what a linear model is (the basic ideas behind regression). You may want to do the readings and play around with the following R file:

attachment:session_0.R
attachment:sleepstudy.csv
attachment:cake.tab

Reading

Understanding of this material will be assumed throughout the course. Please read these introductory materials and make sure you understand them before beginning the readings for the first session. All readings are meant as a refresher -- just browse through them.

Baa08	Chapter 1 (pp. 1-20)	Intro to R
Wiki	intros to [http://en.wikipedia.org/wiki/Probability_theory probability theory] and [http://en.wikipedia.org/wiki/Probability_distribution distributions] ([http://en.wikipedia.org/wiki/Bernoulli_distribution Bernoulli] and [http://en.wikipedia.org/wiki/Normal_distribution normal] distribution)

For absolute beginners, this is also very useful:

Dal04	Chapter 1.1 - 1.2	Basics of R
	Chapter 1.5 - 1.5

Additionally, feel free to download and print out this [attachment:R-Refcard.pdf reference card] for R. While it's a few years old, the basics it covers have not changed.

Notes on the readings

Absolute novices to R should start by reading at least the parts of Dalgaard's Ch1 indicated above (but really Baayen is pretty easy even without that). Everyone should read Baayen's R intro chapter. Then read the top part of the wiki entries given above (they get pretty technical after some time, so just read the intro; for a nice condensed intro to probability theory, I recommend [http://nlp.stanford.edu/fsnlp/ Manning and Schuetze 1999:Ch1.2]). Then do a quick walk through Dalgaard's Ch2 on probability distributions in R to play around with some distributions and to get familiar with R. Plan for 1-2 hours of reading and typing (in R) just for the Baayen and Dalgaard chapters.

Before the first class session you should also make sure to read at least the assigned readings for that session (better even try to read ahead for session 2 since it's a lot of reading for one day).

Additional terminology

Here are some questions that came up in previous class session when I talked about this material. Maybe some of it is useful. Much of the answers were provided by Austin Frank.

file.choose (where do you put this & what exactly happens?)
- file.choose() uses the graphical interface to choose a file. Try read.csv(file=file.choose())
ls() v objects()
- They are the same thing. Try ?ls() and notice that they have the same arguments with the same defaults. Also, ?objects takes you to the same page.
tapply() v aggregate()
- As long as the data that you want to do computations on is in a vector, tapply() can probably be made to do what you want to do. One difference is that aggregate() is a generic function which can be specialized to deal with different data structures. This means that there could be different versions of aggregate() for lme4 models and Design models. In practice, this really just means that whatever type of data structure you apply aggregate() to, you'll probably get the same data structure back. With tapply(), you always get a list back. NB: use unlist() on the results of tapply() to get them back into a useful form.
- I don't use either very much. Instead, I use summarize() from the Design package. The syntax is similar to aggregate(), but it has one crucial ability that aggregate() does not. When you want to apply a user-made function to each element of a data set, aggregate() can only return a single scalar result for each row. summarize() is smart enough to handle functions that return a list of results, putting each element of the vector into its own column in the results. tapply can return lists as well, but generally only works on vectors, not on data frames.BR

> with(sleepstudy, aggregate(Reaction, by=list(Subject), 
       FUN = function (x) {c(is.character(x), mean(x))}))
Error in aggregate.data.frame(as.data.frame(x), ...) : 
  'FUN' must always return a scalar

> with(sleepstudy, summarize(Reaction, by=llist(Subject), 
       FUN= function (x) { llist(mean (x), is.character(x)) }))
   Subject Reaction is.character.x.
1      308 342.1338               0
2      309 215.2330               0
3      310 231.0013               0
4      330 303.2214               0
5      331 309.4361               0
6      332 307.3021               0
7      333 316.1583               0
8      334 295.3021               0
9      335 250.0700               0
10     337 375.7210               0
11     349 275.8345               0
12     350 313.6027               0
13     351 290.0978               0
14     352 337.4215               0
15     369 306.0346               0
16     370 291.7018               0
17     371 294.9840               0
18     372 317.8861               0

Also try with(sleepstudy, tapply(X=Reaction, INDEX=c(Subject), FUN = function (x) { list( mean(x), is.character(x)) } )) to see the same computation carried out using tapply().

Questions

Q: A clarification of the stance on multiple comparisons in G&H (p. 22). Why do they think it's nothing to worry about?
A: [http://www.stat.columbia.edu/~cook/movabletype/archives/2008/03/why_i_dont_usua_1.html Here] is Gelman's elaboration on the matter. The paper is readable and not too long.

R Primer

Topics you should be vaguely familiar with

Interacting with R and R files

Using a command line
- command history, continuation lines, stopping execution BR defining variables BR calling functions
Installing packages
- install.package(), update.package()
Using the R workspace
- ls(), rm(), setwd(), getwd(), library()
Using an R script file
Saving R objects
- save(), save.image()

Getting help

Function-specific help
- ?(), help()
Searching the help
- apropos(), help.search(), RSiteSearch()

Loading data

general purpose functions
- scan()
specific formats
- read.csv(), read.delim(), library(foreign)

General data structures

vectors / arrays
- c()
matrices
- cbind(), rbind(), table()
lists
- list(), unlist()
data frames
- data.frame()
general structure manipulation and interaction
- [], [[]], $, subset(), str(), cut(), repr()

Basic descriptive statistics

summary stats
- mean(), sd(), var(), quantile()

Probability distributions

random sampling
- runif(), rnorm(), rbinom()

Investigating and visualizing your data

Understanding your data set, predictors, and outcome, available information
- str(), summary(), names()BR
Understanding the distributions of your variables
- plot(), points(), lines(), barplot()BR xtabs(), table(), prop.table()BR hist(), histogram(), densityplot()BR
Understanding dependencies between your variables
- pairs(), cor(), abline(), loess()BR

DenmarkMiniCourseSession0 (last edited 2008-11-13 04:20:03 by cpe-67-240-134-21)

-  ⇤ ← Revision 2 as of 2008-11-09 02:59:10 → 
  Size: 7859
  Editor: cpe-67-240-134-21
  Comment:
+   ← Revision 3 as of 2008-11-13 04:20:03 → ⇥
  Size: 8214
  Editor: cpe-67-240-134-21
  Comment:
-Deletions are marked like this.
+Additions are marked like this.
 Line 86:
-== Suggested topics ==
If you have any material that you would like to cover that isn't included in the list below, please make note of it here.
-Line 134:
+Line 132:
+=== Investigating and visualizing your data ===
 * Understanding your data set, predictors, and outcome, available information
       {{{str(), summary(), names()}}}[[BR]]
 * Understanding the distributions of your variables
       {{{plot(), points(), lines(), barplot()}}}[[BR]]
       {{{xtabs(), table(), prop.table()}}}[[BR]]
       {{{hist(), histogram(), densityplot()}}}[[BR]]
 * Understanding dependencies between your variables
       {{{pairs(), cor(), abline(), loess()}}}[[BR]]