Differences between revisions 2 and 10 (spanning 8 versions)
Revision 2 as of 2013-03-05 00:57:23
Size: 113
Editor: cpe-67-242-180-108
Comment:
Revision 10 as of 2013-04-17 20:00:13
Size: 8449
Editor: wifi
Comment:
Deletions are marked like this. Additions are marked like this.
Line 5: Line 5:

= Knitr: automatic report generation =

== What does it do? ==

[[http://yihui.name/knitr/|Knitr]] is an R package which allows direct embedding of code (R, Python, etc.) inside documents (LaTeX, HTML, Markdown, etc.). At the most basic level, this means no more copy and pasting or retyping of generated quantities, no more `\includegraphics{}` and `pdf()` or `ggsave()` calls in R, and no more typing scads of LMER coefficients.

What Knitr actually does is to scan the document, extract all the '''code chunks''', evaluate them, format the results in a nice way (that you can control), and insert them back into the document, which can then be compiled in whatever way is appropriate (`pdflatex` for a LaTeX document, or a web browser/markdown interpreter for HTML or markdown output).

Knitr is based on Sweave, but is a lot better. There's automatic cacheing (so that a document can be re-knit without re-running code that hasn't changed since the last knitting, like time-consuming data analysis code), more transparency (can include all input and output in the final document, as if you ran the code chunks in an R console), pretty formatting of R code and results in LaTeX, and a more modular interface which is easy to hack and expand.

== Why would I want to do that?? ==

Generally, including the code which generated your output (figures, tables, p-values, etc.) makes it easier for other people (including future-you) to replicate and check the analyses, and I've found that it cuts down on my tendency to lose code.

 1. '''Flexibility'''. Including the code that produced the output in the output file itself makes it really easy to ''update analyses when more data becomes available'', although you still would have to re-write the text if anything substantial changes :)
 2. '''Transparency'''. It also makes it easier for other people (especially including future-you) to ''understand and reproduce your analysis'', and can help you check and correct problems or bugs in your analysis.
 3. '''Stability'''. Keeping everything together in one document also helps ''prevent lost code'' for generating figures/tables/analyses in a separate output document (like a previous manuscript).

It's also just really easy to do, especially if you're already using LaTeX. Even if you're not, you can create [[http://daringfireball.net/projects/markdown/|markdown]] documents which are easy to read and can be quickly converted into HTML.

== How do I do it? ==

The quickest way to get started is to use [[http://www.rstudio.com/|RStudio]]. After you've installed the knitr package, RStudio has built in support for knitr processing of LaTeX, HTML, and Markdown. Open up preferences, click on 'Sweave', and change 'Weave Rnw files using ...' to knitr. There's also a [[http://yihui.name/knitr/|video tutorial]] on the knitr homepage showing how to do this.

If you don't want to use RStudio, you can either call `knit('filename.Rnw')` directly in an R console, which will produce a `.tex` file, or you can use this shell script that I created which will automatically do that and then `pdflatex` the resulting `.tex` file: [[https://gist.github.com/kleinschmidt/5407250|knit.sh]]

I recommend using Rstudio (or another IDE, like emacs+ESS+AUCtex), because you can run the R code in the console as you go, instead of having to knit/latex the whole document to see what happens.

For LaTeX+R, the basic (default) syntax is just like Sweave. R code blocks are inserted as so:
{{{
Here's some latex stuff. Then, an R chunk:
<<block-name, option1='value', option2=T, option3=4>>=
x <- rnorm(10, 0, 1)
print('hello, world!')
@
Now, back to more latex stuff.
}}}
You can also insert code inline, using the `\Sexpr{}` syntax:
{{{
This is latex stuff, but if I want to get the value of $x$ then I can say \Sexpr{x[1]}
}}}

When such a document is 'knit', all of these code chunks will be pulled out, evaluated, and re-inserted, after proper formatting. The formatting (among other things) is controlled by '''chunk options''', which go inside the delimiters `<<>>=`. Here are some of the ones I use most often (and they are all documented very well on the [[http://yihui.name/knitr/options#chunk_options|knitr page]]):
 * `echo=TRUE` or `FALSE`: controls whether the input (code in the chunk) is displayed in the output.
 * `results='hide', 'markup', 'asis'`: controls how the text output that would normally show up in the R console is displayed. It can be hidden completely, inserted in a "marked up" form (light gray background, monospaced font, syntax coloring, etc.), or inserted as "naked" LaTeX (useful for when you use something like `xtable` to print a string which is valid LaTeX code for a table).
 * `message, error, warning=TRUE, FALSE`: controls whether messages, errors, and warnings are displayed in the output or not.

=== Cacheing ===

One of the best features of knitr is that it automatically '''caches the results of each chunk''', which means that time-consuming chunks (like a sampler or a call to `lmer`) don't need to be re-run every time the document is knit. If a chunk is changed at all since the last time it was evaluated, it will be re-run, and the new results stored in the cache. Furthermore, knitr will try to determine which other chunks each chunk depends on; if any of those chunks have changed in the meantime, the chunk will be re-run. You can also manually specify the dependencies, too, using the chunk option `dependson=c('chunk1', 'chunk2')`.

Note that knitr defaults to '''not''' cacheing. You can turn on cacheing for a single chunk by setting `cache=TRUE` as a chunk option, or turn it on as a default by putting `opts_chunk$set(cache=TRUE)` in a code chunk and setting `cache=FALSE` for any chunks that you want not cached.

I've found that cacheing (especially using auto-dependency) can be a bit glitchy. If you want to totally re-run everything, just delete (or rename, to be safe) the `cache/` subdirectory and knit again.

=== Figures ===

Knitr will include any figures that are produced by a chunk in the output. They will be saved to a subdirectory (defaults to `figures/`), and given the same name as the chunk, and numbered if there are multiple images produced by the chunk. So:
{{{
<<some-figs>>=
x <- rnorm(1000)
hist(x)
qqnorm(x)
@
}}}
will create two files: `figures/some-figs1.pdf` and `some-figs2.pdf`, the first of which is the histogram and the second of which is the QQ plot. The size of the images produced, and how they are inserted into the document, are controlled by chunk options:
 * `fig.width, fig.height`: controls the width and height of any graphics produced in the chunk. Specify with numbers, as you would for `quartz()`, `pdf()`, or `ggsave()`.
 * `out.width, out.height`: controls the size at which graphics will be displayed ''in the document''. I usually say something like `out.width='\\textwidth'` to make one figure take up the whole page width.
(There are more, of course).

== A couple of use cases ==

=== Lab notebook ===

Besides the obvious case of doing an entire camera-ready publication in LaTeX+R using knitr, it can be useful for keeping track of analyses for a particular data set. I have only started to do this, but here's a quick (and pretty ugly/not very literate) example:

[[attachment:fs_notebook.Rnw]]
[[attachment:fs_notebook.pdf]]

=== Stats homework ===

Here's a statistics problem set that is run using knitr. It has examples of
 * Graphics of different sizes; using ggplot and base graphics.
 * LaTeX formatted tables generated by `xtable()`
 * `echo=TRUE` and `results='markup'` to show the input and output as if on a console.

[[attachment:kleinschmidt-dave-hw5.Rnw]]
[[attachment:kleinschmidt-dave-hw5.pdf]]

=== Poster/talk content ===

I've used knitr as a convenient way to sketch ideas for poster content and keep track of figures in a single document. The finished poster is [[http://www.bcs.rochester.edu/people/dkleinschmidt/pubs/kleinschmidt-jaeger-cimsa-2013.pdf|here]].

[[attachment:poster-content.Rnw]]
[[attachment:poster-content.pdf]]

Here's the knitr document that I used to generate all the schematics for my lunch talk:

[[attachment:plots.Rnw]]
[[attachment:plots.pdf]]

Knitr: automatic report generation

What does it do?

Knitr is an R package which allows direct embedding of code (R, Python, etc.) inside documents (LaTeX, HTML, Markdown, etc.). At the most basic level, this means no more copy and pasting or retyping of generated quantities, no more \includegraphics{} and pdf() or ggsave() calls in R, and no more typing scads of LMER coefficients.

What Knitr actually does is to scan the document, extract all the code chunks, evaluate them, format the results in a nice way (that you can control), and insert them back into the document, which can then be compiled in whatever way is appropriate (pdflatex for a LaTeX document, or a web browser/markdown interpreter for HTML or markdown output).

Knitr is based on Sweave, but is a lot better. There's automatic cacheing (so that a document can be re-knit without re-running code that hasn't changed since the last knitting, like time-consuming data analysis code), more transparency (can include all input and output in the final document, as if you ran the code chunks in an R console), pretty formatting of R code and results in LaTeX, and a more modular interface which is easy to hack and expand.

Why would I want to do that??

Generally, including the code which generated your output (figures, tables, p-values, etc.) makes it easier for other people (including future-you) to replicate and check the analyses, and I've found that it cuts down on my tendency to lose code.

  1. Flexibility. Including the code that produced the output in the output file itself makes it really easy to update analyses when more data becomes available, although you still would have to re-write the text if anything substantial changes :)

  2. Transparency. It also makes it easier for other people (especially including future-you) to understand and reproduce your analysis, and can help you check and correct problems or bugs in your analysis.

  3. Stability. Keeping everything together in one document also helps prevent lost code for generating figures/tables/analyses in a separate output document (like a previous manuscript).

It's also just really easy to do, especially if you're already using LaTeX. Even if you're not, you can create markdown documents which are easy to read and can be quickly converted into HTML.

How do I do it?

The quickest way to get started is to use RStudio. After you've installed the knitr package, RStudio has built in support for knitr processing of LaTeX, HTML, and Markdown. Open up preferences, click on 'Sweave', and change 'Weave Rnw files using ...' to knitr. There's also a video tutorial on the knitr homepage showing how to do this.

If you don't want to use RStudio, you can either call knit('filename.Rnw') directly in an R console, which will produce a .tex file, or you can use this shell script that I created which will automatically do that and then pdflatex the resulting .tex file: knit.sh

I recommend using Rstudio (or another IDE, like emacs+ESS+AUCtex), because you can run the R code in the console as you go, instead of having to knit/latex the whole document to see what happens.

For LaTeX+R, the basic (default) syntax is just like Sweave. R code blocks are inserted as so:

Here's some latex stuff.  Then, an R chunk: 
<<block-name, option1='value', option2=T, option3=4>>=
x <- rnorm(10, 0, 1)
print('hello, world!')
@
Now, back to more latex stuff.

You can also insert code inline, using the \Sexpr{} syntax:

This is latex stuff, but if I want to get the value of $x$ then I can say \Sexpr{x[1]} 

When such a document is 'knit', all of these code chunks will be pulled out, evaluated, and re-inserted, after proper formatting. The formatting (among other things) is controlled by chunk options, which go inside the delimiters <<>>=. Here are some of the ones I use most often (and they are all documented very well on the knitr page):

  • echo=TRUE or FALSE: controls whether the input (code in the chunk) is displayed in the output.

  • results='hide', 'markup', 'asis': controls how the text output that would normally show up in the R console is displayed. It can be hidden completely, inserted in a "marked up" form (light gray background, monospaced font, syntax coloring, etc.), or inserted as "naked" LaTeX (useful for when you use something like xtable to print a string which is valid LaTeX code for a table).

  • message, error, warning=TRUE, FALSE: controls whether messages, errors, and warnings are displayed in the output or not.

Cacheing

One of the best features of knitr is that it automatically caches the results of each chunk, which means that time-consuming chunks (like a sampler or a call to lmer) don't need to be re-run every time the document is knit. If a chunk is changed at all since the last time it was evaluated, it will be re-run, and the new results stored in the cache. Furthermore, knitr will try to determine which other chunks each chunk depends on; if any of those chunks have changed in the meantime, the chunk will be re-run. You can also manually specify the dependencies, too, using the chunk option dependson=c('chunk1', 'chunk2').

Note that knitr defaults to not cacheing. You can turn on cacheing for a single chunk by setting cache=TRUE as a chunk option, or turn it on as a default by putting opts_chunk$set(cache=TRUE) in a code chunk and setting cache=FALSE for any chunks that you want not cached.

I've found that cacheing (especially using auto-dependency) can be a bit glitchy. If you want to totally re-run everything, just delete (or rename, to be safe) the cache/ subdirectory and knit again.

Figures

Knitr will include any figures that are produced by a chunk in the output. They will be saved to a subdirectory (defaults to figures/), and given the same name as the chunk, and numbered if there are multiple images produced by the chunk. So:

<<some-figs>>=
x <- rnorm(1000)
hist(x)
qqnorm(x)
@

will create two files: figures/some-figs1.pdf and some-figs2.pdf, the first of which is the histogram and the second of which is the QQ plot. The size of the images produced, and how they are inserted into the document, are controlled by chunk options:

  • fig.width, fig.height: controls the width and height of any graphics produced in the chunk. Specify with numbers, as you would for quartz(), pdf(), or ggsave().

  • out.width, out.height: controls the size at which graphics will be displayed in the document. I usually say something like out.width='\\textwidth' to make one figure take up the whole page width.

(There are more, of course).

A couple of use cases

Lab notebook

Besides the obvious case of doing an entire camera-ready publication in LaTeX+R using knitr, it can be useful for keeping track of analyses for a particular data set. I have only started to do this, but here's a quick (and pretty ugly/not very literate) example:

fs_notebook.Rnw fs_notebook.pdf

Stats homework

Here's a statistics problem set that is run using knitr. It has examples of

  • Graphics of different sizes; using ggplot and base graphics.
  • LaTeX formatted tables generated by xtable()

  • echo=TRUE and results='markup' to show the input and output as if on a console.

kleinschmidt-dave-hw5.Rnw kleinschmidt-dave-hw5.pdf

Poster/talk content

I've used knitr as a convenient way to sketch ideas for poster content and keep track of figures in a single document. The finished poster is here.

poster-content.Rnw poster-content.pdf

Here's the knitr document that I used to generate all the schematics for my lunch talk:

plots.Rnw plots.pdf

LabmeetingSP13w13 (last edited 2013-10-16 03:31:19 by cpe-74-74-158-116)

MoinMoin Appliance - Powered by TurnKey Linux