How to get Started with R and Statistics

I'm often asked for advice about how to get involved with statistics and the R statistical programming language. Without a doubt, R is the premier open source tool for data analysis, owing to its enormous library of visualization and model-fitting packages, its relatively straightforward data munging abilities, and its in-depth online documentation. What's more, R is largely built by a community of educators, and quite a large number of datasets are included in the standard library. For all its quirks, it remains a tremendous pedagogical and practical tool.

But without a solid grasp of the underlying statistical theory, it can be difficult to suss out just how to interpret the outputs of, say, glm(). Both statistical knowledge and R knowledge provide a basic foundation to work with. If I were designing a basic curriculum, it would include four texts -- one each on practical statistics, R, statistical theory, and statistical learning.

Practical Statistics

For a first introduction to practical statistics, the OpenIntro Statistics, Second Edition really cannot be beat, owing both to its intuitive presentation of almost every foundational concept in statistics, and to its price -- it's a freely available pdf download written by David Diez, Christopher Barr, and Mine Çetinkaya-Rundel. More than merely a book on statistics, the OpenIntro Statistics text is a primer on experimental design and the scientific method. The book uses practical motivation (and common views that R would produce) to develop ideas from the basics of probability distributions through point estimation and hypothesis testing, arriving finally at both normal-theory and logistic linear models. If I were to teach a course on statistics for beginners again, this would be my chosen text.

Among the strongest features of the book are its clear diagrams, most of which seem to be set using R, and the authors' focus on visualizing data in terms of distributions. Whereas many texts will display distributions and histograms sparingly, the OpenIntro makes them a centerpiece of theoretical explanations. After all, they form the foundation of almost all practical data explorations as well. For example, here's an early caveat to understanding distributions only in terms of mean and variance:

Also available are the datasets required for the exercises, indexed by a chapter guide in a bulk dataset download. However, the book itself does not offer a solid introduction to R.

R

There are several different introductions to R available for people of different skill levels. Most folks who ask me for a way into R are software engineers who already have some understanding of other languages, and are interested in picking up some basic ideas about data manipulation at the same time. So far, my favorite text to recommend is Peter Dalgaard's Introductory Statistics with R, Second Edition. Though it's costly, and DRM'ed by Springer (ick), I'm yet to find a more comprehensive and useful overview of R as a programming language and practical tool. For example:

All of the major data structures are clearly explicated, as are the core plotting, modeling, and hypothesis testing functions. However, explanations of the underlying theory behind these topics is not comprehensive -- the book is best used after statistical theory has been learned.

Statistical Theory

For the more mathematically oriented and those interested in a calculus-driven understanding of statistical theory, I really cannot recommend Bernard Lindgren's 1976 Statistical Theory (Third Edition) enough. It's practically free these days, with used copies typically selling for under $5.00. The text is lucidly written, covers sets, combinatorics, probability, distribution functions, point estimation, and more -- all with refreshing clarity. The presentation is logical and mathematical, and really contributes toward developing geometric intuition about statistical constructs:

This calculus-based thinking is incredibly helpful to understand the basic idea of what a probability distribution really is, which distributions are helpful for which purposes, and how to construct completely novel models. For those interested in building real-time processing algorithms that leverage mathematical facts based on approximations -- rather than relying on brute force in offline precomputation -- this introduction to the toolkit is invaluable.

Statistical Learning

Outside of simple linear models, there exist entire worlds of classification and prediction problems that suggest nonlinear models or sophisticated projections, as well as problems that occupy high-dimensional spaces. For more experienced folks, Trevor Hastie, Robert Tibshirani, and Jerome Friedman have provided an instant classic in The Elements of Statistical Learning, available for free in PDF form. The cohesiveness of the book's chapter organization reflects a strong theoretical framework that unites most all concepts in the state of the art of statistical learning.

However, it is the same authors' introductory text -- An Introduction to Statistical Learning with Applications in R written with Gareth James and Daniela Witten -- that will really become the best friend of a newcomer to statistical learning techniques. The theoretic and the practical are well-married in this text, with inline code examples accompanying beautiful diagrams:

While linear algebra is part of the offering, a solid foundation in statistical theory is presumed. After the basics of statistics and R are already learned, this book will take the reader to the next level of model-fitting, and I recommend it highly.