I'm often asked for advice about how to get involved with statistics
and the R statistical programming language.
Without a doubt, R is
the premier open source tool for data analysis, owing to its enormous
library of visualization and model-fitting packages, its relatively
straightforward data munging abilities, and its in-depth online
documentation. What's more, R is largely built by a community of
educators, and quite a large number of datasets are included in the
standard library. For all its quirks, it remains a tremendous
pedagogical and practical tool.
But without a solid grasp of the underlying statistical theory,
it can be difficult to suss out just how to interpret the outputs
statistical knowledge and R knowledge provide a basic foundation
to work with. If I were designing a basic curriculum, it would
include four texts -- one each on practical
statistics, R, statistical theory, and statistical learning.
For a first introduction to practical statistics, the
OpenIntro Statistics, Second Edition
really cannot be beat, owing both to its intuitive presentation of almost
every foundational concept in statistics, and to its price -- it's a freely
available pdf download written by David Diez, Christopher Barr, and Mine
More than merely a book on statistics, the OpenIntro Statistics text
is a primer on experimental design and the scientific method. The book
uses practical motivation (and common views that R would produce) to
develop ideas from the basics of probability distributions through
point estimation and hypothesis testing, arriving finally at
both normal-theory and logistic linear models. If I were to teach a
course on statistics for beginners again, this would be my chosen text.
Among the strongest features of the book are its clear diagrams,
most of which seem to be set using R, and the authors' focus on
visualizing data in terms of distributions. Whereas many texts will
display distributions and histograms sparingly, the OpenIntro makes
them a centerpiece of theoretical explanations. After all, they
form the foundation of almost all practical data explorations as
well. For example, here's an early caveat to understanding
distributions only in terms of mean and variance:
Also available are the datasets required for the exercises, indexed by
a chapter guide in a bulk dataset download. However, the book itself
does not offer a solid introduction to R.
There are several different introductions to R available for people of
different skill levels. Most folks who ask me for a way into R are
software engineers who already have some understanding of other
languages, and are interested in picking up some basic ideas about
data manipulation at the same time. So far, my favorite text to
recommend is Peter Dalgaard's Introductory Statistics with R,
Though it's costly, and DRM'ed by Springer (ick), I'm yet to find
a more comprehensive and useful overview of R as a programming
language and practical tool. For example:
All of the major data structures are clearly explicated, as are
the core plotting, modeling, and hypothesis testing functions.
However, explanations of the underlying theory behind these topics
is not comprehensive -- the book is best used after statistical
theory has been learned.
For the more mathematically oriented and those interested in a
calculus-driven understanding of statistical theory, I really cannot
recommend Bernard Lindgren's 1976 Statistical Theory (Third Edition) enough.
It's practically free these days, with used copies typically selling
for under $5.00. The text is lucidly written, covers sets, combinatorics,
probability, distribution functions, point estimation, and more -- all
with refreshing clarity. The presentation is logical and mathematical,
and really contributes toward developing geometric intuition about
This calculus-based thinking is
incredibly helpful to understand the basic idea of what a probability
distribution really is, which distributions are helpful for which
purposes, and how to
construct completely novel models. For those interested
in building real-time processing algorithms that leverage
mathematical facts based on approximations -- rather than
relying on brute force in offline precomputation -- this
introduction to the toolkit is invaluable.
Outside of simple linear models, there exist entire worlds of
classification and prediction problems that suggest nonlinear
models or sophisticated projections, as well as problems that
occupy high-dimensional spaces.
For more experienced folks, Trevor Hastie, Robert Tibshirani, and
Jerome Friedman have provided an instant classic in The Elements of Statistical
for free in PDF form. The cohesiveness of the book's chapter organization
reflects a strong theoretical framework that unites most all concepts
in the state of the art of statistical learning.
However, it is the same authors' introductory text --
An Introduction to Statistical Learning with Applications in R
written with Gareth James and Daniela Witten -- that will really
become the best friend of a newcomer to statistical learning
techniques. The theoretic and the practical are well-married in this
text, with inline code examples accompanying beautiful diagrams:
While linear algebra is part of the offering, a solid foundation
in statistical theory is presumed.
After the basics of statistics and R are already learned, this book will take
the reader to the next level of model-fitting, and I recommend it highly.