1 Statistics

1.1 History

The practice of statistics has been around for hundreds of years. The early application of statistics was centered around collecting demographic and economic data for governments. The terms “statistics” was coined in the mid-1700s, which is derived from its relationship to state or government data.

Early methodological ideas introduced in statistics involved procedures such as calculating the mean or median, representing data graphically, and trying to model the distribution of “errors” between observed data and a model.

In the 1800s, probability started to become integrated into statistical thinking, which lead to the era of inferential statistics; it gave rise to deep results on how we design and analyze studies to use finite amounts of data that have been collected through a probabilistic mechanism. During the early to mid 1900s, inferential statistics became a well developed and understood component to statistics. Its ties to probability were deepened, especially with the formal axiomatization and rigorous development of probability.

The latter half of the 1900s resulted in another major leap forward for the field of statistics, with the introduction of modern computing. This had a massive impact on how we collect and analyze data. Statistics became less reliant on mathematical models and more immersed in computational approaches.

In the 2000s we have witnessed yet another major leap forward. Data collection has never been faster, cheaper, or larger than today. We are able to collect and analyze massive amounts of data in most areas of science and industry. This has led to a sea-change in statistics, both in its scope and in its importance. The field of statistics has never been more challenging or impactful than it is today.

1.2 Definition

The modern definition of Statistics is the study of how to extract information from data, including how to collect, organize, analyze, and present information in data.

Applied Statistics is concerned with the practical considerations and implementations needed to carry out a statistical analysis.

In Chapter 2 below, we make this definition concrete by discussing the various problems that statistics tackles.

1.3 Relationship to Machine Learning

From https://en.wikipedia.org/wiki/Machine_learning:

Machine learning explores the study and construction of algorithms that can learn from and make predictions on data. Machine learning is closely related to and often overlaps with computational statistics; a discipline which also focuses in prediction-making through the use of computers.

1.4 Relationship to Data Science

From https://en.wikipedia.org/wiki/Data_science:

Data Science is an interdisciplinary field about processes and systems to extract knowledge or insights from data in various forms, either structured or unstructured, which is a continuation of some of the data analysis fields such as statistics, data mining, and predictive analytics.

1.5 Some History On Data Science

John Tukey

John Tukey pioneered a field called “exploratory data analysis” (EDA)

From The Future of Data Analysis (1962) Annals of Mathematical Statistics

For a long time I have thought I was a statistician, interested in inferences from the particular to the general. But as I have watched mathematical statistics evolve, I have had cause to wonder and to doubt.

All in all, I have come to feel that my central interest is in data analysis, which I take to include, among other things: procedures for analyzing data, techniques for interpreting the results of such procedures, ways of planning the gathering of data to make its analysis easier, more precise or more accurate, and all the machinery and results of (mathematical) statistics which apply to analyzing data.

Data analysis is a larger and more varied field than inference, or incisive procedures, or allocation.

Jeff Wu

From https://en.wikipedia.org/wiki/Data_science:

In November 1997, C.F. Jeff Wu gave the inaugural lecture entitled “Statistics = Data Science?”. In this lecture, he characterized statistical work as a trilogy of data collection, data modeling and analysis, and decision making. In his conclusion, he initiated the modern, non-computer science, usage of the term “data science” and advocated that statistics be renamed data science and statisticians data scientists.

William Cleveland

From https://en.wikipedia.org/wiki/Data_science:

In 2001, William Cleveland introduced data science as an independent discipline, extending the field of statistics to incorporate “advances in computing with data” in his article Data Science: An Action Plan for Expanding the Technical Areas of the Field of Statistics in International Statistical Review

Cleveland establishes six technical areas which he believed to encompass the field of data science:

  1. multidisciplinary investigations
  2. models and methods for data
  3. computing with data
  4. pedagogy
  5. tool evaluation
  6. theory


Individuals working in industry began to call themselves “data scientists” in the late 2000’s, but this was far after statisticians had introduced the field.