7 EDA of High-Dimensional Data

7.1 Definition

High-dimensional data (HD data) typically refers to data sets where many variables are simultaneously measured on any number of observations.

The number of variables is often represented by \(p\) and the number of observations by \(n\).

HD data are collected into a \(p \times n\) or \(n \times p\) matrix.

Many methods exist for “large \(p\), small \(n\)” data sets.

7.2 Examples

  • Clinical studies
  • Genomics (e.g., gene expression)
  • Neuroimaging (e.g., fMRI)
  • Finance (e.g., time series)
  • Environmental studies
  • Internet data (e.g., Netflix movie ratings)

7.3 Big Data vs HD Data

“Big data” are data sets that cannot fit into a standard computer’s memory.

HD data were defined above.

They are not necessarily equivalent.

7.4 Definition of HD Data

High-dimesional data is a data set where the number of variables measured is many.

Large same size data is a data set where few variables are measured, but many observations are measured.

Big data is a data set where there are so many data points that it cannot be managed straightforwardly in memory, but must rather be stored and accessed elsewhere. Big data can be high-dimensional, large sample size, or both.

We will abbreviate high-dimensional with HD.

7.5 Rationale

Exploratory data analysis (EDA) of high-dimensional data adds the additional challenge that many variables must be examined simultaneously. Therefore, in addition to the EDA methods we discussed earlier, methods are often employed to organize, visualize, or numerically capture high-dimensional data into lower dimensions.

Examples of EDA approaches applied to HD data include:

  • Traditional EDA methods covered earlier
  • Cluster analysis
  • Dimensionality reduction