3/28/17: Script modified for readability, added Loading Summary table for quick viewing. FILE
A persistent problem in modern experimental biology is the increasing complexity of data sets retrieved from a wide variety of experiments. New experimental techniques and more sensitive assays allow researchers to be more efficient with biological samples allowing the measurement of multiple aspects of the same system while using less of the original sample. With all of these measurements being derived from the same biological sample, researchers are extracting a very large and complex snapshot of the animal’s extant phenotype. Indeed, there has been a trend to view biological systems in a more holistic light, attempting to understand complexities that were once thought out of reach. With these gains, researchers can become inundated with data and could possibly get lost in the apparent complexity of the data set when there could be a more concise explanation. Advances in the technical aspects of experimentation have been followed by a shift in the mathematical methods that are used to determine their biological relevance as well
Standard inferential statistics contain a number of restraints when working with very large and complex data sets. One might be inclined to make every possible comparison between all groups within a given variable. The number of comparisons within a data set increases as the number of dependent variables increases, resulting in a higher probability of false discovery. A shotgun style approach will not only decrease statistical power but data set becomes saturated with significant findings that may or may not contain biological relevance. A large number of significant results may deceive an investigator into believing that their data set is much more complex than it actually is due to redundancies that are difficult to recognize. Thus, it would be helpful if data could be reduced into fewer variables that give the researcher a more holistic understanding of the data. This is precisely the goal of principle component analysis (PCA). PCA employs linear algebra to minimize signal noise and redundancy while revealing previously hidden dynamics of a complex data set.
Application of PCA
The goal of PCA is variable reduction. Variable reduction can be defined as a distillation of a large number of diverse variables into a few more easily interpretable ones. One of the primary strengths of PCA is the diversity of the types of data that can be analyzed as covariates. For example, the measurement of serum hormone concentrations is a common way to assess endocrine disorders. However, increasing hormone levels usually coincide with body weight or the size of the secreting organ.
Assumptions and Limitations
A great strength of principal component analysis is its leniency on standard statistical assumptions. This is because PCA is not a p-value driven analysis and is primarily descriptive in nature.
A major limitation to PCA is the necessity of a complete data set. In the matrix calculations required for PCA, pairwise comparisons of every possible measure (covariance) are necessary. There are a few ways to deal with this problem. First, the variable can be removed if there are few samples that actually have the measurement. Second, the animal can be removed if it is missing multiple measures within the data set. As a last resort, the data can be filled in with probabilistic principle component analysis (Tipping & Bishop, 1999). While not an optimal solution, this method can be fairly accurate at predicting missing values. Care has to be taken in interpreting these results.
Last, an interpretation of the results does not have to exist. PCA will always find axes that account for the most variance, regardless of how important it actually is. It is up to the researcher to understand the output of the data and interpret it in a way that corresponds to the literature and a priori knowledge on the subject.
While the scripts used to create these figures are our own, the software used to create them is R. Here, we offer a standardized workflow to PCA, contained within a single script. This script depends greatly on the work of others and implements 2 VERY useful packages written for R (pcaMethods and ggplot2). You will need to install these 2 packages for R if you would like to use our script (directions are below). You are free to use or edit these scripts but we require acknowledgment when results from your analyses using our script is published or presented.
To install pcaMethods:
To install ggplot2:
PDF Pearson, K. (1901). On Lines and Planes of Closet Fit to Systems of Points in Space. Philosophical Magazine, 2, 559-572.
PDF Samuel V Scarpino, Ross Gillette, and David Crews. 2014. multiDimBio: An R Package for the Design, Analysis, and Visualization of Systems Biology Experiments. arXiv:1404.0594v1 [q-bio.QM].
PDF David Crews, Ross Gillette, Samuel V. Scarpino, Mohan Manikkam, Marina I. Savenkova, and Michael K. Skinner. 2012. Epigenetic transgenerational inheritance of altered stress responses. PNAS. 109 (23): 9143–9148.
Download our script here.
The output of this script is automated and described below. However, you must organize your data such that the script will understand the input. Your data must be presented to the script as a .csv file and must not contain any spacer (blank) rows or columns. Rows should be unique individuals and columns should be unique measure endpoints. At least one of your rows should contain grouping information either in number of text.
Lines you must change:
Line 2, Set your working directory. It is advised you create a new directory which contains your data file and our script. If you intend on performing multiple analyses, we recommend creating a new directory for each as previous results will be over-written if you run a new analysis within the same working directory:
Line 4, read in your data file:
Line 7, create a new data set that does not contain any categorical data (i.e. unique identifiers, grouping variables, etc.). In the example below, columns 1 through 3 are removed.
Line 10, define a column that contains group identification information. In the example below, column 2 of the original data file contains group identification:
The rest of the script can be run as is.