Chapter 3 Day 1 - Section 3 : MSstats, Differential abundance and sample size calculation
3.1 Summary
- Significance analysis for MS proteomics based peak intenesities data
3.2 Data
- The processed data,
quant.skyline.rda
,quant.maxquant.rda
,quant.openms.rda
,quant.progenesis.rda
, which is output ofdataProcess
from section 2.
3.3 Load MSstats
Load MSstats first. Then you are ready to start MSstats.
library(MSstats)
?MSstats
## Registered S3 methods overwritten by 'ggplot2':
## method from
## [.quosures rlang
## c.quosures rlang
## print.quosures rlang
3.4 Inference and future design of experiment with Skyline output
3.4.1 Load the processed data (output of dataProcess
) of Skyline output
load(file='data/data_Skyline/quant.skyline.rda')
3.4.2 Finding differentially abundant proteins across conditions
3.4.2.1 Assign contrast matrix
After we normalized the data and summarized each protein’s behaviour across conditions in dataProcess
step, we are all set to compare protein changes between groups of conditions. Within MSstats we can do this with the groupComparison
function, which takes as input the output of the dataProcess
function.
?groupComparison
We have to tell groupComparison
which are the conditions we would like to compare.
You can make your contrast.matrix
in R in a text editor. We define our contrast matrix by adding a column for every condition, in alphabetical order. We add a row for every comparison we would like to make between groups of conditions.
0 is for conditions we would like to ignore. 1 is for conditions we would like to put in the numerator of the ratio or fold-change. -1 is for conditions we would like to put in the denumerator of the ratio or fold-change.
If you have multiple groups, you can assign any gruop comparisons you are interested in. For example, if you have 4 different conditions, Condition1, Condition2, Condition3, Condition4, there are many possible comparisons.
# check unique conditions and check order of condition information
# In this case, Disease and Healthy
unique(quant.skyline$ProcessedData$GROUP_ORIGINAL)
## [1] Condition1 Condition2 Condition3 Condition4
## Levels: Condition1 Condition2 Condition3 Condition4
comparison1<-matrix(c(-1,1,0,0),nrow=1)
comparison2<-matrix(c(-1,0,1,0),nrow=1)
comparison<-rbind(comparison1, comparison2)
row.names(comparison)<-c("C2-C1","C3-C1")
comparison
## [,1] [,2] [,3] [,4]
## C2-C1 -1 1 0 0
## C3-C1 -1 0 1 0
Challenges
Make the matrix, named
comparison
, for all 6 paired comparision among 4 conditions (condition1, condition2, condition3, condition4)
unique(quant.skyline$ProcessedData$GROUP_ORIGINAL)
## [1] Condition1 Condition2 Condition3 Condition4
## Levels: Condition1 Condition2 Condition3 Condition4
comparison1<-matrix(c(-1,1,0,0),nrow=1)
comparison2<-matrix(c(-1,0,1,0),nrow=1)
comparison3<-matrix(c(-1,0,0,1),nrow=1)
comparison4<-matrix(c(0,-1,1,0),nrow=1)
comparison5<-matrix(c(0,-1,0,1),nrow=1)
comparison6<-matrix(c(0,0,-1,1),nrow=1)
comparison<-rbind(comparison1, comparison2, comparison3, comparison4, comparison5, comparison6)
row.names(comparison)<-c("C2-C1","C3-C1","C4-C1","C3-C2","C4-C2","C4-C3")
comparison
## [,1] [,2] [,3] [,4]
## C2-C1 -1 1 0 0
## C3-C1 -1 0 1 0
## C4-C1 -1 0 0 1
## C3-C2 0 -1 1 0
## C4-C2 0 -1 0 1
## C4-C3 0 0 -1 1
3.4.2.2 Comparing conditions with groupComparison
groupComparison
uses the run-level summarized data ($RunlevelData
from dataProcess
function) for hypothesis testing.
test.skyline <- groupComparison(contrast.matrix=comparison, data=quant.skyline)
Let’s check the output.
class(test.skyline)
## [1] "list"
names(test.skyline)
## [1] "ComparisonResult" "ModelQC" "fittedmodel"
# Show test result
# Label : which comparison is used
# log2FC : estimated log2 fold change between Diseased and Healthy
# adj.pvalue : adjusted p value
# issue : detect whether this protein has any issue for comparison
# such as, there is measurement in certain group, or no measurement at all.
# MissingPercentage : the number of missing intensities/total number of intensities
# in conditions your are interested in for comparison
# ImputationPercentage : the number of imputed intensities/total number of intensities
# in conditions your are interested in for comparison
head(test.skyline$ComparisonResult)
## Protein Label log2FC SE Tvalue DF
## 1 sp|D6VTK4|STE2_YEAST C2-C1 0.06094263 0.10304517 0.5914167 8
## 2 sp|O13297|CET1_YEAST C2-C1 -0.11024579 0.07589698 -1.4525714 8
## 3 sp|O13329|FOB1_YEAST C2-C1 -0.16114268 0.29273717 -0.5504688 8
## 4 sp|O13539|THP2_YEAST C2-C1 -0.43646953 0.83013182 -0.5257834 8
## 5 sp|O13547|CCW14_YEAST C2-C1 -0.05679619 0.14506388 -0.3915254 8
## 6 sp|O13563|RPN13_YEAST C2-C1 -0.16850033 0.09496183 -1.7744006 8
## pvalue adj.pvalue issue MissingPercentage ImputationPercentage
## 1 0.5705706 0.9983122 NA 0 0
## 2 0.1844096 0.9983122 NA 0 0
## 3 0.5970289 0.9983122 NA 0 0
## 4 0.6132973 0.9983122 NA 0 0
## 5 0.7056311 0.9983122 NA 0 0
## 6 0.1139193 0.9983122 NA 0 0
# After fitting linear model, residuals and fitted values can be shown.
head(test.skyline$ModelQC)
## RUN PROTEIN ABUNDANCE NumMeasuredFeature MissingPercentage
## 1: 1 sp|D6VTK4|STE2_YEAST 26.95958 7 0
## 2: 2 sp|D6VTK4|STE2_YEAST 26.78088 7 0
## 3: 3 sp|D6VTK4|STE2_YEAST 26.68074 7 0
## 4: 4 sp|D6VTK4|STE2_YEAST 26.98540 7 0
## 5: 5 sp|D6VTK4|STE2_YEAST 26.90384 7 0
## 6: 6 sp|D6VTK4|STE2_YEAST 26.71479 7 0
## more50missing NumImputedFeature originalRUN GROUP
## 1: FALSE 0 JD_06232014_sample1_B.raw 1
## 2: FALSE 0 JD_06232014_sample1_C.raw 1
## 3: FALSE 0 JD_06232014_sample1-A.raw 1
## 4: FALSE 0 JD_06232014_sample2_A.raw 2
## 5: FALSE 0 JD_06232014_sample2_B.raw 2
## 6: FALSE 0 JD_06232014_sample2_C.raw 2
## GROUP_ORIGINAL SUBJECT_ORIGINAL SUBJECT_NESTED SUBJECT residuals
## 1: Condition1 1 1.1 1 0.15251179
## 2: Condition1 1 1.1 1 -0.02618545
## 3: Condition1 1 1.1 1 -0.12632634
## 4: Condition2 2 2.4 4 0.11739159
## 5: Condition2 2 2.4 4 0.03582686
## 6: Condition2 2 2.4 4 -0.15321844
## fitted
## 1: 26.80707
## 2: 26.80707
## 3: 26.80707
## 4: 26.86801
## 5: 26.86801
## 6: 26.86801
# Fitted model per protein
head(test.skyline$fittedmodel)
## [[1]]
##
## Call:
## lm(formula = ABUNDANCE ~ GROUP, data = data2)
##
## Coefficients:
## (Intercept) GROUP2 GROUP3 GROUP4
## 26.80707 0.06094 -0.16303 -0.11015
##
##
## [[2]]
##
## Call:
## lm(formula = ABUNDANCE ~ GROUP, data = data2)
##
## Coefficients:
## (Intercept) GROUP2 GROUP3 GROUP4
## 24.69673 -0.11025 -0.13596 -0.08799
##
##
## [[3]]
##
## Call:
## lm(formula = ABUNDANCE ~ GROUP, data = data2)
##
## Coefficients:
## (Intercept) GROUP2 GROUP3 GROUP4
## 23.6136 -0.1611 0.1294 -0.3427
##
##
## [[4]]
##
## Call:
## lm(formula = ABUNDANCE ~ GROUP, data = data2)
##
## Coefficients:
## (Intercept) GROUP2 GROUP3 GROUP4
## 26.3935 -0.4365 -0.4232 -0.4941
##
##
## [[5]]
##
## Call:
## lm(formula = ABUNDANCE ~ GROUP, data = data2)
##
## Coefficients:
## (Intercept) GROUP2 GROUP3 GROUP4
## 27.04062 -0.05680 -0.28259 -0.08648
##
##
## [[6]]
##
## Call:
## lm(formula = ABUNDANCE ~ GROUP, data = data2)
##
## Coefficients:
## (Intercept) GROUP2 GROUP3 GROUP4
## 26.17882 -0.16850 -0.06754 -0.20791
test.skyline$fittedmodel[[1]]
##
## Call:
## lm(formula = ABUNDANCE ~ GROUP, data = data2)
##
## Coefficients:
## (Intercept) GROUP2 GROUP3 GROUP4
## 26.80707 0.06094 -0.16303 -0.11015
3.4.2.3 Save the comparison result
Let’s save the testing result as rdata and .csv file.
Skyline.result <- test.skyline$ComparisonResult
save(Skyline.result, file='data/data_Skyline/Skyline.result.rda')
write.csv(Skyline.result, file='data/data_Skyline/testResult_ABRF_skyline.csv')
3.4.2.4 subset of significant comparisons
Let’s inspect the results to see what proteins are changing significantly between Diseased and Healthy.
head(Skyline.result)
## Protein Label log2FC SE Tvalue DF
## 1 sp|D6VTK4|STE2_YEAST C2-C1 0.06094263 0.10304517 0.5914167 8
## 2 sp|O13297|CET1_YEAST C2-C1 -0.11024579 0.07589698 -1.4525714 8
## 3 sp|O13329|FOB1_YEAST C2-C1 -0.16114268 0.29273717 -0.5504688 8
## 4 sp|O13539|THP2_YEAST C2-C1 -0.43646953 0.83013182 -0.5257834 8
## 5 sp|O13547|CCW14_YEAST C2-C1 -0.05679619 0.14506388 -0.3915254 8
## 6 sp|O13563|RPN13_YEAST C2-C1 -0.16850033 0.09496183 -1.7744006 8
## pvalue adj.pvalue issue MissingPercentage ImputationPercentage
## 1 0.5705706 0.9983122 NA 0 0
## 2 0.1844096 0.9983122 NA 0 0
## 3 0.5970289 0.9983122 NA 0 0
## 4 0.6132973 0.9983122 NA 0 0
## 5 0.7056311 0.9983122 NA 0 0
## 6 0.1139193 0.9983122 NA 0 0
# select subset of rows with adj.pvalue < 0.05
SignificantProteins <-
Skyline.result[Skyline.result$adj.pvalue < 0.05, ]
nrow(SignificantProteins)
## [1] 28
# select subset of rows with adj.pvalue < 0.05 and log2FC > 2
SignificantProteinsUpInDiseased <- SignificantProteins[SignificantProteins$log2FC > 2 ,]
nrow(SignificantProteinsUpInDiseased)
## [1] 14
3.4.3 Visualization of differentially abundant proteins
?groupComparisonPlots
3.4.3.1 Volcano plot
Volcano plots allow us to visually separate strong changes, which are not significant, from strong and significant changes. Look for these subjects in the upper right and upper left quadrants of the plot. Protein name will be shown only for siginificant proteins.
groupComparisonPlots(data = Skyline.result,
type = 'VolcanoPlot',
address = 'data/data_Skyline/testResult_ABRF_skyline_')
We can set up estimated fold change cutoff.
groupComparisonPlots(data = Skyline.result,
type = 'VolcanoPlot',
sig = 0.05,
FCcutoff = 2^2,
address = 'data/data_Skyline/testResult_ABRF_skyline_FCcutoff4_')
3.4.3.2 Comparison plot
Comparison plots illustrate model-based estimates of log-fold changes, and the associated uncertainty, in several comparisons of conditions for one protein. X-axis is the comparison of interest. Y-axis is the log fold change. The dots are the model-based estimates of log-fold change, and the error bars are the model-based 95% confidence intervals (the option sig can be used to change the significance level of significance). For simplicity, the confidence intervals are adjusted for multiple comparisons within protein only, using the Bonferroni approach. For proteins with N comparisons, the individual confidence intervals are at the level of 1-sig/N.
groupComparisonPlots(Skyline.result,
type="ComparisonPlot",
address="data/data_Skyline/testResult_ABRF_skyline_")
Let’s check the comparison result for protein sp|P44015|VAC2_YEAST
.
Skyline.result[Skyline.result$Protein == 'sp|P44015|VAC2_YEAST', ]
## Protein Label log2FC SE Tvalue DF
## 1756 sp|P44015|VAC2_YEAST C2-C1 -0.2287384 0.3108677 -0.7358063 8
## 4774 sp|P44015|VAC2_YEAST C3-C1 -2.9716852 0.3108677 -9.5593254 8
## 7792 sp|P44015|VAC2_YEAST C4-C1 -5.2574901 0.3108677 -16.9123090 8
## 10810 sp|P44015|VAC2_YEAST C3-C2 -2.7429468 0.3108677 -8.8235190 8
## 13828 sp|P44015|VAC2_YEAST C4-C2 -5.0287517 0.3108677 -16.1765027 8
## 16846 sp|P44015|VAC2_YEAST C4-C3 -2.2858049 0.3108677 -7.3529837 8
## pvalue adj.pvalue issue MissingPercentage
## 1756 4.828617e-01 0.9983122047 NA 0
## 4774 1.187124e-05 0.0089568526 NA 0
## 7792 1.515151e-07 0.0001143182 NA 0
## 10810 2.143047e-05 0.0129354325 NA 0
## 13828 2.143125e-07 0.0001293590 NA 0
## 16846 7.970992e-05 0.0601411333 NA 0
## ImputationPercentage
## 1756 0
## 4774 0
## 7792 0
## 10810 0
## 13828 0
## 16846 0
3.4.4 Planning future experimental designs
This last analysis step views the dataset as a pilot study of a future experiment, utilizes its variance components, and calculates the minimal number of replicates required in a future
experiment to achieve the desired statistical power. The calculation is performed by the function designSampleSize
, which takes as input the fitted model in groupComparison
. Sample size calculation assumes same experimental design (i.e. group comparison, time course or paired design) as in the current dataset, and uses the model fit to estimate the median variance components across all the proteins. Finally, sample size calculation assumes that a large proportion of proteins (specifically, 99%) will not change in abundance in the future experiment. This assumption also provides conservative results.
Using the estimated variance components, the function relates the number of biological replicates per condition (numSample
, rounded to 0 decimal), average statistical power across all the proteins (power
), minimal fold change that we would like to detect (can be specified as a range, e.g. desiredFC=c(1.1, 2)
), and the False Discovery Rate (FDR
). The user should specify all these quantities but one, and the function will solve for the remainder. The quantity to solve for should be set to = TRUE
.
?designSampleSize
3.4.4.1 Calculating statistical power
# calculate the power
test.power <- designSampleSize(data = test.skyline$fittedmodel,
desiredFC = c(1.1, 1.6),
FDR = 0.05,
power = TRUE,
numSample = 3)
test.power
## desiredFC numSample FDR power CV
## 1 1.100 3 0.05 0.01 0.010
## 2 1.125 3 0.05 0.01 0.010
## 3 1.150 3 0.05 0.01 0.010
## 4 1.175 3 0.05 0.01 0.010
## 5 1.200 3 0.05 0.04 0.009
## 6 1.225 3 0.05 0.10 0.009
## 7 1.250 3 0.05 0.20 0.009
## 8 1.275 3 0.05 0.33 0.009
## 9 1.300 3 0.05 0.47 0.009
## 10 1.325 3 0.05 0.59 0.008
## 11 1.350 3 0.05 0.70 0.008
## 12 1.375 3 0.05 0.79 0.008
## 13 1.400 3 0.05 0.86 0.008
## 14 1.425 3 0.05 0.91 0.008
## 15 1.450 3 0.05 0.94 0.008
## 16 1.475 3 0.05 0.97 0.008
## 17 1.500 3 0.05 0.98 0.007
## 18 1.525 3 0.05 0.99 0.007
## 19 1.550 3 0.05 0.99 0.007
## 20 1.575 3 0.05 0.99 0.007
## 21 1.600 3 0.05 0.99 0.007
3.4.4.2 Visualizing the relationship between desired fold-change and power
designSampleSizePlots(data = test.power)
3.4.4.3 Designing sample size for desired fold-change
# Minimal number of biological replicates per condition
samplesize <- designSampleSize(data = test.skyline$fittedmodel,
desiredFC = c(1.1, 1.6),
FDR = 0.05,
power = 0.9,
numSample = TRUE)
samplesize
## desiredFC numSample FDR power CV
## 1 1.100 40 0.05 0.9 0.001
## 2 1.125 26 0.05 0.9 0.001
## 3 1.150 19 0.05 0.9 0.002
## 4 1.175 14 0.05 0.9 0.002
## 5 1.200 11 0.05 0.9 0.003
## 6 1.225 9 0.05 0.9 0.003
## 7 1.250 7 0.05 0.9 0.004
## 8 1.275 6 0.05 0.9 0.004
## 9 1.300 5 0.05 0.9 0.005
## 10 1.325 5 0.05 0.9 0.005
## 11 1.350 4 0.05 0.9 0.006
## 12 1.375 4 0.05 0.9 0.006
## 13 1.400 3 0.05 0.9 0.008
## 14 1.425 3 0.05 0.9 0.008
## 15 1.450 3 0.05 0.9 0.008
## 16 1.475 2 0.05 0.9 0.011
## 17 1.500 2 0.05 0.9 0.011
## 18 1.525 2 0.05 0.9 0.011
## 19 1.550 2 0.05 0.9 0.011
## 20 1.575 2 0.05 0.9 0.011
## 21 1.600 2 0.05 0.9 0.010
3.4.4.4 Visualizing the relationship between desired fold-change and mininum sample size number
designSampleSizePlots(data = samplesize)
3.4.5 Protein subject quantification
If there is no technical replicate, subject (or sample) quantification should be the same as run-level summarization. If there are technical replicates, subjet-level summarization(quantification) with run-level summarization will be useful for downstream analysis, such as classification.
?quantification
sampleQuant <- quantification(quant.skyline)
head(sampleQuant)
## Protein Condition1_1 Condition2_2 Condition3_3
## 1 sp|D6VTK4|STE2_YEAST 26.78088 26.90384 26.66612
## 2 sp|O13297|CET1_YEAST 24.70436 24.57744 24.62774
## 3 sp|O13329|FOB1_YEAST 23.45702 23.42441 23.73640
## 4 sp|O13539|THP2_YEAST 27.38323 25.90526 25.91699
## 5 sp|O13547|CCW14_YEAST 27.10266 26.90316 26.75663
## 6 sp|O13563|RPN13_YEAST 26.15683 26.00957 26.11312
## Condition4_4
## 1 26.75137
## 2 24.64624
## 3 23.16385
## 4 25.92043
## 5 26.97820
## 6 26.05676
groupQuant <- quantification(quant.skyline, type='Group')
head(groupQuant)
## Protein Condition1 Condition2 Condition3 Condition4
## 1 sp|D6VTK4|STE2_YEAST 26.78088 26.90384 26.66612 26.75137
## 2 sp|O13297|CET1_YEAST 24.70436 24.57744 24.62774 24.64624
## 3 sp|O13329|FOB1_YEAST 23.45702 23.42441 23.73640 23.16385
## 4 sp|O13539|THP2_YEAST 27.38323 25.90526 25.91699 25.92043
## 5 sp|O13547|CCW14_YEAST 27.10266 26.90316 26.75663 26.97820
## 6 sp|O13563|RPN13_YEAST 26.15683 26.00957 26.11312 26.05676
3.4.6 msstats.log and sessionInfo.txt
These two files are important to keep the records of package versions and options in functions.
3.5 Inference and future design of experiment with Skyline output
Challenges
Load the processed data (output of dataProcess from section 2 of MaxQuant output and 1) find the differentially abundant comparisons using
groupComparison
with the samecomparison
matrix, 2) draw volcano plot with output ofgroupComparison
, and 3) draw comparison plot with output ofgroupComparison
with MaxQuant output. and Check the comparison plot for the protein,P55249
.
load(file='data/data_MaxQuant/quant.maxquant.rda')
test.maxquant <- groupComparison(contrast.matrix=comparison, data=quant.maxquant)
MaxQuant.result <- test.maxquant$ComparisonResult
write.csv(MaxQuant.result, file='data/data_MaxQuant/testResult_ABRF_maxquant.csv')
groupComparisonPlots(data = MaxQuant.result,
type = 'VolcanoPlot',
address = 'data/data_MaxQuant/testResult_ABRF_maxquant_')
groupComparisonPlots(data = MaxQuant.result,
type = 'ComparisonPlot',
which.Protein = 'P55249',
address = 'data/data_MaxQuant/testResult_ABRF_maxquant_')
3.6 Inference and future design of experiment with OpenMS output
Challenges
Load the processed data (output of dataProcess from section 2 of OpenMS output and 1) find the differentially abundant comparisons using
groupComparison
with the samecomparison
matrix, 2) draw volcano plot with output ofgroupComparison
, and 3) draw comparison plot with output ofgroupComparison
with OpenMS output. and Check the comparison plot for the protein,sp|P55249|ZRT4_YEAST
.
load(file='data/data_OpenMS/quant.openms.rda')
test.openms <- groupComparison(contrast.matrix=comparison, data=quant.openms)
OpenMS.result <- test.openms$ComparisonResult
write.csv(OpenMS.result, file='data/data_OpenMS/testResult_ABRF_openms.csv')
groupComparisonPlots(data = MaxQuant.result,
type = 'VolcanoPlot',
address = 'data/data_OpenMS/testResult_ABRF_openms_')
groupComparisonPlots(data = MaxQuant.result,
type = 'ComparisonPlot',
which.Protein = 'sp|P55249|ZRT4_YEAST',
address = 'data/data_OpenMS/testResult_ABRF_openms_')
Here is the experimental design for ABRF 2015 study.