Chapter 1 Day 1 - Section 1 : MSstats, introduction to data and preprocessing
1.1 Objective
- Preprocessing steps to make required input format for MSstats from output from diverse output of spectral processing tools.
- Make annotation file, based on experimental design.
1.2 Workflow in MSstats
1.3 Data
- the quantified peak intensities data from ABRF 2015, processed by Skyline.
1.4 Load MSstats
Load MSstats first. Then you are ready to start MSstats.
library(MSstats)
?MSstats
## Registered S3 methods overwritten by 'ggplot2':
## method from
## [.quosures rlang
## c.quosures rlang
## print.quosures rlang
1.5 Allowable data formats
MSstats
performs statistical analysis steps, that follow peak identification and quantitation. Therefore, input
to MSstats is the output of other software tools (such as Skyline
, MaxQuant
and so on) that read raw spectral files
, identify and quantify spectral peaks. The preferred structure of data for use in MSstats is a .csv file
in a long format with at least 10 columns representing the following variables: ProteinName, PeptideSequence, PrecursorCharge, FragmentIon, ProductCharge, IsotopeLabelType, Condition, BioReplicate, Run, Intensity. The variable names are fixed, but are case-insensitive.
## ProteinName PeptideSequence PrecursorCharge FragmentIon ProductCharge
## 1 bovine S.PVDIDTK_5 5 NA NA
## 2 bovine S.PVDIDTK_5 5 NA NA
## 3 bovine S.PVDIDTK_5 5 NA NA
## 4 bovine S.PVDIDTK_5 5 NA NA
## 5 bovine S.PVDIDTK_5 5 NA NA
## 6 bovine S.PVDIDTK_5 5 NA NA
## IsotopeLabelType Condition BioReplicate Run Intensity
## 1 L C1 1 1 2636792
## 2 L C1 1 2 1992418
## 3 L C1 1 3 1982146
## 4 L C2 1 4 5019594
## 5 L C2 1 5 4560468
## 6 L C2 1 6 3627849
1.6 Convert to MSstats required format (Data cleaning)
Let’s start preprocessing steps to make required input format for MSstats from output from diverse output of spectral processing tools.
Data input support for various data acquisition methods : DDA, DIA, SRM
Interoperability with existing computational tools : - Converter functions for 7 data processin tools :
SkylinetoMSstatsFormat
,MaxQtoMSstatsFormat
,OpenMStoMSstatsFormat
,ProgenesistoMSstatsFormat
,PDtoMSstatsFormat
,SpectronauttoMSstatsFormat
,OpenSWATHtoMSstatsFormat
,DIAUmpiretoMSstatsFormat
- Consistent data cleaning steps across converter functions : filter multiple measurements, shared peptides, etc. - Generate the same format of data from diverse type of data format.
1.6.1 Skyline output
1.6.1.1 Read data
The required input data is generated automatically when using MSstats
report format in Skyline
.
We first load and access the dataset processed by Skyline
. The name of saved file from Skyline
using MSstats report format
is ‘ABRF2015_Skyline_report.csv’. or you can use the published data from this link (https://panoramaweb.org/labkey/project/MacCoss/brendan/manuscripts/iPRG%202015/begin.view). This example dataset is the exactly same data in ‘iPRG_10ppm_2rt_15cut_nosingle.csv’ from the link above.
# Read output from skyline
raw.skyline <- read.csv(file="data/data_Skyline/ABRF2015_Skyline_report.csv")
# Check the first 6 rows of dataset
head(raw.skyline)
## ProteinName PeptideSequence PeptideModifiedSequence
## 1 DECOY_sp|P0CF18|YM085_YEAST KDMYGNPFQK KDM[+16]YGNPFQK
## 2 DECOY_sp|P0CF18|YM085_YEAST KDMYGNPFQK KDM[+16]YGNPFQK
## 3 DECOY_sp|P0CF18|YM085_YEAST KDMYGNPFQK KDM[+16]YGNPFQK
## 4 DECOY_sp|P0CF18|YM085_YEAST KDMYGNPFQK KDM[+16]YGNPFQK
## 5 DECOY_sp|P0CF18|YM085_YEAST KDMYGNPFQK KDM[+16]YGNPFQK
## 6 DECOY_sp|P0CF18|YM085_YEAST KDMYGNPFQK KDM[+16]YGNPFQK
## PrecursorCharge PrecursorMz FragmentIon ProductCharge ProductMz
## 1 3 415.1974 precursor 3 415.1974
## 2 3 415.1974 precursor 3 415.1974
## 3 3 415.1974 precursor 3 415.1974
## 4 3 415.1974 precursor 3 415.1974
## 5 3 415.1974 precursor 3 415.1974
## 6 3 415.1974 precursor 3 415.1974
## IsotopeLabelType Condition BioReplicate FileName
## 1 light NA NA JD_06232014_sample1_B.raw
## 2 light NA NA JD_06232014_sample1_C.raw
## 3 light NA NA JD_06232014_sample1-A.raw
## 4 light NA NA JD_06232014_sample2_A.raw
## 5 light NA NA JD_06232014_sample2_B.raw
## 6 light NA NA JD_06232014_sample2_C.raw
## Area StandardType Truncated annotation_QValue
## 1 147327 NA False NA
## 2 1373397 NA False NA
## 3 71765 NA False NA
## 4 66387 NA False NA
## 5 107736 NA False NA
## 6 380812 NA False NA
There are some column named differently than required input. The information for Condition
and BioReplicate
is missing. Let’s do preliminary check for this input.
# total number of unique protein name
length(unique(raw.skyline$Protein))
## [1] 3097
# several isotopic peaks for peptide charge
unique(raw.skyline$FragmentIon)
## [1] precursor precursor [M+1] precursor [M+2]
## Levels: precursor precursor [M+1] precursor [M+2]
# unique FileName, which is MS run.
unique(raw.skyline$FileName)
## [1] JD_06232014_sample1_B.raw JD_06232014_sample1_C.raw
## [3] JD_06232014_sample1-A.raw JD_06232014_sample2_A.raw
## [5] JD_06232014_sample2_B.raw JD_06232014_sample2_C.raw
## [7] JD_06232014_sample3_A.raw JD_06232014_sample3_B.raw
## [9] JD_06232014_sample3_C.raw JD_06232014_sample4_B.raw
## [11] JD_06232014_sample4_C.raw JD_06232014_sample4-A.raw
## 12 Levels: JD_06232014_sample1_B.raw ... JD_06232014_sample4-A.raw
# 'Truncated' column
unique(raw.skyline$Truncated)
## [1] False True
## Levels: False True
# count table for 'Truncated' column
xtabs(~Truncated, raw.skyline)
## Truncated
## False True
## 237 1256666 829
# count which 'Truncated' is 'True'
sum(raw.skyline$Truncated == 'True')
## [1] 829
1.6.1.2 Set annotation file
Annotation information is required to fill in Condition and BioReplicate for corresponding Run information. Users have to prepare as csv or txt file like ‘ABRF2015_Skyline_annotation.csv’, which includes Run, Condition, and BioReplicate information, and load it in R.
annot.skyline <- read.csv(file="data/data_Skyline/ABRF2015_Skyline_annotation.csv")
annot.skyline
## Run Condition BioReplicate
## 1 JD_06232014_sample1-A.raw Condition1 1
## 2 JD_06232014_sample2_A.raw Condition2 2
## 3 JD_06232014_sample4_B.raw Condition4 4
## 4 JD_06232014_sample1_B.raw Condition1 1
## 5 JD_06232014_sample1_C.raw Condition1 1
## 6 JD_06232014_sample2_B.raw Condition2 2
## 7 JD_06232014_sample2_C.raw Condition2 2
## 8 JD_06232014_sample3_A.raw Condition3 3
## 9 JD_06232014_sample3_B.raw Condition3 3
## 10 JD_06232014_sample3_C.raw Condition3 3
## 11 JD_06232014_sample4-A.raw Condition4 4
## 12 JD_06232014_sample4_C.raw Condition4 4
1.6.1.2.1 Common mistake for annotation file : Incorrect Run information
Raw file name in the output of spectral processing tool is commonly long. We can make a typo in annotation file. If Raw file name and Run information are not matched,
Let’s check whether Run information is same as File.Name in output of Skyline
.
setdiff(unique(raw.skyline$FileName), annot.skyline$Run)
## character(0)
setdiff(annot.skyline$Run, unique(raw.skyline$FileName))
## character(0)
There is no issue in this annotation.
Let’s try the example with unmatched information.
annot.wrong <- read.csv(file="data/data_Skyline/ABRF2015_Skyline_annotation_wrong_example.csv")
annot.wrong
## Run Condition BioReplicate
## 1 JD_06232014_sample1_A.raw Condition1 1
## 2 JD_06232014_sample2_A.raw Condition2 2
## 3 JD_06232014_sample4_B.raw Condition4 4
## 4 JD_06232014_sample1_B.raw Condition1 1
## 5 JD_06232014_sample1_C.raw Condition1 1
## 6 JD_06232014_sample2_B.raw Condition2 2
## 7 JD_06232014_sample2_C.raw Condition2 2
## 8 JD_06232014_sample3_A.raw Condition3 3
## 9 JD_06232014_sample3_B.raw Condition3 3
## 10 JD_06232014_sample3_C.raw Condition3 3
## 11 JD_06232014_sample4-A.raw Condition4 4
## 12 JD_06232014_sample4_C.raw Condition4 4
Note! small details : hyphen vs underline.
1.6.1.2.2 Common mistake for annotation file : Incorrect BioReplicate
information
MSstats
distinguish the design of experiment as group comparison, time course, paired design, with the combination of Condition
, BioReplicate
, and Run
.
1.6.1.2.2.1 Group comparison
In a group comparison design, the conditions (e.g., disease states) are profiled across non-overlapping sets of biological replicates (i.e., subjects). In this example there are 2 conditions, Disease and Control (in general the number of conditions can vary). There are 3 subjects (i.e., biological replicates) per condition (in general an equal number of replicates per condition is not required). Overall, in this example there are 2 × 3 = 6 mass spectrometry runs.
The most important is that 1) subject IDs for disease group are completely different thatn subject IDs for control group 2) Run
is not order of spectral acquisition, but just unique MS run ID.
In addition, if each subject has e technical replicate runs (in general technical replicates are not required, and their number per sample may vary). there are 2 × 3 × 3 = 18 mass spectrometry runs.
1.6.1.2.2.2 Time course
The important feature of a time course experimental design is that a same subject (i.e., biological replicate) is repetitively measured across multiple time points. In this example there are 3 time points, Time1, Time2, and Time3 (in general the number of times can vary). There are 3 subjects (i.e., biological replicates) measured across times (in general an equal number of times per replicate is not required). There are no technical replicates (in general the number of technical replicates per sample may vary). Overall, in this example there are 3 × 3 = 9 mass spectrometry runs.
1.6.1.2.2.3 Paired design
Another frequently used experimental design is a paired design, where measurements from multiple conditions (such as healthy biopsy and disease biopsy) are taken from a same subject. The statistical model for this experimental design is the same as in the time course experiment, however the values in the columns of the input data may have a different appearence. In this example there are 3 subjects (in general the number of patients can vary). There are three conditions per subject (in general the number of conditions per subject can exceed two). Overall, in this example there are \(2 \times 3 = 6\) mass spectrometry runs. BioReplicate
should indicate each individual ID.
1.6.1.3 Preprocessing with SkylinetoMSstatsFormat
The input data for MSstats
is required to contain variables of ProteinName, PeptideSequence, PrecursorCharge, FragmentIon, ProductCharge, IsotopeLabelType, Condition, BioReplicate, Run, Intensity. These variable names should be fixed. MSstats
input from Skyline adapts the column scheme of the dataset so that it fits MSstats
input format. However there are several extra column names and also some of them need to be changed. SkylinetoMSstatsFormat
function helps pre-processing for making right format of MSstats input from Skyline output. For example, it renames some column name, and replace truncated peak intensities with NA. Another important step is to handle isotopic peaks before using dataProcess
. The output from Skyline for DDA experiment has several measurements of peak area from the monoisotopic, M+1 and M+2 peaks. To get a robust measure of peptide intensity, we can sum over isotopic peaks per peptide or use the highest peak. Here we take a summation per peptide ion.
Here is the summary of pre-processing steps in SkylinetoMSstatsFormat
function (in orange box below).
For further details, visit the help file using the following code.
?SkylinetoMSstatsFormat
# reformating and pre-processing for Skyline output.
input.skyline <- SkylinetoMSstatsFormat(raw.skyline,
annotation=annot.skyline,
removeProtein_with1Feature = TRUE)
## ** Proteins, which names include DECOY, are removed.
## ** Peptides, that are used in more than one proteins, are removed.
## Warning in SkylinetoMSstatsFormat(raw.skyline, annotation =
## annot.skyline, : NAs introduced by coercion
## ** Truncated peaks are replaced with NA.
## ** For DDA datasets, three isotopic peaks per feature and run are summed.
## ** 4 features have all NAs or zero intensity values and are removed.
## ** 13 features have 1 or 2 intensities across runs and are removed.
## ** All proteins have at least two features.
head(input.skyline)
## ProteinName PeptideSequence PrecursorCharge
## 1 sp|D6VTK4|STE2_YEAST EGEVEPVDM[+16]YTPDTAADEEARK 3
## 2 sp|D6VTK4|STE2_YEAST EGEVEPVDM[+16]YTPDTAADEEARK 3
## 3 sp|D6VTK4|STE2_YEAST EGEVEPVDM[+16]YTPDTAADEEARK 3
## 4 sp|D6VTK4|STE2_YEAST EGEVEPVDM[+16]YTPDTAADEEARK 3
## 5 sp|D6VTK4|STE2_YEAST EGEVEPVDM[+16]YTPDTAADEEARK 3
## 6 sp|D6VTK4|STE2_YEAST EGEVEPVDM[+16]YTPDTAADEEARK 3
## FragmentIon ProductCharge IsotopeLabelType Condition BioReplicate
## 1 sum NA L Condition1 1
## 2 sum NA L Condition2 2
## 3 sum NA L Condition4 4
## 4 sum NA L Condition2 2
## 5 sum NA L Condition4 4
## 6 sum NA L Condition3 3
## Run Intensity StandardType
## 1 JD_06232014_sample1_C.raw 7863713 NA
## 2 JD_06232014_sample2_A.raw 977615 NA
## 3 JD_06232014_sample4_B.raw 4102785 NA
## 4 JD_06232014_sample2_C.raw 6547299 NA
## 5 JD_06232014_sample4_C.raw 3972464 NA
## 6 JD_06232014_sample3_B.raw 8896052 NA
1.6.1.4 Preliminary check
length(unique(input.skyline$ProteinName))
## [1] 3027
sum(is.na(input.skyline$Intensity))
## [1] 219
sum(!is.na(input.skyline$Intensity) & input.skyline$Intensity==0)
## [1] 2414
1.6.2 MaxQuant output
1.6.2.1 Read data
Three files should be prepared before MSstats. Two files, ‘proteinGroups.txt’ and ‘evidence.txt’ are outputs from MaxQuant.
# First, get protein ID information
proteinGroups <- read.table("data/data_MaxQuant/proteinGroups.txt", sep = "\t", header = TRUE)
# Read in MaxQuant file: evidence.txt
evi <- read.table("data/data_MaxQuant/evidence.txt", sep="\t", header=TRUE)
colnames(evi)
## [1] "Sequence"
## [2] "Length"
## [3] "Modifications"
## [4] "Modified.sequence"
## [5] "Oxidation..M..Probabilities"
## [6] "Oxidation..M..Score.Diffs"
## [7] "Acetyl..Protein.N.term."
## [8] "Oxidation..M."
## [9] "Missed.cleavages"
## [10] "Proteins"
## [11] "Leading.proteins"
## [12] "Leading.razor.protein"
## [13] "Gene.names"
## [14] "Protein.names"
## [15] "Type"
## [16] "Raw.file"
## [17] "MS.MS.m.z"
## [18] "Charge"
## [19] "m.z"
## [20] "Mass"
## [21] "Resolution"
## [22] "Uncalibrated...Calibrated.m.z..ppm."
## [23] "Uncalibrated...Calibrated.m.z..Da."
## [24] "Mass.Error..ppm."
## [25] "Mass.Error..Da."
## [26] "Uncalibrated.Mass.Error..ppm."
## [27] "Uncalibrated.Mass.Error..Da."
## [28] "Max.intensity.m.z.0"
## [29] "Retention.time"
## [30] "Retention.length"
## [31] "Calibrated.retention.time"
## [32] "Calibrated.retention.time.start"
## [33] "Calibrated.retention.time.finish"
## [34] "Retention.time.calibration"
## [35] "Match.time.difference"
## [36] "Match.m.z.difference"
## [37] "Match.q.value"
## [38] "Match.score"
## [39] "Number.of.data.points"
## [40] "Number.of.scans"
## [41] "Number.of.isotopic.peaks"
## [42] "PIF"
## [43] "Fraction.of.total.spectrum"
## [44] "Base.peak.fraction"
## [45] "PEP"
## [46] "MS.MS.Count"
## [47] "MS.MS.Scan.Number"
## [48] "Score"
## [49] "Delta.score"
## [50] "Combinatorics"
## [51] "Intensity"
## [52] "Reverse"
## [53] "Potential.contaminant"
## [54] "id"
## [55] "Protein.group.IDs"
## [56] "Peptide.ID"
## [57] "Mod..peptide.ID"
## [58] "MS.MS.IDs"
## [59] "Best.MS.MS"
## [60] "AIF.MS.MS.IDs"
## [61] "Oxidation..M..site.IDs"
unique(evi$Raw.file)
## [1] JD_06232014_sample1-A JD_06232014_sample2_A JD_06232014_sample4_B
## [4] JD_06232014_sample1_B JD_06232014_sample1_C JD_06232014_sample2_B
## [7] JD_06232014_sample2_C JD_06232014_sample3_A JD_06232014_sample3_B
## [10] JD_06232014_sample3_C JD_06232014_sample4-A JD_06232014_sample4_C
## 12 Levels: JD_06232014_sample1_B ... JD_06232014_sample4-A
One file is for annotation information, required to fill in Condition and BioReplicate for corresponding Run information. Users have to prepare as csv or txt file like ‘ABRF2015_MaxQuant_annotation.csv’, which includes Run, Condition, and BioReplicate information, and load it in R.
1.6.2.2 Set annotation file
Annotation information is required to fill in Condition and BioReplicate for corresponding Raw.file information. Users have to prepare as csv or txt file like ‘ABRF2015_MaxQuant_annotation.csv’, which includes Raw.file, Condition, and BioReplicate information, and load it in R. Raw.file column in the annotation file should be the same as unique Raw.file in evidence.txt file.
# Read in annotation including condition and biological replicates: annotation.csv
annot.maxquant <- read.csv("data/data_MaxQuant/ABRF2015_MaxQuant_annotation.csv", header = TRUE)
annot.maxquant
## Raw.file Condition BioReplicate IsotopeLabelType
## 1 JD_06232014_sample1-A Condition1 1 L
## 2 JD_06232014_sample2_A Condition2 2 L
## 3 JD_06232014_sample4_B Condition4 4 L
## 4 JD_06232014_sample1_B Condition1 1 L
## 5 JD_06232014_sample1_C Condition1 1 L
## 6 JD_06232014_sample2_B Condition2 2 L
## 7 JD_06232014_sample2_C Condition2 2 L
## 8 JD_06232014_sample3_A Condition3 3 L
## 9 JD_06232014_sample3_B Condition3 3 L
## 10 JD_06232014_sample3_C Condition3 3 L
## 11 JD_06232014_sample4-A Condition4 4 L
## 12 JD_06232014_sample4_C Condition4 4 L
1.6.2.3 Preprocessing with MaxQtoMSstatsFormat
MaxQtoMSstatsFormat
function helps pre-processing for making right format of MSstats input from MaxQuant output. Basically, this function gets peptide ion intensity from ‘evidence.txt’
file. In addition, there are several steps to filter out or to modify the data in order to get required information.
Here is the summary of pre-processing steps in MaxQtoMSstatsFormat
function (in orange box below).
?MaxQtoMSstatsFormat
# reformating and pre-processing for MaxQuant output.
# no protein with 1 peptide
input.maxquant <- MaxQtoMSstatsFormat(evidence=evi,
annotation=annot.maxquant,
proteinGroups=proteinGroups,
removeProtein_with1Peptide=TRUE)
## ** + Contaminant, + Reverse, + Only.identified.by.site, proteins are removed.
## ** Peptide and charge, that have 1 or 2 measurements across runs, are removed.
## ** 282 proteins, which have only peptide and charge in a protein, are removed among 3157 proteins.
head(input.maxquant)
## ProteinName PeptideSequence PrecursorCharge FragmentIon
## 1 D6VTK4 EGEVEPVDMYTPDTAADEEARK 3 NA
## 2 D6VTK4 FYPGTLSSFQTDSINNDAK 2 NA
## 3 D6VTK4 IGPFADASYK 2 NA
## 4 D6VTK4 NQFYQLPTPTSSK 2 NA
## 5 D6VTK4 TFVSETADDIEK 2 NA
## 6 D6VTK4 TNTITSDFTTSTDR 2 NA
## ProductCharge IsotopeLabelType Condition BioReplicate
## 1 NA L Condition1 1
## 2 NA L Condition1 1
## 3 NA L Condition1 1
## 4 NA L Condition1 1
## 5 NA L Condition1 1
## 6 NA L Condition1 1
## Run Intensity
## 1 JD_06232014_sample1_B 87141000
## 2 JD_06232014_sample1_B 46167000
## 3 JD_06232014_sample1_B 45425000
## 4 JD_06232014_sample1_B 47094000
## 5 JD_06232014_sample1_B NA
## 6 JD_06232014_sample1_B 62786000
1.6.2.4 Preliminary check
length(unique(input.maxquant$ProteinName))
## [1] 2875
sum(is.na(input.maxquant$Intensity))
## [1] 40056
sum(!is.na(input.maxquant$Intensity) & input.maxquant$Intensity==0)
## [1] 0
Challenge
- Let’s check unique information for
Run
,BioReplicate
andCondition
ininput.maxquant
.
unique(input.maxquant[, c('Run', 'BioReplicate', 'Condition')])
## Run BioReplicate Condition
## 1 JD_06232014_sample1_B 1 Condition1
## 30600 JD_06232014_sample1_C 1 Condition1
## 61199 JD_06232014_sample1-A 1 Condition1
## 91798 JD_06232014_sample2_A 2 Condition2
## 122397 JD_06232014_sample2_B 2 Condition2
## 152996 JD_06232014_sample2_C 2 Condition2
## 183595 JD_06232014_sample3_A 3 Condition3
## 214194 JD_06232014_sample3_B 3 Condition3
## 244793 JD_06232014_sample3_C 3 Condition3
## 275392 JD_06232014_sample4_B 4 Condition4
## 305991 JD_06232014_sample4_C 4 Condition4
## 336590 JD_06232014_sample4-A 4 Condition4
1.6.3 OpenMS output
1.6.3.1 Read data
# First, read output of OpenMS
raw.openMS <- read.csv("data/data_OpenMS/ABRF2015_OpenMS_raw.csv", stringsAsFactors=F) # the data file
head(raw.openMS)
## ProteinName PeptideSequence PrecursorCharge FragmentIon
## 1 sp|P09938|RIR2_YEAST AAADALSDLEIK 2 NA
## 2 sp|P09938|RIR2_YEAST AAADALSDLEIK 2 NA
## 3 sp|P09938|RIR2_YEAST AAADALSDLEIK 2 NA
## 4 sp|P09938|RIR2_YEAST AAADALSDLEIK 2 NA
## 5 sp|P09938|RIR2_YEAST AAADALSDLEIK 2 NA
## 6 sp|P09938|RIR2_YEAST AAADALSDLEIK 2 NA
## ProductCharge IsotopeLabelType Condition BioReplicate Run Intensity
## 1 0 L 1 1 1 391797000
## 2 0 L 4 4 10 103656000
## 3 0 L 4 4 11 361107000
## 4 0 L 1 1 2 456756000
## 5 0 L 1 1 3 389268000
## 6 0 L 2 2 4 433488000
1.6.3.2 Set annotation file
The output from OpenMS
already includes Run
, BioReplicate
, Condition
information. Let’s check it.
unique(raw.openMS[, c('Run', 'BioReplicate', 'Condition')])
## Run BioReplicate Condition
## 1 1 1 1
## 2 10 4 4
## 3 11 4 4
## 4 2 1 1
## 5 3 1 1
## 6 4 2 2
## 7 5 2 2
## 8 6 2 2
## 9 7 3 3
## 10 8 3 3
## 11 9 3 3
## 14 12 4 4
1.6.3.3 Preprocessing with OpenMStoMSstatsFormat
There are several steps to filter out or to modify the data in order to get required information.
Here is the summary of pre-processing steps in OpenMStoMSstatsFormat
function.
?OpenMStoMSstatsFormat
# reformating and pre-processing for OpenMS output.
input.openms <- OpenMStoMSstatsFormat(raw.openMS,
removeProtein_with1Feature=TRUE)
## ** 0 features have all NAs or zero intensity values and are removed.
## ** All peptides are unique peptides in proteins.
## ** 909 features have 1 or 2 intensities across runs and are removed.
## ** 698 proteins, which have only one feature in a protein, are removed among 2538 proteins.
## ** No multiple measurements in a feature and a run.
## Warning in OpenMStoMSstatsFormat(raw.openMS, removeProtein_with1Feature =
## TRUE): NAs introduced by coercion
## now 'input.openms' is ready for MSstats
head(input.openms)
## ProteinName PeptideSequence PrecursorCharge FragmentIon
## 1 sp|D6VTK4|STE2_YEAST EGEVEPVDMYTPDTAADEEARK 3 NA
## 2 sp|D6VTK4|STE2_YEAST FYPGTLSSFQTDSINNDAK 2 NA
## 3 sp|D6VTK4|STE2_YEAST IGPFADASYK 2 NA
## 4 sp|D6VTK4|STE2_YEAST NQFYQLPTPTSSK 2 NA
## 5 sp|D6VTK4|STE2_YEAST TFVSETADDIEK 2 NA
## 6 sp|D6VTK4|STE2_YEAST TNTITSDFTTSTDR 2 NA
## ProductCharge IsotopeLabelType Condition BioReplicate Run Intensity
## 1 0 L 1 1 1 64757900
## 2 0 L 1 1 1 38852700
## 3 0 L 1 1 1 73225800
## 4 0 L 1 1 1 63139900
## 5 0 L 1 1 1 NA
## 6 0 L 1 1 1 58905300
1.6.3.4 Preliminary check for preprocessed data
length(unique(input.openms$ProteinName))
## [1] 1840
sum(is.na(input.openms$Intensity))
## [1] 21962
sum(!is.na(input.openms$Intensity) & input.openms$Intensity==0)
## [1] 0
table(input.openms$Run)
##
## 1 2 3 4 5 6 7 8 9 10 11 12
## 15560 15560 15560 15560 15560 15560 15560 15560 15560 15560 15560 15560
Challenge
- Let’s check unique information for
Run
,BioReplicate
andCondition
.
unique(input.openms[, c('Run', 'BioReplicate', 'Condition')])
## Run BioReplicate Condition
## 1 1 1 1
## 15561 10 4 4
## 31121 11 4 4
## 46681 12 4 4
## 62241 2 1 1
## 77801 3 1 1
## 93361 4 2 2
## 108921 5 2 2
## 124481 6 2 2
## 140041 7 3 3
## 155601 8 3 3
## 171161 9 3 3
1.6.4 Progenesis output
1.6.4.1 Read data
# First, read output of Progenesis
raw.progenesis <- read.csv("data/data_Progenesis/ABRF2015_Progenesis_raw.csv", stringsAsFactors=F) # the data file
head(raw.progenesis)
## X X.1 X.2 X.3 X.4
## 1
## 2 # Retention time (min) Charge m/z Measured mass
## 3 16 52.5563333333333 2 501.781277638303 1001.54800234285
## 4 32 38.15255 2 474.251481407549 946.488409881339
## 5 11167 36.2224333333333 2 474.25154745893 946.488541984099
## 6 41 45.5598 2 371.731536419815 741.448519905869
## X.5 X.6 X.7 X.8 X.9
## 1
## 2 Mass error (u) Mass error (ppm) Score Sequence Modifications
## 3 -0.00255665715405939 -2.55269904358308 1 TANDVLTIR
## 4 -0.00219111866147159 -2.31499251990099 1 VTDGVMVAR
## 5 -0.0020590159006133 -2.17542139186367 1 VTDGVMVAR
## 6 -0.00120309413080122 -1.62262402086192 0.9996 AGLNIVR
## X.10
## 1
## 2 Accession
## 3 sp|P00549|KPYK1_YEAST
## 4 sp|P00549|KPYK1_YEAST
## 5 sp|P00549|KPYK1_YEAST
## 6 sp|P00549|KPYK1_YEAST
## X.11
## 1
## 2 Description
## 3 Pyruvate kinase 1 OS=Saccharomyces cerevisiae (strain ATCC 204508 \\ S288c) GN=CDC19 PE=1 SV=2
## 4 Pyruvate kinase 1 OS=Saccharomyces cerevisiae (strain ATCC 204508 \\ S288c) GN=CDC19 PE=1 SV=2
## 5 Pyruvate kinase 1 OS=Saccharomyces cerevisiae (strain ATCC 204508 \\ S288c) GN=CDC19 PE=1 SV=2
## 6 Pyruvate kinase 1 OS=Saccharomyces cerevisiae (strain ATCC 204508 \\ S288c) GN=CDC19 PE=1 SV=2
## X.12 X.13 X.14
## 1
## 2 Use in quantitation Max fold change Highest mean condition
## 3 False 1.23101575731737 A
## 4 False 1.35108253622201 B
## 5 False 1.25419527606242 B
## 6 False 1.04868912680216 A
## X.15 X.16 X.17
## 1
## 2 Lowest mean condition Anova Maximum CV
## 3 C 0.0522715538027003 12.0175133289667
## 4 A 0.0393818452091522 26.8776079679151
## 5 A 0.253277920596793 27.2310093101224
## 6 B 0.993981434364646 27.0631636013386
## Normalized.abundance X.18 X.19
## 1 A
## 2 JD_06232014_sample1-A JD_06232014_sample2_A JD_06232014_sample3_A
## 3 234646642.659118 246323351.490501 306102714.66799
## 4 179120293.733639 104309665.701784 136741892.392964
## 5 2233197.90782367 1134566.5998162 1574437.81004362
## 6 123797188.716029 122761256.64621 116107425.243685
## X.20 X.21 X.22
## 1 B
## 2 JD_06232014_sample4-A JD_06232014_sample1_B JD_06232014_sample2_B
## 3 257629531.217182 235779468.539422 236753257.546934
## 4 105011188.469111 182469696.644175 183285243.781685
## 5 1701362.72342001 2336796.51015815 1788630.29256942
## 6 63610598.879437 108255803.660911 108785457.069653
## X.23 X.24 X.25
## 1 C
## 2 JD_06232014_sample3_B JD_06232014_sample4_B JD_06232014_sample1_C
## 3 186699807.218591 242959514.796972 223435557.783206
## 4 162853464.030243 180957229.609825 186073547.948691
## 5 1932732.32106691 2274168.76697097 2182403.4051037
## 6 92469286.5619254 96974519.2450418 115737794.475905
## X.26 X.27 X.28
## 1
## 2 JD_06232014_sample2_C JD_06232014_sample3_C JD_06232014_sample4_C
## 3 220456628.684641 197285091.954229 207473304.500931
## 4 163946644.909844 150247529.397207 176820166.306319
## 5 2040841.8048229 1497501.50492545 2156318.09540588
## 6 100428622.768589 109442962.863466 83998024.0208531
## Raw.abundance X.29 X.30
## 1 A
## 2 JD_06232014_sample1-A JD_06232014_sample2_A JD_06232014_sample3_A
## 3 244531299.931508 221199440.186087 277078923.760572
## 4 186665863.932395 93670533.1411598 123776414.130536
## 5 2327272.96336292 1018845.73758267 1425154.0840074
## 6 129012233.635811 110240142.001841 105098448.610706
## X.31 X.32 X.33
## 1 B
## 2 JD_06232014_sample4-A JD_06232014_sample1_B JD_06232014_sample2_B
## 3 213112377.670857 265826760.55748 265610928.042007
## 4 86865756.2312952 205723291.596601 205625739.64841
## 5 1407375.36397932 2634593.46237989 2006645.04833363
## 6 52618990.9526949 122051719.672593 122045231.85501
## X.34 X.35 X.36
## 1 C
## 2 JD_06232014_sample3_B JD_06232014_sample4_B JD_06232014_sample1_C
## 3 219812880.452275 242959514.796972 210648874.118526
## 4 191737150.420336 180957229.609825 175425002.929311
## 5 2275521.67817463 2274168.76697097 2057509.66730008
## 6 108869636.960823 96974519.2450418 109114396.746851
## X.37 X.38 X.39
## 1
## 2 JD_06232014_sample2_C JD_06232014_sample3_C JD_06232014_sample4_C
## 3 207260386.701525 217212520.118758 194785326.277837
## 4 154133015.755399 165423774.187439 166006773.109084
## 5 1918679.71576989 1648761.5589421 2024448.99975338
## 6 94417098.3431607 120497608.498225 78861145.7987456
## Spectral.counts X.40 X.41
## 1 A
## 2 JD_06232014_sample1-A JD_06232014_sample2_A JD_06232014_sample3_A
## 3 2 1 1
## 4 1 1 1
## 5 1 0 1
## 6 2 2 2
## X.42 X.43 X.44
## 1 B
## 2 JD_06232014_sample4-A JD_06232014_sample1_B JD_06232014_sample2_B
## 3 3 2 3
## 4 1 1 1
## 5 1 0 0
## 6 2 2 2
## X.45 X.46 X.47
## 1 C
## 2 JD_06232014_sample3_B JD_06232014_sample4_B JD_06232014_sample1_C
## 3 2 2 2
## 4 1 1 1
## 5 0 1 1
## 6 2 2 2
## X.48 X.49 X.50
## 1
## 2 JD_06232014_sample2_C JD_06232014_sample3_C JD_06232014_sample4_C
## 3 3 1 2
## 4 1 1 1
## 5 0 1 0
## 6 2 2 2
One file is for annotation information, required to fill in Condition and BioReplicate for corresponding Run information. Users have to prepare as csv or txt file like ‘ABRF2015_Progenesis_annotation.csv’, which includes Run, Condition, and BioReplicate information, and load it in R.
1.6.4.2 Set annotation file
Annotation information is required to fill in Condition and BioReplicate for corresponding Run information. Users have to prepare as csv or txt file like ‘ABRF2015_Progenesis_annotation.csv’, which includes Run, Condition, and BioReplicate information, and load it in R.
## Read in annotation including condition and biological replicates: ABRF2015_Progenesis_annotation.csv
annot.progenesis <- read.csv("data/data_Progenesis/ABRF2015_Progenesis_annotation.csv", header = TRUE)
annot.progenesis
## Run Condition BioReplicate
## 1 JD_06232014_sample1-A Condition1 1
## 2 JD_06232014_sample2_A Condition2 2
## 3 JD_06232014_sample4_B Condition4 4
## 4 JD_06232014_sample1_B Condition1 1
## 5 JD_06232014_sample1_C Condition1 1
## 6 JD_06232014_sample2_B Condition2 2
## 7 JD_06232014_sample2_C Condition2 2
## 8 JD_06232014_sample3_A Condition3 3
## 9 JD_06232014_sample3_B Condition3 3
## 10 JD_06232014_sample3_C Condition3 3
## 11 JD_06232014_sample4-A Condition4 4
## 12 JD_06232014_sample4_C Condition4 4
!! Run
information should be the same as annotated in output of Progenesis
.
1.6.4.3 Preprocessing with ProgenesistoMSstatsFormat
The output from Progenesis
includes peptide ion-level quantification for each MS runs. ProgenesistoMSstatsFormat
function helps pre-processing for making right format of MSstats
input from Progenesis
output. Basically, this function reformats wide format to long format. It provide Raw.abundance, Normalized.abundance and Spectral count columns. This converter uses Raw.abundance columns for Intensity values. In addition, there are several steps to filter out or to modify the data in order to get required information.
Here is the summary of pre-processing steps in ProgenesistoMSstatsFormat
function (in orange box below).
?ProgenesistoMSstatsFormat
# reformating and pre-processing for Progenesis output.
input.progenesis <- ProgenesistoMSstatsFormat(raw.progenesis,
annotation=annot.progenesis,
removeProtein_with1Peptide=TRUE)
## ** Peptides, that are used in more than one proteins, are removed.
## ** Multiple measurements in a feature and a run are summarized by summaryforMultipleRows.
## ** 898 proteins, which have only one feature in a protein, are removed among 3913 proteins.
## now 'input.progenesis' is ready for MSstats
head(input.progenesis)
## ProteinName PeptideModifiedSequence PrecursorCharge
## 1 sp|A5Z2X5|YP010_YEAST LTGNPELSSLDEVLAK 2
## 2 sp|A5Z2X5|YP010_YEAST RPAQLLL[N-term] nTerm+42.0106 2
## 3 sp|D6VTK4|STE2_YEAST EGEVEPVDMYTPDTAADEEARK 3
## 4 sp|D6VTK4|STE2_YEAST FYPGTLSSFQTDSINNDAK 2
## 5 sp|D6VTK4|STE2_YEAST IGPFADASYK 2
## 6 sp|O13297|CET1_YEAST IAGNAVGSVVK 2
## FragmentIon ProductCharge IsotopeLabelType Condition BioReplicate
## 1 NA NA L Condition1 1
## 2 NA NA L Condition1 1
## 3 NA NA L Condition1 1
## 4 NA NA L Condition1 1
## 5 NA NA L Condition1 1
## 6 NA NA L Condition1 1
## Run Intensity
## 1 JD_06232014_sample1_B 2550357.3
## 2 JD_06232014_sample1_B 865491.8
## 3 JD_06232014_sample1_B 3767734.3
## 4 JD_06232014_sample1_B 2591735.2
## 5 JD_06232014_sample1_B 2301483.1
## 6 JD_06232014_sample1_B 415181.8
1.6.4.4 Preliminary check for preprocessed data
length(unique(input.progenesis$ProteinName))
## [1] 3015
sum(is.na(input.progenesis$Intensity))
## [1] 0
sum(!is.na(input.progenesis$Intensity) & input.progenesis$Intensity==0)
## [1] 418
table(input.progenesis$Run)
##
## JD_06232014_sample1-A JD_06232014_sample2_A JD_06232014_sample3_A
## 9646 9646 9646
## JD_06232014_sample4-A JD_06232014_sample1_B JD_06232014_sample2_B
## 9646 9646 9646
## JD_06232014_sample3_B JD_06232014_sample4_B JD_06232014_sample1_C
## 9646 9646 9646
## JD_06232014_sample2_C JD_06232014_sample3_C JD_06232014_sample4_C
## 9646 9646 9646
Challenge
- Let’s check unique information for
Run
,BioReplicate
andCondition
.
unique(input.progenesis[, c('Run', 'BioReplicate', 'Condition')])
## Run BioReplicate Condition
## 1 JD_06232014_sample1_B 1 Condition1
## 10547 JD_06232014_sample1_C 1 Condition1
## 21093 JD_06232014_sample1-A 1 Condition1
## 31639 JD_06232014_sample2_A 2 Condition2
## 42185 JD_06232014_sample2_B 2 Condition2
## 52731 JD_06232014_sample2_C 2 Condition2
## 63277 JD_06232014_sample3_A 3 Condition3
## 73823 JD_06232014_sample3_B 3 Condition3
## 84369 JD_06232014_sample3_C 3 Condition3
## 94915 JD_06232014_sample4_B 4 Condition4
## 105461 JD_06232014_sample4_C 4 Condition4
## 116007 JD_06232014_sample4-A 4 Condition4
1.7 Save your work
We can save the data that we made so far.
save(input.skyline, file='data/data_Skyline/input.skyline.rda')
save(input.maxquant, file='data/data_MaxQuant/input.maxquant.rda')
save(input.openms, file='data/data_OpenMS/input.openms.rda')
save(input.progenesis, file='data/data_Progenesis/input.progenesis.rda')