Chapter 4 Day 2 - Section 4 : MSstatsTMT, introductioin to data and preprocessing

4.1 Objective

Preprocessing steps to make required input format for MSstatsTMT from output from diverse output of spectral processing and peptide quantification tools.
Make annotation file, based on experimental design.

4.2 Data

Controlled mixtures: Sigma UPS1 48 protein-mix were spiked at 4 different ratio into a SILAC-labelled HeLa lysate. The mixtures were measured by TMT 10-plexes.
the peptide quantification data of controlled mixtures, processed by Proteome Discoverer and MaxQuant.

4.3 Load MSstatsTMT

Load MSstatsTMT first. Then you are ready to start MSstats.

# install MSstatsTMT from bioconductor
# if (!requireNamespace("BiocManager", quietly = TRUE))
#     install.packages("BiocManager")
# BiocManager::install("MSstatsTMT")

library(MSstatsTMT) # make sure it is version 1.2.1

## Registered S3 methods overwritten by 'ggplot2':
##   method         from 
##   [.quosures     rlang
##   c.quosures     rlang
##   print.quosures rlang

?MSstatsTMT

***

4.4 Allowable data formats

MSstatsTMT performs statistical analysis steps, that follow peptide identification and quantitation. Therefore, input to MSstatsTMT is the output of other software tools (such as Proteome Discoverer, MaxQuant and so on) that read raw spectral files , identify and quantify peptide ions. The preferred structure of data for use in MSstatsTMT is a .csv file in a long format with at least 10 columns representing the following variables: ProteinName, PeptideSequence, Charge, PSM, Channel, Condition, BioReplicate, Mixture, TechRepMixture, Intensity. The variable names are fixed, but are case-insensitive.

##   ProteinName               PeptideSequence Charge
## 1      P04406        [K].lISWYDNEFGYSNR.[V]      2
## 2      Q9NSD9           [K].irPFAVAAVLr.[N]      3
## 3      P04406    [K].lVINGNPITIFQErDPSk.[I]      3
## 4      P04406          [R].vVDLmAHMASkE.[-]      3
## 5      P06576      [R].dQEGQDVLLFIDNIFR.[F]      3
## 6      P06576 [R].iPSAVGYQPTLATDMGTMQEr.[I]      3
##                               PSM  Mixture TechRepMixture
## 1        [K].lISWYDNEFGYSNR.[V]_2 Mixture1              1
## 2           [K].irPFAVAAVLr.[N]_3 Mixture1              1
## 3    [K].lVINGNPITIFQErDPSk.[I]_3 Mixture1              1
## 4          [R].vVDLmAHMASkE.[-]_3 Mixture1              1
## 5      [R].dQEGQDVLLFIDNIFR.[F]_3 Mixture1              1
## 6 [R].iPSAVGYQPTLATDMGTMQEr.[I]_3 Mixture1              1
##                                            Run Channel Condition
## 1 161117_SILAC_HeLa_UPS1_TMT10_Mixture1_01.raw     126      Norm
## 2 161117_SILAC_HeLa_UPS1_TMT10_Mixture1_01.raw     126      Norm
## 3 161117_SILAC_HeLa_UPS1_TMT10_Mixture1_01.raw     126      Norm
## 4 161117_SILAC_HeLa_UPS1_TMT10_Mixture1_01.raw     126      Norm
## 5 161117_SILAC_HeLa_UPS1_TMT10_Mixture1_01.raw     126      Norm
## 6 161117_SILAC_HeLa_UPS1_TMT10_Mixture1_01.raw     126      Norm
##    BioReplicate   Intensity
## 1 Mixture1_Norm    8348.351
## 2 Mixture1_Norm   28327.492
## 3 Mixture1_Norm 1275010.965
## 4 Mixture1_Norm   80589.877
## 5 Mixture1_Norm    2231.389
## 6 Mixture1_Norm  144854.307

Let’s start preprocessing steps to make required input format for MSstatsTMT from output from diverse output of peptide quantification tools.

4.5 Proteome Discoverer output

4.5.1 Read data

The required input data is the PSM-level data generated by Proteome Discoverer 2.2. We first load and access the dataset processed by Proteome Discoverer. The file name is ‘spikedin_PSMs.txt’.

# Read output from Proteome Discoverer 
raw.pd <- read.delim(file="data/data_ProteomeDiscoverer_TMT/spikedin_PSMs.txt")

# Check the column names
colnames(raw.pd)

##  [1] "Checked"                     "Confidence"                 
##  [3] "Identifying.Node"            "PSM.Ambiguity"              
##  [5] "Annotated.Sequence"          "Modifications"              
##  [7] "Marked.as"                   "X..Protein.Groups"          
##  [9] "X..Proteins"                 "Master.Protein.Accessions"  
## [11] "Master.Protein.Descriptions" "Protein.Accessions"         
## [13] "Protein.Descriptions"        "X..Missed.Cleavages"        
## [15] "Charge"                      "DeltaScore"                 
## [17] "DeltaCn"                     "Rank"                       
## [19] "Search.Engine.Rank"          "m.z..Da."                   
## [21] "MH...Da."                    "Theo..MH...Da."             
## [23] "DeltaM..ppm."                "Deltam.z..Da."              
## [25] "Activation.Type"             "MS.Order"                   
## [27] "Isolation.Interference...."  "Average.Reporter.S.N"       
## [29] "Ion.Inject.Time..ms."        "RT..min."                   
## [31] "First.Scan"                  "Spectrum.File"              
## [33] "File.ID"                     "Abundance..126"             
## [35] "Abundance..127N"             "Abundance..127C"            
## [37] "Abundance..128N"             "Abundance..128C"            
## [39] "Abundance..129N"             "Abundance..129C"            
## [41] "Abundance..130N"             "Abundance..130C"            
## [43] "Abundance..131"              "Quan.Info"                  
## [45] "Ions.Score"                  "Identity.Strict"            
## [47] "Identity.Relaxed"            "Expectation.Value"          
## [49] "Percolator.q.Value"          "Percolator.PEP"

The column names are differently from required input. Let’s do preliminary check for this input.

# total number of unique protein name
proteins <- unique(raw.pd$Protein.Accessions)
length(proteins)

## [1] 50

# show the spiked-in proteins
proteins[grepl("ups",proteins)]

##  [1] P02788ups   P62988ups   P68871ups   P02741ups   P01008ups  
##  [6] P10636-8ups P02787ups   P02753ups   P00915ups   P05413ups  
## [11] P01344ups   P01579ups   P01031ups   P01375ups   P02144ups  
## 50 Levels: A1L0T0 O00625 O60427 P00915ups P01008ups ... Q9Y6C9

# total number of unique peptide names
length(unique(raw.pd$Annotated.Sequence))

## [1] 461

# unique Spectrum.File, which is TMT run.
unique(raw.pd$Spectrum.File)

##  [1] 161117_SILAC_HeLa_UPS1_TMT10_Mixture3_03.raw
##  [2] 161117_SILAC_HeLa_UPS1_TMT10_Mixture4_02.raw
##  [3] 161117_SILAC_HeLa_UPS1_TMT10_Mixture1_01.raw
##  [4] 161117_SILAC_HeLa_UPS1_TMT10_Mixture5_02.raw
##  [5] 161117_SILAC_HeLa_UPS1_TMT10_Mixture2_02.raw
##  [6] 161117_SILAC_HeLa_UPS1_TMT10_Mixture5_01.raw
##  [7] 161117_SILAC_HeLa_UPS1_TMT10_Mixture1_02.raw
##  [8] 161117_SILAC_HeLa_UPS1_TMT10_Mixture4_03.raw
##  [9] 161117_SILAC_HeLa_UPS1_TMT10_Mixture3_02.raw
## [10] 161117_SILAC_HeLa_UPS1_TMT10_Mixture3_01.raw
## [11] 161117_SILAC_HeLa_UPS1_TMT10_Mixture5_03.raw
## [12] 161117_SILAC_HeLa_UPS1_TMT10_Mixture1_03.raw
## [13] 161117_SILAC_HeLa_UPS1_TMT10_Mixture4_01.raw
## [14] 161117_SILAC_HeLa_UPS1_TMT10_Mixture2_03.raw
## [15] 161117_SILAC_HeLa_UPS1_TMT10_Mixture2_01.raw
## 15 Levels: 161117_SILAC_HeLa_UPS1_TMT10_Mixture1_01.raw ...

4.5.2 Prepare annotation file

MSstatsTMT can make inference for group comparison design. In a group comparison design, the conditions (e.g., disease states) are profiled across non-overlapping sets of biological replicates (i.e., subjects). In this example there are 4 conditions, 0.125, 0.5, 0.667, 1 (in general the number of conditions can vary). There are 2 subjects per condition per MS run (in general an equal number of replicates per condition is not required). Besides 2 pooled control replicates per run for across TMT-plex normalizations. Totally, each mixture has 10 replicates. There are 5 mixtures and each mixture has 3 technical replicate runs (in general technical replicates are not required, and their number per sample may vary). Overall, in this example there are 5 × 3 = 15 mass spectrometry runs and 5 × 3 x 10 = 150 replicates.

`Mixture`	`Run`
1	161117_SILAC_HeLa_UPS1_TMT10_Mixture1_01.raw
1	161117_SILAC_HeLa_UPS1_TMT10_Mixture1_01.raw
1	161117_SILAC_HeLa_UPS1_TMT10_Mixture1_03.raw
2	161117_SILAC_HeLa_UPS1_TMT10_Mixture1_01.raw
2	161117_SILAC_HeLa_UPS1_TMT10_Mixture2_02.raw
2	161117_SILAC_HeLa_UPS1_TMT10_Mixture2_03.raw
3	161117_SILAC_HeLa_UPS1_TMT10_Mixture1_01.raw
3	161117_SILAC_HeLa_UPS1_TMT10_Mixture3_02.raw
3	161117_SILAC_HeLa_UPS1_TMT10_Mixture3_03.raw
4	161117_SILAC_HeLa_UPS1_TMT10_Mixture4_01.raw
4	161117_SILAC_HeLa_UPS1_TMT10_Mixture4_02.raw
4	161117_SILAC_HeLa_UPS1_TMT10_Mixture4_03.raw
5	161117_SILAC_HeLa_UPS1_TMT10_Mixture5_01.raw
5	161117_SILAC_HeLa_UPS1_TMT10_Mixture5_02.raw
5	161117_SILAC_HeLa_UPS1_TMT10_Mixture5_03.raw

Here show run 161117_SILAC_HeLa_UPS1_TMT10_Mixture1_01.raw as an example,

`Run`	`Channel`	`BioReplicate`	`Condition`
1	X127C	1.X127C	0.125
1	X129N	1.X129N	0.125
1	X128N	1.X128N	0.5
1	X129C	1.X129C	0.5
1	X127N	1.X127N	0.667
1	X130C	1.X130C	0.667
1	X128C	1.X128C	1
1	X130N	1.X130N	1
1	X126	1.X126	Norm
1	X131	1.X131	Norm

The most important is that 1) Run is not order of spectral acquisition, but just unique MS run ID 2) if normalization between runs need to be done, then each run must have at least one Norm channel 3) Channel column in the annotation file should match with the corresponding channel names in in output of Proteome Discoverer 4) Spectrum.File in the PSM output of Proteome Discoverer should be the Run column in the annotation file 5) If one channel of one mixture doesn’t have sample, put None under Condition and BioReplicate column .

Annotation information is required to fill in Condition, TechRepMixture, Mixture and BioReplicate for corresponding Run and Channel information. Users have to prepare as csv or txt file like ‘PD_Annotation.csv’, which includes Run, Channel, Condition, BioReplicate, TechRepMixture and Mixture information, and load it in R.

annot.pd <- read.csv(file="data/data_ProteomeDiscoverer_TMT/PD_Annotation.csv")
head(annot.pd)

##                                        Run  Mixture TechRepMixture
## 1 161117_SILAC_HeLa_UPS1_TMT10_Mixture2_01 Mixture2              1
## 2 161117_SILAC_HeLa_UPS1_TMT10_Mixture2_01 Mixture2              1
## 3 161117_SILAC_HeLa_UPS1_TMT10_Mixture2_01 Mixture2              1
## 4 161117_SILAC_HeLa_UPS1_TMT10_Mixture2_01 Mixture2              1
## 5 161117_SILAC_HeLa_UPS1_TMT10_Mixture2_01 Mixture2              1
## 6 161117_SILAC_HeLa_UPS1_TMT10_Mixture2_01 Mixture2              1
##   Fraction   Channel Condition BioReplicate
## 1       F1 channel.0      Norm         Norm
## 2       F1 channel.1       0.5          0.5
## 3       F1 channel.2         1            1
## 4       F1 channel.3     0.667        0.667
## 5       F1 channel.4     0.125        0.125
## 6       F1 channel.5         1            1

4.5.2.1 Prepare the file with run information

Raw Spectrum file name in the output of peptide quantification tool should be the Run information in the annotation file.

Let’s start with Spectrum.File in PSM output of Proteome Discoverer.

runs <- unique(raw.pd$Spectrum.File) # MS runs
runs

##  [1] 161117_SILAC_HeLa_UPS1_TMT10_Mixture3_03.raw
##  [2] 161117_SILAC_HeLa_UPS1_TMT10_Mixture4_02.raw
##  [3] 161117_SILAC_HeLa_UPS1_TMT10_Mixture1_01.raw
##  [4] 161117_SILAC_HeLa_UPS1_TMT10_Mixture5_02.raw
##  [5] 161117_SILAC_HeLa_UPS1_TMT10_Mixture2_02.raw
##  [6] 161117_SILAC_HeLa_UPS1_TMT10_Mixture5_01.raw
##  [7] 161117_SILAC_HeLa_UPS1_TMT10_Mixture1_02.raw
##  [8] 161117_SILAC_HeLa_UPS1_TMT10_Mixture4_03.raw
##  [9] 161117_SILAC_HeLa_UPS1_TMT10_Mixture3_02.raw
## [10] 161117_SILAC_HeLa_UPS1_TMT10_Mixture3_01.raw
## [11] 161117_SILAC_HeLa_UPS1_TMT10_Mixture5_03.raw
## [12] 161117_SILAC_HeLa_UPS1_TMT10_Mixture1_03.raw
## [13] 161117_SILAC_HeLa_UPS1_TMT10_Mixture4_01.raw
## [14] 161117_SILAC_HeLa_UPS1_TMT10_Mixture2_03.raw
## [15] 161117_SILAC_HeLa_UPS1_TMT10_Mixture2_01.raw
## 15 Levels: 161117_SILAC_HeLa_UPS1_TMT10_Mixture1_01.raw ...

The run ID has the mixture and technical replicate information

Let’s try to extract the mixture and technical replicate information.

library(tidyr)
library(dplyr)

Run_info <- data.frame(Run = runs) # initialize the run file

# use function separate()
?separate

## Add the mixture and technical replicate information
Run_info <- Run_info %>% 
  separate(Run, c("Mixture", "TechRepMixture"), sep="_0", remove = FALSE) 
Run_info

##                                             Run
## 1  161117_SILAC_HeLa_UPS1_TMT10_Mixture3_03.raw
## 2  161117_SILAC_HeLa_UPS1_TMT10_Mixture4_02.raw
## 3  161117_SILAC_HeLa_UPS1_TMT10_Mixture1_01.raw
## 4  161117_SILAC_HeLa_UPS1_TMT10_Mixture5_02.raw
## 5  161117_SILAC_HeLa_UPS1_TMT10_Mixture2_02.raw
## 6  161117_SILAC_HeLa_UPS1_TMT10_Mixture5_01.raw
## 7  161117_SILAC_HeLa_UPS1_TMT10_Mixture1_02.raw
## 8  161117_SILAC_HeLa_UPS1_TMT10_Mixture4_03.raw
## 9  161117_SILAC_HeLa_UPS1_TMT10_Mixture3_02.raw
## 10 161117_SILAC_HeLa_UPS1_TMT10_Mixture3_01.raw
## 11 161117_SILAC_HeLa_UPS1_TMT10_Mixture5_03.raw
## 12 161117_SILAC_HeLa_UPS1_TMT10_Mixture1_03.raw
## 13 161117_SILAC_HeLa_UPS1_TMT10_Mixture4_01.raw
## 14 161117_SILAC_HeLa_UPS1_TMT10_Mixture2_03.raw
## 15 161117_SILAC_HeLa_UPS1_TMT10_Mixture2_01.raw
##                                  Mixture TechRepMixture
## 1  161117_SILAC_HeLa_UPS1_TMT10_Mixture3          3.raw
## 2  161117_SILAC_HeLa_UPS1_TMT10_Mixture4          2.raw
## 3  161117_SILAC_HeLa_UPS1_TMT10_Mixture1          1.raw
## 4  161117_SILAC_HeLa_UPS1_TMT10_Mixture5          2.raw
## 5  161117_SILAC_HeLa_UPS1_TMT10_Mixture2          2.raw
## 6  161117_SILAC_HeLa_UPS1_TMT10_Mixture5          1.raw
## 7  161117_SILAC_HeLa_UPS1_TMT10_Mixture1          2.raw
## 8  161117_SILAC_HeLa_UPS1_TMT10_Mixture4          3.raw
## 9  161117_SILAC_HeLa_UPS1_TMT10_Mixture3          2.raw
## 10 161117_SILAC_HeLa_UPS1_TMT10_Mixture3          1.raw
## 11 161117_SILAC_HeLa_UPS1_TMT10_Mixture5          3.raw
## 12 161117_SILAC_HeLa_UPS1_TMT10_Mixture1          3.raw
## 13 161117_SILAC_HeLa_UPS1_TMT10_Mixture4          1.raw
## 14 161117_SILAC_HeLa_UPS1_TMT10_Mixture2          3.raw
## 15 161117_SILAC_HeLa_UPS1_TMT10_Mixture2          1.raw

Then let’s clean the Mixture ID and TechRepMixture ID

## clean the run file and add fraction information 
Run_info <- Run_info %>% 
  mutate(Mixture = gsub("161117_SILAC_HeLa_UPS1_TMT10_", "", Mixture),
         TechRepMixture = gsub(".raw", "", TechRepMixture),
         Fraction = "F1")
Run_info

##                                             Run  Mixture TechRepMixture
## 1  161117_SILAC_HeLa_UPS1_TMT10_Mixture3_03.raw Mixture3              3
## 2  161117_SILAC_HeLa_UPS1_TMT10_Mixture4_02.raw Mixture4              2
## 3  161117_SILAC_HeLa_UPS1_TMT10_Mixture1_01.raw Mixture1              1
## 4  161117_SILAC_HeLa_UPS1_TMT10_Mixture5_02.raw Mixture5              2
## 5  161117_SILAC_HeLa_UPS1_TMT10_Mixture2_02.raw Mixture2              2
## 6  161117_SILAC_HeLa_UPS1_TMT10_Mixture5_01.raw Mixture5              1
## 7  161117_SILAC_HeLa_UPS1_TMT10_Mixture1_02.raw Mixture1              2
## 8  161117_SILAC_HeLa_UPS1_TMT10_Mixture4_03.raw Mixture4              3
## 9  161117_SILAC_HeLa_UPS1_TMT10_Mixture3_02.raw Mixture3              2
## 10 161117_SILAC_HeLa_UPS1_TMT10_Mixture3_01.raw Mixture3              1
## 11 161117_SILAC_HeLa_UPS1_TMT10_Mixture5_03.raw Mixture5              3
## 12 161117_SILAC_HeLa_UPS1_TMT10_Mixture1_03.raw Mixture1              3
## 13 161117_SILAC_HeLa_UPS1_TMT10_Mixture4_01.raw Mixture4              1
## 14 161117_SILAC_HeLa_UPS1_TMT10_Mixture2_03.raw Mixture2              3
## 15 161117_SILAC_HeLa_UPS1_TMT10_Mixture2_01.raw Mixture2              1
##    Fraction
## 1        F1
## 2        F1
## 3        F1
## 4        F1
## 5        F1
## 6        F1
## 7        F1
## 8        F1
## 9        F1
## 10       F1
## 11       F1
## 12       F1
## 13       F1
## 14       F1
## 15       F1

4.5.2.2 Prepare the file with group information

Let’s add the group and biological replicate information for each channel in each mixture. The channel ID should be consistent with the reporter ion intensity columns in the PSM file of Proteome Discoverer.

colnames(raw.pd)

##  [1] "Checked"                     "Confidence"                 
##  [3] "Identifying.Node"            "PSM.Ambiguity"              
##  [5] "Annotated.Sequence"          "Modifications"              
##  [7] "Marked.as"                   "X..Protein.Groups"          
##  [9] "X..Proteins"                 "Master.Protein.Accessions"  
## [11] "Master.Protein.Descriptions" "Protein.Accessions"         
## [13] "Protein.Descriptions"        "X..Missed.Cleavages"        
## [15] "Charge"                      "DeltaScore"                 
## [17] "DeltaCn"                     "Rank"                       
## [19] "Search.Engine.Rank"          "m.z..Da."                   
## [21] "MH...Da."                    "Theo..MH...Da."             
## [23] "DeltaM..ppm."                "Deltam.z..Da."              
## [25] "Activation.Type"             "MS.Order"                   
## [27] "Isolation.Interference...."  "Average.Reporter.S.N"       
## [29] "Ion.Inject.Time..ms."        "RT..min."                   
## [31] "First.Scan"                  "Spectrum.File"              
## [33] "File.ID"                     "Abundance..126"             
## [35] "Abundance..127N"             "Abundance..127C"            
## [37] "Abundance..128N"             "Abundance..128C"            
## [39] "Abundance..129N"             "Abundance..129C"            
## [41] "Abundance..130N"             "Abundance..130C"            
## [43] "Abundance..131"              "Quan.Info"                  
## [45] "Ions.Score"                  "Identity.Strict"            
## [47] "Identity.Relaxed"            "Expectation.Value"          
## [49] "Percolator.q.Value"          "Percolator.PEP"

## Channel ID in the PSM file
channels <- c("126", "127N", "127C", "128N", "128C", "129N", "129C", "130N", "130C", "131")
## Mixture ID from run file
mixtures <- unique(Run_info$Mixture)
mixtures

## [1] "Mixture3" "Mixture4" "Mixture1" "Mixture5" "Mixture2"

## create the file with channel information in each mixture
Group_info <- expand.grid(channels, mixtures)
colnames(Group_info) <- c("Channel", "Mixture")
head(Group_info)

##   Channel  Mixture
## 1     126 Mixture3
## 2    127N Mixture3
## 3    127C Mixture3
## 4    128N Mixture3
## 5    128C Mixture3
## 6    129N Mixture3

## save the channel file and fill in the condition and biological replicate information manually
write.csv(Group_info, file = "Group_info.csv", row.names = FALSE)

## Now the condition information should be available in the file
Group_info_filled <- read.csv(file = "data/data_ProteomeDiscoverer_TMT/Group_info_pd.csv")
head(Group_info_filled)

##   Channel  Mixture Condition BioReplicate
## 1     126 Mixture3      Norm         Norm
## 2    127N Mixture3     0.125        0.125
## 3    127C Mixture3     0.667        0.667
## 4    128N Mixture3         1            1
## 5    128C Mixture3       0.5          0.5
## 6    129N Mixture3       0.5          0.5

4.5.2.3 Prepare the final annotation file

Let’s generate the annotation file from run file and group file

annotation <- full_join(Run_info, Group_info_filled)

## Joining, by = "Mixture"

## Warning: Column `Mixture` joining character vector and factor, coercing
## into character vector

nrow(annotation)

## [1] 150

head(annotation)

##                                            Run  Mixture TechRepMixture
## 1 161117_SILAC_HeLa_UPS1_TMT10_Mixture3_03.raw Mixture3              3
## 2 161117_SILAC_HeLa_UPS1_TMT10_Mixture3_03.raw Mixture3              3
## 3 161117_SILAC_HeLa_UPS1_TMT10_Mixture3_03.raw Mixture3              3
## 4 161117_SILAC_HeLa_UPS1_TMT10_Mixture3_03.raw Mixture3              3
## 5 161117_SILAC_HeLa_UPS1_TMT10_Mixture3_03.raw Mixture3              3
## 6 161117_SILAC_HeLa_UPS1_TMT10_Mixture3_03.raw Mixture3              3
##   Fraction Channel Condition BioReplicate
## 1       F1     126      Norm         Norm
## 2       F1    127N     0.125        0.125
## 3       F1    127C     0.667        0.667
## 4       F1    128N         1            1
## 5       F1    128C       0.5          0.5
## 6       F1    129N       0.5          0.5

4.5.3 Preprocessing with `PDtoMSstatsTMTFormat`

The input data for MSstatsTMT is required to contain variables of ProteinName, PeptideSequence, Charge, PSM, Channel, Condition, BioReplicate, TechRepMixture, Mixture, Intensity. These variable names should be fixed. Output from Proteome Discoverer have different columns from the required input of MSstatsTMT. PDtoMSstatsTMTFormat function helps pre-processing for making right format of MSstatsTMT input from Proteome Discoverer output. For example, it renames some column name, and remove shared peptides.

Here is the summary of pre-processing steps in PDtoMSstatsTMTFormat function.

Peptide ions which are shared by more than one protein are removed
If one spectrum has multiple identifications within one run, it only keeps the best identification which has most measurements within that run or highest identification score or largest summation of all the measurements within that run
If one peptide ion has any missing value within one run, it removes the peptide ion from that run
For fractionation, it removes the shared peptide ions among the fractions of each mixture and then combine all the fractions for each mixture

For further details, visit the help file using the following code.

?PDtoMSstatsTMTFormat

# reformating and pre-processing for PD output.
input.pd <- PDtoMSstatsTMTFormat(raw.pd, annotation=annotation)
head(input.pd)

##   ProteinName                  PeptideSequence Charge
## 1   P05413ups           [K].lGVEFDETTADDRk.[V]      3
## 2      Q92734 [K].nVMSAFGLTDDQVSGPPSAPAEDr.[S]      3
## 3   P02787ups              [K].eGYYGYTGAFR.[C]      2
## 4      Q9HAV4               [K].tDSPScEYSr.[F]      2
## 5      P13667     [K].eENGVLVLNDANFDNFVADk.[D]      3
## 6      P10155                 [K].aLSVETEk.[L]      2
##                                  PSM  Mixture TechRepMixture
## 1           [K].lGVEFDETTADDRk.[V]_3 Mixture1              1
## 2 [K].nVMSAFGLTDDQVSGPPSAPAEDr.[S]_3 Mixture1              1
## 3              [K].eGYYGYTGAFR.[C]_2 Mixture1              1
## 4               [K].tDSPScEYSr.[F]_2 Mixture1              1
## 5     [K].eENGVLVLNDANFDNFVADk.[D]_3 Mixture1              1
## 6                 [K].aLSVETEk.[L]_2 Mixture1              1
##                                            Run Channel Condition
## 1 161117_SILAC_HeLa_UPS1_TMT10_Mixture1_01.raw     126      Norm
## 2 161117_SILAC_HeLa_UPS1_TMT10_Mixture1_01.raw     126      Norm
## 3 161117_SILAC_HeLa_UPS1_TMT10_Mixture1_01.raw     126      Norm
## 4 161117_SILAC_HeLa_UPS1_TMT10_Mixture1_01.raw     126      Norm
## 5 161117_SILAC_HeLa_UPS1_TMT10_Mixture1_01.raw     126      Norm
## 6 161117_SILAC_HeLa_UPS1_TMT10_Mixture1_01.raw     126      Norm
##   BioReplicate Intensity
## 1         Norm 34115.190
## 2         Norm 11431.716
## 3         Norm 25847.839
## 4         Norm  2446.339
## 5         Norm 26508.569
## 6         Norm 20478.670

4.5.4 Preliminary check

# total number of proteins
proteins <- unique(input.pd$ProteinName)
length(proteins)

## [1] 50

# show the spiked-in proteins
proteins[grepl("ups",proteins)]

##  [1] P05413ups   P02787ups   P00915ups   P02788ups   P01344ups  
##  [6] P01375ups   P01031ups   P01008ups   P68871ups   P02753ups  
## [11] P10636-8ups P62988ups   P02741ups   P01579ups   P02144ups  
## 50 Levels: A1L0T0 O00625 O60427 P00915ups P01008ups ... Q9Y6C9

4.5.5 Save your work

We can save the data that we made so far.

save(input.pd, file='data/data_ProteomeDiscoverer_TMT/input.pd.rda')
#write.csv(input.pd, file='data/data_ProteomeDiscoverer_TMT/input.pd.csv', row.names = FALSE)

4.6 MaxQuant output

4.6.1 Read data

Three files should be prepared before MSstatsTMT. Two files, ‘proteinGroups.txt’ and ‘evidence.txt’ are outputs from MaxQuant.

# First, get protein ID information
proteinGroups <- read.table("data/data_MaxQuant_TMT/proteinGroups.txt", sep = "\t", header = TRUE)

# Read in MaxQuant file: evidence.txt
evi <- read.table("data/data_MaxQuant_TMT/evidence.txt", sep="\t", header=TRUE)
colnames(evi)

##   [1] "Sequence"                           
##   [2] "Length"                             
##   [3] "Modifications"                      
##   [4] "Modified.sequence"                  
##   [5] "Arg10.Probabilities"                
##   [6] "Lys8.Probabilities"                 
##   [7] "Lys8.TMT6.Probabilities"            
##   [8] "Oxidation..M..Probabilities"        
##   [9] "Arg10.Score.Diffs"                  
##  [10] "Lys8.Score.Diffs"                   
##  [11] "Lys8.TMT6.Score.Diffs"              
##  [12] "Oxidation..M..Score.Diffs"          
##  [13] "Arg10"                              
##  [14] "Lys8"                               
##  [15] "Lys8.TMT6"                          
##  [16] "Oxidation..M."                      
##  [17] "Missed.cleavages"                   
##  [18] "Proteins"                           
##  [19] "Leading.proteins"                   
##  [20] "Leading.razor.protein"              
##  [21] "Gene.names"                         
##  [22] "Protein.names"                      
##  [23] "Type"                               
##  [24] "Raw.file"                           
##  [25] "Experiment"                         
##  [26] "MS.MS.m.z"                          
##  [27] "Charge"                             
##  [28] "m.z"                                
##  [29] "Mass"                               
##  [30] "Resolution"                         
##  [31] "Uncalibrated...Calibrated.m.z..ppm."
##  [32] "Uncalibrated...Calibrated.m.z..Da." 
##  [33] "Mass.Error..ppm."                   
##  [34] "Mass.Error..Da."                    
##  [35] "Uncalibrated.Mass.Error..ppm."      
##  [36] "Uncalibrated.Mass.Error..Da."       
##  [37] "Max.intensity.m.z.0"                
##  [38] "Retention.time"                     
##  [39] "Retention.length"                   
##  [40] "Calibrated.retention.time"          
##  [41] "Calibrated.retention.time.start"    
##  [42] "Calibrated.retention.time.finish"   
##  [43] "Retention.time.calibration"         
##  [44] "Match.time.difference"              
##  [45] "Match.m.z.difference"               
##  [46] "Match.q.value"                      
##  [47] "Match.score"                        
##  [48] "Number.of.data.points"              
##  [49] "Number.of.scans"                    
##  [50] "Number.of.isotopic.peaks"           
##  [51] "PIF"                                
##  [52] "Fraction.of.total.spectrum"         
##  [53] "Base.peak.fraction"                 
##  [54] "PEP"                                
##  [55] "MS.MS.Count"                        
##  [56] "MS.MS.Scan.Number"                  
##  [57] "Score"                              
##  [58] "Delta.score"                        
##  [59] "Combinatorics"                      
##  [60] "Intensity"                          
##  [61] "Reporter.intensity.corrected.0"     
##  [62] "Reporter.intensity.corrected.1"     
##  [63] "Reporter.intensity.corrected.2"     
##  [64] "Reporter.intensity.corrected.3"     
##  [65] "Reporter.intensity.corrected.4"     
##  [66] "Reporter.intensity.corrected.5"     
##  [67] "Reporter.intensity.corrected.6"     
##  [68] "Reporter.intensity.corrected.7"     
##  [69] "Reporter.intensity.corrected.8"     
##  [70] "Reporter.intensity.corrected.9"     
##  [71] "Reporter.intensity.0"               
##  [72] "Reporter.intensity.1"               
##  [73] "Reporter.intensity.2"               
##  [74] "Reporter.intensity.3"               
##  [75] "Reporter.intensity.4"               
##  [76] "Reporter.intensity.5"               
##  [77] "Reporter.intensity.6"               
##  [78] "Reporter.intensity.7"               
##  [79] "Reporter.intensity.8"               
##  [80] "Reporter.intensity.9"               
##  [81] "Reporter.intensity.count.0"         
##  [82] "Reporter.intensity.count.1"         
##  [83] "Reporter.intensity.count.2"         
##  [84] "Reporter.intensity.count.3"         
##  [85] "Reporter.intensity.count.4"         
##  [86] "Reporter.intensity.count.5"         
##  [87] "Reporter.intensity.count.6"         
##  [88] "Reporter.intensity.count.7"         
##  [89] "Reporter.intensity.count.8"         
##  [90] "Reporter.intensity.count.9"         
##  [91] "Reporter.PIF"                       
##  [92] "Reporter.fraction"                  
##  [93] "Reverse"                            
##  [94] "Potential.contaminant"              
##  [95] "id"                                 
##  [96] "Protein.group.IDs"                  
##  [97] "Peptide.ID"                         
##  [98] "Mod..peptide.ID"                    
##  [99] "MS.MS.IDs"                          
## [100] "Best.MS.MS"                         
## [101] "AIF.MS.MS.IDs"                      
## [102] "Arg10.site.IDs"                     
## [103] "Lys8.site.IDs"                      
## [104] "Lys8.TMT6.site.IDs"                 
## [105] "Oxidation..M..site.IDs"

unique(evi$Raw.file)

##  [1] 161117_SILAC_HeLa_UPS1_TMT10_Mixture2_01
##  [2] 161117_SILAC_HeLa_UPS1_TMT10_Mixture5_01
##  [3] 161117_SILAC_HeLa_UPS1_TMT10_Mixture1_02
##  [4] 161117_SILAC_HeLa_UPS1_TMT10_Mixture1_03
##  [5] 161117_SILAC_HeLa_UPS1_TMT10_Mixture2_03
##  [6] 161117_SILAC_HeLa_UPS1_TMT10_Mixture3_02
##  [7] 161117_SILAC_HeLa_UPS1_TMT10_Mixture3_03
##  [8] 161117_SILAC_HeLa_UPS1_TMT10_Mixture1_01
##  [9] 161117_SILAC_HeLa_UPS1_TMT10_Mixture3_01
## [10] 161117_SILAC_HeLa_UPS1_TMT10_Mixture4_01
## [11] 161117_SILAC_HeLa_UPS1_TMT10_Mixture4_02
## [12] 161117_SILAC_HeLa_UPS1_TMT10_Mixture5_02
## [13] 161117_SILAC_HeLa_UPS1_TMT10_Mixture2_02
## [14] 161117_SILAC_HeLa_UPS1_TMT10_Mixture4_03
## [15] 161117_SILAC_HeLa_UPS1_TMT10_Mixture5_03
## 15 Levels: 161117_SILAC_HeLa_UPS1_TMT10_Mixture1_01 ...

Again, we need file annotation file, required to fill in Condition, BioReplicate and Mixture for corresponding Run and Channel information. Users have to prepare as csv or txt file like ‘MaxQuant_annotation.csv’, which includes Run, Channel, Condition, BioReplicate, TechRepMixture and Mixture information, and load it in R.

4.6.2 Set annotation file

Run column in the annotation file should be the same as unique Raw.file in evidence.txt file. Channel column in the annotation file should match with the corresponding channel names in evidence.txt file.

# Read in annotation including condition and biological replicates: MaxQuant_annotation.csv
annot.maxquant <- read.csv("data/data_MaxQuant_TMT/MaxQuant_annotation.csv", header = TRUE)
head(annot.maxquant)

##                                        Run Fraction TechRepMixture
## 1 161117_SILAC_HeLa_UPS1_TMT10_Mixture1_01        1              1
## 2 161117_SILAC_HeLa_UPS1_TMT10_Mixture1_01        1              1
## 3 161117_SILAC_HeLa_UPS1_TMT10_Mixture1_01        1              1
## 4 161117_SILAC_HeLa_UPS1_TMT10_Mixture1_01        1              1
## 5 161117_SILAC_HeLa_UPS1_TMT10_Mixture1_01        1              1
## 6 161117_SILAC_HeLa_UPS1_TMT10_Mixture1_01        1              1
##     Channel Condition  Mixture BioReplicate
## 1 channel.0      Norm Mixture1         Norm
## 2 channel.1     0.667 Mixture1        0.667
## 3 channel.2     0.125 Mixture1        0.125
## 4 channel.3       0.5 Mixture1          0.5
## 5 channel.4         1 Mixture1            1
## 6 channel.5     0.125 Mixture1        0.125

4.6.3 Preprocessing with `MaxQtoMSstatsTMTFormat`

MaxQtoMSstatsTMTFormat function helps pre-processing for making right format of MSstatsTMT input from MaxQuant output. Basically, this function gets peptide ion intensity from ‘evidence.txt’ file. In addition, there are several steps to filter out or to modify the data in order to get required information.

Here is the summary of pre-processing steps in MaxQtoMSstatsTMTFormat function

Use Proteins in ‘proteinGroups.txt’ as protein ID
Peptide ions which are shared by more than one protein are removed
If one spectrum has multiple identifications within one run, it only keeps the best identification which has most measurements within that run or highest identification score or largest summation of all the measurements within that run
If one peptide ion has any missing value within one run, it removes the peptide ion from that run
For fractionation, it removes the shared peptide ions among the fractions of each mixture and then combine all the fractions for each mixture

?MaxQtoMSstatsTMTFormat

# reformating and pre-processing for MaxQuant output.
input.maxquant <- MaxQtoMSstatsTMTFormat(evidence=evi, 
                                      annotation=annot.maxquant,
                                      proteinGroups=proteinGroups)
head(input.maxquant)

##   ProteinName             PeptideSequence Charge
## 1      P67870 VYCENQPMLPIGLSDIPGEAMVK(ly)      3
## 2      Q9HAV4              TDSPSCEYSR(ar)      2
## 3   P62988ups                   GSHMQIFVK      3
## 4      Q9HAV4           EVMDLITVCCVSK(ly)      3
## 5      P13667    DFPEYTFAIADEEDYAGEVK(ly)      3
## 6      O00625            NLDPFLLFDEFK(ly)      2
##                             PSM  Mixture TechRepMixture
## 1 VYCENQPMLPIGLSDIPGEAMVK(ly)_3 Mixture1              1
## 2              TDSPSCEYSR(ar)_2 Mixture1              1
## 3                   GSHMQIFVK_3 Mixture1              1
## 4           EVMDLITVCCVSK(ly)_3 Mixture1              1
## 5    DFPEYTFAIADEEDYAGEVK(ly)_3 Mixture1              1
## 6            NLDPFLLFDEFK(ly)_2 Mixture1              1
##                                        Run   Channel BioReplicate
## 1 161117_SILAC_HeLa_UPS1_TMT10_Mixture1_01 channel.0         Norm
## 2 161117_SILAC_HeLa_UPS1_TMT10_Mixture1_01 channel.0         Norm
## 3 161117_SILAC_HeLa_UPS1_TMT10_Mixture1_01 channel.0         Norm
## 4 161117_SILAC_HeLa_UPS1_TMT10_Mixture1_01 channel.0         Norm
## 5 161117_SILAC_HeLa_UPS1_TMT10_Mixture1_01 channel.0         Norm
## 6 161117_SILAC_HeLa_UPS1_TMT10_Mixture1_01 channel.0         Norm
##   Condition Intensity
## 1      Norm    970.35
## 2      Norm    405.15
## 3      Norm   2276.00
## 4      Norm   1341.60
## 5      Norm   1092.30
## 6      Norm   2546.50

4.6.4 Preliminary check

# total number of proteins
proteins <- unique(input.maxquant$ProteinName)
length(proteins)

## [1] 33

# show the spiked-in proteins
proteins[grepl("ups",proteins)]

## [1] P62988ups P62937ups
## 39 Levels: A1L0T0 O00625 O60427 P01127ups P01344ups P04183 ... Q9Y6C9

4.6.5 Save your work