Contents

1 Introduction

The AWAggregator package implements an attribute-weighted aggregation algorithm which leverages peptide-spectrum match (PSM) attributes to provide a more accurate estimate of protein abundance compared to conventional aggregation methods. This algorithm employs pre-trained random forest models to predict the quantitative inaccuracy of PSMs based on their attributes. PSMs are then aggregated to the protein level using a weighted average, taking the predicted inaccuracy into account. Additionally, the package allows users to construct their own training sets that are more relevant to their specific experimental conditions if desired.

Since ExperimentHub can only retrieve data from the AWAggregatorData package with Bioconductor version 3.21 or later, please use the legacy version of the AWAggregator package if you are using an earlier Bioconductor version: https://github.com/Tan-Jiahua/AWAggregator-compat

1.1 Overview of Package Functions

Functions available in the AWAggregator package:

  • getDistMetric(): Calculates the distance metric for PSMs. Distance metric reflects on whether the quantified ratio of each pair of samples of a PSM diverges from other PSMs in the same redundant/unique group. Redundant group, unique group and distance metric were originally defined in the iPQF method. Please refer to “iPQF: a new peptide-to-protein summarization method using peptide spectra characteristics to improve protein quantification” for more details.

  • getPSMAttributes(): Retrieves attributes required for training or test sets.

  • getAvgScaledErrorOfLog2FC(): Calculates the Average Scaled Error of log2FC values required for training sets.

  • mergeTrainingSets(): Extracts a similar number of PSMs from each input dataset and merges them into a single training set.

  • fitQuantInaccuracyModel(): Trains a random forest model to predict the level of quantitative inaccuracy of PSMs.

  • aggregateByAttributes(): Aggregates PSMs using a random forest model.

  • convertPDFormat(): Converts output from Proteome Discoverer into the input format required by AWAggregator.

Function available in the associated AWAggregatorData package:

  • loadQuantInaccuracyModel(): Loads a pre-trained random forest model for predicting the level of quantitative inaccuracy of PSMs.

1.2 Overview of Package Data

Data available in the AWAggregator package:

  • sample.PSM.FP: represents sample PSMs mapped to the proteins A0AV96, A0AVF1, A0AVT1, A0FGR8, and A0M8Q6, obtained from the psm.tsv output file generated by FragPipe. Columns unnecessary for the AWAggregator have been removed from the sample data.

  • sample.prot.PD: represents sample proteins A0AV96, A0AVF1, A0AVT1, A0FGR8, and A0M8Q6, obtained from the TXT export of the proteins page in the Proteome Discoverer search results. Columns unnecessary for the AWAggregator have been removed from the sample data.

  • sample.PSM.PD: represents sample PSMs mapped to the proteins A0AV96, A0AVF1, A0AVT1, A0FGR8, and A0M8Q6, obtained from the TXT export of the PSMs page in the Proteome Discoverer search results. Columns unnecessary for the AWAggregator have been removed from the sample data.

Data available in the associated AWAggregatorData package:

  • regr: represent the pre-trained random forest model that incorporates the average coefficient of variation (CV) as a feature.

  • regr.no.CV: represent the pre-trained random forest model that does not include the average CV as a feature.

  • benchmark.set.1, benchmark.set.2, benchmark.set.3: represents PSMs in Benchmark Set 1 ~ 3 derived from the psm.tsv output files generated by FragPipe, which are used to train the random forest model. Columns unnecessary for the AWAggregator have been removed from the sample data.

2 Installation

The AWAggregator package and the associated AWAggregatorData package can be installed from Bioconductor.

if (!requireNamespace('BiocManager', quietly=TRUE))
    install.packages('BiocManager')

BiocManager::install('AWAggregator')
BiocManager::install('AWAggregatorData')

3 Workflow Examples

Load the AWAggregator package and the AWAggregatorData package.

library(AWAggregator)
library(AWAggregatorData)
## Loading required package: ExperimentHub
## Loading required package: BiocGenerics
## Loading required package: generics
## 
## Attaching package: 'generics'
## The following objects are masked from 'package:base':
## 
##     as.difftime, as.factor, as.ordered, intersect, is.element, setdiff,
##     setequal, union
## 
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:stats':
## 
##     IQR, mad, sd, var, xtabs
## The following objects are masked from 'package:base':
## 
##     Filter, Find, Map, Position, Reduce, anyDuplicated, aperm, append,
##     as.data.frame, basename, cbind, colnames, dirname, do.call,
##     duplicated, eval, evalq, get, grep, grepl, is.unsorted, lapply,
##     mapply, match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
##     rank, rbind, rownames, sapply, saveRDS, table, tapply, unique,
##     unsplit, which.max, which.min
## Loading required package: AnnotationHub
## Loading required package: BiocFileCache
## Loading required package: dbplyr

3.1 Ex.1: Aggregate PSMs from FragPipe Using the Pre-Trained Model.

In this example, we aggregate the reporter ion intensities of PSMs to the protein level. We use the sample dataset sample.PSM.FP, included in the AWAggregator package and derived from the psm.tsv output file generated by FragPipe. This dataset includes reporter ion intensities from nine samples, labeled from Sample 1 to Sample 9, without replicates. The PSMs are mapped to the following proteins: A0AV96, A0AVF1, A0AVT1, A0FGR8, and A0M8Q6, with unnecessary columns removed for clarity.

This example demonstrates the basic functionality of the AWAggregator package using the default pre-trained model.

# Load the pre-trained random forest model that does not include the average CV 
# as a feature, which indicates the average CV in percentage for processed PSM 
# reporter ion intensities across different replicate groups. It is recommended 
# to load the pre-trained model with average CV when replicates are available; 
# otherwise, use the model without the average CV
data(sample.PSM.FP)
regr <- loadQuantInaccuracyModel(useAvgCV=FALSE)
## see ?AWAggregatorData and browseVignettes('AWAggregatorData') for documentation
## loading from cache
# Load sample names (Sample 1 ~ Sample 9)
samples <- colnames(sample.PSM.FP)[grep('Sample', colnames(sample.PSM.FP))]
groups <- samples
df <- getPSMAttributes(
    PSM=sample.PSM.FP,
    # TMT tag (229.1629) and carbamidomethylation (57.0214) are applied as 
    # fixed post-translational modifications (PTMs)
    fixedPTMs=c('229.1629', '57.0214'),
    colOfReporterIonInt=samples,
    groups=groups,
    setProgressBar=TRUE
)
## These groups are automatically removed when average CV is calculated because of lack of replicates:
## Sample 1, Sample 2, Sample 3, Sample 4, Sample 5, Sample 6, Sample 7, Sample 8, Sample 9
## There are no replicates so average CV will not be generated as an attribute.
aggregated_results <- aggregateByAttributes(
    PSM=df,
    colOfReporterIonInt=samples,
    ranger=regr,
    ratioCalc=FALSE
)

The output dataframe will provide estimates of protein abundance.

Protein               Sample 1   Sample 2   Sample 3   Sample 4   ...
sp|A0AV96|RBM47_HUMAN 0.9292177  1.0111264  0.7933874  0.9606382  ...
sp|A0AVF1|IFT56_HUMAN 0.6646691  0.6600642  0.6696656  0.7984397  ...
sp|A0AVT1|UBA6_HUMAN  1.1883116  1.1752203  1.0482381  1.0910095  ...
sp|A0FGR8|ESYT2_HUMAN 0.9304190  0.8504465  1.0550898  0.7952998  ...
sp|A0M8Q6|IGLC7_HUMAN 0.4205675  0.6393757  0.7475482  0.6968704  ...

3.2 Ex.2: Aggregate PSMs from Proteome Discoverer Using the Pre-Trained Model.

In this example, we convert the search result from Proteome Discoverer to the format required by AWAggregator and aggregate the reporter ion intensities of PSMs to the protein level. We use the sample dataset sample.PSM.PD, alongside its corresponding protein table sample.prot.PD, both included in the AWAggregator package. These files are derived from the TXT exports of the proteins and PSMs pages in the search results from Proteome Discoverer. This dataset includes reporter ion intensities from nine samples, labeled from Sample 1 to Sample 9, without replicates. The PSM and protein tables contains following proteins: A0AV96, A0AVF1, A0AVT1, A0FGR8, and A0M8Q6, with unnecessary columns removed for clarity.

# Load the pre-trained random forest model that does not include the average CV 
# as a feature, which indicates the average CV in percentage for processed PSM 
# reporter ion intensities across different replicate groups. It is recommended 
# to load the pre-trained model with average CV when replicates are available; 
# otherwise, use the model without the average CV
data(sample.PSM.PD)
data(sample.prot.PD)
regr <- loadQuantInaccuracyModel(useAvgCV=FALSE)
## see ?AWAggregatorData and browseVignettes('AWAggregatorData') for documentation
## loading from cache
# Load sample names (Sample 1 ~ Sample 9)
samples <- colnames(sample.PSM.PD)[grep('Sample', colnames(sample.PSM.PD))]
groups <- samples
df <- convertPDFormat(
    PSM=sample.PSM.PD,
    protein=sample.prot.PD,
    colOfReporterIonInt=samples
)
df <- getPSMAttributes(
    PSM=df,
    # TMT tag and carbamidomethylation are applied as static PTMs
    fixedPTMs=c('TMT6plex', 'Carbamidomethyl'),
    colOfReporterIonInt=samples,
    groups=groups,
    setProgressBar=TRUE
)
## These groups are automatically removed when average CV is calculated because of lack of replicates:
## Sample 1, Sample 2, Sample 3, Sample 4, Sample 5, Sample 6, Sample 7, Sample 8, Sample 9
## There are no replicates so average CV will not be generated as an attribute.
aggregated_results <- aggregateByAttributes(
    PSM=df,
    colOfReporterIonInt=samples,
    ranger=regr,
    ratioCalc=FALSE
)

The output dataframe will provide estimates of protein abundance.

Protein             Sample 1   Sample 2   Sample 3   Sample 4   ...
A0AV96_Homo sapiens 0.9392033  0.9514846  0.7096284  0.9393484  ...
A0AVF1_Homo sapiens 0.6591366  0.6534372  0.7121089  0.7741971  ...
A0AVT1_Homo sapiens 1.2035820  1.1647425  1.0494833  1.1121796  ...
A0FGR8_Homo sapiens 0.9664924  0.8391658  1.0946545  0.7832414  ...
A0M8Q6_Homo sapiens 0.3516833  0.4695273  0.7225070  0.6042526  ...

3.3 Ex.3: Build a Merged Training Set and Retrain the Model.

Retraining the AWA model using additional spike-in datasets can improve the number of quantified PSMs in the merged training set, and hence the robustness of the correlation. In addition, retraining using experiment-specific in-house spike-in datasets could also provide potential benefits for the machine learning model by better representing the employed hardware and acquisition modes.

In this example, we create a training set by merging three benchmark spike-in datasets (benchmark.set.1, benchmark.set.2, and benchmark.set.3), all included in the AWAggregator package and derived from the psm.tsv output files generated by FragPipe. This combined training set is then used to train a random forest model.

3.3.1 Step 1: Load Spike-in Datasets

We load the spike-in datasets using ExperimentHub package. These datasets correspond to the sets described in the AWAggregator publication. You may substitute your own spike-in datasets if desired.

library(ExperimentHub)
eh <- ExperimentHub()
benchmarkSet1 <- eh[['EH9637']] # Benchmark Set 1
## see ?AWAggregatorData and browseVignettes('AWAggregatorData') for documentation
## loading from cache
benchmarkSet2 <- eh[['EH9638']] # Benchmark Set 2
## see ?AWAggregatorData and browseVignettes('AWAggregatorData') for documentation
## loading from cache
benchmarkSet3 <- eh[['EH9639']] # Benchmark Set 3
## see ?AWAggregatorData and browseVignettes('AWAggregatorData') for documentation
## loading from cache

3.3.2 Step 2: Calculate PSM Attributes and Average Scaled Error of log2FC

Firstly, we calculate the attributes and the values of Average Scaled Error of log2FC in benchmark.set.1.

library(stringr)

# Load sample names (Sample 'H1+E1_1' ~ Sample 'H1+E6_3')
samples <- colnames(benchmarkSet1)[
    grep('H1[+]E[0-9]+_[1-4]', colnames(benchmarkSet1))
]
groups <- str_match(samples, 'H1[+]E[0-9]+')[, 1]
PSM1 <- getPSMAttributes(
    PSM=benchmarkSet1,
    # TMT tag (229.1629) and carbamidomethylation (57.0214) are applied as 
    # fixed PTMs
    fixedPTM=c('229.1629', '57.0214'),
    colOfReporterIonInt=samples,
    groups=groups
)
PSM1 <- getAvgScaledErrorOfLog2FC(
    PSM=PSM1,
    colOfReporterIonInt=samples,
    groups=groups,
    # The actual protein fold change may be deviated from the intended values 
    # after TMT labelling as the original work indicates when H1+Y6 is 
    # involved, and therefore, H1+Y6 is not used in the calculation of Average 
    # of Scaled Error of log2FC
    expectedRelativeAbundance=list(`H1+E1`=1, `H1+E2`=2, `H1+E6`=NA),
    speciesAtConstLevel='HUMAN'
)

Secondly, we calculate the attributes and the values of Average Scaled Error of log2FC in benchmark.set.2. benchmark.set.2 consists of three separate mass spectrometry runs, indicated by the Replicate column. Each run is processed individually because of potential run-specific differences using lapply function, and merged together by bind_rows function.

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:dbplyr':
## 
##     ident, sql
## The following objects are masked from 'package:BiocGenerics':
## 
##     combine, intersect, setdiff, setequal, union
## The following object is masked from 'package:generics':
## 
##     explain
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
# Load sample names (Sample 'H1+Y1_1' ~ Sample 'H1+Y10_3')
samples <- colnames(benchmarkSet2)[
    grep('H1[+]Y[0-9]+_[1-3]', colnames(benchmarkSet2))
]
groups <- str_match(samples, 'H1[+]Y[0-9]+')[, 1]

# Process each replicate separately using lapply()
# lapply() loops over all unique replicate IDs in benchmarkSet2.
# 'X' is the current replicate ID.
tmp <- lapply(unique(benchmarkSet2$Replicate), FUN=function(X){
    # Select PSMs from the current replicate X
    df <- benchmarkSet2[benchmarkSet2$Replicate == X, ]
    df <- getPSMAttributes(
        PSM=df,
        fixedPTM=c('229.1629', '57.0214'),
        colOfReporterIonInt=samples,
        groups=groups,
        setProgressBar=FALSE
    )
    df <- getAvgScaledErrorOfLog2FC(
        PSM=df,
        colOfReporterIonInt=samples,
        groups=groups,
        expectedRelativeAbundance=list(`H1+Y1`=1, `H1+Y4`=4, `H1+Y10`=10),
        speciesAtConstLevel='HUMAN'
    )
    # Return the processed PSMs from the current replicate
    return(df)
})
# Combine results from all replicates into one dataframe
PSM2 <- bind_rows(tmp)

Thirdly, we calculate the attributes and the values of Average Scaled Error of log2FC in benchmark.set.3.

# Load sample names (Sample 'H1+Y1_1' ~ Sample 'H1+Y10_2')
samples <- colnames(benchmarkSet3)[
    grep('H1[+]Y[0-9]+_[1-2]', colnames(benchmarkSet3))
]
groups <- str_match(samples, 'H1[+]Y[0-9]+')[, 1]
PSM3 <- getPSMAttributes(
    PSM=benchmarkSet3,
    fixedPTM=c('304.2071', '125.0476'),
    colOfReporterIonInt=samples,
    groups=groups,
    # The signals for yeast PSMs in group H1+Y0 is completely from noise, so 
    # they are not used for calculating Average CV
    groupsExcludedFromCV='H1+Y0'
)
## These groups are removed when average CV is calculated because of the setting of groupsExcludedFromCV:
## H1+Y0
PSM3 <- getAvgScaledErrorOfLog2FC(
    PSM=PSM3,
    colOfReporterIonInt=samples,
    groups=groups,
    expectedRelativeAbundance=list(
        `H1+Y0`=0, `H1+Y1`=1, `H1+Y5`=5, `H1+Y10`=10
    ),
    speciesAtConstLevel='HUMAN'
)

3.3.3 Step 3: Merge Spike-in Datasets as a New Training Set

Next, we merge a new training set from these three datasets. The minimum number of PSMs to extract from each dataset is determined by the number of PSMs in the smallest set. Complete sets of PSMs mapped to the selected proteins are extracted, resulting in final PSM counts from each set that are equal to or slightly larger than the preset values.

set.seed(1000)
PSM <- mergeTrainingSets(
    PSMList=list(
        `Benchmark Set 1`=PSM1,
        `Benchmark Set 2`=PSM2,
        `Benchmark Set 3`=PSM3
    ),
    numPSMs=min(nrow(PSM1), nrow(PSM2), nrow(PSM3))
)

3.3.4 Step 4: Train a New Random Forest Model

Train a new random forest model using Average CV as an attribute.

regr <- fitQuantInaccuracyModel(PSM, useAvgCV=TRUE, seed=3979)
## Growing trees.. Progress: 30%. Estimated remaining time: 1 minute, 10 seconds.
## Growing trees.. Progress: 61%. Estimated remaining time: 38 seconds.
## Growing trees.. Progress: 91%. Estimated remaining time: 9 seconds.
## Model training time = 1.83722259203593 minutes
sessionInfo()
## R version 4.5.1 (2025-06-13)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.3 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.22-bioc/R/lib/libRblas.so 
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/liblapack.so.3;  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB              LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: America/New_York
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] dplyr_1.1.4             stringr_1.5.1           AWAggregatorData_0.99.4
##  [4] ExperimentHub_2.99.5    AnnotationHub_3.99.6    BiocFileCache_2.99.6   
##  [7] dbplyr_2.5.0            BiocGenerics_0.55.1     generics_0.1.4         
## [10] AWAggregator_0.99.4     BiocStyle_2.37.1       
## 
## loaded via a namespace (and not attached):
##  [1] KEGGREST_1.49.1      toOrdinal_1.3-0.0    xfun_0.53           
##  [4] bslib_0.9.0          httr2_1.2.1          Biobase_2.69.0      
##  [7] lattice_0.22-7       vctrs_0.6.5          tools_4.5.1         
## [10] stats4_4.5.1         curl_7.0.0           tibble_3.3.0        
## [13] AnnotationDbi_1.71.1 RSQLite_2.4.3        blob_1.2.4          
## [16] pkgconfig_2.0.3      Matrix_1.7-3         S4Vectors_0.47.0    
## [19] lifecycle_1.0.4      compiler_4.5.1       Biostrings_2.77.2   
## [22] brio_1.1.5           progress_1.2.3       Seqinfo_0.99.2      
## [25] htmltools_0.5.8.1    sass_0.4.10          yaml_2.3.10         
## [28] pillar_1.11.0        crayon_1.5.3         jquerylib_0.1.4     
## [31] tidyr_1.3.1          cachem_1.1.0         tidyselect_1.2.1    
## [34] digest_0.6.37        stringi_1.8.7        purrr_1.1.0         
## [37] bookdown_0.44        BiocVersion_3.22.0   fastmap_1.2.0       
## [40] grid_4.5.1           cli_3.6.5            magrittr_2.0.3      
## [43] withr_3.0.2          prettyunits_1.2.0    filelock_1.0.3      
## [46] rappdirs_0.3.3       bit64_4.6.0-1        XVector_0.49.0      
## [49] rmarkdown_2.29       httr_1.4.7           Peptides_2.4.6      
## [52] bit_4.6.0            ranger_0.17.0        png_0.1-8           
## [55] hms_1.1.3            memoise_2.0.1        evaluate_1.0.4      
## [58] knitr_1.50           IRanges_2.43.0       testthat_3.2.3      
## [61] rlang_1.1.6          Rcpp_1.1.0           glue_1.8.0          
## [64] DBI_1.2.3            BiocManager_1.30.26  jsonlite_2.0.0      
## [67] R6_2.6.1