The AWAggregator
package implements an attribute-weighted aggregation
algorithm which leverages peptide-spectrum match (PSM) attributes to provide a
more accurate estimate of protein abundance compared to conventional
aggregation methods. This algorithm employs pre-trained random forest models to
predict the quantitative inaccuracy of PSMs based on their attributes. PSMs are
then aggregated to the protein level using a weighted average, taking the
predicted inaccuracy into account. Additionally, the package allows users to
construct their own training sets that are more relevant to their specific
experimental conditions if desired.
Since ExperimentHub
can only retrieve data from the AWAggregatorData
package with Bioconductor version 3.21 or later, please use the legacy version
of the AWAggregator
package if you are using an earlier Bioconductor version:
https://github.com/Tan-Jiahua/AWAggregator-compat
Functions available in the AWAggregator
package:
getDistMetric()
: Calculates the distance metric for PSMs. Distance metric
reflects on whether the quantified ratio of each pair of samples of a PSM
diverges from other PSMs in the same redundant/unique group. Redundant group,
unique group and distance metric were originally defined in the iPQF method.
Please refer to “iPQF: a new peptide-to-protein summarization method using
peptide spectra characteristics to improve protein quantification” for more
details.
getPSMAttributes()
: Retrieves attributes required for training or test
sets.
getAvgScaledErrorOfLog2FC()
: Calculates the Average Scaled Error of
log2FC values required for training sets.
mergeTrainingSets()
: Extracts a similar number of PSMs from each input
dataset and merges them into a single training set.
fitQuantInaccuracyModel()
: Trains a random forest model to predict the
level of quantitative inaccuracy of PSMs.
aggregateByAttributes()
: Aggregates PSMs using a random forest model.
convertPDFormat()
: Converts output from Proteome Discoverer into the
input format required by AWAggregator
.
Function available in the associated AWAggregatorData
package:
loadQuantInaccuracyModel()
: Loads a pre-trained random forest model for
predicting the level of quantitative inaccuracy of PSMs.Data available in the AWAggregator
package:
sample.PSM.FP
: represents sample PSMs mapped to the proteins A0AV96,
A0AVF1, A0AVT1, A0FGR8, and A0M8Q6, obtained from the psm.tsv
output file
generated by FragPipe. Columns unnecessary for the AWAggregator
have been
removed from the sample data.
sample.prot.PD
: represents sample proteins A0AV96, A0AVF1, A0AVT1,
A0FGR8, and A0M8Q6, obtained from the TXT export of the proteins page in the
Proteome Discoverer search results. Columns unnecessary for the AWAggregator
have been removed from the sample data.
sample.PSM.PD
: represents sample PSMs mapped to the proteins A0AV96,
A0AVF1, A0AVT1, A0FGR8, and A0M8Q6, obtained from the TXT export of the PSMs
page in the Proteome Discoverer search results. Columns unnecessary for the
AWAggregator
have been removed from the sample data.
Data available in the associated AWAggregatorData
package:
regr
: represent the pre-trained random forest model that incorporates the
average coefficient of variation (CV) as a feature.
regr.no.CV
: represent the pre-trained random forest model that does not
include the average CV as a feature.
benchmark.set.1
, benchmark.set.2
, benchmark.set.3
: represents PSMs in
Benchmark Set 1 ~ 3 derived from the psm.tsv
output files generated by
FragPipe, which are used to train the random forest model. Columns unnecessary
for the AWAggregator
have been removed from the sample data.
The AWAggregator
package and the associated AWAggregatorData
package can be
installed from Bioconductor.
if (!requireNamespace('BiocManager', quietly=TRUE))
install.packages('BiocManager')
BiocManager::install('AWAggregator')
BiocManager::install('AWAggregatorData')
Load the AWAggregator
package and the AWAggregatorData
package.
library(AWAggregator)
library(AWAggregatorData)
## Loading required package: ExperimentHub
## Loading required package: BiocGenerics
## Loading required package: generics
##
## Attaching package: 'generics'
## The following objects are masked from 'package:base':
##
## as.difftime, as.factor, as.ordered, intersect, is.element, setdiff,
## setequal, union
##
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:stats':
##
## IQR, mad, sd, var, xtabs
## The following objects are masked from 'package:base':
##
## Filter, Find, Map, Position, Reduce, anyDuplicated, aperm, append,
## as.data.frame, basename, cbind, colnames, dirname, do.call,
## duplicated, eval, evalq, get, grep, grepl, is.unsorted, lapply,
## mapply, match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
## rank, rbind, rownames, sapply, saveRDS, table, tapply, unique,
## unsplit, which.max, which.min
## Loading required package: AnnotationHub
## Loading required package: BiocFileCache
## Loading required package: dbplyr
In this example, we aggregate the reporter ion intensities of PSMs to the
protein level. We use the sample dataset sample.PSM.FP
, included in the
AWAggregator
package and derived from the psm.tsv
output file generated by
FragPipe. This dataset includes reporter ion intensities from nine samples,
labeled from Sample 1
to Sample 9
, without replicates. The PSMs are mapped
to the following proteins: A0AV96, A0AVF1, A0AVT1, A0FGR8, and A0M8Q6, with
unnecessary columns removed for clarity.
This example demonstrates the basic functionality of the AWAggregator
package
using the default pre-trained model.
# Load the pre-trained random forest model that does not include the average CV
# as a feature, which indicates the average CV in percentage for processed PSM
# reporter ion intensities across different replicate groups. It is recommended
# to load the pre-trained model with average CV when replicates are available;
# otherwise, use the model without the average CV
data(sample.PSM.FP)
regr <- loadQuantInaccuracyModel(useAvgCV=FALSE)
## see ?AWAggregatorData and browseVignettes('AWAggregatorData') for documentation
## loading from cache
# Load sample names (Sample 1 ~ Sample 9)
samples <- colnames(sample.PSM.FP)[grep('Sample', colnames(sample.PSM.FP))]
groups <- samples
df <- getPSMAttributes(
PSM=sample.PSM.FP,
# TMT tag (229.1629) and carbamidomethylation (57.0214) are applied as
# fixed post-translational modifications (PTMs)
fixedPTMs=c('229.1629', '57.0214'),
colOfReporterIonInt=samples,
groups=groups,
setProgressBar=TRUE
)
## These groups are automatically removed when average CV is calculated because of lack of replicates:
## Sample 1, Sample 2, Sample 3, Sample 4, Sample 5, Sample 6, Sample 7, Sample 8, Sample 9
## There are no replicates so average CV will not be generated as an attribute.
aggregated_results <- aggregateByAttributes(
PSM=df,
colOfReporterIonInt=samples,
ranger=regr,
ratioCalc=FALSE
)
The output dataframe will provide estimates of protein abundance.
Protein Sample 1 Sample 2 Sample 3 Sample 4 ...
sp|A0AV96|RBM47_HUMAN 0.9292177 1.0111264 0.7933874 0.9606382 ...
sp|A0AVF1|IFT56_HUMAN 0.6646691 0.6600642 0.6696656 0.7984397 ...
sp|A0AVT1|UBA6_HUMAN 1.1883116 1.1752203 1.0482381 1.0910095 ...
sp|A0FGR8|ESYT2_HUMAN 0.9304190 0.8504465 1.0550898 0.7952998 ...
sp|A0M8Q6|IGLC7_HUMAN 0.4205675 0.6393757 0.7475482 0.6968704 ...
In this example, we convert the search result from Proteome Discoverer to the
format required by AWAggregator
and aggregate the reporter ion intensities of
PSMs to the protein level. We use the sample dataset sample.PSM.PD
, alongside
its corresponding protein table sample.prot.PD
, both included in the
AWAggregator
package. These files are derived from the TXT exports of the
proteins and PSMs pages in the search results from Proteome Discoverer. This
dataset includes reporter ion intensities from nine samples, labeled from
Sample 1
to Sample 9
, without replicates. The PSM and protein tables
contains following proteins: A0AV96, A0AVF1, A0AVT1, A0FGR8, and A0M8Q6, with
unnecessary columns removed for clarity.
# Load the pre-trained random forest model that does not include the average CV
# as a feature, which indicates the average CV in percentage for processed PSM
# reporter ion intensities across different replicate groups. It is recommended
# to load the pre-trained model with average CV when replicates are available;
# otherwise, use the model without the average CV
data(sample.PSM.PD)
data(sample.prot.PD)
regr <- loadQuantInaccuracyModel(useAvgCV=FALSE)
## see ?AWAggregatorData and browseVignettes('AWAggregatorData') for documentation
## loading from cache
# Load sample names (Sample 1 ~ Sample 9)
samples <- colnames(sample.PSM.PD)[grep('Sample', colnames(sample.PSM.PD))]
groups <- samples
df <- convertPDFormat(
PSM=sample.PSM.PD,
protein=sample.prot.PD,
colOfReporterIonInt=samples
)
df <- getPSMAttributes(
PSM=df,
# TMT tag and carbamidomethylation are applied as static PTMs
fixedPTMs=c('TMT6plex', 'Carbamidomethyl'),
colOfReporterIonInt=samples,
groups=groups,
setProgressBar=TRUE
)
## These groups are automatically removed when average CV is calculated because of lack of replicates:
## Sample 1, Sample 2, Sample 3, Sample 4, Sample 5, Sample 6, Sample 7, Sample 8, Sample 9
## There are no replicates so average CV will not be generated as an attribute.
aggregated_results <- aggregateByAttributes(
PSM=df,
colOfReporterIonInt=samples,
ranger=regr,
ratioCalc=FALSE
)
The output dataframe will provide estimates of protein abundance.
Protein Sample 1 Sample 2 Sample 3 Sample 4 ...
A0AV96_Homo sapiens 0.9392033 0.9514846 0.7096284 0.9393484 ...
A0AVF1_Homo sapiens 0.6591366 0.6534372 0.7121089 0.7741971 ...
A0AVT1_Homo sapiens 1.2035820 1.1647425 1.0494833 1.1121796 ...
A0FGR8_Homo sapiens 0.9664924 0.8391658 1.0946545 0.7832414 ...
A0M8Q6_Homo sapiens 0.3516833 0.4695273 0.7225070 0.6042526 ...
Retraining the AWA model using additional spike-in datasets can improve the number of quantified PSMs in the merged training set, and hence the robustness of the correlation. In addition, retraining using experiment-specific in-house spike-in datasets could also provide potential benefits for the machine learning model by better representing the employed hardware and acquisition modes.
In this example, we create a training set by merging three benchmark spike-in
datasets (benchmark.set.1
, benchmark.set.2
, and benchmark.set.3
), all
included in the AWAggregator
package and derived from the psm.tsv
output
files generated by FragPipe. This combined training set is then used to train a
random forest model.
We load the spike-in datasets using ExperimentHub
package. These datasets
correspond to the sets described in the AWAggregator
publication. You may
substitute your own spike-in datasets if desired.
library(ExperimentHub)
eh <- ExperimentHub()
benchmarkSet1 <- eh[['EH9637']] # Benchmark Set 1
## see ?AWAggregatorData and browseVignettes('AWAggregatorData') for documentation
## loading from cache
benchmarkSet2 <- eh[['EH9638']] # Benchmark Set 2
## see ?AWAggregatorData and browseVignettes('AWAggregatorData') for documentation
## loading from cache
benchmarkSet3 <- eh[['EH9639']] # Benchmark Set 3
## see ?AWAggregatorData and browseVignettes('AWAggregatorData') for documentation
## loading from cache
Firstly, we calculate the attributes and the values of Average Scaled Error of
log2FC in benchmark.set.1
.
library(stringr)
# Load sample names (Sample 'H1+E1_1' ~ Sample 'H1+E6_3')
samples <- colnames(benchmarkSet1)[
grep('H1[+]E[0-9]+_[1-4]', colnames(benchmarkSet1))
]
groups <- str_match(samples, 'H1[+]E[0-9]+')[, 1]
PSM1 <- getPSMAttributes(
PSM=benchmarkSet1,
# TMT tag (229.1629) and carbamidomethylation (57.0214) are applied as
# fixed PTMs
fixedPTM=c('229.1629', '57.0214'),
colOfReporterIonInt=samples,
groups=groups
)
PSM1 <- getAvgScaledErrorOfLog2FC(
PSM=PSM1,
colOfReporterIonInt=samples,
groups=groups,
# The actual protein fold change may be deviated from the intended values
# after TMT labelling as the original work indicates when H1+Y6 is
# involved, and therefore, H1+Y6 is not used in the calculation of Average
# of Scaled Error of log2FC
expectedRelativeAbundance=list(`H1+E1`=1, `H1+E2`=2, `H1+E6`=NA),
speciesAtConstLevel='HUMAN'
)
Secondly, we calculate the attributes and the values of Average Scaled Error of
log2FC in benchmark.set.2
. benchmark.set.2
consists of three separate
mass spectrometry runs, indicated by the Replicate
column. Each run is
processed individually because of potential run-specific differences using
lapply
function, and merged together by bind_rows
function.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:dbplyr':
##
## ident, sql
## The following objects are masked from 'package:BiocGenerics':
##
## combine, intersect, setdiff, setequal, union
## The following object is masked from 'package:generics':
##
## explain
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
# Load sample names (Sample 'H1+Y1_1' ~ Sample 'H1+Y10_3')
samples <- colnames(benchmarkSet2)[
grep('H1[+]Y[0-9]+_[1-3]', colnames(benchmarkSet2))
]
groups <- str_match(samples, 'H1[+]Y[0-9]+')[, 1]
# Process each replicate separately using lapply()
# lapply() loops over all unique replicate IDs in benchmarkSet2.
# 'X' is the current replicate ID.
tmp <- lapply(unique(benchmarkSet2$Replicate), FUN=function(X){
# Select PSMs from the current replicate X
df <- benchmarkSet2[benchmarkSet2$Replicate == X, ]
df <- getPSMAttributes(
PSM=df,
fixedPTM=c('229.1629', '57.0214'),
colOfReporterIonInt=samples,
groups=groups,
setProgressBar=FALSE
)
df <- getAvgScaledErrorOfLog2FC(
PSM=df,
colOfReporterIonInt=samples,
groups=groups,
expectedRelativeAbundance=list(`H1+Y1`=1, `H1+Y4`=4, `H1+Y10`=10),
speciesAtConstLevel='HUMAN'
)
# Return the processed PSMs from the current replicate
return(df)
})
# Combine results from all replicates into one dataframe
PSM2 <- bind_rows(tmp)
Thirdly, we calculate the attributes and the values of Average Scaled Error of
log2FC in benchmark.set.3
.
# Load sample names (Sample 'H1+Y1_1' ~ Sample 'H1+Y10_2')
samples <- colnames(benchmarkSet3)[
grep('H1[+]Y[0-9]+_[1-2]', colnames(benchmarkSet3))
]
groups <- str_match(samples, 'H1[+]Y[0-9]+')[, 1]
PSM3 <- getPSMAttributes(
PSM=benchmarkSet3,
fixedPTM=c('304.2071', '125.0476'),
colOfReporterIonInt=samples,
groups=groups,
# The signals for yeast PSMs in group H1+Y0 is completely from noise, so
# they are not used for calculating Average CV
groupsExcludedFromCV='H1+Y0'
)
## These groups are removed when average CV is calculated because of the setting of groupsExcludedFromCV:
## H1+Y0
PSM3 <- getAvgScaledErrorOfLog2FC(
PSM=PSM3,
colOfReporterIonInt=samples,
groups=groups,
expectedRelativeAbundance=list(
`H1+Y0`=0, `H1+Y1`=1, `H1+Y5`=5, `H1+Y10`=10
),
speciesAtConstLevel='HUMAN'
)
Next, we merge a new training set from these three datasets. The minimum number of PSMs to extract from each dataset is determined by the number of PSMs in the smallest set. Complete sets of PSMs mapped to the selected proteins are extracted, resulting in final PSM counts from each set that are equal to or slightly larger than the preset values.
set.seed(1000)
PSM <- mergeTrainingSets(
PSMList=list(
`Benchmark Set 1`=PSM1,
`Benchmark Set 2`=PSM2,
`Benchmark Set 3`=PSM3
),
numPSMs=min(nrow(PSM1), nrow(PSM2), nrow(PSM3))
)
Train a new random forest model using Average CV as an attribute.
regr <- fitQuantInaccuracyModel(PSM, useAvgCV=TRUE, seed=3979)
## Growing trees.. Progress: 30%. Estimated remaining time: 1 minute, 10 seconds.
## Growing trees.. Progress: 61%. Estimated remaining time: 38 seconds.
## Growing trees.. Progress: 91%. Estimated remaining time: 9 seconds.
## Model training time = 1.83722259203593 minutes
sessionInfo()
## R version 4.5.1 (2025-06-13)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.3 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.22-bioc/R/lib/libRblas.so
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/liblapack.so.3; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_GB LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: America/New_York
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] dplyr_1.1.4 stringr_1.5.1 AWAggregatorData_0.99.4
## [4] ExperimentHub_2.99.5 AnnotationHub_3.99.6 BiocFileCache_2.99.6
## [7] dbplyr_2.5.0 BiocGenerics_0.55.1 generics_0.1.4
## [10] AWAggregator_0.99.4 BiocStyle_2.37.1
##
## loaded via a namespace (and not attached):
## [1] KEGGREST_1.49.1 toOrdinal_1.3-0.0 xfun_0.53
## [4] bslib_0.9.0 httr2_1.2.1 Biobase_2.69.0
## [7] lattice_0.22-7 vctrs_0.6.5 tools_4.5.1
## [10] stats4_4.5.1 curl_7.0.0 tibble_3.3.0
## [13] AnnotationDbi_1.71.1 RSQLite_2.4.3 blob_1.2.4
## [16] pkgconfig_2.0.3 Matrix_1.7-3 S4Vectors_0.47.0
## [19] lifecycle_1.0.4 compiler_4.5.1 Biostrings_2.77.2
## [22] brio_1.1.5 progress_1.2.3 Seqinfo_0.99.2
## [25] htmltools_0.5.8.1 sass_0.4.10 yaml_2.3.10
## [28] pillar_1.11.0 crayon_1.5.3 jquerylib_0.1.4
## [31] tidyr_1.3.1 cachem_1.1.0 tidyselect_1.2.1
## [34] digest_0.6.37 stringi_1.8.7 purrr_1.1.0
## [37] bookdown_0.44 BiocVersion_3.22.0 fastmap_1.2.0
## [40] grid_4.5.1 cli_3.6.5 magrittr_2.0.3
## [43] withr_3.0.2 prettyunits_1.2.0 filelock_1.0.3
## [46] rappdirs_0.3.3 bit64_4.6.0-1 XVector_0.49.0
## [49] rmarkdown_2.29 httr_1.4.7 Peptides_2.4.6
## [52] bit_4.6.0 ranger_0.17.0 png_0.1-8
## [55] hms_1.1.3 memoise_2.0.1 evaluate_1.0.4
## [58] knitr_1.50 IRanges_2.43.0 testthat_3.2.3
## [61] rlang_1.1.6 Rcpp_1.1.0 glue_1.8.0
## [64] DBI_1.2.3 BiocManager_1.30.26 jsonlite_2.0.0
## [67] R6_2.6.1