The identification of the transcription factor (TF) responsible for the corregulation of an specific set of genes is a common problem in transcriptomics. In the most simple scenario, the comparison of the transcriptome of cells or organisms in two conditions leads to the identification of a set of differentially expressed (DE) genes and the underlying assumption is that one or a few TFs regulate the expression of those genes. Traditionally, the identification of the relevant TFs has relied on the use of position weight matrices (PWMs) to predict transcription factor binding sites (TFBSs) proximal to the DE genes1 Wasserman, W.W., & Sandelin, A. (2004). Applied bioinformatics for the identification of regulatory elements. Nature Reviews Genetics, 5, 276–287. https://doi.org/10.1038/nrg1315. The comparison of predicted TFBS in DE versus control genes reveals factors that are significantly enriched in the DE gene set. These approaches have been useful to narrow down potential binding sites, but can suffer from high rates of false positives. In addition, this strategy is limited by design to sequence-specific TF and thus unable to identify cofactors that bind indirectly to target genes.
To overcome these limitations, TFEA.ChIP leverages the vast collection of publicly available ChIP-seq datasets to determine TFBSs linked to a given set of genes and performs enrichment analysis based on this experimentally derived, high-quality information. Specifically, TFEA.ChIP utilizes data from the ReMap 2022 repository, which contains hundreds of manually curated and uniformly processed DNA-binding experiments 2 Hammal, F., de Langen, P., Bergon, A., Lopez, F., & Ballester, B. (2022). ReMap 2022: a database of Human, Mouse, Drosophila and Arabidopsis regulatory regions from an integrative analysis of DNA-binding sequencing experiments. Nucleic Acids Research, 50(D1), D316–D325. https://doi.org/10.1093/nar/gkab996.
A critical step in this analysis is linking TF binding sites from ChIP-seq experiments to their target genes. The previous version of TFEA.ChIP 3 Puente-Santamaría, L., Wasserman, W.W., & del Peso, L. (2019). TFEA.ChIP: a tool kit for transcription factor binding site enrichment analysis capitalizing on ChIP-seq datasets. Bioinformatics, 35(24), 5339–5340. https://doi.org/10.1093/bioinformatics/btz573 relied on GeneHancer 4 Fishilevich, S., Nudel, R., Rappaport, N., Hadar, R., Plaschkes, I., Iny Stein, T., Rosen, N., Kohn, A., Twik, M., Safran, M., Lancet, D., & Cohen, D. (2017). GeneHancer: genome-wide integration of enhancers and target genes in GeneCards. Database (Oxford), 2017, bax028. https://doi.org/10.1093/database/bax028, an integrative resource that connects genes to regulatory elements by aggregating data from ENCODE, FANTOM5, eQTLs, and chromatin interaction datasets. The current version of TFEA.ChIP includes a TF-gene interaction database constructed by integrating binding sites from the ReMap 2022 repository with enhancer-gene pairs generated by ENCODE-rE2G 5 Gschwind, A.R., et al. (2023). An encyclopedia of enhancer-gene regulatory interactions in the human genome. bioRxiv. https://doi.org/10.1101/2023.01.01.123456. This database, which contains 8,103 datasets from ChIP-seq experiments covering 1,193 human transcriptional regulators, demonstrated the best performance in internal benchmarking analyses and is therefore the default in this version of the tool. Users can also choose alternative TF–gene interaction databases, including those used in previous versions or custom-defined datasets. Additional databases are available through the accompanying Bioconductor data package ExperimentHub (to be released), and more information can be found at: https://github.com/LauraPS1/TFEA.ChIP_downloads and https://github.com/yberda/ChIPDBData.
TFEA.ChIP implements two enrichment analysis methods:
Although primarily designed for analyzing expression data from human cells, TFEA.ChIP includes an option to incorporate datasets from mouse experiments by translating mouse gene names to their human orthologs. Internal analyses indicate that this approach outperforms TF-gene databases derived from mouse-specific enhancer-gene associations.
TFEA.ChIP is designed to take the output of a differential expression analysis and identify TFBS enriched in the list of DE genes. In the case of the Overrepresentation Analysis, the only required input is a set of DE genes and, optionally, a set of control genes whose expression is not altered by the experimental conditions under study. For the GSEA analysis a ranked list of genes is required. This is supplied as a dataframe containing a column with gene names and a numerical column with the ranking metric, which typically are log2 fold change (LFC) or p-values for the gene expression changes in the two conditions under evaluation. To enhance integration with the Bioconductor ecosystem, the current version of the package natively supports output objects from widely used differential expression analysis tools such as edgeR and DESeq2.
For illustration purposes we will derive the input required for both analysis from a table containing the following field columns:
The output of popular packages, such as DESeq2, for detection of differentially expressed genes from the analysis of count data from RNA-seq experiments produce tables with this information, and the output table can be taken by the tool. The hypoxia_DESeq and hypoxia datasets are the output of a differential expression analysis performed on an RNAseq experiment analyzing the response to hypoxia of endothelial cells8 Tiana, M., Acosta-Iborra, B., Puente-Santamaría, L., Hernansanz-Agustín, P., Worsley-Hunt, R., Masson, N., García-Rio, F., Mole, D., Ratcliffe, P., Wasserman, W.W., Jiménez, B., & del Peso, L. (2018). The SIN3A histone deacetylase complex is required for a complete transcriptional response to hypoxia. Nucleic Acids Research, 46(1), 120–133. https://doi.org/10.1093/nar/gkx951 deposited at the NCBI’s GEO repository (GSE89831).
We begin by loading the input data table into the R environment. The input is expected to be in a format compatible with results generated by the DESeq2 package:
library(TFEA.ChIP)
library(dplyr)
data( "hypoxia_DESeq", "hypoxia", package="TFEA.ChIP" ) # Load example datasets
hypoxia_df <- preprocessInputData( hypoxia_DESeq )
#> Warning: Some genes returned 1:many mapping to ENTREZ ID.
# Display the first few rows
head(hypoxia_df)
#> Symbol Genes log2FoldChange pvalue pval.adj
#> 1 DNAH8 1769 4.588064 3.942129e-62 3.285202e-59
#> 2 STC1 6781 4.339768 7.840000e-11 2.714222e-09
#> 3 SYTL2 54843 4.307058 1.435621e-75 1.522672e-72
#> 4 IGFBP3 3486 4.297902 6.595438e-18 5.535897e-16
#> 5 AK4 205 4.235590 1.318311e-125 5.126912e-122
#> 6 PTGIS 5740 3.931662 4.398104e-15 2.631420e-13
Alternatively, the input can be a simple data frame containing columns for gene identifiers (such as Entrez IDs, Ensembl IDs, or gene symbols), LFC and p-values or adjusted p-values:
# Preview the raw hypoxia dataset
head(hypoxia)
#> Genes log2FoldChange pvalue pval.adj
#> 1 TNMD 0.00000000 1.000000000 1.00000000
#> 2 DPM1 -0.44497788 0.001243905 0.01809033
#> 3 SCYL3 -0.27092276 0.139977254 0.65057456
#> 4 C1orf112 -0.52382569 0.035752231 0.25407939
#> 5 FGR -0.44810165 0.217220835 0.83961811
#> 6 CFH 0.05237749 0.740720152 1.00000000
hypoxia <- preprocessInputData( hypoxia )
#> Warning: Some genes returned 1:many mapping to ENTREZ ID.
# Display the first few rows
head(hypoxia)
#> Genes log2FoldChange pvalue pval.adj Symbol
#> 10785 1365 5.319709 9.647797e-77 2.817857e-73 CLDN3
#> 10138 284525 5.179949 2.411485e-64 5.414910e-61 SLC9C2
#> 9362 392255 4.401819 3.559414e-91 1.998137e-87 GDF6
#> 3958 4883 4.208881 1.414157e-39 1.082537e-36 NPR3
#> 10320 2199 4.060889 1.305316e-46 1.465522e-43 FBLN2
#> 5256 1769 4.024703 8.418844e-65 2.025454e-61 DNAH8
After processing the data with the preprocessInputData()
function, the resulting dataset will be ready for downstream analysis with the TFEA.ChIP package. This function converts gene identifiers to Entrez Gene IDs and sorts the table by LFC for consistency.
Note: Using the option
mode = m2h
will convert mouse gene IDs to their equivalent human gene ID, thus taking advantage of the wider availability of ChIP-seq experiments done on human cells. This strategy relies on the overlap between human and mouse transcription regulatory mechanisms. Nevertheless, we advise to be cautious using this approach, since extrapolating results from one organism to another is not always appropriate.
By default, TFEA.ChIP uses a reduced internal reference database restricted to ChIP-seq experiments performed in ENCODE tier 1, 2, and 2.5 cell types. This streamlined database is intentionally minimal to accommodate users with limited computing resources and may not fully capture the diversity of available TF–gene associations. To perform enrichment analysis using the complete ChIP-seq database, load the curated dataset from ExperimentHub
and assign it to an object named ChIPDB
to override the default:
library(ExperimentHub)
eh <- ExperimentHub()
ChIPDB <- eh[['EH9854']] # rE2G300d
The database shown here is derived from ENCODE’s rE2G predictive models (“rE2G300d”, see ChIPDBData GitHub repository for details). However, additional ready-to-use databases based on rE2G, GeneHancer, and CREdb are also available. To use one of these alternative databases, select a different entry in the eh
object:
query(eh, "ChIPDBData")
As mentioned previously, this analysis requires two gene sets: one consisting of differentially expressed genes (e.g., upregulated), and another comprising control genes whose expression remains unchanged under the experimental conditions.
The Select_genes()
function facilitates this selection. For example, upregulated genes can be defined as those with a LFC greater than 1 and an adjusted p-value (adjP) less than 0.05. Control genes can be selected based on an LFC between -0.25 and 0.25 and an adjP between 0.5 and 1:
# Extract vector with names of upregulated genes
Genes.Upreg <- Select_genes( hypoxia_df, min_LFC = 1, max_pval = 0.05 )
# Extract vector with names of non-responsive genes
Genes.Control <- Select_genes( hypoxia_df,
min_pval = 0.5, max_pval = 1,
min_LFC = -0.25, max_LFC = 0.25 )
In situations where preprocessing is not applicable, such as when using an external or manually curated gene list, gene identifiers must be standardized to Entrez Gene IDs for compatibility with the package’s downstream functions. The GeneID2entrez()
function supports this conversion:
# Conversion of hgnc to ENTREZ IDs
GeneID2entrez( gene.IDs = c("EGLN3","NFYA","ALS2","MYC","ARNT" ) )
#> [1] "112399" "4800" "57679" "4609" "405"
# To translate from mouse IDs:
# GeneID2entrez( gene.IDs = c( "Hmmr", "Tlx3", "Cpeb4" ), mode = "m2h" ) # To get the equivalent human gene IDs
In this step, a contingency table is built for each TF represented in the ChIP-seq regulatory map. Each TF-specific contingency table summarizes the overlap between TF binding and differential expression:
TFbound_yes | TFbound_no | |
---|---|---|
DE_yes | number y/y | number y/n |
DE_no | number n/y | number n/n |
Fisher’s exact test is then applied to each contingency table to evaluate the null hypothesis that TF binding and differential expression are independent events. In addition to raw p-values, the function returns false discovery rate (FDR)-adjusted p-values to account for multiple testing, as well as the odds ratio (OR), which quantifies the strength of the association between the DE genes and the TF based on ChIP-seq data.
CM_list_UP <- contingency_matrix( Genes.Upreg, Genes.Control ) # Generates list of contingency tables, one per dataset
pval_mat_UP <- getCMstats( CM_list_UP ) # Generates list of p-values and OR from association test
head( pval_mat_UP )
#> Accession Cell Treatment TF p.value OR
#> 873 GSE89836.EPAS1.HUVEC-C HUVEC-C EPAS1 2.883815e-06 63.194453
#> 345 ENCSR091BOQ.SUZ12.GM12878 GM12878 SUZ12 9.452603e-32 5.674107
#> 738 GSE109625.EZH2.HUVEC-C HUVEC-C EZH2 4.816831e-24 7.745379
#> 612 ENCSR784VUY.RNF2.WA01 WA01 RNF2 6.049224e-23 3.463252
#> 9 ENCSR000ARI.EZH2.Hep-G2 Hep-G2 EZH2 5.579521e-22 5.167738
#> 715 GSE104690.RNF2.WA01 WA01 RNF2 3.864458e-22 3.840774
#> OR.SE log2.OR adj.p.value log10.adj.pVal distance
#> 873 68.3644690 5.981726 6.918168e-06 5.160009 62.40814
#> 345 0.7785041 2.504393 4.862839e-30 29.313110 29.68342
#> 738 1.4158940 2.953336 9.911969e-23 22.003840 23.01454
#> 612 0.4246438 1.792127 9.657899e-22 21.015117 21.15899
#> 9 0.8024915 2.369533 7.948671e-21 20.099705 20.52726
#> 715 0.5025048 1.941397 5.771755e-21 20.238692 20.43709
In the example above, all 8,103 TF binding datasets included in the internal database were used for the enrichment analysis. However, the analysis can be restricted to a specific subset of TFs. To do this, the get_chip_index()
function can be used to generate an index referencing the desired subset of datasets. This index can then be passed as an argument to the contingency_matrix()
function.
chip_index <- get_chip_index( TFfilter = c( "HIF1A","EPAS1","ARNT" ) ) # Restrict the analysis to datasets assaying these factors
CM_list_UPe <- contingency_matrix( Genes.Upreg, chip_index = chip_index ) # Generates list of contingency tables
pval_mat_UPe <- getCMstats( CM_list_UPe, chip_index ) # Generates list of p-values and ORs
head( pval_mat_UPe )
#> Accession Cell Treatment TF p.value
#> 6 GSE89836.EPAS1.HUVEC-C HUVEC-C EPAS1 1.086534e-08
#> 5 GSE89836.ARNT.HUVEC-C HUVEC-C ARNT 2.066421e-14
#> 3 GSE39089.HIF1A.HUVEC-C_HYPOX HUVEC-C_HYPOX HYPOX HIF1A 2.363868e-24
#> 1 ENCSR029IBC.ARNT.Hep-G2 Hep-G2 ARNT 7.772175e-03
#> 2 ENCSR590KEQ.ARNT.GM12878 GM12878 ARNT 3.734244e-05
#> 4 GSE39089.HIF1A.HUVEC-C_NOMO HUVEC-C_NOMO NOMO HIF1A 5.734422e-02
#> OR OR.SE log2.OR adj.p.value log10.adj.pVal distance
#> 6 53.838261 27.0996505 5.750560 2.173068e-08 7.662927 53.391032
#> 5 20.275898 5.8177766 4.341694 6.199262e-14 13.207660 23.366697
#> 3 4.774115 0.6379584 2.255233 1.418321e-23 22.848226 23.157836
#> 1 5.412187 2.7913479 2.436212 9.326610e-03 2.030276 4.856893
#> 2 1.775250 0.2364437 0.828022 5.601366e-05 4.251706 4.321807
#> 4 2.843944 1.4501576 1.507893 5.734422e-02 1.241510 2.222943
Note: When a complete ranked gene list is provided as input, the current version of the package can automatically identify expressed TFs and limit the analysis to that subset. Details on this process are explained later in the document. Additionally, specifying a control gene set is optional: if omitted, the function defaults to using all annotated human genes not included in the test set as controls.
The TFEA.ChIP package includes a metadata table that provides detailed information about each ChIP-seq dataset in the internal database. This includes experimental conditions, TF targets, and dataset sources. The metadata can be accessed as follows:
data("MetaData", package = "TFEA.ChIP")
head(MetaData)
#> Accession ID
#> 1 ENCSR000AHD.CTCF.MCF-7 ENCSR000AHD
#> 2 ENCSR000AHF.TAF1.MCF-7 ENCSR000AHF
#> 3 ENCSR000AKB.CTCF.GM12878 ENCSR000AKB
#> 4 ENCSR000AKO.CTCF.K-562 ENCSR000AKO
#> 5 ENCSR000ALA.CTCF.endothelial_umbilical-vein ENCSR000ALA
#> 6 ENCSR000ALJ.CTCF.keratinocyte ENCSR000ALJ
#> Cell Treatment TF
#> 1 MCF-7 CTCF
#> 2 MCF-7 TAF1
#> 3 GM12878 CTCF
#> 4 K-562 CTCF
#> 5 endothelial_umbilical-vein umbilical-vein CTCF
#> 6 keratinocyte CTCF
The TFEA.ChIP package provides a function that integrates all key analysis steps into a single command, simplifying the workflow. This streamlined function is particularly helpful when working with complete gene tables, as it can automatically detect TFs that are expressed in the input dataset and restrict the analysis accordingly. By default, this filtering step is enabled to avoid testing TFs that are unlikely to be relevant in the given biological context, thereby reducing the multiple testing burden and increasing statistical power. If desired, this behavior can be disabled by setting the expressed
argument to FALSE
:
pval_mat_UPe <- analysis_from_table(hypoxia_df,
method = 'ora', # Overrepresentation Analysis
interest_min_LFC = 1, # min LFC
expressed = TRUE, # only take expressed TFs
control_min_pval = 0.5, control_max_pval = 1, # control pval limits
control_min_LFC = -0.25, control_max_LFC = 0.25) # control LFC limits
In the analyses described above, each ChIP-seq dataset is treated independently. This approach allows for the identification of specific cell types or experimental conditions where a TF may be active, facilitating biological interpretation in relation to the study samples. Alternatively, the package supports the integration of all ChIP-seq datasets associated with a given TF, irrespective of experimental context. This strategy helps to highlight TFs that show consistent enrichment across conditions. The rankTFs()
function enables this summarization by applying either a Wilcoxon rank-sum test or a GSEA approach to assess whether the ChIP-seq profiles corresponding to the same TF are collectively enriched or depleted in the ranked results.
Important consideration: For TFs whose activity is highly context-dependent, aggregating ChIP-seq data across diverse experimental conditions may obscure biologically meaningful enrichments. Additionally, many public ChIP-seq datasets are derived from baseline or control conditions in which the TF may not be active. While this metadata is informative for interpreting individual experiments, it can confound integrated analyses. Therefore, we recommend using non-integrated (individual) results for more accurate and context-specific interpretation whenever possible.
TF_ranking <- rankTFs( pval_mat_UP, rankMethod = "gsea", makePlot = TRUE )
head( TF_ranking[[ "TF_ranking" ]] )
TF_ranking[[ "TFranking_plot" ]]
Snapshot of a plot generated with rankTFs.
Meta-analyses can also be used to summarize results across multiple ChIP experiments for the same TF, accounting for the standard error of the OR. This approach gives greater weight to more precise estimates. However, it may also dilute effects that are context-dependent:
library(meta)
TF_ranking2 <- metaanalysis_fx( pval_mat_UP )
# Get HIF results
filter(TF_ranking2$summary,
TF %in% c("ARNT", "HIF1A", "EPAS1"))
Snapshot of the meta-analysis results for HIF1A, EPAS1 and ARNT.
In this analysis, k
denotes the number of ChIP experiments included. OR
represents the weighted average Odds Ratio, OR.SE
is its standard error, and CI
refers to the corresponding confidence interval. The pooled effect size (OR values) estimated by the meta-analysis integrates all datasets for a given TF, which may dilute effects that are context-dependent. However, a forest plot can be generated to visualize the results of individual experiments alongside the integrated value:
# Result for HIF1A
forest(TF_ranking2[['results']]$HIF1A)
Snapshot of a plot generated with forest (meta package).
The table of results generated by getCMstats
can be parsed to select candidate TF. The function plot_CM
uses the package plotly
to generate an interactive plot representing the p-value against the Odds Ratio that is very helpful to explore the results:
plot_CM( pval_mat_UP ) # plot p-values against ORs
Snapshot of a plot generated with plot_CM.
As expected for the hypoxia dataset, visual inspection of the enrichment graph reveals a strong signal for several HIF-related ChIP-seq datasets. These can be highlighted to emphasize their relevance in the current analysis:
HIFs <- c( "EPAS1","HIF1A","ARNT" )
plot_CM( pval_mat_UP, specialTF = HIFs ) # Plot p-values against ORs highlighting indicated TFs
Snapshot of a plot generated with plot_CM.
The Gene Set Enrichment Analysis (GSEA) implemented in TFEA.ChIP requires a ranked list of genes as input. By default, the preprocessInputData()
function sorts genes in descending order based on their LFC values. However, users can choose to rank genes using alternative numerical metrics, such as the p-value, adjusted p-value, or a composite measure like the product of LFC and –log10(p-value), depending on the specific goals of the analysis.
If a custom gene ranking is used, it is essential to ensure that gene identifiers are in Entrez Gene ID format. If necessary, the GeneID2entrez()
function can be used to convert gene symbols or alternative identifiers into the required format.
By default, the analysis will include all the ChIP-Seq experiments available in the database. However, this analysis might take several minutes to run. To restrict the analysis to a subset of the database we can generate an index variable and pass it to the function GSEA_run()
. This will limit the analysis to the ChIP-Seq datasets of the user’s choosing. This index variable can be generated using the function get_chip_index()
and allows the user to select the whole database, the set of ChIP-Seq experiments produced by the ENCODE project (“encode”) or a specific subset of TFs (as a vector containing the TF names).
chip_index <- get_chip_index( TFfilter = c( "HIF1A","EPAS1","ARNT" ) ) # Restrict the analysis to datasets assaying these factors
The function GSEA_run()
will perform a GSEA-based analysis on the input gene list. This function is based on the R-GSEA R script bundle written by the GSEA team at the Broad Institute of MIT and Harvard. The output of the analysis depends on the variable get.RES
:
When FALSE
, the function returns a data frame storing maximum Enrichment Score and associated p-value determined for each dataset included in the analysis.
When TRUE
, the function returns a list of three elements. The first element (Enrichment.table) is the enrichment data frame previously mentioned. The second element (RES) is a list of vectors containing the Running Enrichment Scores values (see GSEA documentation, http://software.broadinstitute.org/gsea/doc/GSEAUserGuideTEXT.htm#_Interpreting_GSEA_Results) for each of the sorted genes tested against each one of the analyzed ChIP datasets. The third element (indicators) is a list of vectors indicating if the sorted genes were bound by the factor analyzed in each ChIP dataset.
# run GSEA analysis
GSEA.result <- GSEA_run( hypoxia_df$Genes, hypoxia_df$log2FoldChange, chip_index, get.RES = TRUE)
Finally, the package provides a function that integrates all the above steps into a single command. In the current version of the package, only TFs present in the input sorted list of genes (i.e., TFs expressed under the conditions under study) are considered for the analysis. This default behavior can be overridden by setting the parameter expressed
to FALSE
:
GSEA.result <- analysis_from_table(hypoxia_df,
TFfilter = c('ARNT', 'HIF1A', 'EPAS1'),
expressed = TRUE, # Takes only expressed TFs
method = 'gsea') # Gene Set Enrichment Analysis
TFEA.ChIP provides two functions, plot_ES()
and plot_RES()
, to create interactive HTML visualizations of GSEA results.
plot_ES()
The plot_ES()
function generates an interactive enrichment score (ES) plot. It supports highlighting ChIP-seq datasets associated with specific TFs by displaying them in distinct colors, allowing users to easily identify TFs of interest within the broader analysis.
TF.hightlight <- c( "EPAS1","ARNT","HIF1A" )
plot_ES( GSEA.result$result, LFC = GSEA.result$processed_table$log2FoldChange, specialTF = TF.hightlight)
Snapshot of a plot generated with plot_ES
plot_RES()
This function will plot all the RES stored in the GSEA_run()
output. It is only recommended to restrict output to specific TF and/or datasets by setting the parameters TF
and/or Accession
respectively:
plot_RES(
GSEA_result = GSEA.result$result, LFC = hypoxia_df$log2FoldChange,
Accession = c("GSE89836.ARNT.HUVEC-C", "GSE89836.EPAS1.HUVEC-C" ) )
Snapshot of a plot generated with plot_RES
If the user wants to generate their own database of ChIPseq datasets, the functions txt2gr()
and makeChIPGeneDB()
automate most of the process. The required inputs are:
A Metadata table (storing at least, Accession ID, name of the file, and TF tested in the ChIP-Seq experiment). The metadata table included with this package has the following fields: “Name”, “Accession”, “Cell”, “Cell Type”, “Treatment”, “Antibody”, and “TF”.
A folder containing ChIP-Seq peak data, either in “.narrowpeak” format or the MACS output files "_peaks.bed" -a format that stores “chr”, “start”, “end”, “name”, and “Q-value” of every peak-.
Specify the folder where the ChIP-Seq files are stored, create an array with the names of the ChIP-Seq files, and choose a format.
Set a for
loop to convert all your files to GenomicRanges objects using txt2GR()
. Please note that, by default, only peaks with an associated p-value of 0.05 (for narrow peaks files) or 1e-5 (for MACS files) will be kept. The user can modify the default values by setting the alpha argument to the desired threshold p-value.
folder <- "~/peak.files.folder"
File.list<-dir( folder )
format <- "macs"
gr.list <- lapply(
seq_along( File.list ),
function( File.list, myMetaData, format ){
tmp<-read.table( File.list[i], ..., stringsAsFactors = FALSE )
file.metadata <- myMetaData[ myMetaData$Name == File.list[i], ]
ChIP.dataset.gr<-txt2GR(tmp, format, file.metadata)
return(ChIP.dataset.gr)
},
File.list = File.list,
myMetadata = myMetadata,
format = format
)
# As an example of the output
data( "ARNT.peaks.bed","ARNT.metadata", package = "TFEA.ChIP" ) # Loading example datasets for this function
ARNT.gr <- txt2GR( ARNT.peaks.bed, "macs1.4", ARNT.metadata )
head( ARNT.gr, n=2 )
#> GRanges object with 2 ranges and 9 metadata columns:
#> seqnames ranges strand | score mcols.Name
#> <Rle> <IRanges> <Rle> | <numeric> <character>
#> [1] chr1 1813434-1814386 * | 4.60909e-06 GSM2390643_ARNT_ChIP..
#> [2] chr1 3456399-3457536 * | 8.82450e-27 GSM2390643_ARNT_ChIP..
#> mcols.Accession mcols.Cell mcols.Cell Type mcols.Treatment
#> <character> <character> <character> <character>
#> [1] GSM2390643 HUVEC Hypoxia
#> [2] GSM2390643 HUVEC Hypoxia
#> mcols.Antibody mcols.TF mcols.Score Type
#> <character> <character> <character>
#> [1] HIF1B ARNT corrected p-Value
#> [2] HIF1B ARNT corrected p-Value
#> -------
#> seqinfo: 32 sequences from an unspecified genome; no seqlengths
The function makeChIPGeneDB()
assigns the TFBS peaks in the ChIP datasets stored in gr.list()
to a gene. To this end, a ChIP peak overlaping a regulatory region region receive the gene label associated to said region. By default the function also assigns the gene name when the ChIP peak does not overlap a regulatory region but maps at less than 10 nucleotides from it. This behaviour can be modified by setting the argument distanceMargin
to the desired value (by default distanceMargin = 10 bases).
The resulting ChIP-Gene data base is a list containing two elements: - Gene Keys: vector of gene IDs. - ChIP Targets: list of vectors, one per element in gr.list, containing the putative targets assigned. Each target is coded as its position in the vector ‘Gene Keys’.
data( "DnaseHS_db", "gr.list", package="TFEA.ChIP" ) # Loading example datasets for this function
TF.gene.binding.db <- makeChIPGeneDB( DnaseHS_db, gr.list )
str( TF.gene.binding.db )
#> List of 2
#> $ Gene Keys : chr [1:54] "100507501" "10209" "10277" "10458" ...
#> $ ChIP Targets:List of 1
#> ..$ wgEncodeEH002402: int [1:54] 1 2 3 4 5 6 7 8 9 10 ...
The function, accepts any Genomic Range object that includes a metacolumn with a gene ID (stored in the @elementMetadata@listdata [["gene_id"]] slot of the object) for each genomic segment. For example, asignation of peaks to genes can be done by providing a list of all the genes in the genome:
library(TxDb.Hsapiens.UCSC.hg38.knownGene)
#> Loading required package: GenomicFeatures
#> Loading required package: BiocGenerics
#> Loading required package: generics
#>
#> Attaching package: 'generics'
#> The following object is masked from 'package:dplyr':
#>
#> explain
#> The following objects are masked from 'package:base':
#>
#> as.difftime, as.factor, as.ordered, intersect, is.element, setdiff,
#> setequal, union
#>
#> Attaching package: 'BiocGenerics'
#> The following object is masked from 'package:dplyr':
#>
#> combine
#> The following objects are masked from 'package:stats':
#>
#> IQR, mad, sd, var, xtabs
#> The following objects are masked from 'package:base':
#>
#> Filter, Find, Map, Position, Reduce, anyDuplicated, aperm, append,
#> as.data.frame, basename, cbind, colnames, dirname, do.call,
#> duplicated, eval, evalq, get, grep, grepl, is.unsorted, lapply,
#> mapply, match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
#> rank, rbind, rownames, sapply, saveRDS, table, tapply, unique,
#> unsplit, which.max, which.min
#> Loading required package: S4Vectors
#> Loading required package: stats4
#>
#> Attaching package: 'S4Vectors'
#> The following objects are masked from 'package:dplyr':
#>
#> first, rename
#> The following object is masked from 'package:utils':
#>
#> findMatches
#> The following objects are masked from 'package:base':
#>
#> I, expand.grid, unname
#> Loading required package: IRanges
#>
#> Attaching package: 'IRanges'
#> The following objects are masked from 'package:dplyr':
#>
#> collapse, desc, slice
#> Loading required package: Seqinfo
#> Loading required package: GenomicRanges
#> Loading required package: AnnotationDbi
#> Loading required package: Biobase
#> Welcome to Bioconductor
#>
#> Vignettes contain introductory material; view with
#> 'browseVignettes()'. To cite Bioconductor, see
#> 'citation("Biobase")', and for packages 'citation("pkgname")'.
#>
#> Attaching package: 'AnnotationDbi'
#> The following object is masked from 'package:dplyr':
#>
#> select
data( "gr.list", package="TFEA.ChIP") # Loading example datasets for this function
txdb <- TxDb.Hsapiens.UCSC.hg38.knownGene
Genes <- genes( txdb )
#> 2169 genes were dropped because they have exons located on both strands of
#> the same reference sequence or on more than one reference sequence, so cannot
#> be represented by a single genomic range.
#> Use 'single.strand.genes.only=FALSE' to get all the genes in a GRangesList
#> object, or use suppressMessages() to suppress this message.
TF.gene.binding.db <- makeChIPGeneDB( Genes, gr.list, distanceMargin = 0 )
str( TF.gene.binding.db )
#> List of 2
#> $ Gene Keys : chr [1:35332] "1" "10" "100" "1000" ...
#> $ ChIP Targets:List of 1
#> ..$ wgEncodeEH002402: int [1:41] 1656 3217 3791 4256 7188 8381 8781 9637 11099 12229 ...
In this case the information about Dnase hypersensitivity is disregarded and peaks are assigned to overlapping genes (or genes closer than distanceMargin
residues).
At the beginning of a session, use the function set_user_data()
to use your TFBS binary matrix and metadata table with the rest of the package.
set_user_data( binary_matrix = myTFBSmatrix, metadata = myMetaData )
sessionInfo()
#> R version 4.5.1 (2025-06-13)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.3 LTS
#>
#> Matrix products: default
#> BLAS: /home/biocbuild/bbs-3.22-bioc/R/lib/libRblas.so
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0 LAPACK version 3.12.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_GB LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: America/New_York
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats4 stats graphics grDevices utils datasets methods
#> [8] base
#>
#> other attached packages:
#> [1] TxDb.Hsapiens.UCSC.hg38.knownGene_3.21.0
#> [2] GenomicFeatures_1.61.6
#> [3] AnnotationDbi_1.71.1
#> [4] Biobase_2.69.0
#> [5] GenomicRanges_1.61.2
#> [6] Seqinfo_0.99.2
#> [7] IRanges_2.43.0
#> [8] S4Vectors_0.47.0
#> [9] BiocGenerics_0.55.1
#> [10] generics_0.1.4
#> [11] dplyr_1.1.4
#> [12] TFEA.ChIP_1.29.3
#> [13] BiocStyle_2.37.1
#>
#> loaded via a namespace (and not attached):
#> [1] tidyselect_1.2.1 farver_2.1.2
#> [3] blob_1.2.4 filelock_1.0.3
#> [5] R.utils_2.13.0 Biostrings_2.77.2
#> [7] bitops_1.0-9 fastmap_1.2.0
#> [9] RCurl_1.98-1.17 BiocFileCache_2.99.6
#> [11] GenomicAlignments_1.45.2 XML_3.99-0.19
#> [13] digest_0.6.37 lifecycle_1.0.4
#> [15] KEGGREST_1.49.1 RSQLite_2.4.3
#> [17] magrittr_2.0.3 compiler_4.5.1
#> [19] progress_1.2.3 rlang_1.1.6
#> [21] sass_0.4.10 tools_4.5.1
#> [23] yaml_2.3.10 rtracklayer_1.69.1
#> [25] knitr_1.50 prettyunits_1.2.0
#> [27] S4Arrays_1.9.1 bit_4.6.0
#> [29] curl_7.0.0 DelayedArray_0.35.2
#> [31] RColorBrewer_1.1-3 abind_1.4-8
#> [33] BiocParallel_1.43.4 withr_3.0.2
#> [35] R.oo_1.27.1 grid_4.5.1
#> [37] ExperimentHub_2.99.5 ggplot2_3.5.2
#> [39] scales_1.4.0 dichromat_2.0-0.1
#> [41] biomaRt_2.65.7 SummarizedExperiment_1.39.1
#> [43] cli_3.6.5 rmarkdown_2.29
#> [45] crayon_1.5.3 httr_1.4.7
#> [47] rjson_0.2.23 DBI_1.2.3
#> [49] cachem_1.1.0 stringr_1.5.2
#> [51] parallel_4.5.1 BiocManager_1.30.26
#> [53] XVector_0.49.0 restfulr_0.0.16
#> [55] matrixStats_1.5.0 vctrs_0.6.5
#> [57] Matrix_1.7-4 jsonlite_2.0.0
#> [59] bookdown_0.44 hms_1.1.3
#> [61] bit64_4.6.0-1 locfit_1.5-9.12
#> [63] jquerylib_0.1.4 glue_1.8.0
#> [65] org.Mm.eg.db_3.21.0 codetools_0.2-20
#> [67] gtable_0.3.6 stringi_1.8.7
#> [69] BiocVersion_3.22.0 BiocIO_1.19.0
#> [71] tibble_3.3.0 pillar_1.11.0
#> [73] rappdirs_0.3.3 htmltools_0.5.8.1
#> [75] R6_2.6.1 dbplyr_2.5.0
#> [77] httr2_1.2.1 evaluate_1.0.5
#> [79] lattice_0.22-7 AnnotationHub_3.99.6
#> [81] png_0.1-8 R.methodsS3_1.8.2
#> [83] Rsamtools_2.25.3 memoise_2.0.1
#> [85] bslib_0.9.0 Rcpp_1.1.0
#> [87] SparseArray_1.9.1 DESeq2_1.49.4
#> [89] org.Hs.eg.db_3.21.0 xfun_0.53
#> [91] MatrixGenerics_1.21.0 pkgconfig_2.0.3