1 Introduction

The identification of the transcription factor (TF) responsible for the corregulation of an specific set of genes is a common problem in transcriptomics. In the most simple scenario, the comparison of the transcriptome of cells or organisms in two conditions leads to the identification of a set of differentially expressed (DE) genes and the underlying assumption is that one or a few TFs regulate the expression of those genes. Traditionally, the identification of the relevant TFs has relied on the use of position weight matrices (PWMs) to predict transcription factor binding sites (TFBSs) proximal to the DE genes1 Wasserman, W.W., & Sandelin, A. (2004). Applied bioinformatics for the identification of regulatory elements. Nature Reviews Genetics, 5, 276–287. https://doi.org/10.1038/nrg1315. The comparison of predicted TFBS in DE versus control genes reveals factors that are significantly enriched in the DE gene set. These approaches have been useful to narrow down potential binding sites, but can suffer from high rates of false positives. In addition, this strategy is limited by design to sequence-specific TF and thus unable to identify cofactors that bind indirectly to target genes.

To overcome these limitations, TFEA.ChIP leverages the vast collection of publicly available ChIP-seq datasets to determine TFBSs linked to a given set of genes and performs enrichment analysis based on this experimentally derived, high-quality information. Specifically, TFEA.ChIP utilizes data from the ReMap 2022 repository, which contains hundreds of manually curated and uniformly processed DNA-binding experiments 2 Hammal, F., de Langen, P., Bergon, A., Lopez, F., & Ballester, B. (2022). ReMap 2022: a database of Human, Mouse, Drosophila and Arabidopsis regulatory regions from an integrative analysis of DNA-binding sequencing experiments. Nucleic Acids Research, 50(D1), D316–D325. https://doi.org/10.1093/nar/gkab996.

A critical step in this analysis is linking TF binding sites from ChIP-seq experiments to their target genes. The previous version of TFEA.ChIP 3 Puente-Santamaría, L., Wasserman, W.W., & del Peso, L. (2019). TFEA.ChIP: a tool kit for transcription factor binding site enrichment analysis capitalizing on ChIP-seq datasets. Bioinformatics, 35(24), 5339–5340. https://doi.org/10.1093/bioinformatics/btz573 relied on GeneHancer 4 Fishilevich, S., Nudel, R., Rappaport, N., Hadar, R., Plaschkes, I., Iny Stein, T., Rosen, N., Kohn, A., Twik, M., Safran, M., Lancet, D., & Cohen, D. (2017). GeneHancer: genome-wide integration of enhancers and target genes in GeneCards. Database (Oxford), 2017, bax028. https://doi.org/10.1093/database/bax028, an integrative resource that connects genes to regulatory elements by aggregating data from ENCODE, FANTOM5, eQTLs, and chromatin interaction datasets. The current version of TFEA.ChIP includes a TF-gene interaction database constructed by integrating binding sites from the ReMap 2022 repository with enhancer-gene pairs generated by ENCODE-rE2G 5 Gschwind, A.R., et al. (2023). An encyclopedia of enhancer-gene regulatory interactions in the human genome. bioRxiv. https://doi.org/10.1101/2023.01.01.123456. This database, which contains 8,103 datasets from ChIP-seq experiments covering 1,193 human transcriptional regulators, demonstrated the best performance in internal benchmarking analyses and is therefore the default in this version of the tool. Users can also choose alternative TF–gene interaction databases, including those used in previous versions or custom-defined datasets. Additional databases are available through the accompanying Bioconductor data package ExperimentHub (to be released), and more information can be found at: https://github.com/LauraPS1/TFEA.ChIP_downloads and https://github.com/yberda/ChIPDBData.

TFEA.ChIP implements two enrichment analysis methods:

  • Overrepresentation Analysis. It is the analysis of the association of TFBS and differential expression from 2x2 tables recording the presence of binding sites for a given TF in DE and control genes. The statistical significance of the association for each factor determined by a Fisher’s exact test.
  • GSEA Analysis, based on the core function of the GSEA algorithm developed by the GSEA team at the Broad Institute of MIT and Harvard6 Subramanian, A., Tamayo, P., Mootha, V.K., Mukherjee, S., Ebert, B.L., Gillette, M.A., Paulovich, A., Pomeroy, S.L., Golub, T.R., Lander, E.S., & Mesirov, J.P. (2005). Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proceedings of the National Academy of Sciences, 102(43), 15545–15550. https://doi.org/10.1073/pnas.0506580102 7 Mootha, V.K., Lindgren, C.M., Eriksson, K.F., Subramanian, A., Sihag, S., Lehar, J., Puigserver, P., Carlsson, E., Ridderstråle, M., Laurila, E., Houstis, N., Daly, M.J., Patterson, N., Mesirov, J.P., Golub, T.R., Tamayo, P., Spiegelman, B., Lander, E.S., Hirschhorn, J.N., Altshuler, D., & Groop, L.C. (2003). PGC-1alpha-responsive genes involved in oxidative phosphorylation are coordinately downregulated in human diabetes. Nature Genetics, 34(3), 267–273. https://doi.org/10.1038/ng1180

Although primarily designed for analyzing expression data from human cells, TFEA.ChIP includes an option to incorporate datasets from mouse experiments by translating mouse gene names to their human orthologs. Internal analyses indicate that this approach outperforms TF-gene databases derived from mouse-specific enhancer-gene associations.

2 Analysis Example

TFEA.ChIP is designed to take the output of a differential expression analysis and identify TFBS enriched in the list of DE genes. In the case of the Overrepresentation Analysis, the only required input is a set of DE genes and, optionally, a set of control genes whose expression is not altered by the experimental conditions under study. For the GSEA analysis a ranked list of genes is required. This is supplied as a dataframe containing a column with gene names and a numerical column with the ranking metric, which typically are log2 fold change (LFC) or p-values for the gene expression changes in the two conditions under evaluation. To enhance integration with the Bioconductor ecosystem, the current version of the package natively supports output objects from widely used differential expression analysis tools such as edgeR and DESeq2.

For illustration purposes we will derive the input required for both analysis from a table containing the following field columns:

  • Gene name (Genes). Internally the package uses Entrez IDs, but translating from Gene Symbols and ENSEMBL IDs is available.
  • Log2 Fold Change (Log2FoldChange), indicating the difference in expression for each gene in the two experimental conditions being compared.
  • p-value (pvalue) or adjusted p-value (pval.adj) for the difference in gene expression between the two conditions.

The output of popular packages, such as DESeq2, for detection of differentially expressed genes from the analysis of count data from RNA-seq experiments produce tables with this information, and the output table can be taken by the tool. The hypoxia_DESeq and hypoxia datasets are the output of a differential expression analysis performed on an RNAseq experiment analyzing the response to hypoxia of endothelial cells8 Tiana, M., Acosta-Iborra, B., Puente-Santamaría, L., Hernansanz-Agustín, P., Worsley-Hunt, R., Masson, N., García-Rio, F., Mole, D., Ratcliffe, P., Wasserman, W.W., Jiménez, B., & del Peso, L. (2018). The SIN3A histone deacetylase complex is required for a complete transcriptional response to hypoxia. Nucleic Acids Research, 46(1), 120–133. https://doi.org/10.1093/nar/gkx951 deposited at the NCBI’s GEO repository (GSE89831).

We begin by loading the input data table into the R environment. The input is expected to be in a format compatible with results generated by the DESeq2 package:

library(TFEA.ChIP)
library(dplyr)

data( "hypoxia_DESeq", "hypoxia", package="TFEA.ChIP" ) # Load example datasets
hypoxia_df <- preprocessInputData( hypoxia_DESeq )
#> Warning: Some genes returned 1:many mapping to ENTREZ ID.

# Display the first few rows
head(hypoxia_df)
#>   Symbol Genes log2FoldChange        pvalue      pval.adj
#> 1  DNAH8  1769       4.588064  3.942129e-62  3.285202e-59
#> 2   STC1  6781       4.339768  7.840000e-11  2.714222e-09
#> 3  SYTL2 54843       4.307058  1.435621e-75  1.522672e-72
#> 4 IGFBP3  3486       4.297902  6.595438e-18  5.535897e-16
#> 5    AK4   205       4.235590 1.318311e-125 5.126912e-122
#> 6  PTGIS  5740       3.931662  4.398104e-15  2.631420e-13

Alternatively, the input can be a simple data frame containing columns for gene identifiers (such as Entrez IDs, Ensembl IDs, or gene symbols), LFC and p-values or adjusted p-values:

# Preview the raw hypoxia dataset
head(hypoxia)
#>      Genes log2FoldChange      pvalue   pval.adj
#> 1     TNMD     0.00000000 1.000000000 1.00000000
#> 2     DPM1    -0.44497788 0.001243905 0.01809033
#> 3    SCYL3    -0.27092276 0.139977254 0.65057456
#> 4 C1orf112    -0.52382569 0.035752231 0.25407939
#> 5      FGR    -0.44810165 0.217220835 0.83961811
#> 6      CFH     0.05237749 0.740720152 1.00000000
hypoxia <- preprocessInputData( hypoxia )
#> Warning: Some genes returned 1:many mapping to ENTREZ ID.

# Display the first few rows
head(hypoxia)
#>        Genes log2FoldChange       pvalue     pval.adj Symbol
#> 10785   1365       5.319709 9.647797e-77 2.817857e-73  CLDN3
#> 10138 284525       5.179949 2.411485e-64 5.414910e-61 SLC9C2
#> 9362  392255       4.401819 3.559414e-91 1.998137e-87   GDF6
#> 3958    4883       4.208881 1.414157e-39 1.082537e-36   NPR3
#> 10320   2199       4.060889 1.305316e-46 1.465522e-43  FBLN2
#> 5256    1769       4.024703 8.418844e-65 2.025454e-61  DNAH8

After processing the data with the preprocessInputData() function, the resulting dataset will be ready for downstream analysis with the TFEA.ChIP package. This function converts gene identifiers to Entrez Gene IDs and sorts the table by LFC for consistency.

Note: Using the option mode = m2h will convert mouse gene IDs to their equivalent human gene ID, thus taking advantage of the wider availability of ChIP-seq experiments done on human cells. This strategy relies on the overlap between human and mouse transcription regulatory mechanisms. Nevertheless, we advise to be cautious using this approach, since extrapolating results from one organism to another is not always appropriate.

2.1 Selecting a preferred TF–target gene database

By default, TFEA.ChIP uses a reduced internal reference database restricted to ChIP-seq experiments performed in ENCODE tier 1, 2, and 2.5 cell types. This streamlined database is intentionally minimal to accommodate users with limited computing resources and may not fully capture the diversity of available TF–gene associations. To perform enrichment analysis using the complete ChIP-seq database, load the curated dataset from ExperimentHub and assign it to an object named ChIPDB to override the default:

library(ExperimentHub)
eh <- ExperimentHub()
ChIPDB <- eh[['EH9854']]  # rE2G300d

The database shown here is derived from ENCODE’s rE2G predictive models (“rE2G300d”, see ChIPDBData GitHub repository for details). However, additional ready-to-use databases based on rE2G, GeneHancer, and CREdb are also available. To use one of these alternative databases, select a different entry in the eh object:

query(eh, "ChIPDBData")

2.2 Overrepresentation analysis

2.2.1 Identification of DE genes

As mentioned previously, this analysis requires two gene sets: one consisting of differentially expressed genes (e.g., upregulated), and another comprising control genes whose expression remains unchanged under the experimental conditions.

The Select_genes() function facilitates this selection. For example, upregulated genes can be defined as those with a LFC greater than 1 and an adjusted p-value (adjP) less than 0.05. Control genes can be selected based on an LFC between -0.25 and 0.25 and an adjP between 0.5 and 1:

# Extract vector with names of upregulated genes
Genes.Upreg <- Select_genes( hypoxia_df, min_LFC = 1, max_pval = 0.05 )

# Extract vector with names of non-responsive genes
Genes.Control <- Select_genes( hypoxia_df,
 min_pval = 0.5, max_pval = 1,
 min_LFC = -0.25, max_LFC = 0.25 )

2.2.2 Translate the gene IDs to Entrez Gene IDs

In situations where preprocessing is not applicable, such as when using an external or manually curated gene list, gene identifiers must be standardized to Entrez Gene IDs for compatibility with the package’s downstream functions. The GeneID2entrez() function supports this conversion:

# Conversion of hgnc to ENTREZ IDs
GeneID2entrez( gene.IDs = c("EGLN3","NFYA","ALS2","MYC","ARNT" ) )
#> [1] "112399" "4800"   "57679"  "4609"   "405"

# To translate from mouse IDs:
# GeneID2entrez( gene.IDs = c( "Hmmr", "Tlx3", "Cpeb4" ), mode = "m2h" ) # To get the equivalent human gene IDs

2.2.3 Overrepresentation test

In this step, a contingency table is built for each TF represented in the ChIP-seq regulatory map. Each TF-specific contingency table summarizes the overlap between TF binding and differential expression:

TFbound_yes TFbound_no
DE_yes number y/y number y/n
DE_no number n/y number n/n

Fisher’s exact test is then applied to each contingency table to evaluate the null hypothesis that TF binding and differential expression are independent events. In addition to raw p-values, the function returns false discovery rate (FDR)-adjusted p-values to account for multiple testing, as well as the odds ratio (OR), which quantifies the strength of the association between the DE genes and the TF based on ChIP-seq data.

CM_list_UP <- contingency_matrix( Genes.Upreg, Genes.Control ) # Generates list of contingency tables, one per dataset
pval_mat_UP <- getCMstats( CM_list_UP ) # Generates list of p-values and OR from association test
head( pval_mat_UP )
#>                     Accession    Cell Treatment    TF      p.value        OR
#> 873    GSE89836.EPAS1.HUVEC-C HUVEC-C           EPAS1 2.883815e-06 63.194453
#> 345 ENCSR091BOQ.SUZ12.GM12878 GM12878           SUZ12 9.452603e-32  5.674107
#> 738    GSE109625.EZH2.HUVEC-C HUVEC-C            EZH2 4.816831e-24  7.745379
#> 612     ENCSR784VUY.RNF2.WA01    WA01            RNF2 6.049224e-23  3.463252
#> 9     ENCSR000ARI.EZH2.Hep-G2  Hep-G2            EZH2 5.579521e-22  5.167738
#> 715       GSE104690.RNF2.WA01    WA01            RNF2 3.864458e-22  3.840774
#>          OR.SE  log2.OR  adj.p.value log10.adj.pVal distance
#> 873 68.3644690 5.981726 6.918168e-06       5.160009 62.40814
#> 345  0.7785041 2.504393 4.862839e-30      29.313110 29.68342
#> 738  1.4158940 2.953336 9.911969e-23      22.003840 23.01454
#> 612  0.4246438 1.792127 9.657899e-22      21.015117 21.15899
#> 9    0.8024915 2.369533 7.948671e-21      20.099705 20.52726
#> 715  0.5025048 1.941397 5.771755e-21      20.238692 20.43709

In the example above, all 8,103 TF binding datasets included in the internal database were used for the enrichment analysis. However, the analysis can be restricted to a specific subset of TFs. To do this, the get_chip_index() function can be used to generate an index referencing the desired subset of datasets. This index can then be passed as an argument to the contingency_matrix() function.

chip_index <- get_chip_index( TFfilter = c( "HIF1A","EPAS1","ARNT" ) ) # Restrict the analysis to datasets assaying these factors

CM_list_UPe <- contingency_matrix( Genes.Upreg, chip_index = chip_index ) # Generates list of contingency tables
pval_mat_UPe <- getCMstats( CM_list_UPe, chip_index ) # Generates list of p-values and ORs
head( pval_mat_UPe )
#>                      Accession          Cell Treatment    TF      p.value
#> 6       GSE89836.EPAS1.HUVEC-C       HUVEC-C           EPAS1 1.086534e-08
#> 5        GSE89836.ARNT.HUVEC-C       HUVEC-C            ARNT 2.066421e-14
#> 3 GSE39089.HIF1A.HUVEC-C_HYPOX HUVEC-C_HYPOX     HYPOX HIF1A 2.363868e-24
#> 1      ENCSR029IBC.ARNT.Hep-G2        Hep-G2            ARNT 7.772175e-03
#> 2     ENCSR590KEQ.ARNT.GM12878       GM12878            ARNT 3.734244e-05
#> 4  GSE39089.HIF1A.HUVEC-C_NOMO  HUVEC-C_NOMO      NOMO HIF1A 5.734422e-02
#>          OR      OR.SE  log2.OR  adj.p.value log10.adj.pVal  distance
#> 6 53.838261 27.0996505 5.750560 2.173068e-08       7.662927 53.391032
#> 5 20.275898  5.8177766 4.341694 6.199262e-14      13.207660 23.366697
#> 3  4.774115  0.6379584 2.255233 1.418321e-23      22.848226 23.157836
#> 1  5.412187  2.7913479 2.436212 9.326610e-03       2.030276  4.856893
#> 2  1.775250  0.2364437 0.828022 5.601366e-05       4.251706  4.321807
#> 4  2.843944  1.4501576 1.507893 5.734422e-02       1.241510  2.222943

Note: When a complete ranked gene list is provided as input, the current version of the package can automatically identify expressed TFs and limit the analysis to that subset. Details on this process are explained later in the document. Additionally, specifying a control gene set is optional: if omitted, the function defaults to using all annotated human genes not included in the test set as controls.

The TFEA.ChIP package includes a metadata table that provides detailed information about each ChIP-seq dataset in the internal database. This includes experimental conditions, TF targets, and dataset sources. The metadata can be accessed as follows:

data("MetaData", package = "TFEA.ChIP")
head(MetaData)
#>                                     Accession          ID
#> 1                      ENCSR000AHD.CTCF.MCF-7 ENCSR000AHD
#> 2                      ENCSR000AHF.TAF1.MCF-7 ENCSR000AHF
#> 3                    ENCSR000AKB.CTCF.GM12878 ENCSR000AKB
#> 4                      ENCSR000AKO.CTCF.K-562 ENCSR000AKO
#> 5 ENCSR000ALA.CTCF.endothelial_umbilical-vein ENCSR000ALA
#> 6               ENCSR000ALJ.CTCF.keratinocyte ENCSR000ALJ
#>                         Cell      Treatment   TF
#> 1                      MCF-7                CTCF
#> 2                      MCF-7                TAF1
#> 3                    GM12878                CTCF
#> 4                      K-562                CTCF
#> 5 endothelial_umbilical-vein umbilical-vein CTCF
#> 6               keratinocyte                CTCF

The TFEA.ChIP package provides a function that integrates all key analysis steps into a single command, simplifying the workflow. This streamlined function is particularly helpful when working with complete gene tables, as it can automatically detect TFs that are expressed in the input dataset and restrict the analysis accordingly. By default, this filtering step is enabled to avoid testing TFs that are unlikely to be relevant in the given biological context, thereby reducing the multiple testing burden and increasing statistical power. If desired, this behavior can be disabled by setting the expressed argument to FALSE:

pval_mat_UPe <- analysis_from_table(hypoxia_df, 
 method = 'ora', # Overrepresentation Analysis
 interest_min_LFC = 1, # min LFC
 expressed = TRUE, # only take expressed TFs
 control_min_pval = 0.5, control_max_pval = 1, # control pval limits
 control_min_LFC = -0.25, control_max_LFC = 0.25) # control LFC limits

In the analyses described above, each ChIP-seq dataset is treated independently. This approach allows for the identification of specific cell types or experimental conditions where a TF may be active, facilitating biological interpretation in relation to the study samples. Alternatively, the package supports the integration of all ChIP-seq datasets associated with a given TF, irrespective of experimental context. This strategy helps to highlight TFs that show consistent enrichment across conditions. The rankTFs() function enables this summarization by applying either a Wilcoxon rank-sum test or a GSEA approach to assess whether the ChIP-seq profiles corresponding to the same TF are collectively enriched or depleted in the ranked results.

Important consideration: For TFs whose activity is highly context-dependent, aggregating ChIP-seq data across diverse experimental conditions may obscure biologically meaningful enrichments. Additionally, many public ChIP-seq datasets are derived from baseline or control conditions in which the TF may not be active. While this metadata is informative for interpreting individual experiments, it can confound integrated analyses. Therefore, we recommend using non-integrated (individual) results for more accurate and context-specific interpretation whenever possible.

TF_ranking <- rankTFs( pval_mat_UP, rankMethod = "gsea", makePlot = TRUE )
head( TF_ranking[[ "TF_ranking" ]] )
TF_ranking[[ "TFranking_plot" ]]

Snapshot of a plot generated with rankTFs.

Meta-analyses can also be used to summarize results across multiple ChIP experiments for the same TF, accounting for the standard error of the OR. This approach gives greater weight to more precise estimates. However, it may also dilute effects that are context-dependent:

library(meta)
TF_ranking2 <- metaanalysis_fx( pval_mat_UP )

# Get HIF results
filter(TF_ranking2$summary,
  TF %in% c("ARNT", "HIF1A", "EPAS1"))

Snapshot of the meta-analysis results for HIF1A, EPAS1 and ARNT.

In this analysis, k denotes the number of ChIP experiments included. OR represents the weighted average Odds Ratio, OR.SE is its standard error, and CI refers to the corresponding confidence interval. The pooled effect size (OR values) estimated by the meta-analysis integrates all datasets for a given TF, which may dilute effects that are context-dependent. However, a forest plot can be generated to visualize the results of individual experiments alongside the integrated value:

# Result for HIF1A
forest(TF_ranking2[['results']]$HIF1A)

Snapshot of a plot generated with forest (meta package).

2.2.4 Plot results

The table of results generated by getCMstats can be parsed to select candidate TF. The function plot_CM uses the package plotly to generate an interactive plot representing the p-value against the Odds Ratio that is very helpful to explore the results:

plot_CM( pval_mat_UP ) # plot p-values against ORs

Snapshot of a plot generated with plot_CM.

As expected for the hypoxia dataset, visual inspection of the enrichment graph reveals a strong signal for several HIF-related ChIP-seq datasets. These can be highlighted to emphasize their relevance in the current analysis:

HIFs <- c( "EPAS1","HIF1A","ARNT" )
plot_CM( pval_mat_UP, specialTF = HIFs ) # Plot p-values against ORs highlighting indicated TFs

Snapshot of a plot generated with plot_CM.

2.3 Gene Set Enrichment Analysis

2.3.1 Generate a sorted list of ENTREZ IDs

The Gene Set Enrichment Analysis (GSEA) implemented in TFEA.ChIP requires a ranked list of genes as input. By default, the preprocessInputData() function sorts genes in descending order based on their LFC values. However, users can choose to rank genes using alternative numerical metrics, such as the p-value, adjusted p-value, or a composite measure like the product of LFC and –log10(p-value), depending on the specific goals of the analysis.

If a custom gene ranking is used, it is essential to ensure that gene identifiers are in Entrez Gene ID format. If necessary, the GeneID2entrez() function can be used to convert gene symbols or alternative identifiers into the required format.

2.3.2 Select the ChIP-Seq datasets to analyze

By default, the analysis will include all the ChIP-Seq experiments available in the database. However, this analysis might take several minutes to run. To restrict the analysis to a subset of the database we can generate an index variable and pass it to the function GSEA_run(). This will limit the analysis to the ChIP-Seq datasets of the user’s choosing. This index variable can be generated using the function get_chip_index() and allows the user to select the whole database, the set of ChIP-Seq experiments produced by the ENCODE project (“encode”) or a specific subset of TFs (as a vector containing the TF names).

chip_index <- get_chip_index( TFfilter = c( "HIF1A","EPAS1","ARNT" ) ) # Restrict the analysis to datasets assaying these factors

2.3.3 Run the GSEA analysis

The function GSEA_run() will perform a GSEA-based analysis on the input gene list. This function is based on the R-GSEA R script bundle written by the GSEA team at the Broad Institute of MIT and Harvard. The output of the analysis depends on the variable get.RES:

  • When FALSE, the function returns a data frame storing maximum Enrichment Score and associated p-value determined for each dataset included in the analysis.

  • When TRUE, the function returns a list of three elements. The first element (Enrichment.table) is the enrichment data frame previously mentioned. The second element (RES) is a list of vectors containing the Running Enrichment Scores values (see GSEA documentation, http://software.broadinstitute.org/gsea/doc/GSEAUserGuideTEXT.htm#_Interpreting_GSEA_Results) for each of the sorted genes tested against each one of the analyzed ChIP datasets. The third element (indicators) is a list of vectors indicating if the sorted genes were bound by the factor analyzed in each ChIP dataset.

# run GSEA analysis
GSEA.result <- GSEA_run( hypoxia_df$Genes, hypoxia_df$log2FoldChange, chip_index, get.RES = TRUE) 

Finally, the package provides a function that integrates all the above steps into a single command. In the current version of the package, only TFs present in the input sorted list of genes (i.e., TFs expressed under the conditions under study) are considered for the analysis. This default behavior can be overridden by setting the parameter expressed to FALSE:

GSEA.result <- analysis_from_table(hypoxia_df,
                                   TFfilter = c('ARNT', 'HIF1A', 'EPAS1'),
                                   expressed = TRUE,  # Takes only expressed TFs
                                   method = 'gsea') # Gene Set Enrichment Analysis

2.3.4 Plotting the results

TFEA.ChIP provides two functions, plot_ES() and plot_RES(), to create interactive HTML visualizations of GSEA results.

  1. Plot Enrichment Scores with plot_ES()

The plot_ES() function generates an interactive enrichment score (ES) plot. It supports highlighting ChIP-seq datasets associated with specific TFs by displaying them in distinct colors, allowing users to easily identify TFs of interest within the broader analysis.

TF.hightlight <- c( "EPAS1","ARNT","HIF1A" )

plot_ES( GSEA.result$result, LFC = GSEA.result$processed_table$log2FoldChange, specialTF = TF.hightlight)

Snapshot of a plot generated with plot_ES

  1. Plot Running Enrichment Scores with plot_RES()

This function will plot all the RES stored in the GSEA_run() output. It is only recommended to restrict output to specific TF and/or datasets by setting the parameters TF and/or Accession respectively:

plot_RES( 
 GSEA_result = GSEA.result$result, LFC = hypoxia_df$log2FoldChange,
 Accession = c("GSE89836.ARNT.HUVEC-C", "GSE89836.EPAS1.HUVEC-C" ) )

Snapshot of a plot generated with plot_RES

3 Building a TF-gene binding database

If the user wants to generate their own database of ChIPseq datasets, the functions txt2gr() and makeChIPGeneDB() automate most of the process. The required inputs are:

  • A Metadata table (storing at least, Accession ID, name of the file, and TF tested in the ChIP-Seq experiment). The metadata table included with this package has the following fields: “Name”, “Accession”, “Cell”, “Cell Type”, “Treatment”, “Antibody”, and “TF”.

  • A folder containing ChIP-Seq peak data, either in “.narrowpeak” format or the MACS output files "_peaks.bed" -a format that stores “chr”, “start”, “end”, “name”, and “Q-value” of every peak-.

3.1 Filter peaks from source and store them as a GRanges object

Specify the folder where the ChIP-Seq files are stored, create an array with the names of the ChIP-Seq files, and choose a format. Set a for loop to convert all your files to GenomicRanges objects using txt2GR(). Please note that, by default, only peaks with an associated p-value of 0.05 (for narrow peaks files) or 1e-5 (for MACS files) will be kept. The user can modify the default values by setting the alpha argument to the desired threshold p-value.

folder <- "~/peak.files.folder"
File.list<-dir( folder )
format <- "macs"

gr.list <- lapply(
 seq_along( File.list ),
 function( File.list, myMetaData, format ){
 
 tmp<-read.table( File.list[i], ..., stringsAsFactors = FALSE )
 
 file.metadata <- myMetaData[ myMetaData$Name == File.list[i], ]
 
 ChIP.dataset.gr<-txt2GR(tmp, format, file.metadata)
 
 return(ChIP.dataset.gr)
 },
 File.list = File.list,
 myMetadata = myMetadata,
 format = format
)
# As an example of the output
data( "ARNT.peaks.bed","ARNT.metadata", package = "TFEA.ChIP" ) # Loading example datasets for this function
ARNT.gr <- txt2GR( ARNT.peaks.bed, "macs1.4", ARNT.metadata )
head( ARNT.gr, n=2 )
#> GRanges object with 2 ranges and 9 metadata columns:
#>       seqnames          ranges strand |       score             mcols.Name
#>          <Rle>       <IRanges>  <Rle> |   <numeric>            <character>
#>   [1]     chr1 1813434-1814386      * | 4.60909e-06 GSM2390643_ARNT_ChIP..
#>   [2]     chr1 3456399-3457536      * | 8.82450e-27 GSM2390643_ARNT_ChIP..
#>       mcols.Accession  mcols.Cell mcols.Cell Type mcols.Treatment
#>           <character> <character>     <character>     <character>
#>   [1]      GSM2390643       HUVEC                         Hypoxia
#>   [2]      GSM2390643       HUVEC                         Hypoxia
#>       mcols.Antibody    mcols.TF  mcols.Score Type
#>          <character> <character>       <character>
#>   [1]          HIF1B        ARNT corrected p-Value
#>   [2]          HIF1B        ARNT corrected p-Value
#>   -------
#>   seqinfo: 32 sequences from an unspecified genome; no seqlengths

3.2 Assign TFBS peaks from ChIP dataset to specific genes

The function makeChIPGeneDB() assigns the TFBS peaks in the ChIP datasets stored in gr.list() to a gene. To this end, a ChIP peak overlaping a regulatory region region receive the gene label associated to said region. By default the function also assigns the gene name when the ChIP peak does not overlap a regulatory region but maps at less than 10 nucleotides from it. This behaviour can be modified by setting the argument distanceMargin to the desired value (by default distanceMargin = 10 bases).

The resulting ChIP-Gene data base is a list containing two elements: - Gene Keys: vector of gene IDs. - ChIP Targets: list of vectors, one per element in gr.list, containing the putative targets assigned. Each target is coded as its position in the vector ‘Gene Keys’.

data( "DnaseHS_db", "gr.list", package="TFEA.ChIP" ) # Loading example datasets for this function
TF.gene.binding.db <- makeChIPGeneDB( DnaseHS_db, gr.list ) 
str( TF.gene.binding.db )
#> List of 2
#>  $ Gene Keys   : chr [1:54] "100507501" "10209" "10277" "10458" ...
#>  $ ChIP Targets:List of 1
#>   ..$ wgEncodeEH002402: int [1:54] 1 2 3 4 5 6 7 8 9 10 ...

The function, accepts any Genomic Range object that includes a metacolumn with a gene ID (stored in the @elementMetadata@listdata [["gene_id"]] slot of the object) for each genomic segment. For example, asignation of peaks to genes can be done by providing a list of all the genes in the genome:

library(TxDb.Hsapiens.UCSC.hg38.knownGene)
#> Loading required package: GenomicFeatures
#> Loading required package: BiocGenerics
#> Loading required package: generics
#> 
#> Attaching package: 'generics'
#> The following object is masked from 'package:dplyr':
#> 
#>     explain
#> The following objects are masked from 'package:base':
#> 
#>     as.difftime, as.factor, as.ordered, intersect, is.element, setdiff,
#>     setequal, union
#> 
#> Attaching package: 'BiocGenerics'
#> The following object is masked from 'package:dplyr':
#> 
#>     combine
#> The following objects are masked from 'package:stats':
#> 
#>     IQR, mad, sd, var, xtabs
#> The following objects are masked from 'package:base':
#> 
#>     Filter, Find, Map, Position, Reduce, anyDuplicated, aperm, append,
#>     as.data.frame, basename, cbind, colnames, dirname, do.call,
#>     duplicated, eval, evalq, get, grep, grepl, is.unsorted, lapply,
#>     mapply, match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
#>     rank, rbind, rownames, sapply, saveRDS, table, tapply, unique,
#>     unsplit, which.max, which.min
#> Loading required package: S4Vectors
#> Loading required package: stats4
#> 
#> Attaching package: 'S4Vectors'
#> The following objects are masked from 'package:dplyr':
#> 
#>     first, rename
#> The following object is masked from 'package:utils':
#> 
#>     findMatches
#> The following objects are masked from 'package:base':
#> 
#>     I, expand.grid, unname
#> Loading required package: IRanges
#> 
#> Attaching package: 'IRanges'
#> The following objects are masked from 'package:dplyr':
#> 
#>     collapse, desc, slice
#> Loading required package: Seqinfo
#> Loading required package: GenomicRanges
#> Loading required package: AnnotationDbi
#> Loading required package: Biobase
#> Welcome to Bioconductor
#> 
#>     Vignettes contain introductory material; view with
#>     'browseVignettes()'. To cite Bioconductor, see
#>     'citation("Biobase")', and for packages 'citation("pkgname")'.
#> 
#> Attaching package: 'AnnotationDbi'
#> The following object is masked from 'package:dplyr':
#> 
#>     select
data( "gr.list", package="TFEA.ChIP") # Loading example datasets for this function
txdb <- TxDb.Hsapiens.UCSC.hg38.knownGene
Genes <- genes( txdb )
#>   2169 genes were dropped because they have exons located on both strands of
#>   the same reference sequence or on more than one reference sequence, so cannot
#>   be represented by a single genomic range.
#>   Use 'single.strand.genes.only=FALSE' to get all the genes in a GRangesList
#>   object, or use suppressMessages() to suppress this message.
TF.gene.binding.db <- makeChIPGeneDB( Genes, gr.list, distanceMargin = 0 )
str( TF.gene.binding.db )
#> List of 2
#>  $ Gene Keys   : chr [1:35332] "1" "10" "100" "1000" ...
#>  $ ChIP Targets:List of 1
#>   ..$ wgEncodeEH002402: int [1:41] 1656 3217 3791 4256 7188 8381 8781 9637 11099 12229 ...

In this case the information about Dnase hypersensitivity is disregarded and peaks are assigned to overlapping genes (or genes closer than distanceMargin residues).

3.3 Substitute the default database by a custom generated table.

At the beginning of a session, use the function set_user_data() to use your TFBS binary matrix and metadata table with the rest of the package.

set_user_data( binary_matrix = myTFBSmatrix, metadata = myMetaData )
sessionInfo()
#> R version 4.5.1 (2025-06-13)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.3 LTS
#> 
#> Matrix products: default
#> BLAS:   /home/biocbuild/bbs-3.22-bioc/R/lib/libRblas.so 
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0  LAPACK version 3.12.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_GB              LC_COLLATE=C              
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: America/New_York
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats4    stats     graphics  grDevices utils     datasets  methods  
#> [8] base     
#> 
#> other attached packages:
#>  [1] TxDb.Hsapiens.UCSC.hg38.knownGene_3.21.0
#>  [2] GenomicFeatures_1.61.6                  
#>  [3] AnnotationDbi_1.71.1                    
#>  [4] Biobase_2.69.0                          
#>  [5] GenomicRanges_1.61.2                    
#>  [6] Seqinfo_0.99.2                          
#>  [7] IRanges_2.43.0                          
#>  [8] S4Vectors_0.47.0                        
#>  [9] BiocGenerics_0.55.1                     
#> [10] generics_0.1.4                          
#> [11] dplyr_1.1.4                             
#> [12] TFEA.ChIP_1.29.3                        
#> [13] BiocStyle_2.37.1                        
#> 
#> loaded via a namespace (and not attached):
#>  [1] tidyselect_1.2.1            farver_2.1.2               
#>  [3] blob_1.2.4                  filelock_1.0.3             
#>  [5] R.utils_2.13.0              Biostrings_2.77.2          
#>  [7] bitops_1.0-9                fastmap_1.2.0              
#>  [9] RCurl_1.98-1.17             BiocFileCache_2.99.6       
#> [11] GenomicAlignments_1.45.2    XML_3.99-0.19              
#> [13] digest_0.6.37               lifecycle_1.0.4            
#> [15] KEGGREST_1.49.1             RSQLite_2.4.3              
#> [17] magrittr_2.0.3              compiler_4.5.1             
#> [19] progress_1.2.3              rlang_1.1.6                
#> [21] sass_0.4.10                 tools_4.5.1                
#> [23] yaml_2.3.10                 rtracklayer_1.69.1         
#> [25] knitr_1.50                  prettyunits_1.2.0          
#> [27] S4Arrays_1.9.1              bit_4.6.0                  
#> [29] curl_7.0.0                  DelayedArray_0.35.2        
#> [31] RColorBrewer_1.1-3          abind_1.4-8                
#> [33] BiocParallel_1.43.4         withr_3.0.2                
#> [35] R.oo_1.27.1                 grid_4.5.1                 
#> [37] ExperimentHub_2.99.5        ggplot2_3.5.2              
#> [39] scales_1.4.0                dichromat_2.0-0.1          
#> [41] biomaRt_2.65.7              SummarizedExperiment_1.39.1
#> [43] cli_3.6.5                   rmarkdown_2.29             
#> [45] crayon_1.5.3                httr_1.4.7                 
#> [47] rjson_0.2.23                DBI_1.2.3                  
#> [49] cachem_1.1.0                stringr_1.5.2              
#> [51] parallel_4.5.1              BiocManager_1.30.26        
#> [53] XVector_0.49.0              restfulr_0.0.16            
#> [55] matrixStats_1.5.0           vctrs_0.6.5                
#> [57] Matrix_1.7-4                jsonlite_2.0.0             
#> [59] bookdown_0.44               hms_1.1.3                  
#> [61] bit64_4.6.0-1               locfit_1.5-9.12            
#> [63] jquerylib_0.1.4             glue_1.8.0                 
#> [65] org.Mm.eg.db_3.21.0         codetools_0.2-20           
#> [67] gtable_0.3.6                stringi_1.8.7              
#> [69] BiocVersion_3.22.0          BiocIO_1.19.0              
#> [71] tibble_3.3.0                pillar_1.11.0              
#> [73] rappdirs_0.3.3              htmltools_0.5.8.1          
#> [75] R6_2.6.1                    dbplyr_2.5.0               
#> [77] httr2_1.2.1                 evaluate_1.0.5             
#> [79] lattice_0.22-7              AnnotationHub_3.99.6       
#> [81] png_0.1-8                   R.methodsS3_1.8.2          
#> [83] Rsamtools_2.25.3            memoise_2.0.1              
#> [85] bslib_0.9.0                 Rcpp_1.1.0                 
#> [87] SparseArray_1.9.1           DESeq2_1.49.4              
#> [89] org.Hs.eg.db_3.21.0         xfun_0.53                  
#> [91] MatrixGenerics_1.21.0       pkgconfig_2.0.3