SIP R Package

The SIP R Package is designed to help researchers implement the single-iteration permutation (SIP) method for large-scale biobank data (Annis et. al. “False discovery rates for genome-wide association tests in biobanks with thousands of phenotypes.” https://doi.org/10.21203/rs.3.rs-873449/v1).

Briefly, the SIP method takes advantage of analyzing many phenotypes across a biobank simultaneously. When a large number of phenotypes are analyzed in parallel, a single permutation across all phenotypes followed by genetic association analyses of the permuted data enables estimation of false discovery rates (FDRs) across the phenome. FDR estimates in turn help with interpretation of genetic associations in a biobank context.

SIP Example

The basic permutation procedure can be used with unrelated samples. SIP requires a sample-by-variable input file containing an ID, genotypic covariates, phenotypes, and phenotypic covariates:

sip_exampleData[1:5, 1:15]
##        FID     IID ANCESTRY SEX AGE BATCH STUDY         PC1       PC2
## 1 FAM00001 ID00001      AFR   1  78     1  STU1 -0.00997818 0.0363497
## 2 FAM00002 ID00002      AFR   1  86     1  STU1 -0.01303270 0.0349262
## 3 FAM00003 ID00003      AFR   1  71     1  STU1 -0.01071080 0.0345597
## 4 FAM00004 ID00004      EUR   2  54     1  STU1 -0.01244230 0.0364125
## 5 FAM00005 ID00005      EUR   2  74     1  STU1 -0.01109950 0.0370039
##            PC3        PC4 PHENO1 PHENO2 PHENO3 PHENO4
## 1 -0.002534320 0.00362118      0      0      0      0
## 2  0.002535730 0.00291765      1      1      0      0
## 3 -0.002234060 0.00399563      1      1      0      1
## 4 -0.000127672 0.00474610      1      1      0      1
## 5 -0.001726030 0.00579052      1      1      1      1

The default permutation method will permute within biological sex and requires specification of a sex variable and male and female values as well as ID and genotypic covariates. Additional ID variables (e.g., family ID) can be placed within the genotypic covariate vector. For example:

sip(df = sip_exampleData, id.var = "IID", sex.var = "SEX", male.val = 1, female.val = 2, geno.vars = c("FID","ANCESTRY","BATCH",paste0("PC",1:4)))

Users may specify a seed. If not specified, a random seed will be chosen:

sip(df = sip_exampleData, id.var = "IID", sex.var = "SEX", male.val = 1, female.val = 2, geno.vars = c("FID","ANCESTRY","BATCH",paste0("PC",1:4)), seed = 123)

If permutation without regard to sex is desired, the within.sex argument can be set to FALSE and the sex variable and male and female values are no longer required:

sip(df = sip_exampleData, id.var = "IID", within.sex = FALSE, geno.vars = c("FID","ANCESTRY","BATCH",paste0("PC",1:4)))

SIP-pair Example

The paired permutation procedure should be used with related samples:

sipPair_exampleData[1:5, 1:15]
##       FID     IID BATCH SEX AGE     PC1     PC2    PC3     PC4 PHENO1 PHENO2
## 1 FAM0001 ID00001     1   M  72 -0.0006 -0.0006 0.0034 -0.0010      0      0
## 2 FAM0001 ID00002     1   F  80 -0.0006 -0.0010 0.0033  0.0007      0      0
## 3 FAM0002 ID00003     1   F  82  0.0052 -0.0001 0.0022  0.0001      0      0
## 4 FAM0002 ID00004     1   M  80  0.0055 -0.0006 0.0020  0.0015      1      0
## 5 FAM0003 ID00005     1   F  78 -0.0013 -0.0012 0.0007 -0.0002      0      0
##   PHENO3 PHENO4 PHENO5 PHENO6
## 1      1      0      0      1
## 2      1      1      0      1
## 3      1      0      0      0
## 4      1      1      1      0
## 5      1      0      0      0

SIP-pair requires a data frame with identity by descent (IBD) relatedness information between individuals:

head(sipPair_relatednessData)
##         IID1    IID2 PropIBD
## 8343 ID04318 ID06117  0.0016
## 5839 ID06013 ID06015  0.4766
## 3122 ID02006 ID02007  0.5000
## 9641 ID04318 ID07882  0.0000
## 1989 ID06986 ID06987  0.5114
## 2450 ID04400 ID04401  0.5029

The default permutation method will permute within biological sex and requires specification of a sex variable and male and female values as well as ID and genotypic covariates. Additional ID variables (e.g., family ID) can be placed within the genotypic covariate vector. SIP-pair also requires specification of the relatedness data frame, pairs of ID variables within the relatedness data frame, and an IBD variable:

sip_pair(df = sipPair_exampleData, id.var = "IID", sex.var = "SEX", male.val = "M", female.val = "F", geno.vars = c("FID","BATCH",paste0("PC",1:4)), rel.df = sipPair_relatednessData, rid.vars=c("IID1","IID2"), ibd.var="PropIBD")

Users may specify a seed. If not specified, a random seed will be chosen:

sip_pair(df = sipPair_exampleData, id.var = "IID", sex.var = "SEX", male.val = "M", female.val = "F", geno.vars = c("FID","BATCH",paste0("PC",1:4)), rel.df = sipPair_relatednessData, rid.vars=c("IID1","IID2"), ibd.var="PropIBD", seed = 123)

If permutation without regard to sex is desired, the within.sex argument can be set to FALSE and the sex variable and male and female values are no longer required:

sip_pair(df = sipPair_exampleData, id.var = "IID", geno.vars = c("FID","BATCH",paste0("PC",1:4)), rel.df = sipPair_relatednessData, rid.vars=c("IID1","IID2"), ibd.var="PropIBD", within.sex = FALSE)

Phenotype Correlation Plots

The SIP method maintains correlation among phenotypes:

  1. Permute data
sip_exampleData_permuted <- sip(df = sip_exampleData, id.var = "IID", sex.var = "SEX", male.val = 1, female.val = 2, geno.vars = c("FID","ANCESTRY","BATCH",paste0("PC",1:4)))
## [1] "Seed: 453014"
sip_exampleData_permuted[1:5, 1:15]
##        FID     IID ANCESTRY SEX AGE BATCH STUDY         PC1       PC2
## 1 FAM00001 ID00001      AFR   1  41     1  STU1 -0.00997818 0.0363497
## 2 FAM00002 ID00002      AFR   1  59     1  STU1 -0.01303270 0.0349262
## 3 FAM00003 ID00003      AFR   1  67     1  STU1 -0.01071080 0.0345597
## 4 FAM00004 ID00004      EUR   2  36     1  STU2 -0.01244230 0.0364125
## 5 FAM00005 ID00005      EUR   2  64     1  STU1 -0.01109950 0.0370039
##            PC3        PC4 PHENO1 PHENO2 PHENO3 PHENO4
## 1 -0.002534320 0.00362118      1      1      1      0
## 2  0.002535730 0.00291765      0      0      1      0
## 3 -0.002234060 0.00399563      0      1      0      0
## 4 -0.000127672 0.00474610      1      1      0      1
## 5 -0.001726030 0.00579052      1      1      1      1
  1. Plot correlations between phenotypes in the primary data
plot_phenotype_correlations(df = sip_exampleData, pheno.vars = paste0("PHENO",1:500))

  1. Plot correlations between phenotypes in the permuted data
plot_phenotype_correlations(df = sip_exampleData_permuted, pheno.vars = paste0("PHENO",1:500))

Phenotype-IBD Correlation

The SIP-pair method preserves correlation among related samples:

  1. Permute data without pairing
sipPair_exampleData_permuted <- sip(df = sipPair_exampleData, id.var = "IID", sex.var = "SEX", male.val = "M", female.val = "F", geno.vars = c("FID","BATCH",paste0("PC",1:4)))
## [1] "Seed: 821244"
sipPair_exampleData_permuted[1:5, 1:15]
##       FID     IID BATCH SEX AGE     PC1     PC2    PC3     PC4 PHENO1 PHENO2
## 1 FAM0001 ID00001     1   M  75 -0.0006 -0.0006 0.0034 -0.0010      0      0
## 2 FAM0001 ID00002     1   F  62 -0.0006 -0.0010 0.0033  0.0007      1      0
## 3 FAM0002 ID00003     1   F  67  0.0052 -0.0001 0.0022  0.0001      0      0
## 4 FAM0002 ID00004     1   M  78  0.0055 -0.0006 0.0020  0.0015      1      0
## 5 FAM0003 ID00005     1   F  67 -0.0013 -0.0012 0.0007 -0.0002      1      0
##   PHENO3 PHENO4 PHENO5 PHENO6
## 1      0      0      1      0
## 2      1      0      0      0
## 3      0      0      0      0
## 4      0      0      0      0
## 5      0      0      1      0
  1. Permute data with pairing
sipPair_exampleData_pairPermuted <- sip_pair(df = sipPair_exampleData, id.var = "IID", sex.var = "SEX", male.val = "M", female.val = "F", geno.vars = c("FID","BATCH",paste0("PC",1:4)), rel.df = sipPair_relatednessData, rid.vars=c("IID1","IID2"), ibd.var="PropIBD")
## [1] "Seed: 903890"
sipPair_exampleData_pairPermuted[1:5, 1:15]
##       FID     IID BATCH SEX AGE     PC1     PC2    PC3     PC4 PHENO1 PHENO2
## 1 FAM0001 ID00001     1   M  84 -0.0006 -0.0006 0.0034 -0.0010      0      0
## 2 FAM0001 ID00002     1   F  85 -0.0006 -0.0010 0.0033  0.0007      0      0
## 3 FAM0002 ID00003     1   F  65  0.0052 -0.0001 0.0022  0.0001      1      0
## 4 FAM0002 ID00004     1   M  82  0.0055 -0.0006 0.0020  0.0015      0      0
## 5 FAM0003 ID00005     1   F  60 -0.0013 -0.0012 0.0007 -0.0002      1      0
##   PHENO3 PHENO4 PHENO5 PHENO6
## 1      1      1      1      0
## 2      0      0      0      0
## 3      0      0      0      0
## 4      0      0      0      0
## 5      0      0      0      0
  1. Calculate pairwise correlations between samples in primary, permuted, and pair-permuted phenotype data. Then calculate correlation between pairwise phenotype correlations and pairwise IBD status for samples.
phenotype_IBD_correlation(df = sipPair_exampleData, id.var = "IID", rel.df = sipPair_relatednessData, rid.vars=c("IID1","IID2"), ibd.var="PropIBD", pheno.vars = paste0("PHENO",1:300))
phenotype_IBD_correlation(df = sipPair_exampleData_permuted, id.var = "IID", rel.df = sipPair_relatednessData, rid.vars=c("IID1","IID2"), ibd.var="PropIBD", pheno.vars = paste0("PHENO",1:300))
phenotype_IBD_correlation(df = sipPair_exampleData_pairPermuted, id.var = "IID", rel.df = sipPair_relatednessData, rid.vars=c("IID1","IID2"), ibd.var="PropIBD", pheno.vars = paste0("PHENO",1:300))

Multiple paired permutations show preserved correlation between pairwise phenotypes and IBD status. Multiple non-paired permutations show correlations around 0 for pairwise phenotypes and IBD status.

## [1] "Seed: 843387"
## [1] "Seed: 929772"
## [1] "Seed: 70567"
## [1] "Seed: 983156"
## [1] "Seed: 823200"
## [1] "Seed: 105573"
## [1] "Seed: 588627"
## [1] "Seed: 156553"
## [1] "Seed: 57171"
## [1] "Seed: 540842"
## [1] "Seed: 899929"
## [1] "Seed: 728781"
## [1] "Seed: 752220"
## [1] "Seed: 39710"
## [1] "Seed: 897873"
## [1] "Seed: 710215"
## [1] "Seed: 705753"
## [1] "Seed: 376279"

##    Permutation     Permuted Pair_Permuted
## 1            1  0.005031483     0.1641810
## 2            2 -0.032301401     0.1395814
## 3            3 -0.041537181     0.1479389
## 4            4  0.017517838     0.1629744
## 5            5 -0.007197936     0.1560678
## 6            6  0.013668617     0.1850219
## 7            7 -0.001806165     0.1471663
## 8            8 -0.012147743     0.1359532
## 9            9  0.007154194     0.1469262
## 10          10  0.010456928     0.1118559

Additional Data

Summary statistics for associations with p<5e-6 in our primary and permuted UKB analyses (Annis et. al. “False discovery rates for genome-wide association tests in biobanks with thousands of phenotypes.” https://doi.org/10.21203/rs.3.rs-873449/v1) are provided in data/.