The SIP R Package is designed to help researchers implement the single-iteration permutation (SIP) method for large-scale biobank data (Annis et. al. “False discovery rates for genome-wide association tests in biobanks with thousands of phenotypes.” https://doi.org/10.21203/rs.3.rs-873449/v1).
Briefly, the SIP method takes advantage of analyzing many phenotypes across a biobank simultaneously. When a large number of phenotypes are analyzed in parallel, a single permutation across all phenotypes followed by genetic association analyses of the permuted data enables estimation of false discovery rates (FDRs) across the phenome. FDR estimates in turn help with interpretation of genetic associations in a biobank context.
The basic permutation procedure can be used with unrelated samples. SIP requires a sample-by-variable input file containing an ID, genotypic covariates, phenotypes, and phenotypic covariates:
sip_exampleData[1:5, 1:15]## FID IID ANCESTRY SEX AGE BATCH STUDY PC1 PC2
## 1 FAM00001 ID00001 AFR 1 78 1 STU1 -0.00997818 0.0363497
## 2 FAM00002 ID00002 AFR 1 86 1 STU1 -0.01303270 0.0349262
## 3 FAM00003 ID00003 AFR 1 71 1 STU1 -0.01071080 0.0345597
## 4 FAM00004 ID00004 EUR 2 54 1 STU1 -0.01244230 0.0364125
## 5 FAM00005 ID00005 EUR 2 74 1 STU1 -0.01109950 0.0370039
## PC3 PC4 PHENO1 PHENO2 PHENO3 PHENO4
## 1 -0.002534320 0.00362118 0 0 0 0
## 2 0.002535730 0.00291765 1 1 0 0
## 3 -0.002234060 0.00399563 1 1 0 1
## 4 -0.000127672 0.00474610 1 1 0 1
## 5 -0.001726030 0.00579052 1 1 1 1
The default permutation method will permute within biological sex and requires specification of a sex variable and male and female values as well as ID and genotypic covariates. Additional ID variables (e.g., family ID) can be placed within the genotypic covariate vector. For example:
sip(df = sip_exampleData, id.var = "IID", sex.var = "SEX", male.val = 1, female.val = 2, geno.vars = c("FID","ANCESTRY","BATCH",paste0("PC",1:4)))Users may specify a seed. If not specified, a random seed will be chosen:
sip(df = sip_exampleData, id.var = "IID", sex.var = "SEX", male.val = 1, female.val = 2, geno.vars = c("FID","ANCESTRY","BATCH",paste0("PC",1:4)), seed = 123)If permutation without regard to sex is desired, the within.sex argument can be set to FALSE and the sex variable and male and female values are no longer required:
sip(df = sip_exampleData, id.var = "IID", within.sex = FALSE, geno.vars = c("FID","ANCESTRY","BATCH",paste0("PC",1:4)))The paired permutation procedure should be used with related samples:
sipPair_exampleData[1:5, 1:15]## FID IID BATCH SEX AGE PC1 PC2 PC3 PC4 PHENO1 PHENO2
## 1 FAM0001 ID00001 1 M 72 -0.0006 -0.0006 0.0034 -0.0010 0 0
## 2 FAM0001 ID00002 1 F 80 -0.0006 -0.0010 0.0033 0.0007 0 0
## 3 FAM0002 ID00003 1 F 82 0.0052 -0.0001 0.0022 0.0001 0 0
## 4 FAM0002 ID00004 1 M 80 0.0055 -0.0006 0.0020 0.0015 1 0
## 5 FAM0003 ID00005 1 F 78 -0.0013 -0.0012 0.0007 -0.0002 0 0
## PHENO3 PHENO4 PHENO5 PHENO6
## 1 1 0 0 1
## 2 1 1 0 1
## 3 1 0 0 0
## 4 1 1 1 0
## 5 1 0 0 0
SIP-pair requires a data frame with identity by descent (IBD) relatedness information between individuals:
head(sipPair_relatednessData)## IID1 IID2 PropIBD
## 8343 ID04318 ID06117 0.0016
## 5839 ID06013 ID06015 0.4766
## 3122 ID02006 ID02007 0.5000
## 9641 ID04318 ID07882 0.0000
## 1989 ID06986 ID06987 0.5114
## 2450 ID04400 ID04401 0.5029
The default permutation method will permute within biological sex and requires specification of a sex variable and male and female values as well as ID and genotypic covariates. Additional ID variables (e.g., family ID) can be placed within the genotypic covariate vector. SIP-pair also requires specification of the relatedness data frame, pairs of ID variables within the relatedness data frame, and an IBD variable:
sip_pair(df = sipPair_exampleData, id.var = "IID", sex.var = "SEX", male.val = "M", female.val = "F", geno.vars = c("FID","BATCH",paste0("PC",1:4)), rel.df = sipPair_relatednessData, rid.vars=c("IID1","IID2"), ibd.var="PropIBD")Users may specify a seed. If not specified, a random seed will be chosen:
sip_pair(df = sipPair_exampleData, id.var = "IID", sex.var = "SEX", male.val = "M", female.val = "F", geno.vars = c("FID","BATCH",paste0("PC",1:4)), rel.df = sipPair_relatednessData, rid.vars=c("IID1","IID2"), ibd.var="PropIBD", seed = 123)If permutation without regard to sex is desired, the within.sex argument can be set to FALSE and the sex variable and male and female values are no longer required:
sip_pair(df = sipPair_exampleData, id.var = "IID", geno.vars = c("FID","BATCH",paste0("PC",1:4)), rel.df = sipPair_relatednessData, rid.vars=c("IID1","IID2"), ibd.var="PropIBD", within.sex = FALSE)The SIP method maintains correlation among phenotypes:
sip_exampleData_permuted <- sip(df = sip_exampleData, id.var = "IID", sex.var = "SEX", male.val = 1, female.val = 2, geno.vars = c("FID","ANCESTRY","BATCH",paste0("PC",1:4)))## [1] "Seed: 453014"
sip_exampleData_permuted[1:5, 1:15]## FID IID ANCESTRY SEX AGE BATCH STUDY PC1 PC2
## 1 FAM00001 ID00001 AFR 1 41 1 STU1 -0.00997818 0.0363497
## 2 FAM00002 ID00002 AFR 1 59 1 STU1 -0.01303270 0.0349262
## 3 FAM00003 ID00003 AFR 1 67 1 STU1 -0.01071080 0.0345597
## 4 FAM00004 ID00004 EUR 2 36 1 STU2 -0.01244230 0.0364125
## 5 FAM00005 ID00005 EUR 2 64 1 STU1 -0.01109950 0.0370039
## PC3 PC4 PHENO1 PHENO2 PHENO3 PHENO4
## 1 -0.002534320 0.00362118 1 1 1 0
## 2 0.002535730 0.00291765 0 0 1 0
## 3 -0.002234060 0.00399563 0 1 0 0
## 4 -0.000127672 0.00474610 1 1 0 1
## 5 -0.001726030 0.00579052 1 1 1 1
plot_phenotype_correlations(df = sip_exampleData, pheno.vars = paste0("PHENO",1:500))
plot_phenotype_correlations(df = sip_exampleData_permuted, pheno.vars = paste0("PHENO",1:500))
The SIP-pair method preserves correlation among related samples:
sipPair_exampleData_permuted <- sip(df = sipPair_exampleData, id.var = "IID", sex.var = "SEX", male.val = "M", female.val = "F", geno.vars = c("FID","BATCH",paste0("PC",1:4)))## [1] "Seed: 821244"
sipPair_exampleData_permuted[1:5, 1:15]## FID IID BATCH SEX AGE PC1 PC2 PC3 PC4 PHENO1 PHENO2
## 1 FAM0001 ID00001 1 M 75 -0.0006 -0.0006 0.0034 -0.0010 0 0
## 2 FAM0001 ID00002 1 F 62 -0.0006 -0.0010 0.0033 0.0007 1 0
## 3 FAM0002 ID00003 1 F 67 0.0052 -0.0001 0.0022 0.0001 0 0
## 4 FAM0002 ID00004 1 M 78 0.0055 -0.0006 0.0020 0.0015 1 0
## 5 FAM0003 ID00005 1 F 67 -0.0013 -0.0012 0.0007 -0.0002 1 0
## PHENO3 PHENO4 PHENO5 PHENO6
## 1 0 0 1 0
## 2 1 0 0 0
## 3 0 0 0 0
## 4 0 0 0 0
## 5 0 0 1 0
sipPair_exampleData_pairPermuted <- sip_pair(df = sipPair_exampleData, id.var = "IID", sex.var = "SEX", male.val = "M", female.val = "F", geno.vars = c("FID","BATCH",paste0("PC",1:4)), rel.df = sipPair_relatednessData, rid.vars=c("IID1","IID2"), ibd.var="PropIBD")## [1] "Seed: 903890"
sipPair_exampleData_pairPermuted[1:5, 1:15]## FID IID BATCH SEX AGE PC1 PC2 PC3 PC4 PHENO1 PHENO2
## 1 FAM0001 ID00001 1 M 84 -0.0006 -0.0006 0.0034 -0.0010 0 0
## 2 FAM0001 ID00002 1 F 85 -0.0006 -0.0010 0.0033 0.0007 0 0
## 3 FAM0002 ID00003 1 F 65 0.0052 -0.0001 0.0022 0.0001 1 0
## 4 FAM0002 ID00004 1 M 82 0.0055 -0.0006 0.0020 0.0015 0 0
## 5 FAM0003 ID00005 1 F 60 -0.0013 -0.0012 0.0007 -0.0002 1 0
## PHENO3 PHENO4 PHENO5 PHENO6
## 1 1 1 1 0
## 2 0 0 0 0
## 3 0 0 0 0
## 4 0 0 0 0
## 5 0 0 0 0
phenotype_IBD_correlation(df = sipPair_exampleData, id.var = "IID", rel.df = sipPair_relatednessData, rid.vars=c("IID1","IID2"), ibd.var="PropIBD", pheno.vars = paste0("PHENO",1:300))
phenotype_IBD_correlation(df = sipPair_exampleData_permuted, id.var = "IID", rel.df = sipPair_relatednessData, rid.vars=c("IID1","IID2"), ibd.var="PropIBD", pheno.vars = paste0("PHENO",1:300))
phenotype_IBD_correlation(df = sipPair_exampleData_pairPermuted, id.var = "IID", rel.df = sipPair_relatednessData, rid.vars=c("IID1","IID2"), ibd.var="PropIBD", pheno.vars = paste0("PHENO",1:300))Multiple paired permutations show preserved correlation between pairwise phenotypes and IBD status. Multiple non-paired permutations show correlations around 0 for pairwise phenotypes and IBD status.
## [1] "Seed: 843387"
## [1] "Seed: 929772"
## [1] "Seed: 70567"
## [1] "Seed: 983156"
## [1] "Seed: 823200"
## [1] "Seed: 105573"
## [1] "Seed: 588627"
## [1] "Seed: 156553"
## [1] "Seed: 57171"
## [1] "Seed: 540842"
## [1] "Seed: 899929"
## [1] "Seed: 728781"
## [1] "Seed: 752220"
## [1] "Seed: 39710"
## [1] "Seed: 897873"
## [1] "Seed: 710215"
## [1] "Seed: 705753"
## [1] "Seed: 376279"
## Permutation Permuted Pair_Permuted
## 1 1 0.005031483 0.1641810
## 2 2 -0.032301401 0.1395814
## 3 3 -0.041537181 0.1479389
## 4 4 0.017517838 0.1629744
## 5 5 -0.007197936 0.1560678
## 6 6 0.013668617 0.1850219
## 7 7 -0.001806165 0.1471663
## 8 8 -0.012147743 0.1359532
## 9 9 0.007154194 0.1469262
## 10 10 0.010456928 0.1118559
Summary statistics for associations with p<5e-6 in our primary and permuted UKB analyses (Annis et. al. “False discovery rates for genome-wide association tests in biobanks with thousands of phenotypes.” https://doi.org/10.21203/rs.3.rs-873449/v1) are provided in data/.