--- title: "bmass Introduction -- Real Data" author: "Michael C Turchin" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{bmass Introduction -- Real Data} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- This is an advanced introductory vignette for running `bmass`. Here, we will use real data and explore multiple components of the `bmass()` output. Specifically, we will: * Download our real dataset * Quality control (QC) & format our dataset * Run `bmass` * Access & explore a variety of results For a more basic introductory vignette using a smaller, simulated dataset, please see [here][bmass-vignette1]. ## Downloading `GlobalLipids2013` For this vigenette, we will be download and analyze the `GlobalLipids2013` dataset. To download the necessary files from [http://csg.sph.umich.edu/willer/public/lipids2013/][globallipids2013], run the following: ``` mkdir my-directory/GlobalLipids2013 cd my-directory/GlobalLipids2013 wget http://csg.sph.umich.edu/abecasis/public/lipids2013/jointGwasMc_LDL.txt.gz wget http://csg.sph.umich.edu/abecasis/public/lipids2013/jointGwasMc_HDL.txt.gz wget http://csg.sph.umich.edu/abecasis/public/lipids2013/jointGwasMc_TG.txt.gz wget http://csg.sph.umich.edu/abecasis/public/lipids2013/jointGwasMc_TC.txt.gz ``` ## Formatting & QC'ing our dataset To run bmass, it is recommended that at least a minimal level of QC is performed on the datasets of interest. Additionally, it is important that all the single nucleotide polymorphisms (SNPs) being analyzed are oriented to the same reference allele across all the phenotypes being included. This latter situation is emphasized so that the correct direction of effect is assigned to each phenotype for a given SNP (which is oriented based on reference allele). Below is example `bash` code on how to reformat, conduct some basic QC, and orient the reference alleles on the `GlobalLipids2013` dataset. First, we will reformat our input datafiles to match the requirements for `bmass`. `bmass` expects input files to have the following columns -- * Chr: Chromosome * BP: Basepair Position * Marker: Variant ID information, such as a rsID # (should be a unique value for each variant) * MAF: Minor Allele Frequency * A1: Reference allele * A2: AlternativeaAllele * Direction: Direction of effect size (a column of "+" or "-") * pValue: p-Value from univariate GWAS * N: Sample size To create files that match these specifications, we will conduct the following: extract the columns that are needed, orient SNPs to the minor allele, and extract the direction of effect from the provided effect sizes. ``` zcat jointGwasMc_HDL.txt.gz | sed 's/:/ /g' | sed 's/chr//g' | awk '{ print $1 "\t" $2 "\t" $5 "\t" $12 "\t" $6 "\t" $7 "\t" $8 "\t" $11 "\t" $10 }' | perl -lane '$F[4] = uc($F[4]); $F[5] = uc($F[5]); if ($F[6] > 0) { $F[6] = "+"; } elsif ($F[6] < 0) { $F[6] = "-"; } else { $F[6] = 0; } print join("\t", @F);' | grep -v SNP_hg18 | gzip > jointGwasMc_HDL.formatted.txt.gz zcat jointGwasMc_LDL.txt.gz | sed 's/:/ /g' | sed 's/chr//g' | awk '{ print $1 "\t" $2 "\t" $5 "\t" $12 "\t" $6 "\t" $7 "\t" $8 "\t" $11 "\t" $10 }' | perl -lane '$F[4] = uc($F[4]); $F[5] = uc($F[5]); if ($F[6] > 0) { $F[6] = "+"; } elsif ($F[6] < 0) { $F[6] = "-"; } else { $F[6] = 0; } print join("\t", @F);' | grep -v SNP_hg18 | gzip > jointGwasMc_LDL.formatted.txt.gz zcat jointGwasMc_TG.txt.gz | sed 's/:/ /g' | sed 's/chr//g' | awk '{ print $1 "\t" $2 "\t" $5 "\t" $12 "\t" $6 "\t" $7 "\t" $8 "\t" $11 "\t" $10 }' | perl -lane '$F[4] = uc($F[4]); $F[5] = uc($F[5]); if ($F[6] > 0) { $F[6] = "+"; } elsif ($F[6] < 0) { $F[6] = "-"; } else { $F[6] = 0; } print join("\t", @F);' | grep -v SNP_hg18 | gzip > jointGwasMc_TG.formatted.txt.gz zcat jointGwasMc_TC.txt.gz | sed 's/:/ /g' | sed 's/chr//g' | awk '{ print $1 "\t" $2 "\t" $5 "\t" $12 "\t" $6 "\t" $7 "\t" $8 "\t" $11 "\t" $10 }' | perl -lane '$F[4] = uc($F[4]); $F[5] = uc($F[5]); if ($F[6] > 0) { $F[6] = "+"; } elsif ($F[6] < 0) { $F[6] = "-"; } else { $F[6] = 0; } print join("\t", @F);' | grep -v SNP_hg18 | gzip > jointGwasMc_TC.formatted.txt.gz ``` Next, we will conduct some basic QC. These steps include: remove SNPs that have duplicate entries, that do not have entries in every field, do not have MAF information, that are fixed (MAF = 0), are rare (MAF < .01), have sample sample sizes < 50000, and have effect sizes of 0 (ie direction of effect is indeterminable). We will also orient each SNP/phenotype file to its respective minor allele. ``` join <(zcat jointGwasMc_HDL.formatted.txt.gz | awk '{ print $1 "_" $2 }' | grep -v hg18 | sort | uniq -u) <(zcat jointGwasMc_HDL.formatted.txt.gz | grep -v hg | grep -v ^rs | grep -v NA | perl -lane 'if ($F[3] > .5) { ($F[4], $F[5]) = ($F[5], $F[4]); if ($F[6] eq "+") { $F[6] = "-"; } elsif ($F[6] eq "-") { $F[6] = "+"; } else { $F[6] = 0; } $F[3] = 1 - $F[3]; } if ($F[6] ne "0") { if ($F[3] > .01) { if ($F[8] >= 50000) { print join("\t", @F); } } }' | awk '{ print $1 "_" $2 "\t" $0 }' | sort -k 1,1) | perl -lane 'print join("\t", @F[1..$#F]);' | cat <(echo -e "Chr\tBP\tMarker\tMAF\tA1\tA2\tDirection\tpValue\tN") - | gzip > jointGwasMc_HDL.formatted.QCed.txt.gz join <(zcat jointGwasMc_LDL.formatted.txt.gz | awk '{ print $1 "_" $2 }' | grep -v hg18 | sort | uniq -u) <(zcat jointGwasMc_LDL.formatted.txt.gz | grep -v hg | grep -v ^rs | grep -v NA | perl -lane 'if ($F[3] > .5) { ($F[4], $F[5]) = ($F[5], $F[4]); if ($F[6] eq "+") { $F[6] = "-"; } elsif ($F[6] eq "-") { $F[6] = "+"; } else { $F[6] = 0; } $F[3] = 1 - $F[3]; } if ($F[6] ne "0") { if ($F[3] > .01) { if ($F[8] >= 50000) { print join("\t", @F); } } }' | awk '{ print $1 "_" $2 "\t" $0 }' | sort -k 1,1) | perl -lane 'print join("\t", @F[1..$#F]);' | gzip > jointGwasMc_LDL.formatted.QCed.txt.gz join <(zcat jointGwasMc_TG.formatted.txt.gz | awk '{ print $1 "_" $2 }' | grep -v hg18 | sort | uniq -u) <(zcat jointGwasMc_TG.formatted.txt.gz | grep -v hg | grep -v ^rs | grep -v NA | perl -lane 'if ($F[3] > .5) { ($F[4], $F[5]) = ($F[5], $F[4]); if ($F[6] eq "+") { $F[6] = "-"; } elsif ($F[6] eq "-") { $F[6] = "+"; } else { $F[6] = 0; } $F[3] = 1 - $F[3]; } if ($F[6] ne "0") { if ($F[3] > .01) { if ($F[8] >= 50000) { print join("\t", @F); } } }' | awk '{ print $1 "_" $2 "\t" $0 }' | sort -k 1,1) | perl -lane 'print join("\t", @F[1..$#F]);' | gzip > jointGwasMc_TG.formatted.QCed.txt.gz join <(zcat jointGwasMc_TC.formatted.txt.gz | awk '{ print $1 "_" $2 }' | grep -v hg18 | sort | uniq -u) <(zcat jointGwasMc_TC.formatted.txt.gz | grep -v hg | grep -v ^rs | grep -v NA | perl -lane 'if ($F[3] > .5) { ($F[4], $F[5]) = ($F[5], $F[4]); if ($F[6] eq "+") { $F[6] = "-"; } elsif ($F[6] eq "-") { $F[6] = "+"; } else { $F[6] = 0; } $F[3] = 1 - $F[3]; } if ($F[6] ne "0") { if ($F[3] > .01) { if ($F[8] >= 50000) { print join("\t", @F); } } }' | awk '{ print $1 "_" $2 "\t" $0 }' | sort -k 1,1) | perl -lane 'print join("\t", @F[1..$#F]);' | gzip > jointGwasMc_TC.formatted.QCed.txt.gz ``` Lastly, we will orient all phenotype files to the HDL minor allele: ``` join <(zcat jointGwasMc_HDL.formatted.QCed.txt.gz | awk '{ print $1 "_" $2 "\t" $5 }' | sort -k 1,1) <(zcat jointGwasMc_LDL.formatted.QCed.txt.gz | awk '{ print $1 "_" $2 "\t" $0 }' | sort -k 1,1) | perl -lane 'if ($F[1] eq $F[7]) { ($F[6], $F[7]) = ($F[7], $F[6]); if ($F[8] eq "-") { $F[8] = "+"; } elsif ($F[8] eq "+") { $F[8] = "-"; } else { print STDERR "Error1 -- direction of effect allele matching"; } } print join("\t", @F[2..$#F]);' | cat <(echo -e "Chr\tBP\tMarker\tMAF\tA1\tA2\tDirection\tpValue\tN") - | gzip > jointGwasMc_LDL.formatted.QCed.HDLMatch.txt.gz join <(zcat jointGwasMc_HDL.formatted.QCed.txt.gz | awk '{ print $1 "_" $2 "\t" $5 }' | sort -k 1,1) <(zcat jointGwasMc_TG.formatted.QCed.txt.gz | awk '{ print $1 "_" $2 "\t" $0 }' | sort -k 1,1) | perl -lane 'if ($F[1] eq $F[7]) { ($F[6], $F[7]) = ($F[7], $F[6]); if ($F[8] eq "-") { $F[8] = "+"; } elsif ($F[8] eq "+") { $F[8] = "-"; } else { print STDERR "Error1 -- direction of effect allele matching"; } } print join("\t", @F[2..$#F]);' | cat <(echo -e "Chr\tBP\tMarker\tMAF\tA1\tA2\tDirection\tpValue\tN") - | gzip > jointGwasMc_TG.formatted.QCed.HDLMatch.txt.gz join <(zcat jointGwasMc_HDL.formatted.QCed.txt.gz | awk '{ print $1 "_" $2 "\t" $5 }' | sort -k 1,1) <(zcat jointGwasMc_TC.formatted.QCed.txt.gz | awk '{ print $1 "_" $2 "\t" $0 }' | sort -k 1,1) | perl -lane 'if ($F[1] eq $F[7]) { ($F[6], $F[7]) = ($F[7], $F[6]); if ($F[8] eq "-") { $F[8] = "+"; } elsif ($F[8] eq "+") { $F[8] = "-"; } else { print STDERR "Error1 -- direction of effect allele matching"; } } print join("\t", @F[2..$#F]);' | cat <(echo -e "Chr\tBP\tMarker\tMAF\tA1\tA2\tDirection\tpValue\tN") - | gzip > jointGwasMc_TC.formatted.QCed.HDLMatch.txt.gz ``` ## Running `bmass` Now with our four `GlobalLipids2013` files properly formatted and QCed, we are able to run `bmass`. To do so, we will also need the list of published univariate GWAS SNPs from the `GlobalLipids2013` publication (presented as two columns per SNP, chromosome and basepair position) -- these have been provided in the `data` subdirectory (`GlobalLipids2013.GWASsnps.txt`). To run `bmass` with our prepared input datafiles and our list of univariate GWAS SNPs, we run the following R code: ```{r eval=FALSE} library("bmass"); HDL <- read.table("jointGwasMc_HDL.formatted.QCed.txt.gz", header=T); LDL <- read.table("jointGwasMc_LDL.formatted.QCed.HDLMatch.txt.gz", header=T); TG <- read.table("jointGwasMc_TG.formatted.QCed.HDLMatch.txt.gz", header=T); TC <- read.table("jointGwasMc_TC.formatted.QCed.HDLMatch.txt.gz", header=T); load("bmassDirectory/data/GlobalLipids2013.GWASsnps.rda"); Phenotypes <- c("HDL", "LDL", "TG", "TC"); bmassResults <- bmass(Phenotypes, GlobalLipids2013.GWASsnps); ``` Note once again that to run `bmass` we supply a vector of phenotype names that also correspond to the variable names referencing each phenotype's respective data file. Also note that you will see a warning that one or more of the `GlobalLipids2013` datafiles contains p-values smaller than the threshold at which R can handle converting them into logarthmic scale -- this is to be expected. ## Accessing & exploring results `bmass` returns a list containing multiple areas of information, including separate sublists pertaining to the previous univariate GWAS SNPs used for training the multivariate model priors and the new multivariate SNPs (if any) bmass has idenfitifed as significant. Run `summary(bmassResults)` and you should see the following: ```{r eval=FALSE} > summary(bmassResults) Length Class Mode MarginalSNPs 3 -none- list PreviousSNPs 4 -none- list NewSNPs 3 -none- list LogFile 20 -none- character ZScoresCorMatrix 16 -none- numeric Models 324 -none- numeric ModelPriors 1134 -none- numeric GWASlogBFMinThreshold 1 -none- numeric ``` We will touch on all the above outputs. #### NewSNPs The `NewSNPs` sublist contains information pertaining to the new significant multivariate associations found by `bmass`, if any were identified. Running `summary(bmassResults$NewSNPs)` you should see the following: ```{r eval=FALSE} > summary(bmassResults$NewSNPs) Length Class Mode SNPs 30 data.frame list logBFs 5427 -none- numeric Posteriors 5427 -none- numeric ``` `bmassResults$NewSNPs$SNPs` contains the 'main' `bmass` output of interest -- the list of new multivariate associations found by `bmass` and those SNPs' related information. The format of this output appears as follows: ```{r eval=FALSE} > head(bmassResults$NewSNPs$SNPs, n=3) ChrBP Chr BP Marker MAF A1 HDL_A2 HDL_Direction 1704 10_101902054 10 101902054 rs2862954 0.4631 C T + 72106 10_5839619 10 5839619 rs2275774 0.1781 G A + 118903 11_109521729 11 109521729 rs661171 0.2876 T G + HDL_pValue HDL_N HDL_ZScore LDL_Direction LDL_pValue LDL_N 1704 1.287e-06 186893 4.841751 + 5.875e-01 172821.0 72106 7.601e-07 179144 4.945343 - 7.773e-05 165198.0 118903 1.705e-06 186946 4.785573 + 1.653e-02 172877.9 LDL_ZScore TG_Direction TG_pValue TG_N TG_ZScore TC_Direction 1704 0.5424624 + 0.013930 177587.1 2.459063 + 72106 -3.9512933 - 0.001035 169853.0 -3.280836 - 118903 2.3969983 - 0.155800 177645.0 -1.419340 + TC_pValue TC_N TC_ZScore GWASannot mvstat mvstat_log10pVal unistat 1704 2.526e-04 187083 3.659609 0 48.70946 9.173067 23.44255 72106 1.911e-02 179333 -2.343378 0 38.26288 7.004804 24.45642 118903 1.785e-05 187131 4.290215 0 37.84098 6.917765 22.90171 unistat_log10pVal Nmin logBFWeightedAvg 1704 5.890421 172821.0 7.068306 72106 6.119129 165198.0 5.447201 118903 5.768276 172877.9 5.438490 ``` `NewSNPs$logBFs` contains the log10 Bayes Factor (BF) for each multivariate model tested (in the form of a Model x SNP matrix, with the first columns of the matrix describing each model). `NewSNPs$Posteriors` is a similar matrix, just with each matrix entry representing the posterior probability for each model/SNP combination instead of the log10 BF. ```{r eval=FALSE} > dim(bmassResults$NewSNPs$logBFs) [1] 81 67 > bmassResults$NewSNPs$logBFs[1:5,1:10] HDL LDL TG TC 10_101902054 10_5839619 11_109521729 11_13313759 [1,] 0 0 0 0 0.000000 0.0000000 0.0000000 0.00000000 [2,] 1 0 0 0 -233.831047 -235.0781367 -234.6670299 -234.15472067 [3,] 2 0 0 0 0.000000 0.0000000 0.0000000 0.00000000 [4,] 0 1 0 0 0.165855 0.3172489 -0.1100418 0.06393596 [5,] 1 1 0 0 -64.774919 -66.2959645 -69.2388829 -69.93309528 11_45696596 11_47251202 [1,] 0.00000000 0.00000000 [2,] -231.68478695 -219.10321932 [3,] 0.00000000 0.00000000 [4,] -0.04838241 0.04997886 [5,] -65.59917325 -45.04269241 ``` #### PreviousSNPs `PreviousSNPs` contains similar information as `NewSNPs` does, but pertaining to the set of previous univariate GWAS significant SNPs used to train the multivariate model priors and set the empirical significance threshold for weighted average BF. Running `summary(bmassResults$PreviousSNPs)` you see: ```{r eval=FALSE} > summary(bmassResults$PreviousSNPs) Length Class Mode logBFs 12069 -none- numeric SNPs 30 data.frame list DontPassSNPs 30 data.frame list Posteriors 12069 -none- numeric ``` `PreviousSNPs$SNPs`, `PreviousSNPs$logBFs`, and `PreviousSNPs$Posteriors` all match the descriptions as above with `NewSNPs`. However `PreviousSNPs$DontPassSNPs` refers to any GWAS SNPs which are included in the final publicaiton list but which do meet the original study's univariate GWAS p-value threshold based on the dataset released. This can for instance occur when a final GWAS SNP list is produced from combining a discovery dataset with a replication dataset, and the former data is the only part of the study publicly released. Therefore `bmass` keeps track of these SNPs so as not to call them a completely novel multivariate SNP, but also to not use them to train the multivariate model priors since they do not technically reach univariate GWAS significance based on the dataset provided alone. The format of `DontPassSNPs` is the same seen in `NewSNPs$SNPs` and `PreviousSNPs$SNPs`. #### MarginalSNPs `MarginalSNPs` contains the same information as `NewSNPs` and `PreviousSNPs`, but pertaining to the set of marginally significant SNPs extracted from the intermediate `MergedDataSources` dataset (a data.frame that contains all the input phenotype files combined) based on the `SNPMarginalUnivariateThreshold` and `SNPMarginalMultivariateThreshold` values (default of 1e-6 for each). `summary(bmassResults$MarginalSNPs)` returns: ```{r eval=FALSE} > summary(bmassResults$MarginalSNPs) Length Class Mode SNPs 30 data.frame list logBFs 20493 -none- numeric Posteriors 20493 -none- numeric ``` `MarginalSNPs$SNPs`, `MarginalSNPs$logBFs`, and `MarginalSNPs$Posteriors` are all as described above. #### ZScoresCorMatrix This is the phenotype correlation matrix (V_0 hat in `Detailed Methods (Global Lipids Analysis)` of Stephens 2013 PLoS ONE) derived from extracting all the 'null' SNPs (abs(ZScore) < 2 for every phenotype) in the dataset. ```{r eval=FALSE} > bmassResults$ZScoresCorMatrix HDL_ZScore LDL_ZScore TG_ZScore TC_ZScore HDL_ZScore 1.0000000 -0.0872789 -0.3655508 0.1523894 LDL_ZScore -0.0872789 1.0000000 0.1607208 0.8223175 TG_ZScore -0.3655508 0.1607208 1.0000000 0.2892982 TC_ZScore 0.1523894 0.8223175 0.2892982 1.0000000 ``` #### Models This is a matrix displaying all the model combinations possible given the set of d input phenotypes and 3 model categories of directly associated (1), indirectly associated (2), and unassociated (0). The total number of models displayed should be 3^d. ```{r eval=FALSE} > dim(bmassResults$Models) [1] 81 4 > head(bmassResults$Models) HDL LDL TG TC [1,] 0 0 0 0 [2,] 1 0 0 0 [3,] 2 0 0 0 [4,] 0 1 0 0 [5,] 1 1 0 0 [6,] 2 1 0 0 ``` #### ModelPriors This is the set of trained priors determined for each multivariate model using the previous univariate GWAS SNPs and an EM algorithm (see eq. 36 in Stephens 2013). Note that for any model which does not contain at least one phenotype in the directly associated category, the prior was automatically set to 0 (aside from the global null model of all phenotypes set to unassociated). Also note that ModelPriors contains a full set of model priors for each value of the hyperparameter sigma_alpha (see "Prior on sigma_alpha" in Stephens 2013); these values are stored in `SigmaAlphas`, and by default are set to `c(0.005,0.0075,0.01,0.015,0.02,0.03,0.04,0.05,0.06,0.07,0.08,0.09,0.1,0.15)` (hence 81 models * 14 sigma_alphas = 1134 priors). To see a more interpretable form of the model priors, use the `GetModelPriorMatrix()` function as shown below. ```{r eval=FALSE} > length(bmassResults$ModelPriors) [1] 1134 > bmassResults[c("ModelPriorMatrix", "LogFile")] <- GetModelPriorMatrix(Phenotypes, bmassResults$Models, bmassResults$ModelPriors, bmassResults$LogFile) > head(bmassResults$ModelPriorMatrix) HDL LDL TG TC Prior Cumm_Prior OrigOrder 1 1 2 1 1 0.32744537 0.3274454 44 2 1 2 2 1 0.13788501 0.4653304 53 3 1 1 1 2 0.11727440 0.5826048 68 4 1 1 1 1 0.07801825 0.6606230 41 5 1 2 1 2 0.06210658 0.7227296 71 6 2 1 2 2 0.05698876 0.7797184 78 ``` `ModelPriorMatrix` contains the model descriptions as provided in `Models` along with the combined trained priors across each sigma_alpha value; the `ModelPriomatrix` is sorted by decreasing model prior value and also shows the cumulative distribution of the priors as well as the original order position of each model. #### GWASlogBFMinThreshold This is the empirical weighted average BF threshold (BFavg) used to determine whether any new SNPs are 'multivariate significant'. As described in [Stephens 2013][stephens2013], each SNP has a single summary metric combining all the information across each multivariate model tested; this summary metric is the weighted average of each model's BF, where weights are determined from the previous univariate GWAS SNPs (eq. 1 in [Turchin and Stephens 2019][biorxiv-paper] or eq. 8 in [Stephens 2013][stephens2013]). This summary metric is also calculated for the previous unviariate GWAS SNPs themselves, and the smallest BFavg among them becomes the threshold above which new SNPs are considered multivariate significant. ```{r eval=FALSE} > bmassResults$GWASlogBFMinThreshold [1] 4.289906 ``` So every new multivariate SNP `bmass` identifies in this `GlobalLipids2013` analysis has a BFavg value >= 4.289906 (and is also more than a 1Mb window away from a previous univariate GWAS SNP). [//]: # ##Additional functions [bmass-vignette1]: http://mturchin20.github.io/bmass/articles/bmassIntro.1.SimulatedData.html [biorxiv-paper]: https://www.biorxiv.org/content/10.1101/638882v1 [globallipids2013]: http://csg.sph.umich.edu/willer/public/lipids2013/ [stephens2013]: https://doi.org/10.1371/journal.pone.0065245