---
title: "bmass Introduction -- Real Data"
author: "Michael C Turchin"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{bmass Introduction -- Real Data}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

This is an advanced introductory vignette for running `bmass`. Here, we will use real data and explore multiple components of the `bmass()` output. Specifically, we will: 

* Download our real dataset
* Quality control (QC) & format our dataset
* Run `bmass`
* Access & explore a variety of results

For a more basic introductory vignette using a smaller, simulated dataset, please see [here][bmass-vignette1].

## Downloading `GlobalLipids2013`

For this vigenette, we will be download and analyze the `GlobalLipids2013` dataset. To download the necessary files from [http://csg.sph.umich.edu/willer/public/lipids2013/][globallipids2013], run the following:

```
mkdir my-directory/GlobalLipids2013
cd my-directory/GlobalLipids2013

wget http://csg.sph.umich.edu/abecasis/public/lipids2013/jointGwasMc_LDL.txt.gz
wget http://csg.sph.umich.edu/abecasis/public/lipids2013/jointGwasMc_HDL.txt.gz
wget http://csg.sph.umich.edu/abecasis/public/lipids2013/jointGwasMc_TG.txt.gz
wget http://csg.sph.umich.edu/abecasis/public/lipids2013/jointGwasMc_TC.txt.gz
```

## Formatting & QC'ing our dataset

To run bmass, it is recommended that at least a minimal level of QC is performed on the datasets of interest. Additionally, it is important that all the single nucleotide polymorphisms (SNPs) being analyzed are oriented to the same reference allele across all the phenotypes being included. This latter situation is emphasized so that the correct direction of effect is assigned to each phenotype for a given SNP (which is oriented based on reference allele).

Below is example `bash` code on how to reformat, conduct some basic QC, and orient the reference alleles on the `GlobalLipids2013` dataset. First, we will reformat our input datafiles to match the requirements for `bmass`. `bmass` expects input files to have the following columns -- 

* Chr: Chromosome
* BP: Basepair Position
* Marker: Variant ID information, such as a rsID # (should be a unique value for each variant)
* MAF: Minor Allele Frequency
* A1: Reference allele
* A2: AlternativeaAllele
* Direction: Direction of effect size (a column of "+" or "-")
* pValue: p-Value from univariate GWAS
* N: Sample size

To create files that match these specifications, we will conduct the following: extract the columns that are needed, orient SNPs to the minor allele, and extract the direction of effect from the provided effect sizes.
```
zcat jointGwasMc_HDL.txt.gz | sed 's/:/ /g' | sed 's/chr//g' | awk '{ print $1 "\t" $2 "\t" $5 "\t" $12 "\t" $6 "\t" $7 "\t" $8 "\t" $11 "\t" $10 }' | perl -lane '$F[4] = uc($F[4]); $F[5] = uc($F[5]); if ($F[6] > 0) { $F[6] = "+"; } elsif ($F[6] < 0) { $F[6] = "-"; } else { $F[6] = 0; } print join("\t", @F);' | grep -v SNP_hg18 | gzip > jointGwasMc_HDL.formatted.txt.gz 

zcat jointGwasMc_LDL.txt.gz | sed 's/:/ /g' | sed 's/chr//g' | awk '{ print $1 "\t" $2 "\t" $5 "\t" $12 "\t" $6 "\t" $7 "\t" $8 "\t" $11 "\t" $10 }' | perl -lane '$F[4] = uc($F[4]); $F[5] = uc($F[5]); if ($F[6] > 0) { $F[6] = "+"; } elsif ($F[6] < 0) { $F[6] = "-"; } else { $F[6] = 0; } print join("\t", @F);' | grep -v SNP_hg18 | gzip > jointGwasMc_LDL.formatted.txt.gz 

zcat jointGwasMc_TG.txt.gz | sed 's/:/ /g' | sed 's/chr//g' | awk '{ print $1 "\t" $2 "\t" $5 "\t" $12 "\t" $6 "\t" $7 "\t" $8 "\t" $11 "\t" $10 }' | perl -lane '$F[4] = uc($F[4]); $F[5] = uc($F[5]); if ($F[6] > 0) { $F[6] = "+"; } elsif ($F[6] < 0) { $F[6] = "-"; } else { $F[6] = 0; } print join("\t", @F);' | grep -v SNP_hg18 | gzip > jointGwasMc_TG.formatted.txt.gz 

zcat jointGwasMc_TC.txt.gz | sed 's/:/ /g' | sed 's/chr//g' | awk '{ print $1 "\t" $2 "\t" $5 "\t" $12 "\t" $6 "\t" $7 "\t" $8 "\t" $11 "\t" $10 }' | perl -lane '$F[4] = uc($F[4]); $F[5] = uc($F[5]); if ($F[6] > 0) { $F[6] = "+"; } elsif ($F[6] < 0) { $F[6] = "-"; } else { $F[6] = 0; } print join("\t", @F);' | grep -v SNP_hg18 | gzip > jointGwasMc_TC.formatted.txt.gz 
```

Next, we will conduct some basic QC. These steps include: remove SNPs that have duplicate entries, that do not have entries in every field, do not have MAF information, that are fixed (MAF = 0), are rare (MAF < .01), have sample sample sizes < 50000, and have effect sizes of 0 (ie direction of effect is indeterminable). We will also orient each SNP/phenotype file to its respective minor allele.  
```
join <(zcat jointGwasMc_HDL.formatted.txt.gz | awk '{ print $1 "_" $2 }' | grep -v hg18 | sort | uniq -u) <(zcat jointGwasMc_HDL.formatted.txt.gz | grep -v hg | grep -v ^rs | grep -v NA | perl -lane 'if ($F[3] > .5) { ($F[4], $F[5]) = ($F[5], $F[4]); if ($F[6] eq "+") { $F[6] = "-"; } elsif ($F[6] eq "-") { $F[6] = "+"; } else { $F[6] = 0; } $F[3] = 1 - $F[3]; } if ($F[6] ne "0") { if ($F[3] > .01) { if ($F[8] >= 50000) { print join("\t", @F); } } }' | awk '{ print $1 "_" $2 "\t" $0 }' | sort -k 1,1) | perl -lane 'print join("\t", @F[1..$#F]);' | cat <(echo -e "Chr\tBP\tMarker\tMAF\tA1\tA2\tDirection\tpValue\tN") - | gzip > jointGwasMc_HDL.formatted.QCed.txt.gz 

join <(zcat jointGwasMc_LDL.formatted.txt.gz | awk '{ print $1 "_" $2 }' | grep -v hg18 | sort | uniq -u) <(zcat jointGwasMc_LDL.formatted.txt.gz | grep -v hg | grep -v ^rs | grep -v NA | perl -lane 'if ($F[3] > .5) { ($F[4], $F[5]) = ($F[5], $F[4]); if ($F[6] eq "+") { $F[6] = "-"; } elsif ($F[6] eq "-") { $F[6] = "+"; } else { $F[6] = 0; } $F[3] = 1 - $F[3]; } if ($F[6] ne "0") { if ($F[3] > .01) { if ($F[8] >= 50000) { print join("\t", @F); } } }' | awk '{ print $1 "_" $2 "\t" $0 }' | sort -k 1,1) | perl -lane 'print join("\t", @F[1..$#F]);' | gzip > jointGwasMc_LDL.formatted.QCed.txt.gz 

join <(zcat jointGwasMc_TG.formatted.txt.gz | awk '{ print $1 "_" $2 }' | grep -v hg18 | sort | uniq -u) <(zcat jointGwasMc_TG.formatted.txt.gz | grep -v hg | grep -v ^rs | grep -v NA | perl -lane 'if ($F[3] > .5) { ($F[4], $F[5]) = ($F[5], $F[4]); if ($F[6] eq "+") { $F[6] = "-"; } elsif ($F[6] eq "-") { $F[6] = "+"; } else { $F[6] = 0; } $F[3] = 1 - $F[3]; } if ($F[6] ne "0") { if ($F[3] > .01) { if ($F[8] >= 50000) { print join("\t", @F); } } }' | awk '{ print $1 "_" $2 "\t" $0 }' | sort -k 1,1) | perl -lane 'print join("\t", @F[1..$#F]);' | gzip > jointGwasMc_TG.formatted.QCed.txt.gz 

join <(zcat jointGwasMc_TC.formatted.txt.gz | awk '{ print $1 "_" $2 }' | grep -v hg18 | sort | uniq -u) <(zcat jointGwasMc_TC.formatted.txt.gz | grep -v hg | grep -v ^rs | grep -v NA | perl -lane 'if ($F[3] > .5) { ($F[4], $F[5]) = ($F[5], $F[4]); if ($F[6] eq "+") { $F[6] = "-"; } elsif ($F[6] eq "-") { $F[6] = "+"; } else { $F[6] = 0; } $F[3] = 1 - $F[3]; } if ($F[6] ne "0") { if ($F[3] > .01) { if ($F[8] >= 50000) { print join("\t", @F); } } }' | awk '{ print $1 "_" $2 "\t" $0 }' | sort -k 1,1) | perl -lane 'print join("\t", @F[1..$#F]);' | gzip > jointGwasMc_TC.formatted.QCed.txt.gz 
```

Lastly, we will orient all phenotype files to the HDL minor allele:
```
join <(zcat jointGwasMc_HDL.formatted.QCed.txt.gz | awk '{ print $1 "_" $2 "\t" $5 }' | sort -k 1,1) <(zcat jointGwasMc_LDL.formatted.QCed.txt.gz | awk '{ print $1 "_" $2 "\t" $0 }' | sort -k 1,1) | perl -lane 'if ($F[1] eq $F[7]) { ($F[6], $F[7]) = ($F[7], $F[6]); if ($F[8] eq "-") { $F[8] = "+"; } elsif ($F[8] eq "+") { $F[8] = "-"; } else { print STDERR "Error1 -- direction of effect allele matching"; } } print join("\t", @F[2..$#F]);' | cat <(echo -e "Chr\tBP\tMarker\tMAF\tA1\tA2\tDirection\tpValue\tN") - | gzip > jointGwasMc_LDL.formatted.QCed.HDLMatch.txt.gz

join <(zcat jointGwasMc_HDL.formatted.QCed.txt.gz | awk '{ print $1 "_" $2 "\t" $5 }' | sort -k 1,1) <(zcat jointGwasMc_TG.formatted.QCed.txt.gz | awk '{ print $1 "_" $2 "\t" $0 }' | sort -k 1,1) | perl -lane 'if ($F[1] eq $F[7]) { ($F[6], $F[7]) = ($F[7], $F[6]); if ($F[8] eq "-") { $F[8] = "+"; } elsif ($F[8] eq "+") { $F[8] = "-"; } else { print STDERR "Error1 -- direction of effect allele matching"; } } print join("\t", @F[2..$#F]);' | cat <(echo -e "Chr\tBP\tMarker\tMAF\tA1\tA2\tDirection\tpValue\tN") - | gzip > jointGwasMc_TG.formatted.QCed.HDLMatch.txt.gz

join <(zcat jointGwasMc_HDL.formatted.QCed.txt.gz | awk '{ print $1 "_" $2 "\t" $5 }' | sort -k 1,1) <(zcat jointGwasMc_TC.formatted.QCed.txt.gz | awk '{ print $1 "_" $2 "\t" $0 }' | sort -k 1,1) | perl -lane 'if ($F[1] eq $F[7]) { ($F[6], $F[7]) = ($F[7], $F[6]); if ($F[8] eq "-") { $F[8] = "+"; } elsif ($F[8] eq "+") { $F[8] = "-"; } else { print STDERR "Error1 -- direction of effect allele matching"; } } print join("\t", @F[2..$#F]);' | cat <(echo -e "Chr\tBP\tMarker\tMAF\tA1\tA2\tDirection\tpValue\tN") - | gzip > jointGwasMc_TC.formatted.QCed.HDLMatch.txt.gz
```


## Running `bmass`

Now with our four `GlobalLipids2013` files properly formatted and QCed, we are able to run `bmass`. To do so, we will also need the list of published univariate GWAS SNPs from the `GlobalLipids2013` publication (presented as two columns per SNP, chromosome and basepair position) -- these have been provided in the `data` subdirectory (`GlobalLipids2013.GWASsnps.txt`). To run `bmass` with our prepared input datafiles and our list of univariate GWAS SNPs, we run the following R code:

```{r eval=FALSE}
library("bmass");
HDL <- read.table("jointGwasMc_HDL.formatted.QCed.txt.gz", header=T);
LDL <- read.table("jointGwasMc_LDL.formatted.QCed.HDLMatch.txt.gz", header=T);
TG <- read.table("jointGwasMc_TG.formatted.QCed.HDLMatch.txt.gz", header=T); 
TC <- read.table("jointGwasMc_TC.formatted.QCed.HDLMatch.txt.gz", header=T);
load("bmassDirectory/data/GlobalLipids2013.GWASsnps.rda");
Phenotypes <- c("HDL", "LDL", "TG", "TC");
bmassResults <- bmass(Phenotypes, GlobalLipids2013.GWASsnps);
```

Note once again that to run `bmass` we supply a vector of phenotype names that also correspond to the variable names referencing each phenotype's respective data file. Also note that you will see a warning that one or more of the `GlobalLipids2013` datafiles contains p-values smaller than the threshold at which R can handle converting them into logarthmic scale -- this is to be expected.  


## Accessing & exploring results

`bmass` returns a list containing multiple areas of information, including separate sublists pertaining to the previous univariate GWAS SNPs used for training the multivariate model priors and the new multivariate SNPs (if any) bmass has idenfitifed as significant. Run `summary(bmassResults)` and you should see the following:
```{r eval=FALSE}
> summary(bmassResults)
                      Length Class  Mode     
MarginalSNPs             3   -none- list     
PreviousSNPs             4   -none- list     
NewSNPs                  3   -none- list     
LogFile                 20   -none- character
ZScoresCorMatrix        16   -none- numeric  
Models                 324   -none- numeric  
ModelPriors           1134   -none- numeric  
GWASlogBFMinThreshold    1   -none- numeric  
```

We will touch on all the above outputs. 


#### NewSNPs

The `NewSNPs` sublist contains information pertaining to the new significant multivariate associations found by `bmass`, if any were identified. Running `summary(bmassResults$NewSNPs)` you should see the following:
```{r eval=FALSE}
> summary(bmassResults$NewSNPs)
           Length Class      Mode   
SNPs         30   data.frame list   
logBFs     5427   -none-     numeric
Posteriors 5427   -none-     numeric
```

`bmassResults$NewSNPs$SNPs` contains the 'main' `bmass` output of interest -- the list of new multivariate associations found by `bmass` and those SNPs' related information. The format of this output appears as follows:
```{r eval=FALSE}
> head(bmassResults$NewSNPs$SNPs, n=3)
              ChrBP Chr        BP    Marker    MAF A1 HDL_A2 HDL_Direction
1704   10_101902054  10 101902054 rs2862954 0.4631  C      T             +
72106    10_5839619  10   5839619 rs2275774 0.1781  G      A             +
118903 11_109521729  11 109521729  rs661171 0.2876  T      G             +
       HDL_pValue  HDL_N HDL_ZScore LDL_Direction LDL_pValue    LDL_N
1704    1.287e-06 186893   4.841751             +  5.875e-01 172821.0
72106   7.601e-07 179144   4.945343             -  7.773e-05 165198.0
118903  1.705e-06 186946   4.785573             +  1.653e-02 172877.9
       LDL_ZScore TG_Direction TG_pValue     TG_N TG_ZScore TC_Direction
1704    0.5424624            +  0.013930 177587.1  2.459063            +
72106  -3.9512933            -  0.001035 169853.0 -3.280836            -
118903  2.3969983            -  0.155800 177645.0 -1.419340            +
       TC_pValue   TC_N TC_ZScore GWASannot   mvstat mvstat_log10pVal  unistat
1704   2.526e-04 187083  3.659609         0 48.70946         9.173067 23.44255
72106  1.911e-02 179333 -2.343378         0 38.26288         7.004804 24.45642
118903 1.785e-05 187131  4.290215         0 37.84098         6.917765 22.90171
       unistat_log10pVal     Nmin logBFWeightedAvg
1704            5.890421 172821.0         7.068306
72106           6.119129 165198.0         5.447201
118903          5.768276 172877.9         5.438490
```

`NewSNPs$logBFs` contains the log10 Bayes Factor (BF) for each multivariate model tested (in the form of a Model x SNP matrix, with the first columns of the matrix describing each model). `NewSNPs$Posteriors` is a similar matrix, just with each matrix entry representing the posterior probability for each model/SNP combination instead of the log10 BF. 
```{r eval=FALSE}
> dim(bmassResults$NewSNPs$logBFs)
[1] 81 67
> bmassResults$NewSNPs$logBFs[1:5,1:10]
     HDL LDL TG TC 10_101902054   10_5839619 11_109521729   11_13313759
[1,]   0   0  0  0     0.000000    0.0000000    0.0000000    0.00000000
[2,]   1   0  0  0  -233.831047 -235.0781367 -234.6670299 -234.15472067
[3,]   2   0  0  0     0.000000    0.0000000    0.0000000    0.00000000
[4,]   0   1  0  0     0.165855    0.3172489   -0.1100418    0.06393596
[5,]   1   1  0  0   -64.774919  -66.2959645  -69.2388829  -69.93309528
       11_45696596   11_47251202
[1,]    0.00000000    0.00000000
[2,] -231.68478695 -219.10321932
[3,]    0.00000000    0.00000000
[4,]   -0.04838241    0.04997886
[5,]  -65.59917325  -45.04269241
```


#### PreviousSNPs

`PreviousSNPs` contains similar information as `NewSNPs` does, but pertaining to the set of previous univariate GWAS significant SNPs used to train the multivariate model priors and set the empirical significance threshold for weighted average BF. Running `summary(bmassResults$PreviousSNPs)` you see:
```{r eval=FALSE}
> summary(bmassResults$PreviousSNPs)
             Length Class      Mode   
logBFs       12069  -none-     numeric
SNPs            30  data.frame list   
DontPassSNPs    30  data.frame list   
Posteriors   12069  -none-     numeric
```

`PreviousSNPs$SNPs`, `PreviousSNPs$logBFs`, and `PreviousSNPs$Posteriors` all match the descriptions as above with `NewSNPs`. However `PreviousSNPs$DontPassSNPs` refers to any GWAS SNPs which are included in the final publicaiton list but which do meet the original study's univariate GWAS p-value threshold based on the dataset released. This can for instance occur when a final GWAS SNP list is produced from combining a discovery dataset with a replication dataset, and the former data is the only part of the study publicly released. Therefore `bmass` keeps track of these SNPs so as not to call them a completely novel multivariate SNP, but also to not use them to train the multivariate model priors since they do not technically reach univariate GWAS significance based on the dataset provided alone.

The format of `DontPassSNPs` is the same seen in `NewSNPs$SNPs` and `PreviousSNPs$SNPs`. 


#### MarginalSNPs

`MarginalSNPs` contains the same information as `NewSNPs` and `PreviousSNPs`, but pertaining to the set of marginally significant SNPs extracted from the intermediate `MergedDataSources` dataset (a data.frame that contains all the input phenotype files combined) based on the `SNPMarginalUnivariateThreshold` and `SNPMarginalMultivariateThreshold` values (default of 1e-6 for each). `summary(bmassResults$MarginalSNPs)` returns:
```{r eval=FALSE}
> summary(bmassResults$MarginalSNPs)
           Length Class      Mode   
SNPs          30  data.frame list   
logBFs     20493  -none-     numeric
Posteriors 20493  -none-     numeric
```

`MarginalSNPs$SNPs`, `MarginalSNPs$logBFs`, and `MarginalSNPs$Posteriors` are all as described above.


#### ZScoresCorMatrix

This is the phenotype correlation matrix (V_0 hat in `Detailed Methods (Global Lipids Analysis)` of Stephens 2013 PLoS ONE) derived from extracting all the 'null' SNPs (abs(ZScore) < 2 for every phenotype) in the dataset.
```{r eval=FALSE}
> bmassResults$ZScoresCorMatrix
           HDL_ZScore LDL_ZScore  TG_ZScore TC_ZScore
HDL_ZScore  1.0000000 -0.0872789 -0.3655508 0.1523894
LDL_ZScore -0.0872789  1.0000000  0.1607208 0.8223175
TG_ZScore  -0.3655508  0.1607208  1.0000000 0.2892982
TC_ZScore   0.1523894  0.8223175  0.2892982 1.0000000
``` 


#### Models

This is a matrix displaying all the model combinations possible given the set of d input phenotypes and 3 model categories of directly associated (1), indirectly associated (2), and unassociated (0). The total number of models displayed should be 3^d.
```{r eval=FALSE}
> dim(bmassResults$Models)
[1] 81  4
> head(bmassResults$Models)
     HDL LDL TG TC
[1,]   0   0  0  0
[2,]   1   0  0  0
[3,]   2   0  0  0
[4,]   0   1  0  0
[5,]   1   1  0  0
[6,]   2   1  0  0
``` 


#### ModelPriors

This is the set of trained priors determined for each multivariate model using the previous univariate GWAS SNPs and an EM algorithm (see eq. 36 in Stephens 2013). Note that for any model which does not contain at least one phenotype in the directly associated category, the prior was automatically set to 0 (aside from the global null model of all phenotypes set to unassociated). 

Also note that ModelPriors contains a full set of model priors for each value of the hyperparameter sigma_alpha (see "Prior on sigma_alpha" in Stephens 2013); these values are stored in `SigmaAlphas`, and by default are set to `c(0.005,0.0075,0.01,0.015,0.02,0.03,0.04,0.05,0.06,0.07,0.08,0.09,0.1,0.15)` (hence 81 models * 14 sigma_alphas = 1134 priors). To see a more interpretable form of the model priors, use the `GetModelPriorMatrix()` function as shown below.
```{r eval=FALSE}
> length(bmassResults$ModelPriors)
[1] 1134
> bmassResults[c("ModelPriorMatrix", "LogFile")] <- GetModelPriorMatrix(Phenotypes, bmassResults$Models, bmassResults$ModelPriors, bmassResults$LogFile)
> head(bmassResults$ModelPriorMatrix)
  HDL LDL TG TC      Prior Cumm_Prior OrigOrder
1   1   2  1  1 0.32744537  0.3274454        44
2   1   2  2  1 0.13788501  0.4653304        53
3   1   1  1  2 0.11727440  0.5826048        68
4   1   1  1  1 0.07801825  0.6606230        41
5   1   2  1  2 0.06210658  0.7227296        71
6   2   1  2  2 0.05698876  0.7797184        78
``` 

`ModelPriorMatrix` contains the model descriptions as provided in `Models` along with the combined trained priors across each sigma_alpha value; the `ModelPriomatrix` is sorted by decreasing model prior value and also shows the cumulative distribution of the priors as well as the original order position of each model.


#### GWASlogBFMinThreshold

This is the empirical weighted average BF threshold (BFavg) used to determine whether any new SNPs are 'multivariate significant'. As described in [Stephens 2013][stephens2013], each SNP has a single summary metric combining all the information across each multivariate model tested; this summary metric is the weighted average of each model's BF, where weights are determined from the previous univariate GWAS SNPs (eq. 1 in [Turchin and Stephens 2019][biorxiv-paper] or eq. 8 in [Stephens 2013][stephens2013]). This summary metric is also calculated for the previous unviariate GWAS SNPs themselves, and the smallest BFavg among them becomes the threshold above which new SNPs are considered multivariate significant.
```{r eval=FALSE}
> bmassResults$GWASlogBFMinThreshold
[1] 4.289906
```

So every new multivariate SNP `bmass` identifies in this `GlobalLipids2013` analysis has a BFavg value >= 4.289906 (and is also more than a 1Mb window away from a previous univariate GWAS SNP).


[//]: #  ##Additional functions


[bmass-vignette1]: http://mturchin20.github.io/bmass/articles/bmassIntro.1.SimulatedData.html
[biorxiv-paper]: https://www.biorxiv.org/content/10.1101/638882v1
[globallipids2013]: http://csg.sph.umich.edu/willer/public/lipids2013/
[stephens2013]: https://doi.org/10.1371/journal.pone.0065245