Effect size of a tag variant
Motivation
Ever wondered why the effect size of a tag variant is
Side warning: I would not recommend reading this on a smaller screen (e.g. phone) as the math might not render fully.
Modeling the effect of causal variant (binary phenotype)
Suppose there’s a biallelic disease locus with alleles A and a, where A is the risk allele. Further suppose that each individual in the population is haploid and that their genotype at the causal locus is denoted by
So, if we use
One way to model the effect size of the disease locus is with a log risk regression model:
where the left hand side is the probability of developing the disease,
Note, for disease risk, we typically use a log risk regression model (i.e. where the response is modeled as
Because
Our goal is to answer the question: How does the effect size of a tag SNP vary as a function of the LD with the causal variant?
Modeling the effect of the tag SNP (binary phenotype)
Let’s assume that the tag SNP (with alleles B and b) itself has no intrinsic effect on disease risk. Thus, the risk of individuals carrying the B allele depends on how often it co-occurs with the risk allele at the causal locus (i.e. its correlation with alleles at the causal locus). So, to derive the marginal effect size at the tag SNP, we need to average over the allele at the causal variant. To do this, let’s first define the genotype at the tag locus as
Then, the effect size at the tag SNP (
Let’s first work out the risk carried by the b allele:
To calculate the numerator, we marginalize over the alleles at the causal variant.
where
We can do the same thing for
and
This is quite unsatisfying in that the haplotype frequencies are a bit arbitrary. We need to simplify them further. Let’s define:
We can express the haplotype frequencies in terms of these conditional probabilities.
Plugging these into
The effect size of the tag SNP (
It’s already starting to shape up! Can we express
Then,
and
Therefore,
Plugging this into
In practice, genotypes are often normalized to unit variance, i.e.,
Done! Although, note that this last simplification assumes that genotypes for all SNPs are normalized to the same variance.
Effect of tag SNP (quantitative phenotype)
The derivation for quantitative phenotype is very similar and in some ways simpler because we can use a simple linear model instead of worrying about log models. The derivation for the quantitative phenotype is different mostly in notation. Instead of looking at the probability of disease as a function of which allele an individual carries, we’re going to look at the expected phenotype. We assume the following linear model:
And the effect of the causal variant is (similar to the binary phenotype):
Similary, if we denote
The second last line is possible again because
We derived most of these terms in the binary case, which simplifies the above equation to:
Similarly (and I’ll let you work this out yourself),
I skipped over the last few steps as we already worked them out for the binary phenotype. So there you have it!
This post has already gotten quite long so I’m going to leave it here but here are some extensions that you can try to work out:
- derivations for the log-risk and log-odds (logistic regression) model
- extend this result to diploid individuals
- derive the standard error and power of the test
I might update this post later with some of these (1 and 2).
Cheers! :)
I found this paper which provides a similar result for the log risk model (Vukcevic et al. 2012). I have incorporated some of their notation in my post because it was clearer. Definitely worth the read.↩︎