Effect size of a tag variant

Motivation

Ever wondered why the effect size of a tag variant is rβcausal? Here, r is the correlation (measure of LD) between the tag and causal variant, and βcausal is the effect size of the causal variant. This makes intuitive sense qualitatively because we expect tag SNPs that are further away from causal variants to have smaller effect sizes (and vice versa). But I didn’t know where this exact relationship came from so I decided to work it out 1. Posting the derivation here in detail and with explicit probability statements in case others find it helpful.

Side warning: I would not recommend reading this on a smaller screen (e.g. phone) as the math might not render fully.

Modeling the effect of causal variant (binary phenotype)

Suppose there’s a biallelic disease locus with alleles A and a, where A is the risk allele. Further suppose that each individual in the population is haploid and that their genotype at the causal locus is denoted by A which is 1 if they carry the risk allele and 0 otherwise. Haploidy makes the derivation easier to follow but the result can easily be extended to dploid genotypes as well.

So, if we use Y{0,1} to indicate disease status, where 0 indicates being healthy and 1 indicates having the disease, we can express the risk of each allele in terms as:

P(Y=1|A=0)=α0,  and  P(Y=1|A=1)=α1

One way to model the effect size of the disease locus is with a log risk regression model:

π=μA+βcausalA

where the left hand side is the probability of developing the disease, μA is the baseline prevalence (or the probability of disease in individuals not carrying the risk allele), and βcausal represents the effect size.

Note, for disease risk, we typically use a log risk regression model (i.e. where the response is modeled as log(π)) or a logistic regression model (i.e. where the response is modeled as log odds or log(π1π)). I’m sticking to π here because the result we’re after is easier to derive though the conclusions are similar for the other two models.

Because α0=μA and α1=μA+βcausal, the effect size βcausal can be obtained by subtraction:

P(Y=1|A=1)P(Y=1|A=0)=α1α0=μA+βcausalμA=βcausal

Our goal is to answer the question: How does the effect size of a tag SNP vary as a function of the LD with the causal variant?

Modeling the effect of the tag SNP (binary phenotype)

Let’s assume that the tag SNP (with alleles B and b) itself has no intrinsic effect on disease risk. Thus, the risk of individuals carrying the B allele depends on how often it co-occurs with the risk allele at the causal locus (i.e. its correlation with alleles at the causal locus). So, to derive the marginal effect size at the tag SNP, we need to average over the allele at the causal variant. To do this, let’s first define the genotype at the tag locus as B, which is 1 if individuals carry the B allele and 0 if they carry the b allele. The risk of carrying these alleles can be expressed as:

P(Y=1|B=0)=γ0,  and  P(Y=1|B=1)=γ1

Then, the effect size at the tag SNP (βtag) is γ1γ0 similar to how we derived it for the causal locus.

Let’s first work out the risk carried by the b allele:

P(Y=1|B=0)=P(Y=1,B=0)P(B=0)

To calculate the numerator, we marginalize over the alleles at the causal variant.

P(Y=1,B=0)=P(Y=1,B=0|A=0)P(A=0)+P(Y=1,B=0|A=1)P(A=1)

B and Y are independent (conditional on A). In other words, conditioning on the allele at the causal locus, your probability of developing the disease does not depend on which allele you carry at the tag SNP. For example, the disease risk of individuals carrying the risk (A) allele at the causal locus is α1, regardless of whether they have the B or the b allele at the tag SNP. Therefore, P(Y=1,B=0|A=0)=P(Y=1|A=0)P(B=0|A=0), and so

P(Y=1,B=0)=P(Y=1,B=0|A=0)P(A=0)+P(Y=1,B=0|A=1)P(A=1)=P(Y=1|A=0)P(B=0|A=0)P(A=0)+P(Y=1|A=1)P(B=0|A=1)P(A=1)=α0P(B=0,A=0)+α1P(B=0,A=1)=α0fab+α1fAb

where fab and fAb are haplotype frequencies. Then,

P(Y=1|B=0)=P(Y=1,B=0)P(B=0)=α0fab+α1fAb1fB

We can do the same thing for P(Y=1|B=1):

P(Y=1,B=1)=P(Y=1,B=1|A=0)P(A=0)+P(Y=1,B=1|A=1)P(A=1)=P(Y=1|A=0)P(B=1|A=0)P(A=0)+P(Y=1|A=1)P(B=1|A=1)P(A=1)=α0P(B=1,A=0)+α1P(B=1,A=1)=α0faB+α1fAB

and

P(Y=1|B=1)=P(Y=1,B=1)P(B=1)=α0faB+α1fABfB

This is quite unsatisfying in that the haplotype frequencies are a bit arbitrary. We need to simplify them further. Let’s define:

P(A=1|B=0)=q0,  and  P(A=1|B=1)=q1

We can express the haplotype frequencies in terms of these conditional probabilities.

fAB=P(A=1,B=1)=P(A=1|B=1)P(B=1)=q1fBfAb=P(A=1,B=0)=P(A=1|B=0)P(B=0)=q0(1fB)fab=P(A=0,B=0)=P(A=0|B=0)P(B=0)=(1q0)(1fB)faB=P(A=0,B=1)=P(A=0|B=1)P(B=1)=(1q1)fB

Plugging these into P(Y=1|B=0) and P(Y=1|B=1):

P(Y=1|B=0)=γ0=α0fab+α1fAb1fB=α0(1q0)(1fB)+α1q0(1fB)1fB=α0(1q0)+α1q0P(Y=1|B=1)=γ1=α0faB+α1fABfB=α0(1q1)fB+α1q1fBfB=α0(1q1)+α1q1

The effect size of the tag SNP (βtag) is (similar to how we calculated the effect size of the causal variant):

P(Y=1|B=1)P(Y=1|B=0)=γ1γ0={α0(1q1)+α1q1}{α0(1q0)+α1q1}=q1(α1α0)q0(α1α0)=(q1q0)βcausal

It’s already starting to shape up! Can we express q1 and q0 in terms of the correlation between the two loci r?

r=fABfAfBfA(1fA)fB(1fB) We know (from above) that fAB=q1fB, so we can write the numerator of r as fABfAfB=q1fBfAfB. We can further simplify this by writing fA in terms of fB by marginalization.

fA=P(A=1)=P(A=1,B=1)+P(A=1,B=0)=q1fB+q0(1fB)

Then,

fABfAfB=q1fBfB{q1fB+q0(1fB)}=fB(1fB)(q1q0)

and r becomes:

r=fB(1fB)(q1q0)fB(1fB)fA(1fA)=(q1q0)fB(1fB)fA(1fA)

Therefore, q1q0 can be written as:

q1q0=rfA(1fA)fB(1fB)

Plugging this into (q1q0)βcausal, we get:

βtag=rfA(1fA)fB(1fB) βcausal

In practice, genotypes are often normalized to unit variance, i.e., 2fB(1fB)=2fA(1fA)=1. If that’s the case, βtag further simplifies to:

βtag=rβcausal

Done! Although, note that this last simplification assumes that genotypes for all SNPs are normalized to the same variance.

Effect of tag SNP (quantitative phenotype)

The derivation for quantitative phenotype is very similar and in some ways simpler because we can use a simple linear model instead of worrying about log models. The derivation for the quantitative phenotype is different mostly in notation. Instead of looking at the probability of disease as a function of which allele an individual carries, we’re going to look at the expected phenotype. We assume the following linear model:

Y=μ+βcausalA+ϵ where Y is a continuous phenotype and ϵ is some random error. Then the expected phenotypes are:

E(Y|A=0)=α0=μ   &   E(Y|A=1)=α1=μ+βcausal

And the effect of the causal variant is (similar to the binary phenotype):

βcausal=E(Y|A=1)E(Y|A=0)

Similary, if we denote E(Y|B=0)=γ0 and E(Y|B=1)=γ1 the effect size of the tag SNP is:

βtag=E(Y|B=1)E(Y|B=0)

E(Y|B=0) can be calculated using:

E(Y|B=0)=yy.fY|B=0(y|B=0)dy where fY|B=0(y|B=0) is the probability distribution function (pdf) of y conditional on the b allele. This is the same as calculating the mean of Y across all individuals carrying the b allele. The pdf is necessary because Y is a continuous variable.

fY|B=0(y|B=0)=fY,B=0(y,B=0)fB=0(B=0) where fY,B=0 is the joint distribution and fB=0(B=0)=P(B=0). This is similar to the definition of conditional probability presented in the case of the binary phenotype. Similar to calculating P(Y=1,B=0) (e.g.) in the binary case, we will sum over the allele present at the causal locus (bear with me):

fY,B=0(y,B=0)=afY,B=0|A=aP(A=a)=fY,B=0|A=0(y,B=0|A=0)P(A=0)+fY,B=0|A=1(y,B=0|A=1)P(A=1)=fY|A=0(Y|A=0)fB=0|A=0(B=0|A=0)P(A=0)+fY|A=1(Y|A=1)fB=0|A=1(B=0|A=1)P(A=1)=fY|A=0(Y|A=0)P(B=0|A=0)P(A=0)+fY|A=1(Y|A=1)P(B=0|A=1)P(A=1)

The second last line is possible again because Y is independent of B, conditional on A. From this we can get the conditional distribution:

fY|B=0(y|B=0)=fY|A=0(Y|A=0)P(B=0|A=0)P(A=0)+fY|A=1(Y|A=1)P(B=0|A=1)P(A=1)P(B=0) And E(Y|B=0) becomes:

E(Y|B=0)=yyfY|B=0(y|B=0)dy=yy.fY|A=0(Y|A=0)P(B=0|A=0)P(A=0)dy  +yy.fY|A=1(Y|A=1)P(B=0|A=1)P(A=1)dyP(B=0)=P(B=0|A=0)P(A=0)yyfY|A=0(Y|A=0)dy  +P(B=0|A=1)P(A=1)yyfY|A=1(Y|A=1)dyP(B=0)=P(B=0|A=0)P(A=0)E(Y|A=0)+P(B=0|A=1)P(A=1)E(Y|A=1)P(B=0)

We derived most of these terms in the binary case, which simplifies the above equation to:

E(Y|B=0)=α0fab+α1fAbfB=α0(1q0)+α1q0

Similarly (and I’ll let you work this out yourself), E(Y|B=1)=α0(1q1)+α1q1. Then, the effect size of the tag SNP becomes:

βtag=E(Y|B=1)E(Y|B=0)={α0(1q1)+α1q1}{α0(1q0)+α1q0}=(q1q0)βcausal=rβcausalfA(1fA)fB(1fB)

I skipped over the last few steps as we already worked them out for the binary phenotype. So there you have it!

This post has already gotten quite long so I’m going to leave it here but here are some extensions that you can try to work out:

  • derivations for the log-risk and log-odds (logistic regression) model
  • extend this result to diploid individuals
  • derive the standard error and power of the test

I might update this post later with some of these (1 and 2).

Cheers! :)


  1. I found this paper which provides a similar result for the log risk model (Vukcevic et al. 2012). I have incorporated some of their notation in my post because it was clearer. Definitely worth the read.↩︎

Arslan A. Zaidi
Arslan A. Zaidi
Postdoctoral fellow

Related