Methods (implementation notes)

This page documents the main statistical models implemented in genal. It is intentionally brief and focuses on what is implemented in the codebase.

Per-variant F-statistic (FSTAT)

Implementation: genal.geno_tools.fill_fstatistic()

genal computes a per-variant F-statistic (FSTAT) during preprocessing and after association testing. This statistic measures instrument strength for each variant.

  • Primary: FSTAT = (BETA / SE)² when BETA and SE are available and SE > 0.

  • Fallback: FSTAT = χ²_isf(P, df=1) (equivalent to for a two-sided p-value) when BETA/SE are unavailable but P is present.

MR harmonization

MR workflows rely on aligning exposure and outcome effects to the same effect allele.

Implementation: genal.MR_tools.harmonize_MR()

High-level steps:

  1. Merge exposure and outcome on SNP.

  2. Align alleles:

    • If alleles are inverted, swap outcome alleles and flip the outcome effect (and outcome EAF when available).

    • If alleles are neither aligned nor inverted (and not palindromic), flip outcome alleles to their complements (A↔T, C↔G) and retry.

  3. Drop SNPs with remaining allele mismatches.

  4. Handle palindromes based on action:

    • action=1: keep all palindromes (no attempt to resolve strand ambiguity)

    • action=2: use allele frequencies to drop “ambiguous” palindromes and flip the remaining ones when exposure/outcome frequencies disagree around 0.5

    • action=3: drop all palindromes

Two-sample MR estimators

Implementation: MR wrappers in genal.MR_tools.MR_func(), estimators in genal/MR.py.

For each SNP \(i\), define:

  • exposure association: \(\beta_{e,i}\) with standard error \(se_{e,i}\)

  • outcome association: \(\beta_{o,i}\) with standard error \(se_{o,i}\)

The per-variant ratio estimate is:

\[\hat{\theta}_i = \frac{\beta_{o,i}}{\beta_{e,i}}\]

Method map (what methods=[...] runs)

genal accepts the following method names in genal.Geno.MR():

methods entry

Estimator

Minimum instruments

Notes

"IVW"

IVW (WLS)

2

Under-dispersion correction in the SE; returns Cochran’s Q

"IVW-RE"

IVW (WLS)

2

No under-dispersion correction; returns Cochran’s Q

"IVW-FE"

IVW (WLS)

2

Fixed-effects-style scaling; returns Cochran’s Q

"UWR"

Unweighted regression

2

OLS through origin; SE uses under-dispersion correction

"Egger"

MR-Egger + intercept

3

WLS with intercept; returns both slope and intercept (+ Q)

"Egger-boot"

MR-Egger bootstrap + intercept

3

Bootstrap SE/p-values (normal sampling); returns slope + intercept

"WM"

Weighted median

3

Bootstrap SE

"WM-pen"

Penalised weighted median

3

Downweights outlier ratio estimates using penk; bootstrap SE

"Simple-median"

Simple median

3

Unweighted median of ratio estimates; bootstrap SE

"Simple-mode"

Simple mode

3

Mode of ratio estimates via KDE; bandwidth scaled by phi

"Weighted-mode"

Weighted mode

3

KDE weighted by inverse-variance-like weights; bandwidth scaled by phi

"Sign"

Sign concordance test

6

Binomial test of sign agreement between exposure and outcome effects

Common quantities used by ratio-based methods

For ratio-based estimators (median/mode), genal uses the delta-method standard error of the per-variant ratio:

\[se(\hat{\theta}_i) = \sqrt{ \frac{se_{o,i}^2}{\beta_{e,i}^2} \;+\; \frac{\beta_{o,i}^2 \, se_{e,i}^2}{\beta_{e,i}^4} }\]

IVW family (WLS through origin)

genal implements IVW as a weighted regression of \(\beta_o\) on \(\beta_e\) without intercept:

\[\beta_{o,i} = \theta \, \beta_{e,i} + \varepsilon_i\]

with weights \(w_i = 1/se_{o,i}^2\).

All IVW variants return heterogeneity diagnostics via Cochran’s Q:

\[Q = \sum_i w_i \left(\beta_{o,i} - \hat{\theta}\beta_{e,i}\right)^2\]

Differences between the three IVW variants in genal:

  • "IVW": WLS estimate, with an under-dispersion correction applied to the standard error.

  • "IVW-RE": WLS estimate, with the raw WLS standard error (no under-dispersion correction).

  • "IVW-FE": WLS estimate, with an alternative residual-variance scaling of the standard error.

Practical note: if \(\beta_{e,i}\) is near 0 for some SNPs, ratio-based methods become unstable; genal filters invalid rows after harmonization.

MR-Egger (WLS with intercept)

Egger is implemented as a weighted regression with intercept:

\[\beta_{o,i} = \alpha + \theta \, \beta_{e,i} + \varepsilon_i\]

After harmonization, genal orients \(\beta_e\) to be non-negative and applies the same sign to \(\beta_o\). The intercept \(\alpha\) is returned as “Egger Intercept”.

"Egger-boot" returns bootstrap SE/p-values by repeatedly sampling exposure/outcome effects from normal distributions using their SEs and refitting the regression.

Median estimators

genal implements:

  • Weighted median: weighted median of \(\hat{\theta}_i\) (weights derived from the delta-method variance), with bootstrap SE.

  • Penalised weighted median: downweights variants whose ratio estimates deviate strongly from the weighted-median estimate, then recomputes the weighted median (controlled by penk), with bootstrap SE.

  • Simple median: unweighted median of \(\hat{\theta}_i\), with bootstrap SE.

Mode estimators

Mode methods estimate the mode of the distribution of \(\hat{\theta}_i\) using kernel density estimation (KDE):

  • Simple mode: unweighted KDE

  • Weighted mode: KDE weighted by inverse-variance-like weights

The bandwidth uses a modified Silverman rule multiplied by the user-provided factor phi. genal then uses bootstrap sampling (normal sampling with the per-variant ratio SE) to estimate uncertainty.

Sign concordance test

The sign method tests whether exposure and outcome effects tend to have the same sign across variants. genal performs a binomial test against the null of 50% sign agreement.

Leave-one-out MR

Implementation: genal.MR_tools.MR_loo_func(), wrapped by genal.Geno.MR_loo().

Leave-one-out MR iterates over all instruments, sequentially removing each SNP and re-estimating the causal effect using the remaining instruments. This identifies variants that have a disproportionate influence on the overall estimate.

For each SNP \(i\) in the instrument set:

  1. Remove SNP \(i\) from the harmonized data.

  2. Re-run the selected MR method on the remaining \(J-1\) SNPs.

  3. Store the resulting estimate \(\hat{\theta}_{-i}\).

The “influence” of SNP \(i\) is defined as:

\[\text{influence}_i = \left| \hat{\theta}_{-i} - \hat{\theta}_{\text{all}} \right|\]

where \(\hat{\theta}_{\text{all}}\) is the estimate using all instruments.

Notes:

  • Any single MR method can be used (IVW, Egger, weighted median, mode-based, etc.).

  • At least 3 instruments are required (so that each LOO subset has ≥2 instruments).

MR-PRESSO (summary)

Implementation: genal.MRpresso.mr_presso(), wrapped by genal.Geno.MRpresso().

At a high level, genal’s MR-PRESSO implementation:

  • computes an observed leave-one-out residual sum of squares (RSS),

  • simulates an empirical RSS distribution by generating random data under the fitted model (n_iterations controls this),

  • if the global p-value is significant (global_test_p < significance_p), performs:

    • an outlier test (per-variant p-values, Bonferroni-corrected),

    • an optional distortion test (whether the causal estimate changes materially after removing outliers).

Distortion test

The distortion test assesses whether detected outliers materially bias the causal estimate. genal implements the following version:

  1. Observed distortion: \(D_\text{obs} = (\hat{\theta}_\text{all} - \hat{\theta}_\text{no outliers}) / |\hat{\theta}_\text{no outliers}|\)

  2. Expected distortion null: bootstrap resampling is performed exclusively on the non-outlier subset. For each iteration:

    • Sample with replacement \(J-k\) SNPs from the non-outlier data (where \(k\) is the number of detected outliers).

    • Fit the IVW model on the sampled data and record \(\hat{\theta}_\text{exp}\).

    • Compute \(D_\text{exp} = (\hat{\theta}_\text{all} - \hat{\theta}_\text{exp}) / |\hat{\theta}_\text{exp}|\).

  3. P-value: \(p = \text{mean}(|D_\text{exp}| > |D_\text{obs}|)\).

This differs from the original MR-PRESSO R implementation, which in some cases samples from the full dataset (including outliers) for the expected-bias regressions and was inconsistent with the paper’s description.

Output structure

Geno.MRpresso() returns four objects:

  • mod_table: a small results table (Raw and Outlier-corrected rows; IVW model),

  • GlobalTest: RSS and global p-value,

  • OutlierTest: per-variant outlier p-values (empty if the global test is not significant); SNP IDs as row labels,

  • BiasTest: distortion test result dictionary containing "outliers_indices" (SNP IDs), "distortion_test_coefficient", and "distortion_test_p" (empty if distortion test was not run).

If outliers are found, genal stores the outlier-removed harmonized table and allows rerunning MR with Geno.MR(use_mrpresso_data=True).

Colocalization (ABF)

Implementation: genal.colocalization.coloc_abf_func().

For each trait and each variant, genal computes an approximate Bayes factor (ABF) from regression estimates. Let:

\[z = \frac{\beta}{\sqrt{\mathrm{var}(\beta)}}\]

and let the prior standard deviation be:

  • quantitative trait: \(sd_\mathrm{prior} = 0.15 \cdot sd_Y\)

  • case-control trait: \(sd_\mathrm{prior} = 0.2\)

Then, with \(r = \frac{sd_\mathrm{prior}^2}{sd_\mathrm{prior}^2 + \mathrm{var}(\beta)}\), the log-ABF is:

\[\log(\mathrm{ABF}) = \frac{1}{2}\left(\log(1-r) + r z^2\right)\]

genal combines per-trait ABFs to compute posterior probabilities for hypotheses \(H_0\)\(H_4\) (no association, trait1 only, trait2 only, both traits but different variants, both traits and shared variant) using priors p1, p2, p12.

Practical note: colocalization is usually done on a locus/region, not genome-wide, and requires careful choice of priors and trait type (quant vs cc).