Methods (implementation notes)

This page documents the main statistical models implemented in genal. It is intentionally brief and focuses on what is implemented in the codebase.

Per-variant F-statistic (FSTAT)

Implementation: genal.geno_tools.fill_fstatistic()

genal computes a per-variant F-statistic (FSTAT) during preprocessing and after association testing. This statistic measures instrument strength for each variant.

Primary: FSTAT = (BETA / SE)² when BETA and SE are available and SE > 0.
Fallback: FSTAT = χ²_isf(P, df=1) (equivalent to Z² for a two-sided p-value) when BETA/SE are unavailable but P is present.

MR harmonization

MR workflows rely on aligning exposure and outcome effects to the same effect allele.

Implementation: genal.MR_tools.harmonize_MR()

High-level steps:

Merge exposure and outcome on SNP.
Align alleles:
- If alleles are inverted, swap outcome alleles and flip the outcome effect (and outcome EAF when available).
- If alleles are neither aligned nor inverted (and not palindromic), flip outcome alleles to their complements (A↔T, C↔G) and retry.
Drop SNPs with remaining allele mismatches.
Handle palindromes based on action:
- action=1: keep all palindromes (no attempt to resolve strand ambiguity)
- action=2: use allele frequencies to drop “ambiguous” palindromes and flip the remaining ones when exposure/outcome frequencies disagree around 0.5
- action=3: drop all palindromes

Two-sample MR estimators

Implementation: MR wrappers in genal.MR_tools.MR_func(), estimators in genal/MR.py.

For each SNP \(i\), define:

exposure association: \(\beta_{e,i}\) with standard error \(se_{e,i}\)
outcome association: \(\beta_{o,i}\) with standard error \(se_{o,i}\)

The per-variant ratio estimate is:

\[\hat{\theta}_i = \frac{\beta_{o,i}}{\beta_{e,i}}\]

Method map (what `methods=[...]` runs)

genal accepts the following method names in genal.Geno.MR():

`methods` entry	Estimator	Minimum instruments	Notes
`"IVW"`	IVW (WLS)	2	Under-dispersion correction in the SE; returns Cochran’s Q
`"IVW-RE"`	IVW (WLS)	2	No under-dispersion correction; returns Cochran’s Q
`"IVW-FE"`	IVW (WLS)	2	Fixed-effects-style scaling; returns Cochran’s Q
`"UWR"`	Unweighted regression	2	OLS through origin; SE uses under-dispersion correction
`"Egger"`	MR-Egger + intercept	3	WLS with intercept; returns both slope and intercept (+ Q)
`"Egger-boot"`	MR-Egger bootstrap + intercept	3	Bootstrap SE/p-values (normal sampling); returns slope + intercept
`"WM"`	Weighted median	3	Bootstrap SE
`"WM-pen"`	Penalised weighted median	3	Downweights outlier ratio estimates using `penk`; bootstrap SE
`"Simple-median"`	Simple median	3	Unweighted median of ratio estimates; bootstrap SE
`"Simple-mode"`	Simple mode	3	Mode of ratio estimates via KDE; bandwidth scaled by `phi`
`"Weighted-mode"`	Weighted mode	3	KDE weighted by inverse-variance-like weights; bandwidth scaled by `phi`
`"Sign"`	Sign concordance test	6	Binomial test of sign agreement between exposure and outcome effects

Common quantities used by ratio-based methods

For ratio-based estimators (median/mode), genal uses the delta-method standard error of the per-variant ratio:

\[se(\hat{\theta}_i) = \sqrt{ \frac{se_{o,i}^2}{\beta_{e,i}^2} \;+\; \frac{\beta_{o,i}^2 \, se_{e,i}^2}{\beta_{e,i}^4} }\]

IVW family (WLS through origin)

genal implements IVW as a weighted regression of \(\beta_o\) on \(\beta_e\) without intercept:

\[\beta_{o,i} = \theta \, \beta_{e,i} + \varepsilon_i\]

with weights \(w_i = 1/se_{o,i}^2\).

All IVW variants return heterogeneity diagnostics via Cochran’s Q:

\[Q = \sum_i w_i \left(\beta_{o,i} - \hat{\theta}\beta_{e,i}\right)^2\]

Differences between the three IVW variants in genal:

"IVW": WLS estimate, with an under-dispersion correction applied to the standard error.
"IVW-RE": WLS estimate, with the raw WLS standard error (no under-dispersion correction).
"IVW-FE": WLS estimate, with an alternative residual-variance scaling of the standard error.

Practical note: if \(\beta_{e,i}\) is near 0 for some SNPs, ratio-based methods become unstable; genal filters invalid rows after harmonization.

MR-Egger (WLS with intercept)

Egger is implemented as a weighted regression with intercept:

\[\beta_{o,i} = \alpha + \theta \, \beta_{e,i} + \varepsilon_i\]

After harmonization, genal orients \(\beta_e\) to be non-negative and applies the same sign to \(\beta_o\). The intercept \(\alpha\) is returned as “Egger Intercept”.

"Egger-boot" returns bootstrap SE/p-values by repeatedly sampling exposure/outcome effects from normal distributions using their SEs and refitting the regression.

Median estimators

genal implements:

Weighted median: weighted median of \(\hat{\theta}_i\) (weights derived from the delta-method variance), with bootstrap SE.
Penalised weighted median: downweights variants whose ratio estimates deviate strongly from the weighted-median estimate, then recomputes the weighted median (controlled by penk), with bootstrap SE.
Simple median: unweighted median of \(\hat{\theta}_i\), with bootstrap SE.

Mode estimators

Mode methods estimate the mode of the distribution of \(\hat{\theta}_i\) using kernel density estimation (KDE):

Simple mode: unweighted KDE
Weighted mode: KDE weighted by inverse-variance-like weights

The bandwidth uses a modified Silverman rule multiplied by the user-provided factor phi. genal then uses bootstrap sampling (normal sampling with the per-variant ratio SE) to estimate uncertainty.

Sign concordance test

The sign method tests whether exposure and outcome effects tend to have the same sign across variants. genal performs a binomial test against the null of 50% sign agreement.

Leave-one-out MR

Implementation: genal.MR_tools.MR_loo_func(), wrapped by genal.Geno.MR_loo().

Leave-one-out MR iterates over all instruments, sequentially removing each SNP and re-estimating the causal effect using the remaining instruments. This identifies variants that have a disproportionate influence on the overall estimate.

For each SNP \(i\) in the instrument set:

Remove SNP \(i\) from the harmonized data.
Re-run the selected MR method on the remaining \(J-1\) SNPs.
Store the resulting estimate \(\hat{\theta}_{-i}\).

The “influence” of SNP \(i\) is defined as:

\[\text{influence}_i = \left| \hat{\theta}_{-i} - \hat{\theta}_{\text{all}} \right|\]

where \(\hat{\theta}_{\text{all}}\) is the estimate using all instruments.

Notes:

Any single MR method can be used (IVW, Egger, weighted median, mode-based, etc.).
At least 3 instruments are required (so that each LOO subset has ≥2 instruments).

MR-PRESSO (summary)

Implementation: genal.MRpresso.mr_presso(), wrapped by genal.Geno.MRpresso().

At a high level, genal’s MR-PRESSO implementation:

computes an observed leave-one-out residual sum of squares (RSS),
simulates an empirical RSS distribution by generating random data under the fitted model (n_iterations controls this),
if the global p-value is significant (global_test_p < significance_p), performs:
- an outlier test (per-variant p-values, Bonferroni-corrected),
- an optional distortion test (whether the causal estimate changes materially after removing outliers).

Distortion test

The distortion test assesses whether detected outliers materially bias the causal estimate. genal implements the following version:

Observed distortion: \(D_\text{obs} = (\hat{\theta}_\text{all} - \hat{\theta}_\text{no outliers}) / |\hat{\theta}_\text{no outliers}|\)
Expected distortion null: bootstrap resampling is performed exclusively on the non-outlier subset. For each iteration:
- Sample with replacement \(J-k\) SNPs from the non-outlier data (where \(k\) is the number of detected outliers).
- Fit the IVW model on the sampled data and record \(\hat{\theta}_\text{exp}\).
- Compute \(D_\text{exp} = (\hat{\theta}_\text{all} - \hat{\theta}_\text{exp}) / |\hat{\theta}_\text{exp}|\).
P-value: \(p = \text{mean}(|D_\text{exp}| > |D_\text{obs}|)\).

This differs from the original MR-PRESSO R implementation, which in some cases samples from the full dataset (including outliers) for the expected-bias regressions and was inconsistent with the paper’s description.

Output structure

Geno.MRpresso() returns four objects:

mod_table: a small results table (Raw and Outlier-corrected rows; IVW model),
GlobalTest: RSS and global p-value,
OutlierTest: per-variant outlier p-values (empty if the global test is not significant); SNP IDs as row labels,
BiasTest: distortion test result dictionary containing "outliers_indices" (SNP IDs), "distortion_test_coefficient", and "distortion_test_p" (empty if distortion test was not run).

If outliers are found, genal stores the outlier-removed harmonized table and allows rerunning MR with Geno.MR(use_mrpresso_data=True).

Colocalization (ABF)

Implementation: genal.colocalization.coloc_abf_func().

For each trait and each variant, genal computes an approximate Bayes factor (ABF) from regression estimates. Let:

\[z = \frac{\beta}{\sqrt{\mathrm{var}(\beta)}}\]

and let the prior standard deviation be:

quantitative trait: \(sd_\mathrm{prior} = 0.15 \cdot sd_Y\)
case-control trait: \(sd_\mathrm{prior} = 0.2\)

Then, with \(r = \frac{sd_\mathrm{prior}^2}{sd_\mathrm{prior}^2 + \mathrm{var}(\beta)}\), the log-ABF is:

\[\log(\mathrm{ABF}) = \frac{1}{2}\left(\log(1-r) + r z^2\right)\]

genal combines per-trait ABFs to compute posterior probabilities for hypotheses \(H_0\)–\(H_4\) (no association, trait1 only, trait2 only, both traits but different variants, both traits and shared variant) using priors p1, p2, p12.

Practical note: colocalization is usually done on a locus/region, not genome-wide, and requires careful choice of priors and trait type (quant vs cc).