Methods (implementation notes)
This page documents the main statistical models implemented in genal. It is intentionally brief and focuses on what is implemented in the codebase.
Per-variant F-statistic (FSTAT)
Implementation: genal.geno_tools.fill_fstatistic()
genal computes a per-variant F-statistic (FSTAT) during preprocessing and after association testing. This statistic measures instrument strength for each variant.
Primary:
FSTAT = (BETA / SE)²whenBETAandSEare available andSE > 0.Fallback:
FSTAT = χ²_isf(P, df=1)(equivalent toZ²for a two-sided p-value) whenBETA/SEare unavailable butPis present.
MR harmonization
MR workflows rely on aligning exposure and outcome effects to the same effect allele.
Implementation: genal.MR_tools.harmonize_MR()
High-level steps:
Merge exposure and outcome on
SNP.Align alleles:
If alleles are inverted, swap outcome alleles and flip the outcome effect (and outcome EAF when available).
If alleles are neither aligned nor inverted (and not palindromic), flip outcome alleles to their complements (A↔T, C↔G) and retry.
Drop SNPs with remaining allele mismatches.
Handle palindromes based on
action:action=1: keep all palindromes (no attempt to resolve strand ambiguity)action=2: use allele frequencies to drop “ambiguous” palindromes and flip the remaining ones when exposure/outcome frequencies disagree around 0.5action=3: drop all palindromes
Two-sample MR estimators
Implementation: MR wrappers in genal.MR_tools.MR_func(), estimators in genal/MR.py.
For each SNP \(i\), define:
exposure association: \(\beta_{e,i}\) with standard error \(se_{e,i}\)
outcome association: \(\beta_{o,i}\) with standard error \(se_{o,i}\)
The per-variant ratio estimate is:
Method map (what methods=[...] runs)
genal accepts the following method names in genal.Geno.MR():
|
Estimator |
Minimum instruments |
Notes |
|---|---|---|---|
|
IVW (WLS) |
2 |
Under-dispersion correction in the SE; returns Cochran’s Q |
|
IVW (WLS) |
2 |
No under-dispersion correction; returns Cochran’s Q |
|
IVW (WLS) |
2 |
Fixed-effects-style scaling; returns Cochran’s Q |
|
Unweighted regression |
2 |
OLS through origin; SE uses under-dispersion correction |
|
MR-Egger + intercept |
3 |
WLS with intercept; returns both slope and intercept (+ Q) |
|
MR-Egger bootstrap + intercept |
3 |
Bootstrap SE/p-values (normal sampling); returns slope + intercept |
|
Weighted median |
3 |
Bootstrap SE |
|
Penalised weighted median |
3 |
Downweights outlier ratio estimates using |
|
Simple median |
3 |
Unweighted median of ratio estimates; bootstrap SE |
|
Simple mode |
3 |
Mode of ratio estimates via KDE; bandwidth scaled by |
|
Weighted mode |
3 |
KDE weighted by inverse-variance-like weights; bandwidth scaled by |
|
Sign concordance test |
6 |
Binomial test of sign agreement between exposure and outcome effects |
Common quantities used by ratio-based methods
For ratio-based estimators (median/mode), genal uses the delta-method standard error of the per-variant ratio:
IVW family (WLS through origin)
genal implements IVW as a weighted regression of \(\beta_o\) on \(\beta_e\) without intercept:
with weights \(w_i = 1/se_{o,i}^2\).
All IVW variants return heterogeneity diagnostics via Cochran’s Q:
Differences between the three IVW variants in genal:
"IVW": WLS estimate, with an under-dispersion correction applied to the standard error."IVW-RE": WLS estimate, with the raw WLS standard error (no under-dispersion correction)."IVW-FE": WLS estimate, with an alternative residual-variance scaling of the standard error.
Practical note: if \(\beta_{e,i}\) is near 0 for some SNPs, ratio-based methods become unstable; genal filters invalid rows after harmonization.
MR-Egger (WLS with intercept)
Egger is implemented as a weighted regression with intercept:
After harmonization, genal orients \(\beta_e\) to be non-negative and applies the same sign to \(\beta_o\). The intercept \(\alpha\) is returned as “Egger Intercept”.
"Egger-boot" returns bootstrap SE/p-values by repeatedly sampling exposure/outcome effects from normal distributions using their SEs and refitting the regression.
Median estimators
genal implements:
Weighted median: weighted median of \(\hat{\theta}_i\) (weights derived from the delta-method variance), with bootstrap SE.
Penalised weighted median: downweights variants whose ratio estimates deviate strongly from the weighted-median estimate, then recomputes the weighted median (controlled by
penk), with bootstrap SE.Simple median: unweighted median of \(\hat{\theta}_i\), with bootstrap SE.
Mode estimators
Mode methods estimate the mode of the distribution of \(\hat{\theta}_i\) using kernel density estimation (KDE):
Simple mode: unweighted KDE
Weighted mode: KDE weighted by inverse-variance-like weights
The bandwidth uses a modified Silverman rule multiplied by the user-provided factor phi. genal then uses bootstrap sampling (normal sampling with the per-variant ratio SE) to estimate uncertainty.
Sign concordance test
The sign method tests whether exposure and outcome effects tend to have the same sign across variants. genal performs a binomial test against the null of 50% sign agreement.
Leave-one-out MR
Implementation: genal.MR_tools.MR_loo_func(), wrapped by genal.Geno.MR_loo().
Leave-one-out MR iterates over all instruments, sequentially removing each SNP and re-estimating the causal effect using the remaining instruments. This identifies variants that have a disproportionate influence on the overall estimate.
For each SNP \(i\) in the instrument set:
Remove SNP \(i\) from the harmonized data.
Re-run the selected MR method on the remaining \(J-1\) SNPs.
Store the resulting estimate \(\hat{\theta}_{-i}\).
The “influence” of SNP \(i\) is defined as:
where \(\hat{\theta}_{\text{all}}\) is the estimate using all instruments.
Notes:
Any single MR method can be used (IVW, Egger, weighted median, mode-based, etc.).
At least 3 instruments are required (so that each LOO subset has ≥2 instruments).
MR-PRESSO (summary)
Implementation: genal.MRpresso.mr_presso(), wrapped by genal.Geno.MRpresso().
At a high level, genal’s MR-PRESSO implementation:
computes an observed leave-one-out residual sum of squares (RSS),
simulates an empirical RSS distribution by generating random data under the fitted model (
n_iterationscontrols this),if the global p-value is significant (
global_test_p < significance_p), performs:an outlier test (per-variant p-values, Bonferroni-corrected),
an optional distortion test (whether the causal estimate changes materially after removing outliers).
Distortion test
The distortion test assesses whether detected outliers materially bias the causal estimate. genal implements the following version:
Observed distortion: \(D_\text{obs} = (\hat{\theta}_\text{all} - \hat{\theta}_\text{no outliers}) / |\hat{\theta}_\text{no outliers}|\)
Expected distortion null: bootstrap resampling is performed exclusively on the non-outlier subset. For each iteration:
Sample with replacement \(J-k\) SNPs from the non-outlier data (where \(k\) is the number of detected outliers).
Fit the IVW model on the sampled data and record \(\hat{\theta}_\text{exp}\).
Compute \(D_\text{exp} = (\hat{\theta}_\text{all} - \hat{\theta}_\text{exp}) / |\hat{\theta}_\text{exp}|\).
P-value: \(p = \text{mean}(|D_\text{exp}| > |D_\text{obs}|)\).
This differs from the original MR-PRESSO R implementation, which in some cases samples from the full dataset (including outliers) for the expected-bias regressions and was inconsistent with the paper’s description.
Output structure
Geno.MRpresso() returns four objects:
mod_table: a small results table (RawandOutlier-correctedrows; IVW model),GlobalTest: RSS and global p-value,OutlierTest: per-variant outlier p-values (empty if the global test is not significant); SNP IDs as row labels,BiasTest: distortion test result dictionary containing"outliers_indices"(SNP IDs),"distortion_test_coefficient", and"distortion_test_p"(empty if distortion test was not run).
If outliers are found, genal stores the outlier-removed harmonized table and allows rerunning MR with Geno.MR(use_mrpresso_data=True).
Colocalization (ABF)
Implementation: genal.colocalization.coloc_abf_func().
For each trait and each variant, genal computes an approximate Bayes factor (ABF) from regression estimates. Let:
and let the prior standard deviation be:
quantitative trait: \(sd_\mathrm{prior} = 0.15 \cdot sd_Y\)
case-control trait: \(sd_\mathrm{prior} = 0.2\)
Then, with \(r = \frac{sd_\mathrm{prior}^2}{sd_\mathrm{prior}^2 + \mathrm{var}(\beta)}\), the log-ABF is:
genal combines per-trait ABFs to compute posterior probabilities for hypotheses \(H_0\)–\(H_4\) (no association, trait1 only, trait2 only, both traits but different variants, both traits and shared variant) using priors p1, p2, p12.
Practical note: colocalization is usually done on a locus/region, not genome-wide, and requires careful choice of priors and trait type (quant vs cc).