# Methods (implementation notes)

This page documents the main statistical models implemented in `genal`. It is intentionally brief and focuses on what is implemented in the codebase.

## Per-variant F-statistic (FSTAT)

Implementation: {py:func}`genal.geno_tools.fill_fstatistic`

`genal` computes a per-variant F-statistic (`FSTAT`) during preprocessing and after association testing. This statistic measures instrument strength for each variant.

- **Primary:** `FSTAT = (BETA / SE)²` when `BETA` and `SE` are available and `SE > 0`.
- **Fallback:** `FSTAT = χ²_isf(P, df=1)` (equivalent to `Z²` for a two-sided p-value) when `BETA/SE` are unavailable but `P` is present.

## MR harmonization

MR workflows rely on aligning exposure and outcome effects to the same effect allele.

Implementation: {py:func}`genal.MR_tools.harmonize_MR`

High-level steps:

1. Merge exposure and outcome on `SNP`.
2. Align alleles:
   - If alleles are inverted, swap outcome alleles and flip the outcome effect (and outcome EAF when available).
   - If alleles are neither aligned nor inverted (and not palindromic), flip outcome alleles to their complements (A↔T, C↔G) and retry.
3. Drop SNPs with remaining allele mismatches.
4. Handle palindromes based on `action`:
   - `action=1`: keep all palindromes (no attempt to resolve strand ambiguity)
   - `action=2`: use allele frequencies to drop “ambiguous” palindromes and flip the remaining ones when exposure/outcome frequencies disagree around 0.5
   - `action=3`: drop all palindromes

## Two-sample MR estimators

Implementation: MR wrappers in {py:func}`genal.MR_tools.MR_func`, estimators in `genal/MR.py`.

For each SNP $i$, define:

- exposure association: $\beta_{e,i}$ with standard error $se_{e,i}$
- outcome association: $\beta_{o,i}$ with standard error $se_{o,i}$

The per-variant ratio estimate is:

```{math}
\hat{\theta}_i = \frac{\beta_{o,i}}{\beta_{e,i}}
```

### Method map (what `methods=[...]` runs)

`genal` accepts the following method names in {py:meth}`genal.Geno.MR`:

| `methods` entry | Estimator | Minimum instruments | Notes |
|---|---:|---:|---|
| `"IVW"` | IVW (WLS) | 2 | Under-dispersion correction in the SE; returns Cochran’s Q |
| `"IVW-RE"` | IVW (WLS) | 2 | No under-dispersion correction; returns Cochran’s Q |
| `"IVW-FE"` | IVW (WLS) | 2 | Fixed-effects-style scaling; returns Cochran’s Q |
| `"UWR"` | Unweighted regression | 2 | OLS through origin; SE uses under-dispersion correction |
| `"Egger"` | MR-Egger + intercept | 3 | WLS with intercept; returns both slope and intercept (+ Q) |
| `"Egger-boot"` | MR-Egger bootstrap + intercept | 3 | Bootstrap SE/p-values (normal sampling); returns slope + intercept |
| `"WM"` | Weighted median | 3 | Bootstrap SE |
| `"WM-pen"` | Penalised weighted median | 3 | Downweights outlier ratio estimates using `penk`; bootstrap SE |
| `"Simple-median"` | Simple median | 3 | Unweighted median of ratio estimates; bootstrap SE |
| `"Simple-mode"` | Simple mode | 3 | Mode of ratio estimates via KDE; bandwidth scaled by `phi` |
| `"Weighted-mode"` | Weighted mode | 3 | KDE weighted by inverse-variance-like weights; bandwidth scaled by `phi` |
| `"Sign"` | Sign concordance test | 6 | Binomial test of sign agreement between exposure and outcome effects |

### Common quantities used by ratio-based methods

For ratio-based estimators (median/mode), `genal` uses the delta-method standard error of the per-variant ratio:

```{math}
se(\hat{\theta}_i) =
\sqrt{
\frac{se_{o,i}^2}{\beta_{e,i}^2}
\;+\;
\frac{\beta_{o,i}^2 \, se_{e,i}^2}{\beta_{e,i}^4}
}
```

### IVW family (WLS through origin)

`genal` implements IVW as a weighted regression of $\beta_o$ on $\beta_e$ without intercept:

```{math}
\beta_{o,i} = \theta \, \beta_{e,i} + \varepsilon_i
```

with weights $w_i = 1/se_{o,i}^2$.

All IVW variants return heterogeneity diagnostics via Cochran’s Q:

```{math}
Q = \sum_i w_i \left(\beta_{o,i} - \hat{\theta}\beta_{e,i}\right)^2
```

Differences between the three IVW variants in `genal`:
- `"IVW"`: WLS estimate, with an **under-dispersion correction** applied to the standard error.
- `"IVW-RE"`: WLS estimate, with the *raw* WLS standard error (no under-dispersion correction).
- `"IVW-FE"`: WLS estimate, with an alternative residual-variance scaling of the standard error.

Practical note: if $\beta_{e,i}$ is near 0 for some SNPs, ratio-based methods become unstable; `genal` filters invalid rows after harmonization.

### MR-Egger (WLS with intercept)

Egger is implemented as a weighted regression with intercept:

```{math}
\beta_{o,i} = \alpha + \theta \, \beta_{e,i} + \varepsilon_i
```

After harmonization, `genal` orients $\beta_e$ to be non-negative and applies the same sign to $\beta_o$. The intercept $\alpha$ is returned as “Egger Intercept”.

`"Egger-boot"` returns bootstrap SE/p-values by repeatedly sampling exposure/outcome effects from normal distributions using their SEs and refitting the regression.

### Median estimators

`genal` implements:

- **Weighted median**: weighted median of $\hat{\theta}_i$ (weights derived from the delta-method variance), with bootstrap SE.
- **Penalised weighted median**: downweights variants whose ratio estimates deviate strongly from the weighted-median estimate, then recomputes the weighted median (controlled by `penk`), with bootstrap SE.
- **Simple median**: unweighted median of $\hat{\theta}_i$, with bootstrap SE.

### Mode estimators

Mode methods estimate the mode of the distribution of $\hat{\theta}_i$ using kernel density estimation (KDE):

- **Simple mode**: unweighted KDE
- **Weighted mode**: KDE weighted by inverse-variance-like weights

The bandwidth uses a modified Silverman rule multiplied by the user-provided factor `phi`. `genal` then uses bootstrap sampling (normal sampling with the per-variant ratio SE) to estimate uncertainty.

### Sign concordance test

The sign method tests whether exposure and outcome effects tend to have the same sign across variants. `genal` performs a binomial test against the null of 50% sign agreement.

## Leave-one-out MR

Implementation: {py:func}`genal.MR_tools.MR_loo_func`, wrapped by {py:meth}`genal.Geno.MR_loo`.

Leave-one-out MR iterates over all instruments, sequentially removing each SNP and re-estimating the causal effect using the remaining instruments. This identifies variants that have a disproportionate influence on the overall estimate.

For each SNP $i$ in the instrument set:

1. Remove SNP $i$ from the harmonized data.
2. Re-run the selected MR method on the remaining $J-1$ SNPs.
3. Store the resulting estimate $\hat{\theta}_{-i}$.

The "influence" of SNP $i$ is defined as:

```{math}
\text{influence}_i = \left| \hat{\theta}_{-i} - \hat{\theta}_{\text{all}} \right|
```

where $\hat{\theta}_{\text{all}}$ is the estimate using all instruments.

Notes:
- Any single MR method can be used (IVW, Egger, weighted median, mode-based, etc.).
- At least 3 instruments are required (so that each LOO subset has ≥2 instruments).

## MR-PRESSO (summary)

Implementation: {py:func}`genal.MRpresso.mr_presso`, wrapped by {py:meth}`genal.Geno.MRpresso`.

At a high level, `genal`'s MR-PRESSO implementation:

- computes an observed leave-one-out residual sum of squares (RSS),
- simulates an empirical RSS distribution by generating random data under the fitted model (`n_iterations` controls this),
- if the global p-value is significant (`global_test_p < significance_p`), performs:
  - an outlier test (per-variant p-values, Bonferroni-corrected),
  - an optional distortion test (whether the causal estimate changes materially after removing outliers).

### Distortion test

The distortion test assesses whether detected outliers materially bias the causal estimate. `genal` implements the following version:

1. **Observed distortion**: $D_\text{obs} = (\hat{\theta}_\text{all} - \hat{\theta}_\text{no outliers}) / |\hat{\theta}_\text{no outliers}|$
2. **Expected distortion null**: bootstrap resampling is performed *exclusively on the non-outlier subset*. For each iteration:
   - Sample with replacement $J-k$ SNPs from the non-outlier data (where $k$ is the number of detected outliers).
   - Fit the IVW model on the sampled data and record $\hat{\theta}_\text{exp}$.
   - Compute $D_\text{exp} = (\hat{\theta}_\text{all} - \hat{\theta}_\text{exp}) / |\hat{\theta}_\text{exp}|$.
3. **P-value**: $p = \text{mean}(|D_\text{exp}| > |D_\text{obs}|)$.

This differs from the original MR-PRESSO R implementation, which in some cases samples from the full dataset (including outliers) for the expected-bias regressions and was inconsistent with the paper's description.

### Output structure

`Geno.MRpresso()` returns four objects:
- `mod_table`: a small results table (`Raw` and `Outlier-corrected` rows; IVW model),
- `GlobalTest`: RSS and global p-value,
- `OutlierTest`: per-variant outlier p-values (empty if the global test is not significant); SNP IDs as row labels,
- `BiasTest`: distortion test result dictionary containing `"outliers_indices"` (SNP IDs), `"distortion_test_coefficient"`, and `"distortion_test_p"` (empty if distortion test was not run).

If outliers are found, `genal` stores the outlier-removed harmonized table and allows rerunning MR with `Geno.MR(use_mrpresso_data=True)`.

## Colocalization (ABF)

Implementation: {py:func}`genal.colocalization.coloc_abf_func`.

For each trait and each variant, `genal` computes an approximate Bayes factor (ABF) from regression estimates. Let:

```{math}
z = \frac{\beta}{\sqrt{\mathrm{var}(\beta)}}
```

and let the prior standard deviation be:

- quantitative trait: $sd_\mathrm{prior} = 0.15 \cdot sd_Y$
- case-control trait: $sd_\mathrm{prior} = 0.2$

Then, with $r = \frac{sd_\mathrm{prior}^2}{sd_\mathrm{prior}^2 + \mathrm{var}(\beta)}$, the log-ABF is:

```{math}
\log(\mathrm{ABF}) = \frac{1}{2}\left(\log(1-r) + r z^2\right)
```

`genal` combines per-trait ABFs to compute posterior probabilities for hypotheses $H_0$–$H_4$ (no association, trait1 only, trait2 only, both traits but different variants, both traits and shared variant) using priors `p1`, `p2`, `p12`.

Practical note: colocalization is usually done on a **locus/region**, not genome-wide, and requires careful choice of priors and trait type (`quant` vs `cc`).