Tutorial (detailed)

The README.md shows a compact “GWAS → instruments → PRS → MR” example. This page expands that workflow into a tutorial, explaining why each step exists and which parameters you typically tune.

If you haven’t configured PLINK/reference data yet, start with Setup.

0) Mental model: what a `Geno` instance represents

genal.Geno wraps a SNP-level table in G.data. You typically:

load and standardize GWAS summary statistics into a Geno,
derive an instrument set (clumping),
run downstream analyses (PRS, MR, MR-PRESSO, colocalization, …).

During a workflow, the object evolves in two main ways:

the working table G.data may be transformed (sometimes into a new Geno, e.g. clumping),
results are attached as attributes (e.g. G.MR_data after query_outcome(), G.MR_results after MR()).

Many analysis methods both store results on the object and return a convenient summary object (for example, MR() returns a results table and also populates G.MR_results).

See Core concepts for required columns and a concise “what each method returns / modifies” cheat sheet.

1) Create a `Geno` from GWAS summary statistics

You map your input columns to genal’s standard names at initialization:

import pandas as pd
import genal

exposure_df = pd.read_csv("exposure_gwas.tsv", sep="\t")
G_exposure = genal.Geno(
    exposure_df,
    CHR="CHR", POS="POS", SNP="SNP",
    EA="EA", NEA="NEA",
    BETA="BETA", SE="SE", P="P", EAF="EAF",
    keep_columns=False,  # keep only standard columns after renaming
)

Practical notes:

If your GWAS uses odds ratios, pass effect_column="OR" later in preprocess_data() (see below).
If you have either SNP or CHR+POS (but not both), preprocess_data() can often fill the missing identifier using a build-only reference.

2) Preprocess and standardize the table

Run genal.Geno.preprocess_data() early. It validates types/values and can fill missing columns using a build-only reference (reference_panel="37" or "38").

G_exposure.preprocess_data(preprocessing="Fill_delete", reference_panel="37")

Key arguments you commonly tune:

preprocessing:
- "Fill" keeps rows and sets invalid values to NaN (useful for exploratory work).
- "Fill_delete" drops invalid/duplicated/NA rows (recommended before downstream genetics methods).
reference_panel="37" vs "38" should match your GWAS build.
effect_column="OR" forces log-transforming odds ratios into betas and adjusts SE accordingly.
fill_snpids / fill_coordinates: override the default logic if you want to force filling rsIDs from CHR+POS or vice-versa.
keep_indel / keep_dups: keep indels or duplicated IDs (generally you keep these False unless you have a reason).
fill_f=True: force recomputation of the F-statistic (FSTAT) column even if it already exists. By default, FSTAT is created if missing or only missing values are filled.

3) Select independent instruments via LD clumping

genal.Geno.clump() runs PLINK 2 clumping and returns a new Geno containing the clumped instruments.

G_instruments = G_exposure.clump(
    p1=5e-8,       # significance threshold for index variants
    r2=0.01,       # LD threshold
    kb=10_000,     # window size in kb
    p2=0.01,       # secondary threshold
    reference_panel="EUR_37",  # LD reference panel (not the same as build-only reference)
)

Notes:

reference_panel here is an LD reference panel (EUR_37, AFR_38, …) or a path to custom PLINK files.
If no variants remain after clumping, clump() returns None.

4) Compute a PRS in a target cohort (individual-level genotypes)

genal.Geno.prs() writes a CSV and stores intermediate PLINK outputs under tmp_GENAL/.

G_instruments.prs(
    name="prs_sbp",
    path="my_genotypes_chr$",  # bed/bim/fam or pgen/pvar/psam; use '$' if split by chr
    weighted=True,
)

Tunable options and common pitfalls:

Variant matching:
- If CHR+POS are available in G_instruments.data, genal prefers coordinate matching and will call update_snpids() internally to reduce ID mismatches.
- If you only have rsIDs, ensure your cohort uses compatible IDs (or add coordinates and rerun preprocessing).
proxy=True enables proxy SNP search when variants are missing in your cohort:
- r2, kb, window_snps, and the LD reference_panel control proxy search scope/strictness.
name can be a path prefix; the output is <name>.csv.

Example with proxies:

G_instruments.prs(
    name="prs_sbp_proxy",
    path="my_genotypes_chr$",
    proxy=True,
    reference_panel="EUR_37",
    r2=0.8,
    kb=5000,
    window_snps=1_000_000,
)

5) Two-sample MR: query an outcome GWAS and run MR

5.1) Load and preprocess the outcome GWAS

outcome_df = pd.read_csv("outcome_gwas.tsv", sep="\t")
G_outcome = genal.Geno(
    outcome_df,
    CHR="CHR", POS="POS", SNP="SNP",
    EA="EA", NEA="NEA",
    BETA="BETA", SE="SE", P="P", EAF="EAF",
    keep_columns=False,
)
G_outcome.preprocess_data(preprocessing="Fill_delete", reference_panel="37")

5.2) Query the outcome at the instrument variants

genal.Geno.query_outcome() extracts exposure SNPs from an outcome dataset and stores (exposure_df, outcome_df, outcome_name) in G_instruments.MR_data.

G_instruments.query_outcome(
    G_outcome,
    name="stroke",
    proxy=True,
    reference_panel="EUR_37",
)

Tunable options:

proxy=True searches proxies when SNPs are missing from the outcome.
r2, kb, window_snps control proxy search scope/strictness.

5.3) Run MR

genal.Geno.MR() harmonizes alleles and runs the requested estimators:

res = G_instruments.MR(
    methods=["IVW", "IVW-FE", "WM", "Simple-mode", "Egger"],
    action=2,
    heterogeneity=True,
    odds=False,
    exposure_name="SBP",
    outcome_name="Stroke",
)

Key arguments you commonly tune:

methods: choose estimators/sensitivity analyses; see Methods (implementation notes) for implementation notes.
Palindromic SNP handling (action):
- 1: keep palindromes (assumes alleles already aligned)
- 2: frequency-based flipping using EAF (conservative default; requires EAF in both exposure/outcome)
- 3: drop all palindromes
heterogeneity=True adds Cochran’s Q diagnostics for methods that support it.
odds=True adds exponentiated effect summaries (odds ratios + 95% CI strings) when appropriate.

5.4) Plot MR results

After MR(), you can generate a scatter plot with method lines:

G_instruments.MR_plot(filename="mr_scatter", figure_size=(10, 6))  # saves mr_scatter.png

You can also draw a funnel plot of single-SNP ratio estimates (Wald ratios):

G_instruments.MR_funnel(
    methods=["IVW", "WM", "Egger"],  # vertical reference lines (optional)
    filename="mr_funnel",
    figure_size=(10, 6),
)

6a) MR-PRESSO (outlier detection and distortion testing)

genal.Geno.MRpresso() runs a parallel MR-PRESSO implementation.

mod_table, GlobalTest, OutlierTest, BiasTest = G_instruments.MRpresso(
    action=2,
    n_iterations=10_000,
    significance_p=0.05,
    cpus=-1,
)

What you typically tune:

n_iterations: more iterations → more stable p-values but slower.
significance_p: threshold for global/outlier tests.
outlier_test / distortion_test: disable if you only want the global test.

Output structure:

mod_table: DataFrame with two rows ("Raw" and "Outlier-corrected") containing columns method, nSNP, b, se, pval for the IVW model before and after outlier removal.
GlobalTest: dictionary with "RSSobs" (observed residual sum of squares) and "global_test_p" (p-value for the global pleiotropy test).
OutlierTest: DataFrame with per-SNP outlier p-values (RSSobs, Pvalue); SNP identifiers (rsIDs) are used as row labels. Empty if the global test is not significant.
BiasTest: dictionary containing "outliers_indices" (list of SNP IDs), "distortion_test_coefficient" (percent), and "distortion_test_p".

If outliers are found, you can rerun MR using the outlier-removed subset:

res_no_outliers = G_instruments.MR(use_mrpresso_data=True)

To highlight MR-PRESSO outliers on plots, pass use_mrpresso_data=True (outliers are colored in red and shown in the legend):

G_instruments.MR_plot(filename="mr_scatter_mrpresso", figure_size=(10, 6), use_mrpresso_data=True)
G_instruments.MR_funnel(filename="mr_funnel_mrpresso", figure_size=(10, 6), use_mrpresso_data=True)
G_instruments.MR_loo_plot(filename="loo_forest_mrpresso", figure_size=(10, 8), use_mrpresso_data=True)

See Methods (implementation notes) for algorithm details and outputs.

6b) Leave-one-out MR (sensitivity analysis)

genal.Geno.MR_loo() iteratively removes each SNP and re-estimates the causal effect. This helps identify influential variants that may be driving the overall result.

loo_df = G_instruments.MR_loo(
    method="IVW",        # any single MR method key (see MR method map)
    action=2,
    heterogeneity=False, # set True to include Q statistics
    odds=False,          # set True for OR-scale output
)

Key arguments:

method: a single MR method key (e.g., "IVW", "Egger", "WM"); must not be "all".
use_mrpresso_data=True: use the outlier-removed dataset from MR-PRESSO instead of all instruments.

Visualizing leave-one-out results

genal.Geno.MR_loo_plot() creates a forest plot from the stored MR_loo_results:

# Default: show top influential instruments
G_instruments.MR_loo_plot(filename="loo_forest", figure_size=(10, 8))

# Or paginate all instruments
G_instruments.MR_loo_plot(
    top_influential=False,  # show all, not just influential
    snps_per_page=30,
    page=1,                 # or None for all pages
    filename="loo_forest_all",
    figure_size=(10, 12),
)

Key arguments:

top_influential=True (default): select the snps_per_page most influential SNPs (largest change in estimate when removed) and render a single compact figure.
top_influential=False: paginate all instruments; use page=N to select a specific page or page=None to render all pages.
snps_per_page: number of SNPs per page (minimum 5).
use_mrpresso_data=True: color MR-PRESSO outliers in red (requires MRpresso() first). When outliers exist, an extra summary row (“MR-PRESSO corrected”) is added using the same MR method as the leave-one-out analysis.
methods=["WM", "Egger"]: add extra overall estimates for the requested methods (computed on all instruments).

7) Additional capabilities (beyond the core pipeline)

Single-SNP association tests (individual-level data)

Use this when you have individual-level genotypes (PLINK files) and a phenotype table for the same cohort, and you want to estimate SNP–phenotype associations in that cohort (linear regression for quantitative traits, logistic regression for binary traits via PLINK 2 --glm).

Attach phenotype data to the object:

G_instruments.set_phenotype(
    pheno_df,
    PHENO="sbp",
    PHENO_type="quant",  # or "binary"; if omitted, genal tries to infer it
    IID="IID",
    FID="FID",           # optional; defaults to IID
)

For binary phenotypes, set_phenotype() recodes the two observed values to 0/1 (control/case) using the most frequent value as the control code by default (use alternate_control=True to invert this interpretation).

Run association tests (updates G.data in place):

G_instruments.association_test(
    path="my_genotypes_chr$",  # bed/bim/fam or pgen/pvar/psam; use '$' if split by chr
    covar=["age", "sex"],      # used to specify covariates to adjust for
    standardize=True,          # standardize quantitative phenotypes before regression
)

What you typically tune / watch:

covar: covariate names (must be present in pheno_df and numeric; constant covariates are dropped).
standardize=True: for quantitative traits; set False if you want raw-scale effects.
Variant matching: if CHR+POS are present in G.data, genal will map to cohort SNP IDs before running PLINK (reduces ID mismatch losses).
After updating BETA, SE, and P, the F-statistic (FSTAT) column is automatically recomputed to remain consistent with the updated estimates.

Liftover between builds

genal.Geno.lift() lifts CHR/POS coordinates across genome builds and returns a DataFrame (it does not return a new Geno). The appropriate chain file is downloaded automatically if used for the first time.

lifted_df = G_exposure.lift(start="hg19", end="hg38", replace=False)  # G_exposure.data unchanged

Key arguments:

replace=True|False: whether to overwrite G.data (return value is still a DataFrame either way).
chain_file=...: use a local UCSC chain file (useful for offline/reproducible builds).
liftover_path=...: use UCSC liftOver executable (faster than Python liftover for large datasets).
name=... / extraction_file=True: optionally write the lifted table and/or an extraction file.

If you want a lifted Geno object (to keep chaining methods), use:

G_lifted = G_exposure.copy(lifted_df)

Update allele frequencies (EAF) from a reference panel

Allele frequencies are useful for MR harmonization with action=2 (frequency-based palindrome handling) and as a sanity-check for allele alignment.

G_exposure.update_eaf(reference_panel="EUR_37", replace=True, fill=True)

Key arguments:

reference_panel: an LD reference panel name (e.g. "EUR_37", "AFR_38") or a custom PLINK fileset (no extension).
replace=True|False: whether to overwrite G.data (return value is always a DataFrame).
fill=True|False: keep existing EAF values for variants not found in the reference (True), or set them to NaN (False).

Requirements / behavior:

Requires EA and either SNP or CHR+POS.
Uses PLINK --freq and aligns the returned frequency to the effect allele (EA). If alleles do not match the reference panel, the updated EAF is set to NaN (or left unchanged if fill=True).

Colocalization (ABF)

Colocalization is typically run on a locus, not genome-wide. A common pattern is to subset both datasets first:

G_exp_region = G_exposure.filter_by_gene(gene="APOE", window_size=500_000, build="37", replace=False)
G_out_region = G_outcome.filter_by_gene(gene="APOE", window_size=500_000, build="37", replace=False)

coloc = G_exp_region.colocalize(
    G_out_region,
    trait1_type="quant",
    trait2_type="cc",
    n1=700_000,
    n2=40_000,
)

Key arguments:

trait1_type / trait2_type: "quant" or "cc".
For quantitative traits: provide sdY1/sdY2, or provide n1/n2 and EAF so genal can estimate \(sd_Y\); otherwise genal falls back to \(sd_Y = 1\).
p1, p2, p12: priors for the ABF model (see Methods (implementation notes)).
Matching: by default, genal attempts to merge on CHR+POS when available; set merge_on_snp=True to force merging by rsID (SNP).

Output:

A dictionary containing nsnps and posterior probabilities (PP.H0.abf … PP.H4.abf).

GWAS Catalog annotation

genal.Geno.query_gwas_catalog() annotates variants with traits from the GWAS Catalog REST API (network-bound; can be slow for large SNP sets).

annotated_df = G_exposure.query_gwas_catalog(
    p_threshold=5e-8,
    return_p=True,
    return_study=True,
    max_associations=5,
    replace=False,
)

Key arguments:

p_threshold: only report associations with \(p \le p_\mathrm{threshold}\).
max_associations: cap the number of reported associations per SNP (keeps the output readable).
timeout: per-request timeout; increase for large SNP sets.
replace=True|False: whether to store the resulting ASSOC column in G.data.

The returned table contains an ASSOC column with lists of associations; SNPs that cannot be queried are labeled "FAILED_QUERY" and long requests may be labeled "TIMEOUT".

Gene-window filtering

apoe_region = G_exposure.filter_by_gene(
    gene="APOE",
    id_type="symbol",
    window_size=500_000,
    build="37",
    replace=False,
)

Key arguments:

gene: gene identifier (e.g. "APOE").
id_type: identifier type ("symbol", "HGNC", "Ensembl", "NCBI", "UCSC", "Vega", or "name").
window_size: total window size in base pairs (± window_size/2 around the gene).
build: "37" or "38" (must match your CHR/POS coordinates).
replace=False|True: return a new Geno (default) or filter this object in place.

Notes:

Requires CHR and POS.
On first use, genal may download a small gene-mapping parquet file into your configured reference folder (network required).

Save intermediate datasets

Use genal.Geno.save() to persist G.data to disk for reuse.

G_instruments.name = "my_instruments"  # controls the filename prefix
G_instruments.save(fmt="h5")           # writes my_instruments.h5 (key="data")

Notes:

save() writes the file using G.name; set it explicitly if you want stable filenames.
Supported formats are fmt="h5" (default), "parquet", "csv", and "txt".
You can later load with pd.read_hdf("my_instruments.h5", key="data") or pd.read_parquet("my_instruments.parquet"), or pass the .h5/.hdf5/.parquet path directly to query_outcome().

Tutorial (detailed)

0) Mental model: what a Geno instance represents

1) Create a Geno from GWAS summary statistics

2) Preprocess and standardize the table

3) Select independent instruments via LD clumping

4) Compute a PRS in a target cohort (individual-level genotypes)

5) Two-sample MR: query an outcome GWAS and run MR

5.1) Load and preprocess the outcome GWAS

5.2) Query the outcome at the instrument variants

5.3) Run MR

5.4) Plot MR results

6a) MR-PRESSO (outlier detection and distortion testing)

6b) Leave-one-out MR (sensitivity analysis)

Visualizing leave-one-out results

7) Additional capabilities (beyond the core pipeline)

Single-SNP association tests (individual-level data)

Liftover between builds

Update allele frequencies (EAF) from a reference panel

Colocalization (ABF)

GWAS Catalog annotation

Gene-window filtering

Save intermediate datasets

0) Mental model: what a `Geno` instance represents

1) Create a `Geno` from GWAS summary statistics