Tutorial (detailed)
The README.md shows a compact “GWAS → instruments → PRS → MR” example. This page expands that workflow into a tutorial, explaining why each step exists and which parameters you typically tune.
If you haven’t configured PLINK/reference data yet, start with Setup.
0) Mental model: what a Geno instance represents
genal.Geno wraps a SNP-level table in G.data. You typically:
load and standardize GWAS summary statistics into a
Geno,derive an instrument set (clumping),
run downstream analyses (PRS, MR, MR-PRESSO, colocalization, …).
During a workflow, the object evolves in two main ways:
the working table
G.datamay be transformed (sometimes into a newGeno, e.g. clumping),results are attached as attributes (e.g.
G.MR_dataafterquery_outcome(),G.MR_resultsafterMR()).
Many analysis methods both store results on the object and return a convenient summary object (for example, MR() returns a results table and also populates G.MR_results).
See Core concepts for required columns and a concise “what each method returns / modifies” cheat sheet.
1) Create a Geno from GWAS summary statistics
You map your input columns to genal’s standard names at initialization:
import pandas as pd
import genal
exposure_df = pd.read_csv("exposure_gwas.tsv", sep="\t")
G_exposure = genal.Geno(
exposure_df,
CHR="CHR", POS="POS", SNP="SNP",
EA="EA", NEA="NEA",
BETA="BETA", SE="SE", P="P", EAF="EAF",
keep_columns=False, # keep only standard columns after renaming
)
Practical notes:
If your GWAS uses odds ratios, pass
effect_column="OR"later inpreprocess_data()(see below).If you have either
SNPorCHR+POS(but not both),preprocess_data()can often fill the missing identifier using a build-only reference.
2) Preprocess and standardize the table
Run genal.Geno.preprocess_data() early. It validates types/values and can fill missing columns using a build-only reference (reference_panel="37" or "38").
G_exposure.preprocess_data(preprocessing="Fill_delete", reference_panel="37")
Key arguments you commonly tune:
preprocessing:"Fill"keeps rows and sets invalid values toNaN(useful for exploratory work)."Fill_delete"drops invalid/duplicated/NA rows (recommended before downstream genetics methods).
reference_panel="37"vs"38"should match your GWAS build.effect_column="OR"forces log-transforming odds ratios into betas and adjusts SE accordingly.fill_snpids/fill_coordinates: override the default logic if you want to force filling rsIDs fromCHR+POSor vice-versa.keep_indel/keep_dups: keep indels or duplicated IDs (generally you keep theseFalseunless you have a reason).fill_f=True: force recomputation of the F-statistic (FSTAT) column even if it already exists. By default, FSTAT is created if missing or only missing values are filled.
3) Select independent instruments via LD clumping
genal.Geno.clump() runs PLINK 2 clumping and returns a new Geno containing the clumped instruments.
G_instruments = G_exposure.clump(
p1=5e-8, # significance threshold for index variants
r2=0.01, # LD threshold
kb=10_000, # window size in kb
p2=0.01, # secondary threshold
reference_panel="EUR_37", # LD reference panel (not the same as build-only reference)
)
Notes:
reference_panelhere is an LD reference panel (EUR_37,AFR_38, …) or a path to custom PLINK files.If no variants remain after clumping,
clump()returnsNone.
4) Compute a PRS in a target cohort (individual-level genotypes)
genal.Geno.prs() writes a CSV and stores intermediate PLINK outputs under tmp_GENAL/.
G_instruments.prs(
name="prs_sbp",
path="my_genotypes_chr$", # bed/bim/fam or pgen/pvar/psam; use '$' if split by chr
weighted=True,
)
Tunable options and common pitfalls:
Variant matching:
If
CHR+POSare available inG_instruments.data, genal prefers coordinate matching and will callupdate_snpids()internally to reduce ID mismatches.If you only have rsIDs, ensure your cohort uses compatible IDs (or add coordinates and rerun preprocessing).
proxy=Trueenables proxy SNP search when variants are missing in your cohort:r2,kb,window_snps, and the LDreference_panelcontrol proxy search scope/strictness.
namecan be a path prefix; the output is<name>.csv.
Example with proxies:
G_instruments.prs(
name="prs_sbp_proxy",
path="my_genotypes_chr$",
proxy=True,
reference_panel="EUR_37",
r2=0.8,
kb=5000,
window_snps=1_000_000,
)
5) Two-sample MR: query an outcome GWAS and run MR
5.1) Load and preprocess the outcome GWAS
outcome_df = pd.read_csv("outcome_gwas.tsv", sep="\t")
G_outcome = genal.Geno(
outcome_df,
CHR="CHR", POS="POS", SNP="SNP",
EA="EA", NEA="NEA",
BETA="BETA", SE="SE", P="P", EAF="EAF",
keep_columns=False,
)
G_outcome.preprocess_data(preprocessing="Fill_delete", reference_panel="37")
5.2) Query the outcome at the instrument variants
genal.Geno.query_outcome() extracts exposure SNPs from an outcome dataset and stores (exposure_df, outcome_df, outcome_name) in G_instruments.MR_data.
G_instruments.query_outcome(
G_outcome,
name="stroke",
proxy=True,
reference_panel="EUR_37",
)
Tunable options:
proxy=Truesearches proxies when SNPs are missing from the outcome.r2,kb,window_snpscontrol proxy search scope/strictness.
5.3) Run MR
genal.Geno.MR() harmonizes alleles and runs the requested estimators:
res = G_instruments.MR(
methods=["IVW", "IVW-FE", "WM", "Simple-mode", "Egger"],
action=2,
heterogeneity=True,
odds=False,
exposure_name="SBP",
outcome_name="Stroke",
)
Key arguments you commonly tune:
methods: choose estimators/sensitivity analyses; see Methods (implementation notes) for implementation notes.Palindromic SNP handling (
action):1: keep palindromes (assumes alleles already aligned)2: frequency-based flipping usingEAF(conservative default; requiresEAFin both exposure/outcome)3: drop all palindromes
heterogeneity=Trueadds Cochran’s Q diagnostics for methods that support it.odds=Trueadds exponentiated effect summaries (odds ratios + 95% CI strings) when appropriate.
5.4) Plot MR results
After MR(), you can generate a scatter plot with method lines:
G_instruments.MR_plot(filename="mr_scatter", figure_size=(10, 6)) # saves mr_scatter.png
You can also draw a funnel plot of single-SNP ratio estimates (Wald ratios):
G_instruments.MR_funnel(
methods=["IVW", "WM", "Egger"], # vertical reference lines (optional)
filename="mr_funnel",
figure_size=(10, 6),
)
6a) MR-PRESSO (outlier detection and distortion testing)
genal.Geno.MRpresso() runs a parallel MR-PRESSO implementation.
mod_table, GlobalTest, OutlierTest, BiasTest = G_instruments.MRpresso(
action=2,
n_iterations=10_000,
significance_p=0.05,
cpus=-1,
)
What you typically tune:
n_iterations: more iterations → more stable p-values but slower.significance_p: threshold for global/outlier tests.outlier_test/distortion_test: disable if you only want the global test.
Output structure:
mod_table: DataFrame with two rows ("Raw"and"Outlier-corrected") containing columnsmethod,nSNP,b,se,pvalfor the IVW model before and after outlier removal.GlobalTest: dictionary with"RSSobs"(observed residual sum of squares) and"global_test_p"(p-value for the global pleiotropy test).OutlierTest: DataFrame with per-SNP outlier p-values (RSSobs,Pvalue); SNP identifiers (rsIDs) are used as row labels. Empty if the global test is not significant.BiasTest: dictionary containing"outliers_indices"(list of SNP IDs),"distortion_test_coefficient"(percent), and"distortion_test_p".
If outliers are found, you can rerun MR using the outlier-removed subset:
res_no_outliers = G_instruments.MR(use_mrpresso_data=True)
To highlight MR-PRESSO outliers on plots, pass use_mrpresso_data=True (outliers are colored in red and shown in the legend):
G_instruments.MR_plot(filename="mr_scatter_mrpresso", figure_size=(10, 6), use_mrpresso_data=True)
G_instruments.MR_funnel(filename="mr_funnel_mrpresso", figure_size=(10, 6), use_mrpresso_data=True)
G_instruments.MR_loo_plot(filename="loo_forest_mrpresso", figure_size=(10, 8), use_mrpresso_data=True)
See Methods (implementation notes) for algorithm details and outputs.
6b) Leave-one-out MR (sensitivity analysis)
genal.Geno.MR_loo() iteratively removes each SNP and re-estimates the causal effect. This helps identify influential variants that may be driving the overall result.
loo_df = G_instruments.MR_loo(
method="IVW", # any single MR method key (see MR method map)
action=2,
heterogeneity=False, # set True to include Q statistics
odds=False, # set True for OR-scale output
)
Key arguments:
method: a single MR method key (e.g.,"IVW","Egger","WM"); must not be"all".use_mrpresso_data=True: use the outlier-removed dataset from MR-PRESSO instead of all instruments.
Visualizing leave-one-out results
genal.Geno.MR_loo_plot() creates a forest plot from the stored MR_loo_results:
# Default: show top influential instruments
G_instruments.MR_loo_plot(filename="loo_forest", figure_size=(10, 8))
# Or paginate all instruments
G_instruments.MR_loo_plot(
top_influential=False, # show all, not just influential
snps_per_page=30,
page=1, # or None for all pages
filename="loo_forest_all",
figure_size=(10, 12),
)
Key arguments:
top_influential=True(default): select thesnps_per_pagemost influential SNPs (largest change in estimate when removed) and render a single compact figure.top_influential=False: paginate all instruments; usepage=Nto select a specific page orpage=Noneto render all pages.snps_per_page: number of SNPs per page (minimum 5).use_mrpresso_data=True: color MR-PRESSO outliers in red (requiresMRpresso()first). When outliers exist, an extra summary row (“MR-PRESSO corrected”) is added using the same MR method as the leave-one-out analysis.methods=["WM", "Egger"]: add extra overall estimates for the requested methods (computed on all instruments).
7) Additional capabilities (beyond the core pipeline)
Single-SNP association tests (individual-level data)
Use this when you have individual-level genotypes (PLINK files) and a phenotype table for the same cohort, and you want to estimate SNP–phenotype associations in that cohort (linear regression for quantitative traits, logistic regression for binary traits via PLINK 2 --glm).
Attach phenotype data to the object:
G_instruments.set_phenotype(
pheno_df,
PHENO="sbp",
PHENO_type="quant", # or "binary"; if omitted, genal tries to infer it
IID="IID",
FID="FID", # optional; defaults to IID
)
For binary phenotypes, set_phenotype() recodes the two observed values to 0/1 (control/case) using the most frequent value as the control code by default (use alternate_control=True to invert this interpretation).
Run association tests (updates
G.datain place):
G_instruments.association_test(
path="my_genotypes_chr$", # bed/bim/fam or pgen/pvar/psam; use '$' if split by chr
covar=["age", "sex"], # used to specify covariates to adjust for
standardize=True, # standardize quantitative phenotypes before regression
)
What you typically tune / watch:
covar: covariate names (must be present inpheno_dfand numeric; constant covariates are dropped).standardize=True: for quantitative traits; setFalseif you want raw-scale effects.Variant matching: if
CHR+POSare present inG.data, genal will map to cohort SNP IDs before running PLINK (reduces ID mismatch losses).After updating
BETA,SE, andP, the F-statistic (FSTAT) column is automatically recomputed to remain consistent with the updated estimates.
Liftover between builds
genal.Geno.lift() lifts CHR/POS coordinates across genome builds and returns a DataFrame (it does not return a new Geno).
The appropriate chain file is downloaded automatically if used for the first time.
lifted_df = G_exposure.lift(start="hg19", end="hg38", replace=False) # G_exposure.data unchanged
Key arguments:
replace=True|False: whether to overwriteG.data(return value is still a DataFrame either way).chain_file=...: use a local UCSC chain file (useful for offline/reproducible builds).liftover_path=...: use UCSCliftOverexecutable (faster than Python liftover for large datasets).name=.../extraction_file=True: optionally write the lifted table and/or an extraction file.
If you want a lifted Geno object (to keep chaining methods), use:
G_lifted = G_exposure.copy(lifted_df)
Update allele frequencies (EAF) from a reference panel
Allele frequencies are useful for MR harmonization with action=2 (frequency-based palindrome handling) and as a sanity-check for allele alignment.
G_exposure.update_eaf(reference_panel="EUR_37", replace=True, fill=True)
Key arguments:
reference_panel: an LD reference panel name (e.g."EUR_37","AFR_38") or a custom PLINK fileset (no extension).replace=True|False: whether to overwriteG.data(return value is always a DataFrame).fill=True|False: keep existingEAFvalues for variants not found in the reference (True), or set them toNaN(False).
Requirements / behavior:
Requires
EAand eitherSNPorCHR+POS.Uses PLINK
--freqand aligns the returned frequency to the effect allele (EA). If alleles do not match the reference panel, the updated EAF is set toNaN(or left unchanged iffill=True).
Colocalization (ABF)
Colocalization is typically run on a locus, not genome-wide. A common pattern is to subset both datasets first:
G_exp_region = G_exposure.filter_by_gene(gene="APOE", window_size=500_000, build="37", replace=False)
G_out_region = G_outcome.filter_by_gene(gene="APOE", window_size=500_000, build="37", replace=False)
coloc = G_exp_region.colocalize(
G_out_region,
trait1_type="quant",
trait2_type="cc",
n1=700_000,
n2=40_000,
)
Key arguments:
trait1_type/trait2_type:"quant"or"cc".For quantitative traits: provide
sdY1/sdY2, or providen1/n2andEAFso genal can estimate \(sd_Y\); otherwise genal falls back to \(sd_Y = 1\).p1,p2,p12: priors for the ABF model (see Methods (implementation notes)).Matching: by default, genal attempts to merge on
CHR+POSwhen available; setmerge_on_snp=Trueto force merging by rsID (SNP).
Output:
A dictionary containing
nsnpsand posterior probabilities (PP.H0.abf…PP.H4.abf).
GWAS Catalog annotation
genal.Geno.query_gwas_catalog() annotates variants with traits from the GWAS Catalog REST API (network-bound; can be slow for large SNP sets).
annotated_df = G_exposure.query_gwas_catalog(
p_threshold=5e-8,
return_p=True,
return_study=True,
max_associations=5,
replace=False,
)
Key arguments:
p_threshold: only report associations with \(p \le p_\mathrm{threshold}\).max_associations: cap the number of reported associations per SNP (keeps the output readable).timeout: per-request timeout; increase for large SNP sets.replace=True|False: whether to store the resultingASSOCcolumn inG.data.
The returned table contains an ASSOC column with lists of associations; SNPs that cannot be queried are labeled "FAILED_QUERY" and long requests may be labeled "TIMEOUT".
Gene-window filtering
apoe_region = G_exposure.filter_by_gene(
gene="APOE",
id_type="symbol",
window_size=500_000,
build="37",
replace=False,
)
Key arguments:
gene: gene identifier (e.g."APOE").id_type: identifier type ("symbol","HGNC","Ensembl","NCBI","UCSC","Vega", or"name").window_size: total window size in base pairs (±window_size/2around the gene).build:"37"or"38"(must match yourCHR/POScoordinates).replace=False|True: return a newGeno(default) or filter this object in place.
Notes:
Requires
CHRandPOS.On first use, genal may download a small gene-mapping parquet file into your configured reference folder (network required).
Save intermediate datasets
Use genal.Geno.save() to persist G.data to disk for reuse.
G_instruments.name = "my_instruments" # controls the filename prefix
G_instruments.save(fmt="h5") # writes my_instruments.h5 (key="data")
Notes:
save()writes the file usingG.name; set it explicitly if you want stable filenames.Supported formats are
fmt="h5"(default),"parquet","csv", and"txt".You can later load with
pd.read_hdf("my_instruments.h5", key="data")orpd.read_parquet("my_instruments.parquet"), or pass the.h5/.hdf5/.parquetpath directly toquery_outcome().