# Tutorial (detailed) The `README.md` shows a compact “GWAS → instruments → PRS → MR” example. This page expands that workflow into a **tutorial**, explaining *why* each step exists and which parameters you typically tune. If you haven't configured PLINK/reference data yet, start with {doc}`setup`. ## 0) Mental model: what a `Geno` instance represents {py:class}`genal.Geno` wraps a SNP-level table in `G.data`. You typically: 1. load and standardize GWAS summary statistics into a `Geno`, 2. derive an instrument set (clumping), 3. run downstream analyses (PRS, MR, MR-PRESSO, colocalization, …). During a workflow, the object evolves in two main ways: - the working table `G.data` may be transformed (sometimes into a new `Geno`, e.g. clumping), - results are attached as attributes (e.g. `G.MR_data` after `query_outcome()`, `G.MR_results` after `MR()`). Many analysis methods both **store** results on the object and **return** a convenient summary object (for example, `MR()` returns a results table and also populates `G.MR_results`). See {doc}`concepts` for required columns and a concise “what each method returns / modifies” cheat sheet. ## 1) Create a `Geno` from GWAS summary statistics You map your input columns to `genal`'s standard names at initialization: ```python import pandas as pd import genal exposure_df = pd.read_csv("exposure_gwas.tsv", sep="\t") G_exposure = genal.Geno( exposure_df, CHR="CHR", POS="POS", SNP="SNP", EA="EA", NEA="NEA", BETA="BETA", SE="SE", P="P", EAF="EAF", keep_columns=False, # keep only standard columns after renaming ) ``` Practical notes: - If your GWAS uses odds ratios, pass `effect_column="OR"` later in `preprocess_data()` (see below). - If you have either `SNP` *or* `CHR+POS` (but not both), `preprocess_data()` can often fill the missing identifier using a build-only reference. ## 2) Preprocess and standardize the table Run {py:meth}`genal.Geno.preprocess_data` early. It validates types/values and can fill missing columns using a build-only reference (`reference_panel="37"` or `"38"`). ```python G_exposure.preprocess_data(preprocessing="Fill_delete", reference_panel="37") ``` Key arguments you commonly tune: - `preprocessing`: - `"Fill"` keeps rows and sets invalid values to `NaN` (useful for exploratory work). - `"Fill_delete"` drops invalid/duplicated/NA rows (recommended before downstream genetics methods). - `reference_panel="37"` vs `"38"` should match your GWAS build. - `effect_column="OR"` forces log-transforming odds ratios into betas and adjusts SE accordingly. - `fill_snpids` / `fill_coordinates`: override the default logic if you want to force filling rsIDs from `CHR+POS` or vice-versa. - `keep_indel` / `keep_dups`: keep indels or duplicated IDs (generally you keep these `False` unless you have a reason). - `fill_f=True`: force recomputation of the F-statistic (`FSTAT`) column even if it already exists. By default, FSTAT is created if missing or only missing values are filled. ## 3) Select independent instruments via LD clumping {py:meth}`genal.Geno.clump` runs PLINK 2 clumping and returns a **new** `Geno` containing the clumped instruments. ```python G_instruments = G_exposure.clump( p1=5e-8, # significance threshold for index variants r2=0.01, # LD threshold kb=10_000, # window size in kb p2=0.01, # secondary threshold reference_panel="EUR_37", # LD reference panel (not the same as build-only reference) ) ``` Notes: - `reference_panel` here is an **LD reference panel** (`EUR_37`, `AFR_38`, …) or a path to custom PLINK files. - If no variants remain after clumping, `clump()` returns `None`. ## 4) Compute a PRS in a target cohort (individual-level genotypes) {py:meth}`genal.Geno.prs` writes a CSV and stores intermediate PLINK outputs under `tmp_GENAL/`. ```python G_instruments.prs( name="prs_sbp", path="my_genotypes_chr$", # bed/bim/fam or pgen/pvar/psam; use '$' if split by chr weighted=True, ) ``` Tunable options and common pitfalls: - **Variant matching**: - If `CHR+POS` are available in `G_instruments.data`, genal prefers coordinate matching and will call `update_snpids()` internally to reduce ID mismatches. - If you only have rsIDs, ensure your cohort uses compatible IDs (or add coordinates and rerun preprocessing). - `proxy=True` enables proxy SNP search when variants are missing in your cohort: - `r2`, `kb`, `window_snps`, and the LD `reference_panel` control proxy search scope/strictness. - `name` can be a path prefix; the output is `.csv`. Example with proxies: ```python G_instruments.prs( name="prs_sbp_proxy", path="my_genotypes_chr$", proxy=True, reference_panel="EUR_37", r2=0.8, kb=5000, window_snps=1_000_000, ) ``` ## 5) Two-sample MR: query an outcome GWAS and run MR ### 5.1) Load and preprocess the outcome GWAS ```python outcome_df = pd.read_csv("outcome_gwas.tsv", sep="\t") G_outcome = genal.Geno( outcome_df, CHR="CHR", POS="POS", SNP="SNP", EA="EA", NEA="NEA", BETA="BETA", SE="SE", P="P", EAF="EAF", keep_columns=False, ) G_outcome.preprocess_data(preprocessing="Fill_delete", reference_panel="37") ``` ### 5.2) Query the outcome at the instrument variants {py:meth}`genal.Geno.query_outcome` extracts exposure SNPs from an outcome dataset and stores `(exposure_df, outcome_df, outcome_name)` in `G_instruments.MR_data`. ```python G_instruments.query_outcome( G_outcome, name="stroke", proxy=True, reference_panel="EUR_37", ) ``` Tunable options: - `proxy=True` searches proxies when SNPs are missing from the outcome. - `r2`, `kb`, `window_snps` control proxy search scope/strictness. ### 5.3) Run MR {py:meth}`genal.Geno.MR` harmonizes alleles and runs the requested estimators: ```python res = G_instruments.MR( methods=["IVW", "IVW-FE", "WM", "Simple-mode", "Egger"], action=2, heterogeneity=True, odds=False, exposure_name="SBP", outcome_name="Stroke", ) ``` Key arguments you commonly tune: - `methods`: choose estimators/sensitivity analyses; see {doc}`methods` for implementation notes. - Palindromic SNP handling (`action`): - `1`: keep palindromes (assumes alleles already aligned) - `2`: frequency-based flipping using `EAF` (conservative default; requires `EAF` in both exposure/outcome) - `3`: drop all palindromes - `heterogeneity=True` adds Cochran’s Q diagnostics for methods that support it. - `odds=True` adds exponentiated effect summaries (odds ratios + 95% CI strings) when appropriate. ### 5.4) Plot MR results After `MR()`, you can generate a scatter plot with method lines: ```python G_instruments.MR_plot(filename="mr_scatter", figure_size=(10, 6)) # saves mr_scatter.png ``` You can also draw a funnel plot of single-SNP ratio estimates (Wald ratios): ```python G_instruments.MR_funnel( methods=["IVW", "WM", "Egger"], # vertical reference lines (optional) filename="mr_funnel", figure_size=(10, 6), ) ``` ## 6a) MR-PRESSO (outlier detection and distortion testing) {py:meth}`genal.Geno.MRpresso` runs a parallel MR-PRESSO implementation. ```python mod_table, GlobalTest, OutlierTest, BiasTest = G_instruments.MRpresso( action=2, n_iterations=10_000, significance_p=0.05, cpus=-1, ) ``` What you typically tune: - `n_iterations`: more iterations → more stable p-values but slower. - `significance_p`: threshold for global/outlier tests. - `outlier_test` / `distortion_test`: disable if you only want the global test. Output structure: - `mod_table`: DataFrame with two rows (`"Raw"` and `"Outlier-corrected"`) containing columns `method`, `nSNP`, `b`, `se`, `pval` for the IVW model before and after outlier removal. - `GlobalTest`: dictionary with `"RSSobs"` (observed residual sum of squares) and `"global_test_p"` (p-value for the global pleiotropy test). - `OutlierTest`: DataFrame with per-SNP outlier p-values (`RSSobs`, `Pvalue`); SNP identifiers (rsIDs) are used as row labels. Empty if the global test is not significant. - `BiasTest`: dictionary containing `"outliers_indices"` (list of SNP IDs), `"distortion_test_coefficient"` (percent), and `"distortion_test_p"`. If outliers are found, you can rerun MR using the outlier-removed subset: ```python res_no_outliers = G_instruments.MR(use_mrpresso_data=True) ``` To highlight MR-PRESSO outliers on plots, pass `use_mrpresso_data=True` (outliers are colored in red and shown in the legend): ```python G_instruments.MR_plot(filename="mr_scatter_mrpresso", figure_size=(10, 6), use_mrpresso_data=True) G_instruments.MR_funnel(filename="mr_funnel_mrpresso", figure_size=(10, 6), use_mrpresso_data=True) G_instruments.MR_loo_plot(filename="loo_forest_mrpresso", figure_size=(10, 8), use_mrpresso_data=True) ``` See {doc}`methods` for algorithm details and outputs. ## 6b) Leave-one-out MR (sensitivity analysis) {py:meth}`genal.Geno.MR_loo` iteratively removes each SNP and re-estimates the causal effect. This helps identify influential variants that may be driving the overall result. ```python loo_df = G_instruments.MR_loo( method="IVW", # any single MR method key (see MR method map) action=2, heterogeneity=False, # set True to include Q statistics odds=False, # set True for OR-scale output ) ``` Key arguments: - `method`: a single MR method key (e.g., `"IVW"`, `"Egger"`, `"WM"`); must not be `"all"`. - `use_mrpresso_data=True`: use the outlier-removed dataset from MR-PRESSO instead of all instruments. ### Visualizing leave-one-out results {py:meth}`genal.Geno.MR_loo_plot` creates a forest plot from the stored `MR_loo_results`: ```python # Default: show top influential instruments G_instruments.MR_loo_plot(filename="loo_forest", figure_size=(10, 8)) ``` ```python # Or paginate all instruments G_instruments.MR_loo_plot( top_influential=False, # show all, not just influential snps_per_page=30, page=1, # or None for all pages filename="loo_forest_all", figure_size=(10, 12), ) ``` Key arguments: - `top_influential=True` (default): select the `snps_per_page` most influential SNPs (largest change in estimate when removed) and render a single compact figure. - `top_influential=False`: paginate all instruments; use `page=N` to select a specific page or `page=None` to render all pages. - `snps_per_page`: number of SNPs per page (minimum 5). - `use_mrpresso_data=True`: color MR-PRESSO outliers in red (requires `MRpresso()` first). When outliers exist, an extra summary row ("MR-PRESSO corrected") is added using the same MR method as the leave-one-out analysis. - `methods=["WM", "Egger"]`: add extra overall estimates for the requested methods (computed on all instruments). ## 7) Additional capabilities (beyond the core pipeline) ### Single-SNP association tests (individual-level data) Use this when you have **individual-level genotypes** (PLINK files) and a phenotype table for the same cohort, and you want to estimate SNP–phenotype associations in that cohort (linear regression for quantitative traits, logistic regression for binary traits via PLINK 2 `--glm`). 1) Attach phenotype data to the object: ```python G_instruments.set_phenotype( pheno_df, PHENO="sbp", PHENO_type="quant", # or "binary"; if omitted, genal tries to infer it IID="IID", FID="FID", # optional; defaults to IID ) ``` For binary phenotypes, `set_phenotype()` recodes the two observed values to `0/1` (control/case) using the most frequent value as the control code by default (use `alternate_control=True` to invert this interpretation). 2) Run association tests (updates `G.data` in place): ```python G_instruments.association_test( path="my_genotypes_chr$", # bed/bim/fam or pgen/pvar/psam; use '$' if split by chr covar=["age", "sex"], # used to specify covariates to adjust for standardize=True, # standardize quantitative phenotypes before regression ) ``` What you typically tune / watch: - `covar`: covariate names (must be present in `pheno_df` and numeric; constant covariates are dropped). - `standardize=True`: for quantitative traits; set `False` if you want raw-scale effects. - Variant matching: if `CHR+POS` are present in `G.data`, genal will map to cohort SNP IDs before running PLINK (reduces ID mismatch losses). - After updating `BETA`, `SE`, and `P`, the F-statistic (`FSTAT`) column is automatically recomputed to remain consistent with the updated estimates. ### Liftover between builds {py:meth}`genal.Geno.lift` lifts `CHR`/`POS` coordinates across genome builds and returns a **DataFrame** (it does not return a new `Geno`). The appropriate chain file is downloaded automatically if used for the first time. ```python lifted_df = G_exposure.lift(start="hg19", end="hg38", replace=False) # G_exposure.data unchanged ``` Key arguments: - `replace=True|False`: whether to overwrite `G.data` (return value is still a DataFrame either way). - `chain_file=...`: use a local UCSC chain file (useful for offline/reproducible builds). - `liftover_path=...`: use UCSC `liftOver` executable (faster than Python liftover for large datasets). - `name=...` / `extraction_file=True`: optionally write the lifted table and/or an extraction file. If you want a lifted `Geno` object (to keep chaining methods), use: ```python G_lifted = G_exposure.copy(lifted_df) ``` ### Update allele frequencies (EAF) from a reference panel Allele frequencies are useful for MR harmonization with `action=2` (frequency-based palindrome handling) and as a sanity-check for allele alignment. ```python G_exposure.update_eaf(reference_panel="EUR_37", replace=True, fill=True) ``` Key arguments: - `reference_panel`: an LD reference panel name (e.g. `"EUR_37"`, `"AFR_38"`) or a custom PLINK fileset (no extension). - `replace=True|False`: whether to overwrite `G.data` (return value is always a DataFrame). - `fill=True|False`: keep existing `EAF` values for variants not found in the reference (`True`), or set them to `NaN` (`False`). Requirements / behavior: - Requires `EA` and either `SNP` or `CHR+POS`. - Uses PLINK `--freq` and aligns the returned frequency to the **effect allele** (`EA`). If alleles do not match the reference panel, the updated EAF is set to `NaN` (or left unchanged if `fill=True`). ### Colocalization (ABF) Colocalization is typically run on a locus, not genome-wide. A common pattern is to subset both datasets first: ```python G_exp_region = G_exposure.filter_by_gene(gene="APOE", window_size=500_000, build="37", replace=False) G_out_region = G_outcome.filter_by_gene(gene="APOE", window_size=500_000, build="37", replace=False) coloc = G_exp_region.colocalize( G_out_region, trait1_type="quant", trait2_type="cc", n1=700_000, n2=40_000, ) ``` Key arguments: - `trait1_type` / `trait2_type`: `"quant"` or `"cc"`. - For quantitative traits: provide `sdY1`/`sdY2`, or provide `n1`/`n2` *and* `EAF` so genal can estimate $sd_Y$; otherwise genal falls back to $sd_Y = 1$. - `p1`, `p2`, `p12`: priors for the ABF model (see {doc}`methods`). - Matching: by default, genal attempts to merge on `CHR+POS` when available; set `merge_on_snp=True` to force merging by rsID (`SNP`). Output: - A dictionary containing `nsnps` and posterior probabilities (`PP.H0.abf` … `PP.H4.abf`). ### GWAS Catalog annotation {py:meth}`genal.Geno.query_gwas_catalog` annotates variants with traits from the GWAS Catalog REST API (network-bound; can be slow for large SNP sets). ```python annotated_df = G_exposure.query_gwas_catalog( p_threshold=5e-8, return_p=True, return_study=True, max_associations=5, replace=False, ) ``` Key arguments: - `p_threshold`: only report associations with $p \le p_\mathrm{threshold}$. - `max_associations`: cap the number of reported associations per SNP (keeps the output readable). - `timeout`: per-request timeout; increase for large SNP sets. - `replace=True|False`: whether to store the resulting `ASSOC` column in `G.data`. The returned table contains an `ASSOC` column with lists of associations; SNPs that cannot be queried are labeled `"FAILED_QUERY"` and long requests may be labeled `"TIMEOUT"`. ### Gene-window filtering ```python apoe_region = G_exposure.filter_by_gene( gene="APOE", id_type="symbol", window_size=500_000, build="37", replace=False, ) ``` Key arguments: - `gene`: gene identifier (e.g. `"APOE"`). - `id_type`: identifier type (`"symbol"`, `"HGNC"`, `"Ensembl"`, `"NCBI"`, `"UCSC"`, `"Vega"`, or `"name"`). - `window_size`: total window size in base pairs (± `window_size/2` around the gene). - `build`: `"37"` or `"38"` (must match your `CHR/POS` coordinates). - `replace=False|True`: return a new `Geno` (default) or filter this object in place. Notes: - Requires `CHR` and `POS`. - On first use, genal may download a small gene-mapping parquet file into your configured reference folder (network required). ### Save intermediate datasets Use {py:meth}`genal.Geno.save` to persist `G.data` to disk for reuse. ```python G_instruments.name = "my_instruments" # controls the filename prefix G_instruments.save(fmt="h5") # writes my_instruments.h5 (key="data") ``` Notes: - `save()` writes the file using `G.name`; set it explicitly if you want stable filenames. - Supported formats are `fmt="h5"` (default), `"parquet"`, `"csv"`, and `"txt"`. - You can later load with `pd.read_hdf("my_instruments.h5", key="data")` or `pd.read_parquet("my_instruments.parquet")`, or pass the `.h5/.hdf5/.parquet` path directly to `query_outcome()`.