# Core concepts ## The `Geno` object model {py:class}`genal.Geno` is the central class. It wraps a SNP-level table and accumulates intermediate results during workflows. Key attributes (you don't need to manipulate these directly): - `G.data`: main SNP table (`pandas.DataFrame`) - `G.phenotype`: set by {py:meth}`genal.Geno.set_phenotype` (phenotype DataFrame + metadata) - `G.MR_data`: set by {py:meth}`genal.Geno.query_outcome` (exposure/outcome tables used by MR) - `G.MR_results`: set by {py:meth}`genal.Geno.MR` (results table + harmonized SNP table; used by plotting) - `G.MR_loo_results`: set by {py:meth}`genal.Geno.MR_loo` (leave-one-out results tuple; used by `MR_loo_plot`) - `G.MRpresso_results` / `G.MRpresso_subset_data`: set by {py:meth}`genal.Geno.MRpresso` ## Standard columns Most `genal` workflows assume (a subset of) the following “standard” columns: | Column | Meaning | |---|---| | `CHR` | Chromosome (integer; X is typically encoded as 23) | | `POS` | Base-pair position | | `SNP` | Variant identifier (rsID or a `CHR:POS:...` fallback) | | `EA` | Effect allele | | `NEA` | Non-effect allele | | `BETA` | Effect estimate (beta; odds ratios can be log-transformed during preprocessing) | | `SE` | Standard error of `BETA` | | `P` | P-value | | `EAF` | Effect allele frequency (aligned to `EA` when possible) | ### What is required where (rule of thumb) This is a *practical guide*, not an exhaustive contract. When a method can work with either an rsID or genomic coordinates, this is written as `SNP (or CHR+POS)`. | Method | Minimal required columns | Notes / recommended inputs | |---|---|---| | {py:meth}`genal.Geno.preprocess_data` | partial inputs | Can fill/validate columns using a build-only reference. Filling rsIDs requires `CHR+POS`; filling coordinates requires `SNP`. | | {py:meth}`genal.Geno.clump` | `SNP`, `P` | LD clumping via PLINK; returns a new `Geno` (or `None` if nothing passes). | | {py:meth}`genal.Geno.prs` | `EA`, `BETA`, plus `SNP (or CHR+POS)` | If `CHR+POS` are available, genal will prefer position-based matching to your genotype dataset to reduce ID-mismatch losses. | | {py:meth}`genal.Geno.query_outcome` | `SNP`, `EA`, `NEA`, `BETA`, `SE` (exposure and outcome) | Outcome querying is rsID-based; proxy search is optional. If you plan to use `action=2` later, `EAF` in both datasets is strongly recommended. | | {py:meth}`genal.Geno.MR` / {py:meth}`genal.Geno.MR_loo` / {py:meth}`genal.Geno.MRpresso` | `MR_data` | All consume `MR_data` produced by `query_outcome()`. | | {py:meth}`genal.Geno.colocalize` | `BETA`, `SE`, plus `CHR+POS` (preferred) **or** `SNP` (in both datasets) | If `EA/NEA` are present in both datasets, effects are allele-aligned; otherwise results assume both GWAS use the same reference allele. For quantitative traits, provide `sdY` or (`EAF` + `n`) to avoid the default `sdY=1` assumption. | | {py:meth}`genal.Geno.update_eaf` | `EA`, plus `CHR+POS` **or** `SNP` | Uses PLINK to compute allele frequencies from a reference panel; coordinate-based matching is faster when available. | | {py:meth}`genal.Geno.filter_by_gene` / {py:meth}`genal.Geno.lift` | `CHR`, `POS` | Genomic coordinate operations. | ## Method behaviors (what you get back) A helpful mental framework: - Methods that *select/subset variants* create a new working table and return a **new `Geno`** (so you can keep both input and output to chain methods on both Geno objects). - Many “table-transformer” utilities return a **`pandas.DataFrame`**. If they accept `replace=`, that flag usually controls whether `G.data` is overwritten; the return type stays a DataFrame. If you want to keep chaining Geno methods, wrap the returned DataFrame with `G.copy(df)`. - Workflow/analysis steps typically **attach results** on the object (and sometimes also return a summary object). ### Cheat sheet (common methods) | Method | Returns | Behavior / side effects | |---|---|---| | `preprocess_data()` | `None` | mutates `G.data` (clean/fill/validate) | | `clump()` | `Geno` | returns a new `Geno` (instrument set); original unchanged; uses PLINK temp files | | `update_snpids()` | `pd.DataFrame` | returns updated SNP IDs; `replace=True` overwrites `G.data` | | `lift()` | `pd.DataFrame` | returns lifted coordinates; `replace=True` overwrites `G.data`; may download chain files / write outputs | | `update_eaf()` | `pd.DataFrame` | returns updated `EAF`; `replace=True` overwrites `G.data`; uses PLINK | | `standardize_betas()` | `pd.DataFrame` | returns standardized effects; `replace=True` overwrites `G.data` | | `set_phenotype()` | `None` | sets `G.phenotype` (phenotype table + metadata) | | `association_test()` | `None` | runs PLINK `--glm`; mutates `G.data` (`BETA/SE/P`) | | `query_outcome()` | `None` | sets `G.MR_data` (exposure/outcome tables used by MR) | | `MR()` | `pd.DataFrame` | sets `G.MR_results` and returns the results table | | `MR_plot()` | plot object | requires `G.MR_results`; writes `.png` if `filename=...`; supports `use_mrpresso_data=True` for outlier highlighting | | `MR_funnel()` | plot object | requires `G.MR_results`; writes `.png` if `filename=...`; supports `use_mrpresso_data=True` for outlier highlighting | | `MR_loo()` | `pd.DataFrame` | sets `G.MR_loo_results` and returns the LOO results table | | `MR_loo_plot()` | plot object(s) | requires `G.MR_loo_results`; writes `.png` if `filename=...`; may return a list for multi-page output; supports `methods=[...]` overall rows and `use_mrpresso_data=True` for outlier highlighting | | `MRpresso()` | tuple | sets `G.MRpresso_results` and `G.MRpresso_subset_data` (outlier-removed harmonized table; SNP-indexed) | | `prs()` | `None` | writes `.csv` and uses PLINK temp files | | `query_gwas_catalog()` | `pd.DataFrame` | adds an `ASSOC` column (network-bound); `replace=True` overwrites `G.data` | | `filter_by_gene(replace=False)` | `Geno` | returns a new `Geno` filtered to a locus | | `filter_by_gene(replace=True)` | `None` | filters `G.data` in place | | `colocalize()` | dict | returns posterior probabilities; does not modify `G.data` | | `save()` | `None` | writes `G.name.(h5|parquet|csv|txt)` to disk | ## Side effects and files Be aware of these common side effects: - `~/.genal/config.json` is created/updated as you configure PLINK, reference folders, or default genotype paths. - `tmp_GENAL/` is used as a scratch directory for PLINK commands and is **not** automatically deleted. - Some methods generate output files in your current directory (notably `prs()`, and plot saving in `MR_plot()`, `MR_funnel()`, and `MR_loo_plot()`). ## Resource usage (`ram`, `cpus`) `Geno` sets defaults for: - `G.cpus`: derived from `SLURM_CPUS_PER_TASK` when present, otherwise `os.cpu_count() - 1` - `G.ram`: derived from `SLURM_MEM_PER_CPU` when present, otherwise from detected system RAM You can override them after initialization: ```python G.cpus = 8 G.ram = 25_000 # MB ``` Many PLINK commands accept `--memory` and/or `--threads` parameters that are fed by these attributes.