Core concepts

The `Geno` object model

genal.Geno is the central class. It wraps a SNP-level table and accumulates intermediate results during workflows.

Key attributes (you don’t need to manipulate these directly):

G.data: main SNP table (pandas.DataFrame)
G.phenotype: set by genal.Geno.set_phenotype() (phenotype DataFrame + metadata)
G.MR_data: set by genal.Geno.query_outcome() (exposure/outcome tables used by MR)
G.MR_results: set by genal.Geno.MR() (results table + harmonized SNP table; used by plotting)
G.MR_loo_results: set by genal.Geno.MR_loo() (leave-one-out results tuple; used by MR_loo_plot)
G.MRpresso_results / G.MRpresso_subset_data: set by genal.Geno.MRpresso()

Standard columns

Most genal workflows assume (a subset of) the following “standard” columns:

Column	Meaning
`CHR`	Chromosome (integer; X is typically encoded as 23)
`POS`	Base-pair position
`SNP`	Variant identifier (rsID or a `CHR:POS:...` fallback)
`EA`	Effect allele
`NEA`	Non-effect allele
`BETA`	Effect estimate (beta; odds ratios can be log-transformed during preprocessing)
`SE`	Standard error of `BETA`
`P`	P-value
`EAF`	Effect allele frequency (aligned to `EA` when possible)

What is required where (rule of thumb)

This is a practical guide, not an exhaustive contract. When a method can work with either an rsID or genomic coordinates, this is written as SNP (or CHR+POS).

Method	Minimal required columns	Notes / recommended inputs
`genal.Geno.preprocess_data()`	partial inputs	Can fill/validate columns using a build-only reference. Filling rsIDs requires `CHR+POS`; filling coordinates requires `SNP`.
`genal.Geno.clump()`	`SNP`, `P`	LD clumping via PLINK; returns a new `Geno` (or `None` if nothing passes).
`genal.Geno.prs()`	`EA`, `BETA`, plus `SNP (or CHR+POS)`	If `CHR+POS` are available, genal will prefer position-based matching to your genotype dataset to reduce ID-mismatch losses.
`genal.Geno.query_outcome()`	`SNP`, `EA`, `NEA`, `BETA`, `SE` (exposure and outcome)	Outcome querying is rsID-based; proxy search is optional. If you plan to use `action=2` later, `EAF` in both datasets is strongly recommended.
`genal.Geno.MR()` / `genal.Geno.MR_loo()` / `genal.Geno.MRpresso()`	`MR_data`	All consume `MR_data` produced by `query_outcome()`.
`genal.Geno.colocalize()`	`BETA`, `SE`, plus `CHR+POS` (preferred) or `SNP` (in both datasets)	If `EA/NEA` are present in both datasets, effects are allele-aligned; otherwise results assume both GWAS use the same reference allele. For quantitative traits, provide `sdY` or (`EAF` + `n`) to avoid the default `sdY=1` assumption.
`genal.Geno.update_eaf()`	`EA`, plus `CHR+POS` or `SNP`	Uses PLINK to compute allele frequencies from a reference panel; coordinate-based matching is faster when available.
`genal.Geno.filter_by_gene()` / `genal.Geno.lift()`	`CHR`, `POS`	Genomic coordinate operations.

Method behaviors (what you get back)

A helpful mental framework:

Methods that select/subset variants create a new working table and return a new Geno (so you can keep both input and output to chain methods on both Geno objects).
Many “table-transformer” utilities return a pandas.DataFrame. If they accept replace=, that flag usually controls whether G.data is overwritten; the return type stays a DataFrame. If you want to keep chaining Geno methods, wrap the returned DataFrame with G.copy(df).
Workflow/analysis steps typically attach results on the object (and sometimes also return a summary object).

Cheat sheet (common methods)

Method	Returns	Behavior / side effects
`preprocess_data()`	`None`	mutates `G.data` (clean/fill/validate)
`clump()`	`Geno`	returns a new `Geno` (instrument set); original unchanged; uses PLINK temp files
`update_snpids()`	`pd.DataFrame`	returns updated SNP IDs; `replace=True` overwrites `G.data`
`lift()`	`pd.DataFrame`	returns lifted coordinates; `replace=True` overwrites `G.data`; may download chain files / write outputs
`update_eaf()`	`pd.DataFrame`	returns updated `EAF`; `replace=True` overwrites `G.data`; uses PLINK
`standardize_betas()`	`pd.DataFrame`	returns standardized effects; `replace=True` overwrites `G.data`
`set_phenotype()`	`None`	sets `G.phenotype` (phenotype table + metadata)
`association_test()`	`None`	runs PLINK `--glm`; mutates `G.data` (`BETA/SE/P`)
`query_outcome()`	`None`	sets `G.MR_data` (exposure/outcome tables used by MR)
`MR()`	`pd.DataFrame`	sets `G.MR_results` and returns the results table
`MR_plot()`	plot object	requires `G.MR_results`; writes `.png` if `filename=...`; supports `use_mrpresso_data=True` for outlier highlighting
`MR_funnel()`	plot object	requires `G.MR_results`; writes `.png` if `filename=...`; supports `use_mrpresso_data=True` for outlier highlighting
`MR_loo()`	`pd.DataFrame`	sets `G.MR_loo_results` and returns the LOO results table
`MR_loo_plot()`	plot object(s)	requires `G.MR_loo_results`; writes `.png` if `filename=...`; may return a list for multi-page output; supports `methods=[...]` overall rows and `use_mrpresso_data=True` for outlier highlighting
`MRpresso()`	tuple	sets `G.MRpresso_results` and `G.MRpresso_subset_data` (outlier-removed harmonized table; SNP-indexed)
`prs()`	`None`	writes `<name>.csv` and uses PLINK temp files
`query_gwas_catalog()`	`pd.DataFrame`	adds an `ASSOC` column (network-bound); `replace=True` overwrites `G.data`
`filter_by_gene(replace=False)`	`Geno`	returns a new `Geno` filtered to a locus
`filter_by_gene(replace=True)`	`None`	filters `G.data` in place
`colocalize()`	dict	returns posterior probabilities; does not modify `G.data`
`save()`	`None`	writes `G.name.(h5

Side effects and files

Be aware of these common side effects:

~/.genal/config.json is created/updated as you configure PLINK, reference folders, or default genotype paths.
tmp_GENAL/ is used as a scratch directory for PLINK commands and is not automatically deleted.
Some methods generate output files in your current directory (notably prs(), and plot saving in MR_plot(), MR_funnel(), and MR_loo_plot()).

Resource usage (`ram`, `cpus`)

Geno sets defaults for:

G.cpus: derived from SLURM_CPUS_PER_TASK when present, otherwise os.cpu_count() - 1
G.ram: derived from SLURM_MEM_PER_CPU when present, otherwise from detected system RAM

You can override them after initialization:

G.cpus = 8
G.ram = 25_000  # MB

Many PLINK commands accept --memory and/or --threads parameters that are fed by these attributes.