Core concepts
The Geno object model
genal.Geno is the central class. It wraps a SNP-level table and accumulates intermediate results during workflows.
Key attributes (you don’t need to manipulate these directly):
G.data: main SNP table (pandas.DataFrame)G.phenotype: set bygenal.Geno.set_phenotype()(phenotype DataFrame + metadata)G.MR_data: set bygenal.Geno.query_outcome()(exposure/outcome tables used by MR)G.MR_results: set bygenal.Geno.MR()(results table + harmonized SNP table; used by plotting)G.MR_loo_results: set bygenal.Geno.MR_loo()(leave-one-out results tuple; used byMR_loo_plot)G.MRpresso_results/G.MRpresso_subset_data: set bygenal.Geno.MRpresso()
Standard columns
Most genal workflows assume (a subset of) the following “standard” columns:
Column |
Meaning |
|---|---|
|
Chromosome (integer; X is typically encoded as 23) |
|
Base-pair position |
|
Variant identifier (rsID or a |
|
Effect allele |
|
Non-effect allele |
|
Effect estimate (beta; odds ratios can be log-transformed during preprocessing) |
|
Standard error of |
|
P-value |
|
Effect allele frequency (aligned to |
What is required where (rule of thumb)
This is a practical guide, not an exhaustive contract. When a method can work with either an rsID or genomic coordinates, this is written as SNP (or CHR+POS).
Method |
Minimal required columns |
Notes / recommended inputs |
|---|---|---|
|
partial inputs |
Can fill/validate columns using a build-only reference. Filling rsIDs requires |
|
|
LD clumping via PLINK; returns a new |
|
|
If |
|
|
Outcome querying is rsID-based; proxy search is optional. If you plan to use |
|
|
All consume |
|
|
If |
|
|
Uses PLINK to compute allele frequencies from a reference panel; coordinate-based matching is faster when available. |
|
|
Genomic coordinate operations. |
Method behaviors (what you get back)
A helpful mental framework:
Methods that select/subset variants create a new working table and return a new
Geno(so you can keep both input and output to chain methods on both Geno objects).Many “table-transformer” utilities return a
pandas.DataFrame. If they acceptreplace=, that flag usually controls whetherG.datais overwritten; the return type stays a DataFrame. If you want to keep chaining Geno methods, wrap the returned DataFrame withG.copy(df).Workflow/analysis steps typically attach results on the object (and sometimes also return a summary object).
Cheat sheet (common methods)
Method |
Returns |
Behavior / side effects |
|---|---|---|
|
|
mutates |
|
|
returns a new |
|
|
returns updated SNP IDs; |
|
|
returns lifted coordinates; |
|
|
returns updated |
|
|
returns standardized effects; |
|
|
sets |
|
|
runs PLINK |
|
|
sets |
|
|
sets |
|
plot object |
requires |
|
plot object |
requires |
|
|
sets |
|
plot object(s) |
requires |
|
tuple |
sets |
|
|
writes |
|
|
adds an |
|
|
returns a new |
|
|
filters |
|
dict |
returns posterior probabilities; does not modify |
|
|
writes `G.name.(h5 |
Side effects and files
Be aware of these common side effects:
~/.genal/config.jsonis created/updated as you configure PLINK, reference folders, or default genotype paths.tmp_GENAL/is used as a scratch directory for PLINK commands and is not automatically deleted.Some methods generate output files in your current directory (notably
prs(), and plot saving inMR_plot(),MR_funnel(), andMR_loo_plot()).
Resource usage (ram, cpus)
Geno sets defaults for:
G.cpus: derived fromSLURM_CPUS_PER_TASKwhen present, otherwiseos.cpu_count() - 1G.ram: derived fromSLURM_MEM_PER_CPUwhen present, otherwise from detected system RAM
You can override them after initialization:
G.cpus = 8
G.ram = 25_000 # MB
Many PLINK commands accept --memory and/or --threads parameters that are fed by these attributes.