Usage#
The fundamental data structures and API provided by giglib intentionally mirror that of tskit (which makes it easy e.g. to initialise a GIG from an msprime-simulated tree sequence 🎉). Nevertheless, there are a number of terminological and implementation differences, an incomplete list of which are below:
API differences#
Iedges Intervals in a GIG as defined by giglib are stored in the iedges table, which has a
parent_left
andchild_left
column rather than a simpleleft
column as in tskit (and similarly forright
). We use the termiedge
rather than simplyedge
to indicate that this is a denormalised table that contains an interval (i
) as well as the parent and child coordinates that represent anedge
in the graph. That means a single graph edge can be represented by multiple disjoint intervals in theiedges
table.Tables and Graphs giglib provides a
Tables
andGraph
class, corresponding toTableCollection
andTreeSequence
classes in tskit. Thus to create a GIG from scratch, you dogig = tables.graph()
Chromosomes The API includes the possibility of having genetic material on different chromosomes. This is implemented using two extra columns in the iedges table (see https://github.com/hyanwong/giglib/issues/11 for the rationale)
Object access Information stored in GIG tables can be accessed using square brackets, and the
len()
function should work, so the canonical usage looks likegig.nodes[0]
,len(gig.nodes)
, and[u.id for u in gig.nodes]
rather than the equivalents in tskit (ts.node(0)
,ts.num_nodes
, and[u.id for u in ts.nodes()]
. Similarly, we usegig.sample_ids
(no braces) rather thants.samples()
.Internal iedge order The iedges in a GIG are sorted such that all those for a given child are adjacent (rather than all edges for a parent being adjacent as in tskit). Moreover, iedges are sorted in decreasing order of child time, so that edges with more recent children come last. Since edges for a given child are adjacent, child intervals for those edges cannot overlap. Edges for a given child are ordered by left coordinate, which helps algorithmic efficiency. The internal
gig.iedge_map_sorted_by_parent
array indexes into iedges in the order expected by tskit.No
.trees()
iterator See above: the meaning of a “local tree” is unclear in a GIG, so implementing the equivalent of the fundamental tskit.trees()
method is likely to require substantial theoretical work.Algorithms on tables Unlike tskit, some fundamental algorithms such as
find_mrca_regions
andsample_resolve
(a basic type ofsimplify
) can be run directly on a set of tables. For this to work, the tables need to conform to certain validity criteria. Substantial additional functionality has been incoporated into the tables objects, so that table rows (especially for iedges) can be validated on callingtable.add_row()
, according to certain validity flags (see theValidFlags
class inconstants.py
). This allows use of the GIG structure during generation of the tables, without having the substantial overhead of continually having to freeze them into an immutable graph. This makes forward simulation feasible.Other stuff More differences should be noted here!
Examples#
As with a tskit tree sequence, a GIG is immmutable, so to create one from scratch you
need to make a set of Tables
and freeze them using the .graph()
method (which gives
opportunity to cache and index important stuff):
import giglib as gigl
tables = gigl.Tables()
tables.nodes.add_row(0, flags=gigl.NODE_IS_SAMPLE)
tables.nodes.add_row(1, flags=0)
# Add an inversion
tables.iedges.add_row(
parent=1, child=0, child_left=0, child_right=1, parent_left=1, parent_right=0
)
gig = tables.graph()
assert len(gig.nodes) == 2
assert len(gig.iedges) == 1
assert len(gig.samples) == 1
assert gig.iedges[0].is_inversion()
Simulations#
Examples of code that runs simulations are provided in the test suite. Currently they are based on a relatively simple extension of the boilerplate forward-simulation code in the tskit forward-time simulation tutorial. The important difference is that the find_mrca_regions method is used to locate breakpoints for recombination.
For more details the sim.py file is a convenience wrapper that links out to other files containing simulation code. Currently only forward simulation code is provided. Nevertheless, it is conceptually general enough to form the basis of a pangenome simulator (this would however require substantial effort in fixing reasonable parameters before realistic-looking pangenomes could be created).