Python API#
This page provides formal documentation for the giglib Python API.
The Tables class#
Note
Some Tables
methods require the data stored in the tables
(in particular that in the tables.IEdgeTable
) to
conform to certain validity conditions before running. See
Validity flags for Tables.
- class giglib.Tables(time_units=None, *, initial_sizes=None)[source]#
A group of tables which can describe a Genetic Inheritance Graph (GIG). This class is intentionally similar to a tskit
TableCollection
.- add_iedge_row(*args, validate=None, skip_validate=None, **kwargs)[source]#
Similar to the
add_row()
method of theIEdgeTable
stored in this set of tables, but also (optionally) can be used to validate features of the nodes used (e.g. that the parent node time is older than the child node time).- Parameters:
validate (int) – A set of bitflags (as listed in
ValidFlags
) specifying which iedge table validation checks to perform. In particular, this can include theIEDGES_PARENT_OLDER_THAN_CHILD
andIEDGES_PRIMARY_ORDER_CHILD_TIME_DESC
flags, which will check the node table for those properties. Other flags will be passed totables.IEdgeTable.add_row()
.skip_validate (bool) –
If True, assume the user has checked that this operation will pass the validation tests implied by the
validate
flags. This means that the validation routines will not be run, but the tables may claim that they are valid when they are not. If False, and any of theIEDGES_...
validation flags are set, perform the appropriate validations. Default:None
treated as False.Warning
The
skip_validate
parameter is only intended to speed up well-tested code. It should only be set toTrue
when you are sure that any calling code has been tested with validation.
- copy(omit_iedges=None)[source]#
Return an unfrozen deep copy of the tables.
- Parameters:
omit_iedges (bool) – If True do not copy the iedges table, use an empty one.
- Returns:
A deep copy of the tables
- Return type:
- sort(shrink_memory=False)[source]#
Sort the tables in place. Currently only affects the iedges table. Sorting criteria match those used in tskit (see https://tskit.dev/tskit/docs/stable/data-model.html#edge-requirements).
- graph()[source]#
Return a genetic inheritance graph (Graph) object based on these tables. This requires the following iedge conditions to be met:
The child_left, child_right, parent_left, parent_right, parent, and child values must all be provided and be non-negative integers. If
child_chromosome
andparent_chromosome
are provided, they also must be non-negative integers (otherwise a default chromosome ID of 0 is assumed). This corresponds to the flagIEDGES_INTEGERS
.The intervals must be valid (i.e.
abs(child_left - child_right)
is finite, nonzero, and equal toabs(parent_left - parent_right)
. This corresponds to the flagIEDGES_INTERVALS
.Each chromosomal position in a child is covered by only one interval (i.e. for any particular chromosome, intervals for a given child do not overlap). This corresponds to the flag
IEDGES_FOR_CHILD_NONOVERLAPPING
The
child
andparent
IDs of an iedge must correspond to nodes in in the node table in which the parent is older (has a strictly greater time) than the child. This corresponds to the flagIEDGES_PARENT_OLDER_THAN_CHILD
If an
edge
ID is provided for an iedge, and it is notNULL
(-1), then other iedges with the sameedge
ID must have the samechild
andparent
IDs. This corresponds to the flagIEDGES_SAME_PARENT_CHILD_FOR_EDGE
For consistency and as an enforced convention, the
child_left
position for an iedge must be strictly less than thechild_right
position. This corresponds to the flagIEDGES_CHILD_INTERVAL_POSITIVE
To create a valid GIG, the iedges must also be sorted into a canonical order such that the following conditions are met (you can also accomplish this by calling
Tables.sort()
on the tables first):The iedges must be grouped by child ID. This corresponds to the flag
IEDGES_FOR_CHILD_ADJACENT
Within each group of iedges with the same child ID, the iedges must be ordered by chromosome ID and then by left position. This corresponds to the flags
IEDGES_FOR_CHILD_PRIMARY_ORDER_CHR_ASC
andIEDGES_FOR_CHILD_SECONDARY_ORDER_LEFT_ASC
The groups of iedges with the same child ID must be ordered by time of child node and then (if nodes have identical times) by child node ID. This corresponds to the flags
IEDGES_PRIMARY_ORDER_CHILD_TIME_DESC
andIEDGES_SECONDARY_ORDER_CHILD_ID_ASC
.
Note
The nodes are not required to be in any particular order (e.g. by time)
- Returns:
A genetic inheritance graph object
- Return type:
- property sample_ids#
Return the IDs of all samples in the nodes table
- Return type:
- decapitate(time)[source]#
Remove nodes that are older than a certain time, and edges that have those as parents. Also remove individuals associated with those nodes.
- Parameters:
time (float) – The time before which nodes should be removed
- Returns:
An array mapping old node ids to new node ids (or -1 if removed)
- Return type:
- classmethod from_tree_sequence(ts, *, chromosome=None, **kwargs)[source]#
Create a Tables object from a
tskit.TreeSequence
. To create a GIG directly, useGraph.from_tree_sequence()
which simply wraps this method.- Parameters:
ts (tskit.TreeSequence) – The tree sequence on which to base the Tables object
chromosome (int) – The chromosome number to use for all interval edges
kwargs – Other parameters passed to the Tables constructor
- Returns:
A Tables object representing the tree sequence
- Return type:
- sample_resolve(sort=True)[source]#
Sample resolve the Tables, keeping only those edge regions which transmit information to the current samples. This is rather like running the Hudson algorithm on a fixed graph, but without counting up the number of samples under each node. This requires the equivalent of making a new edges table, so is a costly operation.
The algorithm is implemented by using a stack that contains intervals for each node, ordered by node time (oldest first). When considering a node, we pop the (youngest) node off the end of the stack, which ensures that we have collected all the intervals with that node as a parent, before passing inheritance information upwards
- remove_non_ancestral_nodes()[source]#
Subset the node table to remove any nonsample nodes that are not referenced in the iedges table
- find_mrca_regions(u, v, *, u_chromosomes=None, v_chromosomes=None, time_cutoff=None)[source]#
Find all regions between nodes u and v (potentially restricted to a list of chromosomes in u and v) that share a most recent common ancestor in the GIG which is more recent than time_cutoff. Returns a dict of dicts of the following form
{ MRCA_node_ID1 : {(X, Y, CHR): ( [(uA, uB, uCHRa), (uC, uD, uCHRb), ...], [(vA, vB, vCHR), ...] )}, MRCA_node_ID2 : ..., ... }
Where in each inner dict, the key (X, Y, CHR) gives an interval (with X < Y) in the MRCA node, and the value is a 2-tuple giving the corresponding intervals in u and v. In the example above there is a list of two corresponding intervals in u: (uA, uB, cCHRa) and (uC, uD, cCHRb) representing a duplication of the MRCA interval into u. If uA > uB then that interval in u is inverted relative to that in the MRCA node.
Implementation-wise, this is similar to the sample_resolve algorithm, but
Instead of following the ancestry of all samples upwards, we follow just two of them. This means we are unlikely to visit all nodes in the graph, and so we create a dynamic stack (using the sortedcontainers.SortedDict class, which is kept ordered by node time.
We do not edit/change the edge intervals as we go. Instead we simply return the intervals shared between the two samples.
The intervals “know” whether they were originally associated with u or v. If we find an ancestor with both u and v-associated intervals this indicates a potential common ancestor, in which coalescence could have occurred.
We do not add older regions to the stack if they have coalesced.
We have a time cutoff, so that we don’t have to go all the way back to the “origin of life”
We have to keep track of the offset of the current interval into the original genome, to allow mapping back to the original coordinates
- Parameters:
u (int) – The first node
v (int) – The second node
u_chromosomes (int) – Restrict the MRCAs search to this list of chromosomes from node
u
. If None (default), return MRCAs for allu
’s chromosomesv_chromosomes (int) – Restrict MRCAs search to this list of chromosomes from node
v
. If None (default), return MRCAs for allv
’s chromosomes
Validity flags for Tables#
See also
Validity requirements for a GIG are described in Tables.graph()
.
- class giglib.ValidFlags(value)[source]#
Flags involving validity of the GIG. Those starting with
IEDGES\_
are specific to the iedges table.- IEDGES_INTEGERS = 1#
If set, iedge data is guaranteed to be integers
- IEDGES_INTERVALS = 2#
If set, intervals are guaranteed finite and have parent span == child span
- IEDGES_FOR_CHILD_NONOVERLAPPING = 4#
IF set, each genomic position in a child’s chromosome is guaranteed to have only one interval above
- IEDGES_PARENT_OLDER_THAN_CHILD = 8#
If set, all parents are older than their children, guaranteeing the GIG is acyclic (note this requires knowledge of the times in the nodes table). This flag also guarantees that nodes with those parent and child IDs exist in the nodes table.
- IEDGES_SAME_PARENT_CHILD_FOR_EDGE = 16#
If set, iedges with the same edge_ID are guaranteed to have the same combination of parentID / childID. Note the inverse may not be true: iedges with the same parentID / childID can have different edge_IDs.
- IEDGES_CHILD_INTERVAL_POSITIVE = 32#
Set if
child_left
<child_right
(inversions can haveparent_left
>parent_right
)
- IEDGES_FOR_CHILD_ADJACENT = 64#
Set if all iedges with the same child ID are adjacent in the table
- IEDGES_FOR_CHILD_PRIMARY_ORDER_CHR_ASC = 128#
Set if within a set of iedges for the same child, iedges are ordered first by chromosome number
- IEDGES_FOR_CHILD_SECONDARY_ORDER_LEFT_ASC = 256#
Set if, within a set of iedges for the same child & chromosome, iedges are ordered by left position
- IEDGES_PRIMARY_ORDER_CHILD_TIME_DESC = 512#
Set if the adjacent rows for a child are ordered in descending order of child node time. Note that this requires knowledge of the times in the nodes table.
- IEDGES_SECONDARY_ORDER_CHILD_ID_ASC = 1024#
Set if iedges are ordered such that where iedge children are tied at the same time, iedges with the lowest child ID come first. Note that although this is often not algorithmically necessary, it helps to creates a canonical iedge ordering.
- IEDGES_COMBO_STANDALONE = 503#
The combination of requirements that simply involve the iedge table and no linked tables
- IEDGES_WITHIN_CHILD_SORTED = 448#
The combination of requirements that identify if edges for a give child are sorted. Some algorithms (e.g. iterating over edges for a child / chromosome only require sorting by child ID and within the child
- IEDGES_SORTED = 1984#
A stricter canonical order of iedge sorting
Specific tables#
The IEdgeTable class#
- class giglib.tables.IEdgeTable(initial_size=None)[source]#
A table containing the iedges. This contains more complex logic than other GIG tables as we want to store and check various forms of validity, so that we can run algorithms directly on the tables rather than on a frozen GIG. For example, the
ids_for_child()
method requiresIEDGES_FOR_CHILD_ADJACENT
to be true.- add_row(child_left, child_right, parent_left, parent_right, *, child, parent, child_chromosome=0, parent_chromosome=0, edge=-1, validate=None, skip_validate=None)[source]#
The canonical way to add data to an IEdgeTable
See also
Tables.add_iedge_row()
which is a wrapper around this method that also allows validation of parent and child node times- Parameters:
validate (int) – A set of bitflags (as listed in
ValidFlags
) specifying which iedge table validation checks should be performed when adding this data. If the existing data is valid, and the new data is added in a way that preserves the existing validity, then calling \(has_bitflag\) for the flags in this set will return True. If any of the bits iniedges_validation
are0
, that particular validation will not be performed: in this case thehas_bitflag
method will return False for certain flags, and some table algorithms will not run. For instance, using theids_for_child()
method is only valid ifIEDGES_FOR_CHILD_ADJACENT
andIEDGES_FOR_CHILD_PRIMARY_ORDER_CHR_ASC
are set, so if you wish to use that method you should add those flags tovalidate
. Defaults toNone
which is treated as0
, meaning that allIEDGE`_
validation flags will be zeroed, no validation checks will be performed, and hence the table will be marked as containing potentially invalid iedge information.skip_validate (bool) –
If True, assume the user has checked that this operation will pass the validation tests implied by the
validate
flags. This means that the validation routines will not be run, but the tables may claim that they are valid when they are not. If False, and any of theiedges_validation
flags are set, perform the appropriate validations (default:None
treated as False)Warning
This parameter is only intended to speed up well-tested code. It should only be set to
True
when you are sure that any calling code has been tested with validation.
- Returns:
The row ID of the newly added row
For example:
new_id = iedges.add_row( cl, cr, pl, pr, child=c, parent=p, validate=ValidFlags.IEDGES_INTERVALS )
- append(obj, validate=None, skip_validate=None)[source]#
Append a row to an IEdgeTable taking the named attributes of the passed-in object to populate the row. If a left field is present in the object, is it placed in the
parent_left
andchild_left
attributes of this row (and similarly for aright
field). This allows tskit conversion.To enable tskit conversion, the special add_int_row() method is used, which is slower than the standard add_row. If you require efficiency, you are therefore recommended to call add_row() directly, ensuring that the intervals and node IDs are integers before you pass them in.
- ids_for_child(u, chromosome=None)[source]#
Return all the iedge ids for a child. If chromosome is not None, return only the iedge IDs for that chromosome.
Note
The returned IDs are only guaranteed to be ordered by left position if
IEDGES_FOR_CHILD_SECONDARY_ORDER_LEFT_ASC
is set.- Parameters:
- Returns:
A numpy array of iedge IDs
- Raises:
ValueError – If the iedges for a child are not adjacent in the table, or not ordered by chromosome
- max_pos_as_child(u, chromosome=None)[source]#
Return the maximum child position for a given child ID and chromosome. This should be equivalent np.max(child_right[child == u]), but faster because it relies on
IEDGES_WITHIN_CHILD_SORTED
.If chromosome is not given, this will return the maximum child position in any chromosome.
Returns None if there are no iedges for the given child.
Note
This does not look at edges for which this node is a parent, so e.g. root nodes may not have a maximum position defined
- min_pos_as_child(u, chromosome=None)[source]#
Return the minimum child position for a given child ID and chromosome. This should be equivalent np.max(child_right[child == u]), but faster because it relies on
IEDGES_WITHIN_CHILD_SORTED
.If chromosome is not given, this will return the minimum child position in any chromosome.
Returns None if there are no iedges for the given child.
Note
This does not look at edges for which this node is a parent, so e.g. root nodes may not have a minimum position defined
- chromosomes_as_child(u)[source]#
Iterate over the chromosome numbers for a node ID
u
, when this node is considered as a child. This will only iterate over the chromosomes which correspond to edges above the nodeu
: ifu
had chromosomes which are not traced by any edge, these will not be reported.
- transform_interval(iedge_id, interval, direction)[source]#
Given an iedge ID, use that edge to transform the provided interval. If this is an inversion then an iedge such as
[0, 10, chrA] -> [10, 0, chrB]
means that a interval such as[1, 8]
gets transformed to[9, 2]
. The chromosome transformation is not reported: it is assumed that the intervals are on chromosomes specified by the iedge
The NodeTable class#
The IndividualTable class#
The Graph class#
- class giglib.Graph(tables)[source]#
This is similar to a _tskit_
TreeSequence
. An instance can be created from aTables
object by callingTables.graph()
, which freezes the tables into a GIG.- classmethod from_tree_sequence(ts)[source]#
Construct a GIG from a tree sequence.
- Parameters:
ts (tskit.TreeSequence) – A tree sequence object
- Returns:
A new GIG object
- Return type:
- property nodes#
Return an object used to iterate over the nodes in this GIG
- property individuals#
Return an object used to iterate over the individuals in this GIG
- property iedges#
Return an object used to iterate over the iedges in this GIG
- max_pos(u, chromosome=None)[source]#
Return the maximum position in the edges above or below this node. This defines the known sequence length of the node u for a given chromosome (or the maximum position in any chromosome if not specified)
- sequence_length(u, chromosome)[source]#
Return the known sequence length of the node u: equivalent to max_position(u) but returns 0 rather than None if this is an isolated node (with no associated edges).
- to_tree_sequence(sequence_length=None)[source]#
Convert this GIG to a tree sequence. This can only be done if each iedge has the same child_left as parent_left and the same child_right as parent_right.
- Parameters:
sequence_length (int) – The sequence length in the resulting tree sequence (if
None
, take the maximum position of any edge).- Returns:
A tree sequence
- Return type:
- sample_resolve()[source]#
Sample resolve a GIG, keeping only those edge regions which transmit information to the current samples. This is rather like running the Hudson algorithm on a fixed graph, but without counting up the number of samples under each node. This is identical to the
Tables.sample_resolve()
method, but returns a new GIG instead.- Returns:
A new GIG object
- Return type: