Python API

Contents

Python API#

This page provides formal documentation for the giglib Python API.

The Tables class#

Note

Some Tables methods require the data stored in the tables (in particular that in the tables.IEdgeTable) to conform to certain validity conditions before running. See Validity flags for Tables.

class giglib.Tables(time_units=None, *, initial_sizes=None)[source]#

A group of tables which can describe a Genetic Inheritance Graph (GIG). This class is intentionally similar to a tskit TableCollection.

add_iedge_row(*args, validate=None, skip_validate=None, **kwargs)[source]#

Similar to the add_row() method of the IEdgeTable stored in this set of tables, but also (optionally) can be used to validate features of the nodes used (e.g. that the parent node time is older than the child node time).

Parameters:
  • validate (int) – A set of bitflags (as listed in ValidFlags) specifying which iedge table validation checks to perform. In particular, this can include the IEDGES_PARENT_OLDER_THAN_CHILD and IEDGES_PRIMARY_ORDER_CHILD_TIME_DESC flags, which will check the node table for those properties. Other flags will be passed to tables.IEdgeTable.add_row().

  • skip_validate (bool) –

    If True, assume the user has checked that this operation will pass the validation tests implied by the validate flags. This means that the validation routines will not be run, but the tables may claim that they are valid when they are not. If False, and any of the IEDGES_... validation flags are set, perform the appropriate validations. Default: None treated as False.

    Warning

    The skip_validate parameter is only intended to speed up well-tested code. It should only be set to True when you are sure that any calling code has been tested with validation.

freeze()[source]#

Freeze all tables so they cannot be modified

clear()[source]#

Clear all tables

copy(omit_iedges=None)[source]#

Return an unfrozen deep copy of the tables.

Parameters:

omit_iedges (bool) – If True do not copy the iedges table, use an empty one.

Returns:

A deep copy of the tables

Return type:

giglib.Tables

sort(shrink_memory=False)[source]#

Sort the tables in place. Currently only affects the iedges table. Sorting criteria match those used in tskit (see https://tskit.dev/tskit/docs/stable/data-model.html#edge-requirements).

graph()[source]#

Return a genetic inheritance graph (Graph) object based on these tables. This requires the following iedge conditions to be met:

  • The child_left, child_right, parent_left, parent_right, parent, and child values must all be provided and be non-negative integers. If child_chromosome and parent_chromosome are provided, they also must be non-negative integers (otherwise a default chromosome ID of 0 is assumed). This corresponds to the flag IEDGES_INTEGERS.

  • The intervals must be valid (i.e. abs(child_left - child_right) is finite, nonzero, and equal to abs(parent_left - parent_right). This corresponds to the flag IEDGES_INTERVALS.

  • Each chromosomal position in a child is covered by only one interval (i.e. for any particular chromosome, intervals for a given child do not overlap). This corresponds to the flag IEDGES_FOR_CHILD_NONOVERLAPPING

  • The child and parent IDs of an iedge must correspond to nodes in in the node table in which the parent is older (has a strictly greater time) than the child. This corresponds to the flag IEDGES_PARENT_OLDER_THAN_CHILD

  • If an edge ID is provided for an iedge, and it is not NULL (-1), then other iedges with the same edge ID must have the same child and parent IDs. This corresponds to the flag IEDGES_SAME_PARENT_CHILD_FOR_EDGE

  • For consistency and as an enforced convention, the child_left position for an iedge must be strictly less than the child_right position. This corresponds to the flag IEDGES_CHILD_INTERVAL_POSITIVE

To create a valid GIG, the iedges must also be sorted into a canonical order such that the following conditions are met (you can also accomplish this by calling Tables.sort() on the tables first):

Note

The nodes are not required to be in any particular order (e.g. by time)

Returns:

A genetic inheritance graph object

Return type:

giglib.Graph

property sample_ids#

Return the IDs of all samples in the nodes table

Return type:

numpy.ndarray

change_times(timedelta)[source]#

Add a value to all times (nodes and mutations)

decapitate(time)[source]#

Remove nodes that are older than a certain time, and edges that have those as parents. Also remove individuals associated with those nodes.

Parameters:

time (float) – The time before which nodes should be removed

Returns:

An array mapping old node ids to new node ids (or -1 if removed)

Return type:

numpy.ndarray

classmethod from_tree_sequence(ts, *, chromosome=None, **kwargs)[source]#

Create a Tables object from a tskit.TreeSequence. To create a GIG directly, use Graph.from_tree_sequence() which simply wraps this method.

Parameters:
  • ts (tskit.TreeSequence) – The tree sequence on which to base the Tables object

  • chromosome (int) – The chromosome number to use for all interval edges

  • kwargs – Other parameters passed to the Tables constructor

Returns:

A Tables object representing the tree sequence

Return type:

Tables

sample_resolve(sort=True)[source]#

Sample resolve the Tables, keeping only those edge regions which transmit information to the current samples. This is rather like running the Hudson algorithm on a fixed graph, but without counting up the number of samples under each node. This requires the equivalent of making a new edges table, so is a costly operation.

The algorithm is implemented by using a stack that contains intervals for each node, ordered by node time (oldest first). When considering a node, we pop the (youngest) node off the end of the stack, which ensures that we have collected all the intervals with that node as a parent, before passing inheritance information upwards

remove_non_ancestral_nodes()[source]#

Subset the node table to remove any nonsample nodes that are not referenced in the iedges table

find_mrca_regions(u, v, *, u_chromosomes=None, v_chromosomes=None, time_cutoff=None)[source]#

Find all regions between nodes u and v (potentially restricted to a list of chromosomes in u and v) that share a most recent common ancestor in the GIG which is more recent than time_cutoff. Returns a dict of dicts of the following form

{
    MRCA_node_ID1 : {(X, Y, CHR): (
        [(uA, uB, uCHRa), (uC, uD, uCHRb), ...],
        [(vA, vB, vCHR), ...]
    )},
    MRCA_node_ID2 : ...,
    ...
}

Where in each inner dict, the key (X, Y, CHR) gives an interval (with X < Y) in the MRCA node, and the value is a 2-tuple giving the corresponding intervals in u and v. In the example above there is a list of two corresponding intervals in u: (uA, uB, cCHRa) and (uC, uD, cCHRb) representing a duplication of the MRCA interval into u. If uA > uB then that interval in u is inverted relative to that in the MRCA node.

Implementation-wise, this is similar to the sample_resolve algorithm, but

  1. Instead of following the ancestry of all samples upwards, we follow just two of them. This means we are unlikely to visit all nodes in the graph, and so we create a dynamic stack (using the sortedcontainers.SortedDict class, which is kept ordered by node time.

  2. We do not edit/change the edge intervals as we go. Instead we simply return the intervals shared between the two samples.

  3. The intervals “know” whether they were originally associated with u or v. If we find an ancestor with both u and v-associated intervals this indicates a potential common ancestor, in which coalescence could have occurred.

  4. We do not add older regions to the stack if they have coalesced.

  5. We have a time cutoff, so that we don’t have to go all the way back to the “origin of life”

  6. We have to keep track of the offset of the current interval into the original genome, to allow mapping back to the original coordinates

Parameters:
  • u (int) – The first node

  • v (int) – The second node

  • u_chromosomes (int) – Restrict the MRCAs search to this list of chromosomes from node u. If None (default), return MRCAs for all u’s chromosomes

  • v_chromosomes (int) – Restrict MRCAs search to this list of chromosomes from node v. If None (default), return MRCAs for all v’s chromosomes

Validity flags for Tables#

See also

Validity requirements for a GIG are described in Tables.graph().

class giglib.ValidFlags(value)[source]#

Flags involving validity of the GIG. Those starting with IEDGES\_ are specific to the iedges table.

IEDGES_INTEGERS = 1#

If set, iedge data is guaranteed to be integers

IEDGES_INTERVALS = 2#

If set, intervals are guaranteed finite and have parent span == child span

IEDGES_FOR_CHILD_NONOVERLAPPING = 4#

IF set, each genomic position in a child’s chromosome is guaranteed to have only one interval above

IEDGES_PARENT_OLDER_THAN_CHILD = 8#

If set, all parents are older than their children, guaranteeing the GIG is acyclic (note this requires knowledge of the times in the nodes table). This flag also guarantees that nodes with those parent and child IDs exist in the nodes table.

IEDGES_SAME_PARENT_CHILD_FOR_EDGE = 16#

If set, iedges with the same edge_ID are guaranteed to have the same combination of parentID / childID. Note the inverse may not be true: iedges with the same parentID / childID can have different edge_IDs.

IEDGES_CHILD_INTERVAL_POSITIVE = 32#

Set if child_left < child_right (inversions can have parent_left > parent_right)

IEDGES_FOR_CHILD_ADJACENT = 64#

Set if all iedges with the same child ID are adjacent in the table

IEDGES_FOR_CHILD_PRIMARY_ORDER_CHR_ASC = 128#

Set if within a set of iedges for the same child, iedges are ordered first by chromosome number

IEDGES_FOR_CHILD_SECONDARY_ORDER_LEFT_ASC = 256#

Set if, within a set of iedges for the same child & chromosome, iedges are ordered by left position

IEDGES_PRIMARY_ORDER_CHILD_TIME_DESC = 512#

Set if the adjacent rows for a child are ordered in descending order of child node time. Note that this requires knowledge of the times in the nodes table.

IEDGES_SECONDARY_ORDER_CHILD_ID_ASC = 1024#

Set if iedges are ordered such that where iedge children are tied at the same time, iedges with the lowest child ID come first. Note that although this is often not algorithmically necessary, it helps to creates a canonical iedge ordering.

IEDGES_COMBO_STANDALONE = 503#

The combination of requirements that simply involve the iedge table and no linked tables

IEDGES_WITHIN_CHILD_SORTED = 448#

The combination of requirements that identify if edges for a give child are sorted. Some algorithms (e.g. iterating over edges for a child / chromosome only require sorting by child ID and within the child

IEDGES_SORTED = 1984#

A stricter canonical order of iedge sorting

Specific tables#

The IEdgeTable class#

class giglib.tables.IEdgeTable(initial_size=None)[source]#

A table containing the iedges. This contains more complex logic than other GIG tables as we want to store and check various forms of validity, so that we can run algorithms directly on the tables rather than on a frozen GIG. For example, the ids_for_child() method requires IEDGES_FOR_CHILD_ADJACENT to be true.

copy()[source]#

Returns an unfrozen deep copy of this table

clear()[source]#

Deletes all rows in this table.

add_row(child_left, child_right, parent_left, parent_right, *, child, parent, child_chromosome=0, parent_chromosome=0, edge=-1, validate=None, skip_validate=None)[source]#

The canonical way to add data to an IEdgeTable

See also

Tables.add_iedge_row() which is a wrapper around this method that also allows validation of parent and child node times

Parameters:
  • validate (int) – A set of bitflags (as listed in ValidFlags) specifying which iedge table validation checks should be performed when adding this data. If the existing data is valid, and the new data is added in a way that preserves the existing validity, then calling \(has_bitflag\) for the flags in this set will return True. If any of the bits in iedges_validation are 0, that particular validation will not be performed: in this case the has_bitflag method will return False for certain flags, and some table algorithms will not run. For instance, using the ids_for_child() method is only valid if IEDGES_FOR_CHILD_ADJACENT and IEDGES_FOR_CHILD_PRIMARY_ORDER_CHR_ASC are set, so if you wish to use that method you should add those flags to validate. Defaults to None which is treated as 0, meaning that all IEDGE`_ validation flags will be zeroed, no validation checks will be performed, and hence the table will be marked as containing potentially invalid iedge information.

  • skip_validate (bool) –

    If True, assume the user has checked that this operation will pass the validation tests implied by the validate flags. This means that the validation routines will not be run, but the tables may claim that they are valid when they are not. If False, and any of the iedges_validation flags are set, perform the appropriate validations (default: None treated as False)

    Warning

    This parameter is only intended to speed up well-tested code. It should only be set to True when you are sure that any calling code has been tested with validation.

Returns:

The row ID of the newly added row

For example:

new_id = iedges.add_row(
    cl, cr, pl, pr, child=c, parent=p, validate=ValidFlags.IEDGES_INTERVALS
)
append(obj, validate=None, skip_validate=None)[source]#

Append a row to an IEdgeTable taking the named attributes of the passed-in object to populate the row. If a left field is present in the object, is it placed in the parent_left and child_left attributes of this row (and similarly for a right field). This allows tskit conversion.

To enable tskit conversion, the special add_int_row() method is used, which is slower than the standard add_row. If you require efficiency, you are therefore recommended to call add_row() directly, ensuring that the intervals and node IDs are integers before you pass them in.

ids_for_child(u, chromosome=None)[source]#

Return all the iedge ids for a child. If chromosome is not None, return only the iedge IDs for that chromosome.

Note

The returned IDs are only guaranteed to be ordered by left position if IEDGES_FOR_CHILD_SECONDARY_ORDER_LEFT_ASC is set.

Parameters:
  • u (int) – The child ID

  • chromosome (int) – The chromosome number. If None (default), ruturn all the iedge IDs for the child, sorted by chromosome.

Returns:

A numpy array of iedge IDs

Raises:

ValueError – If the iedges for a child are not adjacent in the table, or not ordered by chromosome

max_pos_as_child(u, chromosome=None)[source]#

Return the maximum child position for a given child ID and chromosome. This should be equivalent np.max(child_right[child == u]), but faster because it relies on IEDGES_WITHIN_CHILD_SORTED.

If chromosome is not given, this will return the maximum child position in any chromosome.

Returns None if there are no iedges for the given child.

Note

This does not look at edges for which this node is a parent, so e.g. root nodes may not have a maximum position defined

min_pos_as_child(u, chromosome=None)[source]#

Return the minimum child position for a given child ID and chromosome. This should be equivalent np.max(child_right[child == u]), but faster because it relies on IEDGES_WITHIN_CHILD_SORTED.

If chromosome is not given, this will return the minimum child position in any chromosome.

Returns None if there are no iedges for the given child.

Note

This does not look at edges for which this node is a parent, so e.g. root nodes may not have a minimum position defined

chromosomes_as_child(u)[source]#

Iterate over the chromosome numbers for a node ID u, when this node is considered as a child. This will only iterate over the chromosomes which correspond to edges above the node u: if u had chromosomes which are not traced by any edge, these will not be reported.

transform_interval(iedge_id, interval, direction)[source]#

Given an iedge ID, use that edge to transform the provided interval. If this is an inversion then an iedge such as [0, 10, chrA] -> [10, 0, chrB] means that a interval such as [1, 8] gets transformed to [9, 2]. The chromosome transformation is not reported: it is assumed that the intervals are on chromosomes specified by the iedge

Parameters:
  • edge_id (int) – The edge specifying the transformation

  • interval (tuple(int, int)) – The (left, right) interval to transform, assumed to be on the appropriate chromosome for this iedge

  • direction (int) – Either Const.LEAFWARDS or Const.ROOTWARDS (only ROOTWARDS currently implemented)

The NodeTable class#

class giglib.tables.NodeTable(initial_size=None)[source]#

A table containing the nodes with their times

add_row(time, *, flags=None, individual=None, metadata=None)[source]#

The canonical way to add data to an NodeTable

Parameters:

flags (int) – The flags for this node

Returns:

The row ID of the newly added row

Example:

new_id = nodes.add_row(0, flags=NODE_IS_SAMPLE)

The IndividualTable class#

class giglib.tables.IndividualTable(initial_size=None)[source]#

The Graph class#

class giglib.Graph(tables)[source]#

This is similar to a _tskit_ TreeSequence. An instance can be created from a Tables object by calling Tables.graph(), which freezes the tables into a GIG.

classmethod from_tree_sequence(ts)[source]#

Construct a GIG from a tree sequence.

Parameters:

ts (tskit.TreeSequence) – A tree sequence object

Returns:

A new GIG object

Return type:

Graph

property nodes#

Return an object used to iterate over the nodes in this GIG

samples()[source]#

Iterate over the sample nodes, returning

property individuals#

Return an object used to iterate over the individuals in this GIG

property iedges#

Return an object used to iterate over the iedges in this GIG

iedges_for_parent(u, chromosome=None)[source]#

Iterate over all iedges with parent u

iedges_for_child(u, chromosome=None)[source]#

Iterate over all iedges with child u

max_pos(u, chromosome=None)[source]#

Return the maximum position in the edges above or below this node. This defines the known sequence length of the node u for a given chromosome (or the maximum position in any chromosome if not specified)

sequence_length(u, chromosome)[source]#

Return the known sequence length of the node u: equivalent to max_position(u) but returns 0 rather than None if this is an isolated node (with no associated edges).

total_sequence_length(u)[source]#

Return the sum of all the sequence lengths in all the chromosomes

to_tree_sequence(sequence_length=None)[source]#

Convert this GIG to a tree sequence. This can only be done if each iedge has the same child_left as parent_left and the same child_right as parent_right.

Parameters:

sequence_length (int) – The sequence length in the resulting tree sequence (if None, take the maximum position of any edge).

Returns:

A tree sequence

Return type:

tskit.TreeSequence

decapitate(time)[source]#

Return a new GIG with all nodes older than time removed.

Parameters:

time (float) – The cutoff time

Returns:

A new GIG object

Return type:

Graph

sample_resolve()[source]#

Sample resolve a GIG, keeping only those edge regions which transmit information to the current samples. This is rather like running the Hudson algorithm on a fixed graph, but without counting up the number of samples under each node. This is identical to the Tables.sample_resolve() method, but returns a new GIG instead.

Returns:

A new GIG object

Return type:

Graph

find_mrca_regions(*args, **kwargs)[source]#

A wrapper around the find_mrca_regions method in the tables object