mgkit.counts.func module¶
New in version 0.1.13.
Misc functions for count data
-
mgkit.counts.func.
batch_load_htseq_counts
(count_files, samples=None, cut_name=None)¶ Loads a list of htseq count result files and returns a DataFrame (IDxSAMPLE)
The sample names are names are the file names if samples and cut_name are None, supplying a list of sample names with samples is the preferred way, and cut_name is used for backward compatibility and as an option in cases a string replace is enough.
Parameters: Returns: with sample names as columns and gene_ids as index
Return type: pandas.DataFrame
-
mgkit.counts.func.
filter_counts
(counts_iter, info_func, gfilters=None, tfilters=None)¶ Returns counts that pass filters for each uid associated gene_id and taxon_id.
Parameters: - counts_iter (iterable) – iterator that yields a tuple (uid, count)
- info_func (func) – function accepting a uid that returns a tuple (gene_id, taxon_id)
- gfilters (iterable) – list of filters to apply to each uid associated gene_id
- tfilters (iterable) – list of filters to apply to each uid associated taxon_id
Yields: tuple – (uid, count) that pass filters
-
mgkit.counts.func.
from_gff
(annotations, samples, ann_func=None, sample_func=None)¶ New in version 0.3.1.
Loads count data from a GFF file, only for the requested samples. By default the function returns a DataFrame where the index is the uid of each annotation and the columns the requested samples.
This can be customised by supplying ann_func and sample_func. sample_func is a function that accept a sample name and is expected to return a string or a tuple. This will be used to change the columns in the DataFrame. ann_func must accept an
mgkit.io.gff.Annotation
instance and return an iterable, with each iteration yielding either a single element or a tuple (for a MultiIndex DataFrame), each element yielded will have the count of that annotation added to.Parameters: Returns: dataframe with the count data, columns are the samples and rows the annotation counts (unless mapped with ann_func)
Return type: DataFrame
- Exmples:
Assuming we have a list of annotations and sample SAMPLE1 and SAMPLE2 we can obtain the count table for all annotations with this
>>> from_gff(annotations, ['SAMPLE1', 'SAMPLE2'])
Assuming we want to group the samples, for example treatment1, treatment2 and control1, control2 into a MultiIndex DataFrame column
>>> sample_func = lambda x: ('T' if x.startswith('t') else 'C', x) >>> from_gff(annotations, ['treatment1', 'treatment2', 'control1', 'control2'], sample_func=sample_func)
Annotations can be mapped to other levels for example instead of using the uid that is the default, it can be mapped to the gene_id, taxon_id information that is included in the annotation, resulting in a MultiIndex index for the rows, with (gene_id, taxon_id) as key.
>>> ann_func = lambda x: [(x.gene_id, x.taxon_id)] >>> from_gff(annotations, ['SAMPLE1', 'SAMPLE2'], ann_func=ann_func)
-
mgkit.counts.func.
get_uid_info
(info_dict, uid)¶ Simple function to get a value from a dictionary of tuples (gene_id, taxon_id)
-
mgkit.counts.func.
get_uid_info_ann
(annotations, uid)¶ Simple function to get a value from a dictionary of annotations
-
mgkit.counts.func.
load_counts_from_gff
(annotations, elem_func=<function <lambda>>, sample_func=None, nozero=True)¶ New in version 0.2.5.
Loads counts for each annotations that are stored into the annotation counts_ attributes. Annotations with a total of 0 counts are skipped by default (nozero=True), the row index is set to the uid of the annotation and the column to the sample name. The functions used to transform the indices expect the annotation (for the row, elem_func) and the sample name (for the column, sample_func).
Parameters: - annotations (iter) – iterable of annotations
- elem_func (func) – function that accepts an annotation and return a str/int for a Index or a tuple for a MultiIndex, defaults to returning the uid of the annotation
- sample_func (func, None) – function that accepts the sample name and returns tuple for a MultiIndex. Defaults to None so no transformation is performed
- nozero (bool) – if True, annotations with no counts are skipped
-
mgkit.counts.func.
load_deseq2_results
(file_name, taxon_id=None)¶ New in version 0.1.14.
Reads a CSV file output with DESeq2 results, adding a taxon_id to the index for concatenating multiple results from different taxonomic groups.
Parameters: file_name (str) – file name of the CSV Returns: a MultiIndex DataFrame with the results Return type: pandas.DataFrame
-
mgkit.counts.func.
load_htseq_counts
(file_handle, conv_func=<type 'int'>)¶ Changed in version 0.1.15: added conv_func parameter
Loads an HTSeq-count result file
Parameters: Yields: tuple – first element is the gene_id and the second is the count
-
mgkit.counts.func.
load_sample_counts
(info_dict, counts_iter, taxonomy, inc_anc=None, rank=None, gene_map=None, ex_anc=None, include_higher=True, cached=True, uid_used=None)¶ Changed in version 0.1.14: added cached argument
Changed in version 0.1.15: added uid_used parameter
Changed in version 0.2.0: info_dict can be a function
Reads sample counts, filtering and mapping them if requested. It’s an example of the usage of the above functions.
Parameters: - info_dict (dict) – dictionary that has uid as key and (gene_id, taxon_id) as value. In alternative a function that accepts a uid as sole argument and returns (gene_id, taxon_id)
- counts_iter (iterable) – iterable that yields a (uid, count)
- taxonomy – taxonomy instance
- inc_anc (int, list) – ancestor taxa to include
- rank (str) – rank to which map the counts
- gene_map (dict) – dictionary with the gene mappings
- ex_anc (int, list) – ancestor taxa to exclude
- include_higher (bool) – if False, any rank different than the requested one is discarded
- cached (bool) – if True, the function will use
mgkit.simple_cache.memoize
to cache some of the functions used - uid_used (None, dict) – an empty dictionary in which to store the uid that were assigned to each key of the returned pandas.Series. If None, no information is saved
Returns: array with MultiIndex (gene_id, taxon_id) with the filtered and mapped counts
Return type: pandas.Series
-
mgkit.counts.func.
load_sample_counts_to_genes
(info_func, counts_iter, taxonomy, inc_anc=None, gene_map=None, ex_anc=None, cached=True, uid_used=None)¶ New in version 0.1.14.
Changed in version 0.1.15: added uid_used parameter
Reads sample counts, filtering and mapping them if requested. It’s a variation of
load_sample_counts()
, with the counts being mapped only to each specific gene_id. Another difference is the absence of any assumption on the first parameter. It is expected to return a (gene_id, taxon_id) tuple.Parameters: - info_func (callable) – any callable that accept an uid as the only parameter and and returns (gene_id, taxon_id) as value
- counts_iter (iterable) – iterable that yields a (uid, count)
- taxonomy – taxonomy instance
- inc_anc (int, list) – ancestor taxa to include
- rank (str) – rank to which map the counts
- gene_map (dict) – dictionary with the gene mappings
- ex_anc (int, list) – ancestor taxa to exclude
- cached (bool) – if True, the function will use
mgkit.simple_cache.memoize
to cache some of the functions used - uid_used (None, dict) – an empty dictionary in which to store the uid that were assigned to each key of the returned pandas.Series. If None, no information is saved
Returns: array with Index gene_id with the filtered and mapped counts
Return type: pandas.Series
-
mgkit.counts.func.
load_sample_counts_to_taxon
(info_func, counts_iter, taxonomy, inc_anc=None, rank=None, ex_anc=None, include_higher=True, cached=True, uid_used=None)¶ New in version 0.1.14.
Changed in version 0.1.15: added uid_used parameter
Reads sample counts, filtering and mapping them if requested. It’s a variation of
load_sample_counts()
, with the counts being mapped only to each specific taxon. Another difference is the absence of any assumption on the first parameter. It is expected to return a (gene_id, taxon_id) tuple.Parameters: - info_func (callable) – any callable that accept an uid as the only parameter and and returns (gene_id, taxon_id) as value
- counts_iter (iterable) – iterable that yields a (uid, count)
- taxonomy – taxonomy instance
- inc_anc (int, list) – ancestor taxa to include
- rank (str) – rank to which map the counts
- ex_anc (int, list) – ancestor taxa to exclude
- include_higher (bool) – if False, any rank different than the requested one is discarded
- cached (bool) – if True, the function will use
mgkit.simple_cache.memoize
to cache some of the functions used - uid_used (None, dict) – an empty dictionary in which to store the uid that were assigned to each key of the returned pandas.Series. If None, no information is saved
Returns: array with Index taxon_id with the filtered and mapped counts
Return type: pandas.Series
-
mgkit.counts.func.
map_counts
(counts_iter, info_func, gmapper=None, tmapper=None, index=None, uid_used=None)¶ Changed in version 0.1.14: added index parameter
Changed in version 0.1.15: added uid_used parameter
Maps counts according to the gmapper and tmapper functions. Each mapped gene ID count is the sum of all uid that have the same ID(s). The same is true for the taxa.
Parameters: - counts_iter (iterable) – iterator that yields a tuple (uid, count)
- info_func (func) – function accepting a uid that returns a tuple (gene_id, taxon_id)
- gmapper (func) – fucntion that accepts a gene_id and returns a list of mapped IDs
- tmapper (func) – fucntion that accepts a taxon_id and returns a new taxon_id
- index (None, str) – if None, the index of the Series if (gene_id, taxon_id), if a str, it can be either gene or taxon, to specify a single value
- uid_used (None, dict) – an empty dictionary in which to store the uid that were assigned to each key of the returned pandas.Series. If None, no information is saved
Returns: array with MultiIndex (gene_id, taxon_id) with the mapped counts
Return type: pandas.Series
-
mgkit.counts.func.
map_counts_to_category
(counts, gene_map, nomap=False, nomap_id='NOMAP')¶ Used to map the counts from a certain gene identifier to another. Genes with no mappings are not counted, unless nomap=True, in which case they are counted as nomap_id.
Parameters: - counts (iterator) – an iterator that yield a tuple, with the first value being the gene_id and the second value the count for it
- gene_map (dictionary) – a dictionary whose keys are the gene_id yield by counts and the values are iterable of mapping identifiers
- nomap (bool) – if False, counts for genes with no mappings in gene_map are discarded, if True, they a counted as nomap_id
- nomap_id (str) – name of the mapping for genes with no mappings
Returns: mapped counts
Return type: pandas.Series
-
mgkit.counts.func.
map_gene_id_to_map
(gene_map, gene_id)¶ Function that extract a list of gene mappings from a dictionary and returns an empty list if the gene_id is not found.
-
mgkit.counts.func.
map_taxon_id_to_rank
(taxonomy, rank, taxon_id, include_higher=True)¶ Maps a taxon_id to the request taxon rank. Returns None if include_higher is False and the found rank is not the one requested.
Internally uses
mgkit.taxon.UniprotTaxonomy.get_ranked_taxon()
Parameters: Returns: if the mapping is successful, the ranked taxon_id is returned, otherwise None is returned
Return type: