MGKit GFF Specifications¶
The GFF produced with MGKit follows the conventions of GFF/GTF files but it provides some additional fields in the 9th column which translate to a
Python dictionary when an annotation is loaded into an Annotation
instance.
The 9th column is a list of key=value item, separated by a semicolon (;); each value is also expected to be quoted with double quotes and the values to not include a semicolon or other characters that can make the parsing difficult. MGKit uses urllib.quote()
to encode those characters and also ” ()/”. The mgkit.io.gff.from_gff()
uses urllib.unquote()
to set the values.
Warning
As the last column translates to a dictionary in the data structures, duplicate keys are not allowed. mgkit.io.gff.from_gff()
raises an exception if any are found.
Reserved Values¶
Any key can be added to a GFF annotation, but MGKit expects a few key to be in the GFF annotation as summarised in the following tables.
Key | Value | Explanation |
---|---|---|
gene_id | any string | used to identify the gene predicted |
db | any string, like UNIPROT-SP, UNIPROT-TR, NCBI-NT | identifies the database used to make the gene_id prediction |
taxon_db | any string, like UNIPROT-SP, UNIPROT-TR, NCBI-NT | identifies the database used to make the taxon_id prediction |
dbq | integer | identifies the quality of the database, used when filtering annotations |
taxon_id | integer | identifies the annotation taxon, NCBI taxonomy is used |
uid | string | unique identifier for the annotation, any string is accepted but a value is assigned by using uuid.uuid4() |
cov and {any}_cov | integer | coverage for the annotation over all samples, keys ending with _cov indicates coverage for each sample |
exp_syn, exp_nonsyn | integer | used for expected number of synonymous and non-synonymous changes for the annotation |
The following keys are added by different scripts and may be used in different scripts or annotation methods.
Key | Value | Explanation | Used |
---|---|---|---|
taxon_name | string | name of the taxon | not used |
lineage | string | taxon lineage | not used |
EC | comma separated values | list of EC numbers associated to the annotation | used by mgkit.io.gff.Annotation.get_ec() |
map_{any} | comma separated values | list of mapping to a specific db (e.g. eggNOG -> map_EGGNOG) | used by mgkit.io.gff.Annotation.get_mapping() |
counts_{any} | float | Stores the count data for a sample (e.g. counts_Sample1) | used by script add-gff-info |
fpkms_{any} | float | Stores the count data for a sample (e.g. fpkms_Sample1) | used by script add-gff-info |