add-gff-info - Add informations to GFF annotations¶
Overview¶
Add more information to GFF annotations: gene mappings, coverage, taxonomy, etc..
Uniprot Command¶
If the gene_id of an annotation is a Uniprot ID, the script queries Uniprot for the requested information. At the moment the information that can be added is the taxon_id, taxon_name, lineage and mapping to EC, KO, eggNOG IDs.
It’s also possible to add mappings to other databases using the -m option with the correct identifier for the mapping, which can be found at this page; for example if it’s we want to add the mappings of uniprot IDs to BioCyc, in the abbreviation column of the mappings we find that it’s identifier is REACTOME_ID, so we pass -m REACTOME to the script (leaving _ID out). Mapped IDs are separated by commas.
The taxonomy IDs are not overwritten if they are found in the annotations, the -f is provided to force the overwriting of those values.
See also MGKit GFF Specifications for more informations about the GFF specifications used.
Note
As the script needs to query Uniprot a lot, it is recommended to split the GFF in several files, so an error in the connection doesn’t waste time.
However, a cache is kept to reduce the number of connections
Taxonomy Command¶
To refine the taxonomic assignments of predicted genes annotations, the annotation sequences may be searched against a database like the NCBI nt.
This commands takes as input a GFF file, one or more blast output files and a
file with all mappings from GIDs to taxonomy IDs. More information on how to
get the file can be read in the documentation of the function
mgkit.io.blast.parse_gi_taxa_table()
.
The fasta sequences used with BLAST must have as name the uid of the
annotations they refer to, and one way to obtain these sequences is to use the
function mgkit.io.gff.extract_nuc_seqs()
and save them to a fasta file.
Another options is to use the sequence command of the get-gff-info script
(get-gff-info - Extract informations to GFF annotations).
The command accept a minimum bitscore to accept an hit and the taxon ID is selected by default using top hit method, but LCA can be used, using the -l switch.
Top Hit¶
The best hit is selected from all those found for a sequence which has the maximum bitscore and identity, with the bitscore having the highest priority.
LCA Taxon¶
Activated with the -l switch, it selects the last common ancestor of all taxon IDs that are from the cellular organism root in the taxonomy and are within a 10 bits (by default, can be customised with -a) from the hit with the highest bitscore. If a taxon ID is not found in the taxonomy, it is excluded. One of the requirements of this option is a file that contains the full taxonomy from Uniprot/NCBI. The file can be obtained with the following command:
$ download_data -x -p -m your@email
The command will output a taxonomy.pickle file that can be passed to the -x option download-data - Download Taxonomy from NCBI.
Coverage Command¶
Adds coverage information from BAM alignment files to a GFF file, using the
function mgkit.align.add_coverage_info()
, the user needs to supply for
each sample a BAM file, using the -a option, whose parameter is in the form
sample,samplealg.bam. More samples can be supplied adding more -a
arguments.
Hint
As an example, to add coverage for sample1, sample2 the command line is:
add-gff-info coverage -a sample1,sample1.bam -a sample2,sample2.bam \
inputgff outputgff
A total coverage for the annotation is also calculated and stored in the cov attribute, while each sample coverage is stored into sample_cov as per MGKit GFF Specifications.
Adding Coverage from samtools depth¶
The cov_samtools allows the use of the output of samtools depth command. The -aa options must be used to pass information about all base pairs and sequences coverage in the BAM/SAM file. The command accept only one sample and the relative file/stream from samtools, meaning that multiple samples coverage information must be added one at a time. One solution is to pipe multiple commands to obtain result wanted. For example:
$ add-gff-info cov_samtools -s SAMPLE1 -d sample1-coverage input.gff | add-gff-info cov_samtools -s SAMPLE2 -d sample2-coverage - output.gff
This command will add the coverage information for SAMPLE1 and SAMPLE2 from the respective files.
Uniprot Offline Mappings¶
Similar to the uniprot command, it uses the idmapping file provided by Uniprot, which speeds up the process of adding mappings and taxonomy IDs from Uniprot gene IDs. It’s not possible tough to add EC mappings with this command, as those are not included in the file.
Kegg Information¶
The kegg command allows to add information to each annotation. Right now the information that can be added is restricted to the pathway(s) (reference KO) a KO is part of and both the KO and pathway(s) descriptions. This information is stored in keys starting with ko_.
Expected Aminoacidic Changes¶
Some scripts, like snp_parser - SNPs analysis, require information about the expected
number of synonymous and non-synonymous changes of an annotation. This can be
done using mgkit.io.gff.Annotation.add_exp_syn_count()
by the user of the
command exp_syn of this script. The attributes added to each annotation are
explained in the MGKit GFF Specifications
Adding Information from eggNOG¶
The eggnog command allows to add information from the annotations file available for profiles in eggNOG.
Adding Count Data¶
Count data on a per-sample basis can be added with the counts command. The accepted inputs are from HTSeq-count and featureCounts. The ouput produced by featureCounts, is the one from using its -f option must be used.
This script accept by default a tab separated file, with a uid in the first column and the other columns are the counts for each sample, in the same order as they are passed to the -s option. To use the featureCounts file format, this script -e option must be used.
The sample names must be provided in the same order as the columns in the input files. If the counts are FPKMS the -f option can be used.
Adding Taxonomy from a Table¶
There are cases where it may needed or preferred to add the taxonomy from a gene_id already provided in the GFF file. For such cases the addtaxa command can be used. It works in a similar way to the taxonomy command, only it expect three different type of inputs:
- GI-Taxa table from NCBI (e.g. gi_taxid_nucl.dmp, )
- tab separated table
- dictionary
- HDF5
The first two are tab separated files, where on each line, the first column is the gene_id that is found in the first column, while the second if the taxon_id.
The third option is a serialised Python dict/hash table, whose keys are the gene_id and the value is that gene corresponding taxon_id. The serialised formats accepted are msgpack, json and pickle. The msgpack module must be importable. The option to use json and msgpack allow to integrate this script with other languages without resorting to a text file.
The last option is a HDF5 created using the to_hdf command in taxon-utils - Taxonomy Utilities. This requires pandas installed and pytables and it provides faster lookup of IDs in the table.
While the default is to look for the gene_id attribute in the GFF annotation, another attribute can be specified, using the -gene-attr option.
Note
the dictionary content is loaded after the table files and its keys and corresponding values takes precedence over the text files.
Warning
from September 2016 NCBI will retire the GI. In that case the same
kind of table can be built from the nucl_gb.accession2taxid.gz file
The format is different, but some information can be found in
mgkit.io.blast.parse_accession_taxa_table()
Adding information from Pfam¶
Adds the Pfam description for the annotation, by downloading the list from Pfam.
The options allow to specify in which attribute the ID/ACCESSION is stored (defaults to gene_id) and which one between ID/ACCESSION is the value of that attribute (defaults to ID). if no description is found for the family, a warning message is logged.
Changes¶
Changed in version 0.3.0: added cov_samtools command, –split option to exp_syn, -c option to addtaxa
Changed in version 0.2.6: added skip-no-taxon option to addtaxa
Changed in version 0.2.5: if a dictionary is supplied to addtaxa, the GFF is not preloaded
Changed in version 0.2.3: added pfam command, renamed gitaxa to addtaxa and made it general
Changed in version 0.2.2: added eggnog, gitaxa and counts command
Changed in version 0.2.1.
- added -d to uniprot command
- added cache to uniprot command
- added kegg command (cached)
Changed in version 0.1.16: added exp_syn command
Changed in version 0.1.15: taxonomy command -b option changed
Changed in version 0.1.13.
- added –force-taxon-id option to the uniprot command
- added coverage command
- added taxonomy command
- added unipfile command
New in version 0.1.12.
Options¶
Adds informations to a GFF file
usage: add-gff-info [-h] [-v | --quiet] [--cite] [--manual] [--version]
{uniprot,taxonomy,coverage,exp_syn,unipfile,kegg,eggnog,counts,addtaxa,pfam,cov_samtools}
...
Named Arguments¶
-v, --verbose | more verbose - includes debug messages Default: 20 |
--quiet | less verbose - only error and critical messages |
--cite | Show citation for the framework |
--manual | Show the script manual |
--version | show program’s version number and exit |
Sub-commands:¶
uniprot¶
Adds information from GFF whose gene_id is from Uniprot
add-gff-info uniprot [-h] [-c EMAIL] [--buffer BUFFER] [-f] [-t] [-l] [-e]
[-ec] [-ko] [-d] [-m MAPPING] [-v | --quiet] [--cite]
[--manual] [--version]
[input_file] [output_file]
Positional Arguments¶
input_file | Input GFF file, defaults to stdin Default: - |
output_file | Output GFF file, defaults to stdout Default: <open file ‘<stdout>’, mode ‘w’ at 0x7fa31cd94150> |
Named Arguments¶
-c, --email | Contact email |
--buffer | Number of annotations to keep in memory Default: 50 |
-v, --verbose | more verbose - includes debug messages Default: 20 |
--quiet | less verbose - only error and critical messages |
--cite | Show citation for the framework |
--manual | Show the script manual |
--version | show program’s version number and exit |
Requires Internet connection¶
-f, --force-taxon-id | |
Overwrite taxon_id if already present Default: False | |
-t, --taxon-id |
Default: False |
-l, --lineage | Add taxonomic lineage to annotations Default: False |
-e, --eggnog | Add eggNOG mappings to annotations Default: False |
-ec | Add EC mappings to annotations Default: False |
-ko | Add KO mappings to annotations Default: False |
-d, --protein-names | |
Add Uniprot description Default: False | |
-m, --mapping | Add any DB mappings to annotations |
taxonomy¶
- Adds taxonomic information from annotation sequences blasted
- against a NCBI db
add-gff-info taxonomy [-h] -t GI_TAXA_TABLE -b BLAST_OUTPUT [-s BITSCORE]
[-d TAXON_DB] [-l] [-x TAXONOMY] [-a MAX_DIFF]
[-v | --quiet] [--cite] [--manual] [--version]
[input_file] [output_file]
Positional Arguments¶
input_file | Input GFF file, defaults to stdin Default: - |
output_file | Output GFF file, defaults to stdout Default: <open file ‘<stdout>’, mode ‘w’ at 0x7fa31cd94150> |
Named Arguments¶
-t, --gi-taxa-table | |
GIDs taxonomy table (e.g. gi_taxid_nucl.dmp.gz) | |
-b, --blast-output | |
BLAST output file(s) | |
-s, --bitscore | Minimum bitscore allowed Default: 40 |
-d, --taxon-db | NCBI database used Default: “NCBI-NT” |
-v, --verbose | more verbose - includes debug messages Default: 20 |
--quiet | less verbose - only error and critical messages |
--cite | Show citation for the framework |
--manual | Show the script manual |
--version | show program’s version number and exit |
LCA options¶
-l, --lca | Use last common ancestor to solve ambiguities Default: False |
-x, --taxonomy | Taxonomy file |
-a, --max-diff | Bitscore difference from the max hit Default: 10 |
coverage¶
Adds coverage information from BAM Alignment files
add-gff-info coverage [-h] -a SAMPLE_ALIGNMENT [-v | --quiet] [--cite]
[--manual] [--version]
[input_file] [output_file]
Positional Arguments¶
input_file | Input GFF file, defaults to stdin Default: - |
output_file | Output GFF file, defaults to stdout Default: <open file ‘<stdout>’, mode ‘w’ at 0x7fa31cd94150> |
Named Arguments¶
-a, --sample-alignment | |
sample name and correspondent alignment file separated by comma | |
-v, --verbose | more verbose - includes debug messages Default: 20 |
--quiet | less verbose - only error and critical messages |
--cite | Show citation for the framework |
--manual | Show the script manual |
--version | show program’s version number and exit |
exp_syn¶
Adds expected synonymous and non-synonymous changes information
add-gff-info exp_syn [-h] -r REFERENCE [-s] [-v | --quiet] [--cite] [--manual]
[--version]
[input_file] [output_file]
Positional Arguments¶
input_file | Input GFF file, defaults to stdin Default: - |
output_file | Output GFF file, defaults to stdout Default: <open file ‘<stdout>’, mode ‘w’ at 0x7fa31cd94150> |
Named Arguments¶
-r, --reference | |
reference sequence in fasta format | |
-s, --split |
Default: False |
-v, --verbose | more verbose - includes debug messages Default: 20 |
--quiet | less verbose - only error and critical messages |
--cite | Show citation for the framework |
--manual | Show the script manual |
--version | show program’s version number and exit |
unipfile¶
Adds mappings and taxonomy from Uniprot mapping file
add-gff-info unipfile [-h] -i MAPPING_FILE [-f] -m
{EMBL-CDS,KEGG,eggNOG,EMBL,STRING,UniPathway,BioCyc,NCBI_TaxID,KO,GI}
[-v | --quiet] [--cite] [--manual] [--version]
[input_file] [output_file]
Positional Arguments¶
input_file | Input GFF file, defaults to stdin Default: - |
output_file | Output GFF file, defaults to stdout Default: <open file ‘<stdout>’, mode ‘w’ at 0x7fa31cd94150> |
Named Arguments¶
-i, --mapping-file | |
Uniprot mapping file Default: “idmapping.dat.gz” | |
-f, --force-taxon-id | |
Overwrite taxon_id if already present Default: False | |
-m, --mapping | Possible choices: EMBL-CDS, KEGG, eggNOG, EMBL, STRING, UniPathway, BioCyc, NCBI_TaxID, KO, GI Mappings to add |
-v, --verbose | more verbose - includes debug messages Default: 20 |
--quiet | less verbose - only error and critical messages |
--cite | Show citation for the framework |
--manual | Show the script manual |
--version | show program’s version number and exit |
kegg¶
Adds information and mapping from Kegg
add-gff-info kegg [-h] [-c EMAIL] [-d] [-p] [-m KEGG_ID] [-v | --quiet]
[--cite] [--manual] [--version]
[input_file] [output_file]
Positional Arguments¶
input_file | Input GFF file, defaults to stdin Default: - |
output_file | Output GFF file, defaults to stdout Default: <open file ‘<stdout>’, mode ‘w’ at 0x7fa31cd94150> |
Named Arguments¶
-c, --email | Contact email |
-v, --verbose | more verbose - includes debug messages Default: 20 |
--quiet | less verbose - only error and critical messages |
--cite | Show citation for the framework |
--manual | Show the script manual |
--version | show program’s version number and exit |
Requires Internet connection¶
-d, --description | |
Add Kegg description Default: False | |
-p, --pathways | Add pathways ID involved Default: False |
-m, --kegg-id |
Default: “gene_id” |
eggnog¶
Adds information from eggNOG
add-gff-info eggnog [-h] -a ANNOTATIONS_FILE [-v | --quiet] [--cite]
[--manual] [--version]
[input_file] [output_file]
Positional Arguments¶
input_file | Input GFF file, defaults to stdin Default: - |
output_file | Output GFF file, defaults to stdout Default: <open file ‘<stdout>’, mode ‘w’ at 0x7fa31cd94150> |
Named Arguments¶
-a, --annotations-file | |
Annotations file | |
-v, --verbose | more verbose - includes debug messages Default: 20 |
--quiet | less verbose - only error and critical messages |
--cite | Show citation for the framework |
--manual | Show the script manual |
--version | show program’s version number and exit |
counts¶
Adds counts data to the GFF
add-gff-info counts [-h] -s SAMPLES -c COUNT_FILES [-f] [-e] [-v | --quiet]
[--cite] [--manual] [--version]
[input_file] [output_file]
Positional Arguments¶
input_file | Input GFF file, defaults to stdin Default: - |
output_file | Output GFF file, defaults to stdout Default: <open file ‘<stdout>’, mode ‘w’ at 0x7fa31cd94150> |
Named Arguments¶
-s, --samples | Comma separated sample names, in the same order as the count file |
-c, --count-files | |
Count file(s) | |
-f, --fpkms | If the counts are FPKMS Default: False |
-e, --featureCounts | |
If the counts files are from featureCounts (using the -f option) Default: False | |
-v, --verbose | more verbose - includes debug messages Default: 20 |
--quiet | less verbose - only error and critical messages |
--cite | Show citation for the framework |
--manual | Show the script manual |
--version | show program’s version number and exit |
addtaxa¶
- Adds taxonomy information from a GI-Taxa, gene_id/taxon_id
- table or a dictionary serialised as a pickle/msgpack/json file
add-gff-info addtaxa [-h] [-t GENE_TAXON_TABLE] [-f HDF_TABLE] [-a GENE_ATTR]
[-x TAXONOMY] [-d DICTIONARY] [-e] [-db TAXON_DB] [-c]
[-v | --quiet] [--cite] [--manual] [--version]
[input_file] [output_file]
Positional Arguments¶
input_file | Input GFF file, defaults to stdin Default: - |
output_file | Output GFF file, defaults to stdout Default: <open file ‘<stdout>’, mode ‘w’ at 0x7fa31cd94150> |
Named Arguments¶
-t, --gene-taxon-table | |
| |
-f, --hdf-table | |
| |
-a, --gene-attr | |
Default: “gene_id” | |
-x, --taxonomy |
|
-d, --dictionary | |
| |
-e, --skip-no-taxon | |
If used, annotations with no taxon_id won’t be outputted Default: False | |
-db, --taxon-db | |
DB used to add the taxonomic information Default: “NONE” | |
-c, --cache-table | |
Default: False | |
-v, --verbose | more verbose - includes debug messages Default: 20 |
--quiet | less verbose - only error and critical messages |
--cite | Show citation for the framework |
--manual | Show the script manual |
--version | show program’s version number and exit |
pfam¶
Adds information from Pfam
add-gff-info pfam [-h] [-i ID_ATTR] [-a] [-v | --quiet] [--cite] [--manual]
[--version]
[input_file] [output_file]
Positional Arguments¶
input_file | Input GFF file, defaults to stdin Default: - |
output_file | Output GFF file, defaults to stdout Default: <open file ‘<stdout>’, mode ‘w’ at 0x7fa31cd94150> |
Named Arguments¶
-i, --id-attr |
Default: “gene_id” |
-a, --use-accession | |
Default: False | |
-v, --verbose | more verbose - includes debug messages Default: 20 |
--quiet | less verbose - only error and critical messages |
--cite | Show citation for the framework |
--manual | Show the script manual |
--version | show program’s version number and exit |
cov_samtools¶
Adds information from samtools_depth
add-gff-info cov_samtools [-h] [-s SAMPLE] -d DEPTH [-n NUM_SEQS]
[-v | --quiet] [--cite] [--manual] [--version]
[input_file] [output_file]
Positional Arguments¶
input_file | Input GFF file, defaults to stdin Default: - |
output_file | Output GFF file, defaults to stdout Default: <open file ‘<stdout>’, mode ‘w’ at 0x7fa31cd94150> |
Named Arguments¶
-s, --sample | sample name |
-d, --depth | samtools depth -aa file |
-n, --num-seqs | Number of sequences to update the log Default: 10000 |
-v, --verbose | more verbose - includes debug messages Default: 20 |
--quiet | less verbose - only error and critical messages |
--cite | Show citation for the framework |
--manual | Show the script manual |
--version | show program’s version number and exit |