snp_parser - SNPs analysis¶
Overview¶
The workflow starts with a number of alignments passed to the SNP calling software, which produces one VCF file per alignment/sample. These VCF files are used by SNPDat along a GTF file and the reference genome to integrate the information in VCF files with synonymous/non-synonymous information.
All VCF files are merged into a VCF that includes information about all the SNPs called among all samples. This merged VCF is passed, along with the results from SNPDat and the GFF file to snp_parser.py which integrates information from all data sources and output files in a format that can be later used by the rest of the pipeline. [1]
Note
The GFF file passed to the parser must have per sample coverage information.
[1] | This step is done separately because it’s both time consuming and can helps to paralellise later steps |
Script Reference¶
This script parses results of SNPs analysis from any tool for SNP calling [2] and integrates them into a format that can be later used for other scripts in the pipeline.
It integrates coverage and expected number of syn/nonsyn change and taxonomy from a GFF file, SNP data from a VCF file.
Note
The script accept gzipped VCF files
[2] | GATK pipeline was tested, but it is possible to use samtools and bcftools |
Changes¶
Changed in version 0.2.1: added -s option for VCF files generated using bcftools
Changed in version 0.1.16: reworkked internals and removed SNPDat, syn/nonsyn evaluation is internal
Changed in version 0.1.13: reworked the internals and the classes used, including options -m and -s
Options¶
SNPs analysis, requires a vcf file and SNPDat results
usage: snp_parser [-h] [-o OUTPUT_FILE] [-q MIN_QUAL] [-f MIN_FREQ]
[-r MIN_READS] -g GFF_FILE -p VCF_FILE -a REFERENCE -m
SAMPLES_ID [-c COV_SUFF] [-s] [-v | --quiet] [--cite]
[--manual] [--version]
Named Arguments¶
-o, --output-file | |
Ouput file Default: snp_data.pickle | |
-q, --min-qual | Minimum SNP quality (Phred score) Default: 30 |
-f, --min-freq | Minimum allele frequency Default: 0.01 |
-r, --min-reads | |
Minimum number of reads to accept the SNP Default: 4 | |
-g, --gff-file | GFF file with annotations |
-p, --vcf-file | Merged VCF file |
-a, --reference | |
Fasta file with the GFF Reference | |
-m, --samples-id | |
the ids of the samples used in the analysis | |
-c, --cov-suff | Per sample coverage suffix in the GFF Default: “_cov” |
-s, --bcftools-vcf | |
bcftools call was used to produce the VCF file Default: False | |
-v, --verbose | more verbose - includes debug messages Default: 20 |
--quiet | less verbose - only error and critical messages |
--cite | Show citation for the framework |
--manual | Show the script manual |
--version | show program’s version number and exit |