mgkit.workflow.snp_parser module¶
This script parses results of SNPs analysis from any tool for SNP calling [1] and integrates them into a format that can be later used for other scripts in the pipeline.
It integrates coverage and expected number of syn/nonsyn change and taxonomy from a GFF file, SNP data from a VCF file.
Note
The script accept gzipped VCF files
[1] | GATK pipeline was tested, but it is possible to use samtools and bcftools |
Changes¶
Changed in version 0.2.1: added -s option for VCF files generated using bcftools
Changed in version 0.1.16: reworkked internals and removed SNPDat, syn/nonsyn evaluation is internal
Changed in version 0.1.13: reworked the internals and the classes used, including options -m and -s
-
mgkit.workflow.snp_parser.
check_snp_in_set
(samples, snp_data, pos, change, annotations, seq)[source]¶ Used by
parse_vcf()
to check if a SNPParameters: - samples (iterable) – list of samples that contain the SNP
- snp_data (dict) – dictionary from
init_count_set()
with per sample SNPs information
-
mgkit.workflow.snp_parser.
parse_vcf
(vcf_file, snp_data, min_reads, min_af, min_qual, annotations, seqs, options, line_num=100000)[source]¶ Parse VCF file counts synonymous and non-synonymous SNPs
Parameters: - vcf_file (file) – file handle to a VCF file
- snp_data (dict) – dictionary from
init_count_set()
with per sample SNPs information - min_reads (int) – minimum number of reads to accept a SNP
- min_af (float) – minimum allele frequency to accept a SNP
- min_qual (int) – minimum quality (Phred score) to accept a SNP
- annotations (dict) – annotations grouped by their reference sequence
- seqs (dict) – reference sequences
- line_num (int) – the interval in number of lines at which progress will be printed
-
mgkit.workflow.snp_parser.
save_data
(output_file, snp_data)[source]¶ Pickle data structures to the disk.
Parameters: - output_file (str) – base name for pickle files
- snp_data (dict) – dictionary from
init_count_set()
with per sample SNPs information