fastq2vcf


fastq2vcf is a program that generates an analysis pipeline for Whole Exome Sequencing (WES) projects. It takes the raw reads through to variant calling and annotation. The software is written in PERL. It is friendly to both novice and expert users.

Citation:
Gao X, Xu J, Starmer J. (2015). Fastq2vcf: a concise and transparent pipeline for whole-exome sequencing data analysis. BMC Res Notes. 8:72. doi: 10.1186/s13104-015-1027-x.

Download

Feedback and Suggestions:
Gao Lab, ray.x.gao_at_gmail*dot*com

Thank you for trying/using our software.

Manual:


Prerequisites:
The programs and software are required to be installed before running fastq2vcf:

If you do not already have BWA index files for hg19 on your computer, you can build them with these instructions (Note: you only need to do this once):
  1. Download the reference files for hg19 using the following command (mac users, use "curl" instead of "wget"):
     
    shell> get ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/2.8/hg19/*
  2. After the download has completed, generate the BWA index files using the following command:
    shell> bwa index -a bwtsw ucsc.hg19.fasta

Installation:
Download the fastq2vcf.tar.gz file from the website:
Extract the file:
shell> tar zxvf fastq2vcf.tar.gz 
change into the directory:
shell> cd fastq2vcf/
You will see four files:
	fastq2vcf.pl
dataTable.txt
config.ini
cluster.conf

Single vs Cluster Computing:
The fastq2vcf can run on both a single Linux environment and in a clustered Linux environment. By default, fasq2vcf is configured to run on a single Linux environment.
To run it on cluster environment, users need to specify runEnvironment=1 in the config.ini file. Users need to also edit cluster.conf with the correct cluster configuration header according to the configuration requirement of the clusters.

HOWTO: run fastq2vcf analysis:
Before running the pipeline you must update "dataTable.txt" and "config.ini" with values that are relavent for your analysis:
  1. cd fastq2vcf
  2. Update the "Directory" field in dataTable.txt with the path to the directory that contains the raw sequence data in FASTQ format.
  3. Edit config.ini and make sure all the paths for programs and reference files are correct.

There are 5 steps to run fastq2vcf on Single Linux server environment:
  1. Generate the shell scripts for all samples:
    shell> perl fastq2vcf.v9.pl -d dataTable.txt -c config.ini -o $PWD 
    Note: After running this step, you will generate three shell scripts per sample, QC_Mapping, PreCalling and Variant.
  2. QC and Mapping:
    shell> nohup sh QC_Mapping_sample1_RG1_SRR504483.sh & 
    shell> nohup sh QC_Mapping_sample2_RG1_SRR504515.sh &
    shell> nohup sh QC_Mapping_sample3_RG1_SRR504516.sh &
    shell> nohup sh QC_Mapping_sample4_RG1_SRR504517.sh &
    shell> nohup sh QC_Mapping_sample5_RG1_SRR524806.sh &
  3. When those scripts have finished executing, mark duplicates, realignments, quality recalibration and data compression:
    shell> nohup sh PreCalling_sample1.sh &
    shell> nohup sh PreCalling_sample2.sh &
    shell> nohup sh PreCalling_sample3.sh &
    shell> nohup sh PreCalling_sample4.sh &
    shell> nohup sh PreCalling_sample5.sh &
  4. When those scripts have finished executing, perform multi-sample variant calling: we use four variant callers to call variants.
    shell> nohup sh variant.HaplotypeCaller.sh &
    shell> nohup sh variant.samtools.sh &
    shell> nohup sh variant.SNVer.sh &
    shell> nohup sh variant.UnifiedGenotyper.sh &
    Note: After running this step, you will get all the annotated variants for all samples from each caller.
  5. Finally, run variant summary:
    		
    shell> nohup sh variant.summary.sh &
    Note: After running this step, you will get the overlapped variant call set among 4 callers.

    Final annotated variants will be written into Variant/.
There are 5 steps to run fastq2vcf on a Cluster environment:
  1. Generate the shell scripts for all samples:
    shell> perl fastq2vcf.v9.pl -d dataTable.txt -c config.ini -e cluster.ini -o $PWD 
    Note: After running this step, you will generate a series of shell scripts for this run.
  2. Run QC and Mapping:
    shell> qsub QC_Mapping_sample1_RG1_SRR504483.sh
    shell> qsub QC_Mapping_sample2_RG1_SRR504515.sh
    shell> qsub QC_Mapping_sample3_RG1_SRR504516.sh
    shell> qsub QC_Mapping_sample4_RG1_SRR504517.sh
    shell> qsub QC_Mapping_sample5_RG1_SRR524806.sh
  3. When those scripts have finished executing, mark duplicates, realignments, quality recalibration and data compression:
    	
    shell> qsub PreCalling_sample1.sh
    shell> qsub PreCalling_sample2.sh
    shell> qsub PreCalling_sample3.sh
    shell> qsub PreCalling_sample4.sh
    shell> qsub PreCalling_sample5.sh
  4. When those scripts have finished executing, perform multi-sample variant calling: we use four variant callers to call variants.
    	 
    shell> qsub variant.HaplotypeCaller.sh
    shell> qsub variant.samtools.sh
    shell> qsub variant.SNVer.sh
    shell> qsub variant.UnifiedGenotyper.sh
    Note: After running this step, you will get all the annotated variants for all samples from each caller.
  5. Finally, run variant summary:
    		
    shell> qsub variant.summary.sh
    Note: After running this step, you will get the overlapped variant call set among 4 callers.