Step 5 - QC Data
AnVIL Data Processing Working Group has created a genomic evaluation tool for whole genome data (a whole exome QC tool is in development). You will collect quality control metrics for genome and exome sequencing data by running the tool - a workflow written in Workflow Description Language - in a sandbox workspace.
The WDL includes multiple software packages (Picard, VerifyBamID2, Samtools flagstat, bamUtil stats ) organized in a single, efficient tool that is compatible with AnVIL.
The current QC pass/fail status is based on three metrics: coverage, freemix, and sample contamination. QC metrics can be made available in the AnVIL workspace to aid users in sample selection.
QC processing results table
Below is the current output, generated by the workflow in a
qc_results_sample data table.
|Metric Name||Metric Description||Pass threshold||Purpose||Source Tool|
|Sample ID||NA||Identify sample||NA|
|Cram google path||NA||Locate file||NA|
|FREEMIX||< 0.01||Sample contamination||VerifyBamID2|
|Haploid Coverage||≥ 30||Coverage depth||Picard CollectWgs Metrics|
|Library insert size mad||NA||Batch characteristics||Picard CollectInsertSize Metrics|
|Library insert size median||NA||Batch characteristics||Picard CollectInsertSize Metrics|
|% coverage at 10X||> 0.95||Coverage breadth||Picard CollectWgs Metrics|
|% coverage at 20X||> 0.90||Coverage breadth||Picard CollectWgs Metrics|
|% coverage at 30X||NA||Additional metadata||Picard CollectWgs Metrics|
|% Chimeras||< 0.05||Variant detection||Picard CollectAlignmentSummary Metrics|
|Total bases with Q20 or higher||≥ 86x109||Sequence quality||Picard CollectQualityYield Metrics|
|Reported status at the sample level||Pass/Fail/No QC||Overall quality assessment|
|Read1 base mismatch rate||< 0.05||Sequence quality||Picard Collect Alignment Summary Metrics|
|Read2 base mismatch rate||< 0.05||Sequence quality||Picard Collect Alignment Summary Metrics|
5.1 Select QC status criteria
Data submitters should establish the specific metrics and thresholds for determining the pass/fail criteria on their dataset.
5.2 Run QC Processing
Data Submitters are responsible for running the WDL on their data to generate the QC metrics. AnVIL Data Processing Working Group has created QC aggregator Jupyter notebook. Once QC status criteria have been determined, the thresholds can be modified in the notebook. The criteria is used to assign QC status of pass or fail. If a sample fails multiple times, it is assigned No QC under QC status.
Video - Walkthrough of WGS QC Processing
5.3 Post QC Processing to AnVIL Workspaces
The output from the QC aggregator is a QC summary results TSV file. Data submitters will pass off the QC summary results file to AnVIL ingestion team. The AnVIL team will push the QC summary results to the workspaces, which will contain the QC status including those that fail QC or have no QC. The example below is the QC results table in 1000 Genomes workspace.
Sample QC Results Table
Additional Resources - Upcoming AnVIL Tools
AnVIL Data Processing Working Group is evaluating two tools to add to the submission process to estimate (genetic) sex and compare that to reported sex. The goal is to identify at a cohort level any major issues between the genomic data and the reported phenotype data. Variation in sex chromosome copy number (e.g., XXY, XO, somatic mosaicism) means that genetic sex prediction is not 100% accurate, although it is an excellent tool for detecting major cohort-level issues.
Exome QC Processing