Step 3 - Prepare for Submission
AnVIL accepts two types of data: 1) genomic object files and 2) phenotypes and metadata. Most studies are submitting both. In this step, you will organize all required data and metadata in a format compatible with AnVIL.
Note that in addition to the data files, genomic object files require minimal metadata, some of which is generated by the AnVIL (i.e. full path to the files in AnVIL cloud storage).
You will submit all metadata (including phenotypic data) in a spreadsheet-like file (TSV/TXT/CSV format). To prepare data for submission, you will
- Make sure all object files conform to AnVIL’s naming requirements
- Generate a TSV file for each table in the data model (from Step 2)
Tables in an AnVIL Data Workspace
To learn how workspace tables help and organize data in the AnVIL, see Managing Data with Workspace Tables (estimated read time 15 minutes).
If you prefer a video, you can watch this “Introduction to data tables” video on YouTube (5:25 min).
Genomic Object Files
As part of the submission process, you may be providing AnVIL with genomic object files such as VCFs, CRAMs, BAMs, IDATs, or FASTQs. Before depositing the object files in a workspace bucket (Step 4), you will need to 1) make sure the object file names fit AnVIL requirements and 2) generate a TSV with object file metadata (i.e.
subject_id in the figure below).
Unallowed Characters (object and TSV file names)
Your genomic object files may only contain numbers, letters, “:”, “-” and “_”. No special characters (&, $, %, #, etc.) are allowed in the file names.
Note that AnVIL will generate a global unique ID (GUID) for object files, such as sequencing data, and add to the TSV (i.e.
cram_path in the figure above) after you deposit the data files.
These identifiers allow researchers to access data across AnVIL tools, without creating additional copies or transferring across environments.
They facilitate the ability to interoperate with other data commons due to their extensibility. Further, they enable tracking of live data being processed in workflow pipelines, and data backup to cold storage.
Functional Equivalence (FE)
To maximize the value of AnVIL-hosted data and minimize batch effects in cross-project analyses (Regier et al., 2018), CCDG and TOPMed consortia have defined a functional equivalence (FE) standard for alignment and processing of whole-genome sequencing data (i.e. WGS). AnVIL strongly encourages the submission of FE-compliant genome and exome sequencing data aligned to GRChB38. (See the CCDG pipeline standard).
FE is important for downstream joint calling across datasets but is difficult to prove. There is no easy way for AnVIL to validate or have the submitter prove that submitted data were aligned and mapped on a FE pipeline.
If you are unsure of whether or not your data is functionally equivalent, the AnVIL ingestion team may reach out to you to review your dataset prior to submission.
3.1 - Generate Table Load Files (TSV, CSV or TXT format)
Your spreadsheet can include an almost unlimited number of rows (individual entities) and columns (entity properties). A video walkthrough of generating a load file (TSV format) from a template is available below:
Your spreadsheet may only contain numbers, letters, “:”, “-” and “_”. No special characters (&, $, %, #, etc.) are allowed in any fields of the load file.
The first column in the load file is the identifier key (ID field). The first column header corresponds to the node in the data model.
First column headers must have the following format typed exactly as shown - note the
- Subject table -
- Sample table -
- Sequencing table -
- Family table -
Associating Data in Different Tables
Hint - Where possible, try to include data in the
sequencingtables. If that’s not an option, the data can be submitted as separate tables. Any data beyond these minimal required tables must always be linked to either the
sequencing_id- depending on what the data element describes. For example, to link data in an additional table to a subject, make sure to include a
Addressing Repeated Elements
Please bring any repeating data elements (i.e. multiple values for a given data element for an individual) to the attention of the AnVIL team to ensure proper modeling and submission.
- An individual in a data set has a measurement (e.g., blood pressure, lab test, BMI) taken at multiple time points.
- An individual in a data set is affected by multiple disease/phenotype/conditions included in the study (e.g., an individual in a diabetes study has both diabetes and diabetes retinopathy; both are being tracked in the study).
3.2. - Save as "Tab-Delimited Text" or "Tab-Separated Values"
Your editor may give you a warning about losing data in this format, but we assure you, it's fine! Also, Terra will completely ignore the name you give the file. It's the root entity in the first column header (the part in front of the
_id) that determines the table name in the workspace.
TSV versus TXT File Extensions
Depending on what spreadsheet editor you use, when you save in the proper format your spreadsheet may have either a ".tsv" or a ".txt" extension. Terra will accept either one.
- On This Page
- Tables in an AnVIL Data Workspace
- Genomic Object Files
- Unallowed Characters (object and TSV file names)
- Data Indexing
- Functional Equivalence (FE)
- 3.1 - Generate Table Load Files (TSV, CSV or TXT format)
- Unallowed Characters
- Required Formatting
- Associating Data in Different Tables
- Addressing Repeated Elements
- 3.2. - Save as "Tab-Delimited Text" or "Tab-Separated Values"
- TSV versus TXT File Extensions