MiAIRR Data Elements

The AIRR Community has agreed to six high-level data sets that will guide the publication, curation and sharing of AIRR-Seq data and metadata: Study and subject, sample collection, sample processing and sequencing, raw sequences, processing of sequence data, and processed AIRR sequences.

Download as TSV.

Set Subset Designation Field Type Format Definition Example Requirement
1 study Study ID study_id string Free text Unique ID assigned by study registry PRJNA001 important
1 study Study title study_title string Free text Descriptive study title Effects of sun light exposure of the Treg repertoire important
1 study Study type study_type string Ontology: { name: NCIT , top_node: { id: C63536, value: Study}, draft: True} Type of study design id: C15197, value: Case-Control Study important
1 study Study inclusion/exclusion criteria inclusion_exclusion_criteria string Free text List of criteria for inclusion/exclusion for the study Include: Clinical P. falciparum infection; Exclude: Seropositive for HIV important
1 study Grant funding agency grants string Free text Funding agencies and grant numbers NIH, award number R01GM987654 important
1 study Contact information (data collection) collected_by string Free text Full contact information of the data collector, i.e. the person who is legally responsible for data collection and release. This should include an e-mail address. Dr. P. Stibbons, p.stibbons@unseenu.edu important
1 study Lab name lab_name string Free text Department of data collector Department for Planar Immunology important
1 study Lab address lab_address string Free text Institution and institutional address of data collector School of Medicine, Unseen University, Ankh-Morpork, Disk World important
1 study Contact information (data deposition) submitted_by string Free text Full contact information of the data depositor, i.e. the person submitting the data to a repository. This is supposed to be a short-lived and technical role until the submission is relased. Adrian Turnipseed, a.turnipseed@unseenu.edu important
1 study Relevant publications pub_ids string Free text Publications describing the rationale and/or outcome of the study PMID:85642 important
1 study Keywords for study keywords_study array Controlled vocabulary: [‘contains_ig’, ‘contains_tcr’, ‘contains_single_cell’, ‘contains_paired_chain’] Keywords describing properties of one or more data sets in a study [‘contains_ig’, ‘contains_paired_chain’] important
1 subject Subject ID subject_id string Free text Subject ID assigned by submitter, unique within study SUB856413 important
1 subject Synthetic library synthetic boolean T | F TRUE for libraries in which the diversity has been synthetically generated (e.g. phage display)   essential
1 subject Organism species string Ontology: { name: NCBITAXON , top_node: { id: 7776, value: Gnathostomata}, draft: False} Binomial designation of subject’s species id: 9096, value: Homo sapiens essential
1 subject Sex sex string Controlled vocabulary: [‘male’, ‘female’, ‘pooled’, ‘hermaphrodite’, ‘intersex’, ‘not collected’, ‘not applicable’] Biological sex of subject female important
1 subject Age minimum age_min number Any positive number Specific age or lower boundary of age range. 60 important
1 subject Age maximum age_max number Any positive number Upper boundary of age range or equal to age_min for specific age. This field should only be null if age_min is null. 80 important
1 subject Age unit age_unit string Ontology: { name: Units of measurement ontology , top_node: { id: UO_0000003, value: time unit}, draft: True} Unit of age range id: UO_0000036, value: year important
1 subject Age event age_event string Free text Event in the study schedule to which Age refers. For NCBI BioSample this MUST be sampling. For other implementations submitters need to be aware that there is currently no mechanism to encode to potential delta between Age event and Sample collection time, hence the chosen events should be in temporal proximity. enrollment important
1 subject Ancestry population ancestry_population string Free text Broad geographic origin of ancestry (continent) list of continents, mixed or unknown important
1 subject Ethnicity ethnicity string Free text Ethnic group of subject (defined as cultural/language-based membership) English, Kurds, Manchu, Yakuts (and other fields from Wikipedia) important
1 subject Race race string Free text Racial group of subject (as defined by NIH) White, American Indian or Alaska Native, Black, Asian, Native Hawaiian or Other Pacific Islander, Other important
1 subject Strain name strain_name string Free text Non-human designation of the strain or breed of animal used C57BL/6J important
1 subject Relation to other subjects linked_subjects string Free text Subject ID to which Relation type refers SUB1355648 important
1 subject Relation type link_type string Free text Relation between subject and linked_subjects, can be genetic or environmental (e.g.exposure) father, daughter, household important
1 diagnosis and intervention Study group description study_group_description string Free text Designation of study arm to which the subject is assigned to control important
1 diagnosis and intervention Diagnosis disease_diagnosis string Ontology: { name: Human Disease Ontology , top_node: { id: 4, value: disease}, draft: False} Diagnosis of subject id: 9538, value: multiple myeloma important
1 diagnosis and intervention Length of disease disease_length string Physical quantity Time duration between initial diagnosis and current intervention 23 months important
1 diagnosis and intervention Disease stage disease_stage string Free text Stage of disease at current intervention Stage II important
1 diagnosis and intervention Prior therapies for primary disease under study prior_therapies string Free text List of all relevant previous therapies applied to subject for treatment of Diagnosis melphalan/prednisone important
1 diagnosis and intervention Immunogen/agent immunogen string Free text Antigen, vaccine or drug applied to subject at this intervention bortezomib important
1 diagnosis and intervention Intervention definition intervention string Free text Description of intervention systemic chemotherapy, 6 cycles, 1.25 mg/m2 important
1 diagnosis and intervention Other relevant medical history medical_history string Free text Medical history of subject that is relevant to assess the course of disease and/or treatment MGUS, first diagnosed 5 years prior important
2 sample Biological sample ID sample_id string Free text Sample ID assigned by submitter, unique within study SUP52415 important
2 sample Sample type sample_type string Free text The way the sample was obtained, e.g. fine-needle aspirate, organ harvest, peripheral venous puncture Biopsy important
2 sample Tissue tissue string Ontology: { name: UBERON , top_node: { id: UBERON_0010000, value: multicellular anatomical structure}, draft: False} The actual tissue sampled, e.g. lymph node, liver, peripheral blood id: UBERON_0002371, value: bone marrow important
2 sample Anatomic site anatomic_site string Free text The anatomic location of the tissue, e.g. Inguinal, femur Iliac crest important
2 sample Disease state of sample disease_state_sample string Free text Histopathologic evaluation of the sample Tumor infiltration important
2 sample Sample collection time collection_time_point_relative string Physical quantity Time point at which sample was taken, relative to Collection time event 14 d important
2 sample Collection time event collection_time_point_reference string Free text Event in the study schedule to which Sample collection time relates to Primary vaccination important
2 sample Biomaterial provider biomaterial_provider string Free text Name and address of the entity providing the sample Tissues-R-Us, Tampa, FL, USA important
3 process (cell) Tissue processing tissue_processing string Free text Enzymatic digestion and/or physical methods used to isolate cells from sample Collagenase A/Dnase I digested, followed by Percoll gradient important
3 process (cell) Cell subset cell_subset string Ontology: { name: CL , top_node: { id: CL_0000542, value: lymphocyte}, draft: False} Commonly-used designation of isolated cell population id: CL_0000972, value: class switched memory B cell important
3 process (cell) Cell subset phenotype cell_phenotype string Free text List of cellular markers and their expression levels used to isolate the cell population CD19+ CD38+ CD27+ IgM- IgD- important
3 process (cell) Single-cell sort single_cell boolean T | F TRUE if single cells were isolated into separate compartments   important
3 process (cell) Number of cells in experiment cell_number integer Any positive integer Total number of cells that went into the experiment 1000000 important
3 process (cell) Number of cells per sequencing reaction cells_per_reaction integer Any positive integer Number of cells for each biological replicate 50000 important
3 process (cell) Cell storage cell_storage boolean T | F TRUE if cells were cryo-preserved between isolation and further processing True important
3 process (cell) Cell quality cell_quality string Free text Relative amount of viable cells after preparation and (if applicable) thawing 90% viability as determined by 7-AAD important
3 process (cell) Cell isolation / enrichment procedure cell_isolation string Free text Description of the procedure used for marker-based isolation or enrich cells Cells were stained with fluorochrome labeled antibodies and then sorted on a FlowMerlin (CE) cytometer important
3 process (cell) Processing protocol cell_processing_protocol string Free text Description of the methods applied to the sample including cell preparation/ isolation/enrichment and nucleic acid extraction. This should closely mirror the Materials and methods section in the manuscript Stimulated wih anti-CD3/anti-CD28 important
3 process (nucleic acid [pcr]) Target locus for PCR pcr_target_locus string Controlled vocabulary: [‘IGH’, ‘IGI’, ‘IGK’, ‘IGL’, ‘TRA’, ‘TRB’, ‘TRD’, ‘TRG’] Designation of the target locus. Note that this field uses a controlled vocubulary that is meant to provide a generic classification of the locus, not necessarily the correct designation according to a specific nomenclature. IGK important
3 process (nucleic acid [pcr]) Forward PCR primer target location forward_pcr_primer_target_location string Free text Position of the most distal nucleotide templated by the forward primer or primer mix IGHV, +23 important
3 process (nucleic acid [pcr]) Reverse PCR primer target location reverse_pcr_primer_target_location string Free text Position of the most proximal nucleotide templated by the reverse primer or primer mix IGHG, +57 important
3 process (nucleic acid) Target substrate template_class string Controlled vocabulary: [‘DNA’, ‘RNA’] The class of nucleic acid that was used as primary starting material for the following procedures RNA essential
3 process (nucleic acid) Target substrate quality template_quality string Free text Description and results of the quality control performed on the template material RIN 9.2 important
3 process (nucleic acid) Template amount template_amount string Physical quantity Amount of template that went into the process 1000 ng important
3 process (nucleic acid) Library generation method library_generation_method string Controlled vocabulary: [‘PCR’, ‘RT(RHP)+PCR’, ‘RT(oligo-dT)+PCR’, ‘RT(oligo-dT)+TS+PCR’, ‘RT(oligo-dT)+TS(UMI)+PCR’, ‘RT(specific)+PCR’, ‘RT(specific)+TS+PCR’, ‘RT(specific)+TS(UMI)+PCR’, ‘RT(specific+UMI)+PCR’, ‘RT(specific+UMI)+TS+PCR’, ‘RT(specific)+TS’, ‘other’] Generic type of library generation RT(oligo-dT)+TS(UMI)+PCR essential
3 process (nucleic acid) Library generation protocol library_generation_protocol string Free text Description of processes applied to substrate to obtain a library that is ready for sequencing cDNA was generated using important
3 process (nucleic acid) Protocol IDs library_generation_kit_version string Free text When using a library generation protocol from a commercial provider, provide the protocol version number v2.1 (2016-09-15) important
3 process (nucleic acid) Complete sequences complete_sequences string Controlled vocabulary: [‘partial’, ‘complete’, ‘complete+untemplated’, ‘mixed’] To be considered complete, the procedure used for library construction MUST generate sequences that 1) include the first V gene codon that encodes the mature polypeptide chain (i.e. after the leader sequence) and 2) include the last complete codon of the J gene (i.e. 1 bp 5’ of the J->C splice site) and 3) provide sequence information for all positions between 1) and 2). To be considered complete & untemplated, the sections of the sequences defined in points 1) to 3) of the previous sentence MUST be untemplated, i.e. MUST NOT overlap with the primers used in library preparation. mixed should only be used if the procedure used for library construction will likely produce multiple categories of sequences in the given experiment. It SHOULD NOT be used as a replacement of a NULL value. partial essential
3 process (nucleic acid) Physical linkage of different rearrangements physical_linkage string Controlled vocabulary: [‘none’, ‘hetero_head-head’, ‘hetero_tail-head’, ‘hetero_prelinked’] In case an experimental setup is used that physically links nucleic acids derived from distinct Rearrangements before library preparation, this field describes the mode of that linkage. All hetero_* terms indicate that in case of paired-read sequencing, the two reads should be expected to map to distinct IG/TR loci. *_head-head refers to techniques that link the 5’ ends of transcripts in a single-cell context. *_tail-head refers to techniques that link the 3’ end of one transcript to the 5’ end of another one in a single-cell context. This term does not provide any information whether a continuous reading-frame between the two is generated. *_prelinked refers to constructs in which the linkage was already present on the DNA level (e.g. scFv). hetero_head-head essential
3 process (sequencing) Batch number sequencing_run_id string Free text ID of sequencing run assigned by the sequencing facility 160101_M012 34_0201_000 000000-D2T7 V important
3 process (sequencing) Total reads passing QC filter total_reads_passing_qc_filter integer Any positive integer Number of usable reads for analysis 10365118 important
3 process (sequencing) Sequencing platform sequencing_platform string Free text Designation of sequencing instrument used Alumina LoSeq 1000 important
3 process (sequencing) Sequencing facility sequencing_facility string Free text Name and address of sequencing facility Seqs-R-Us, Vancouver, BC, Canada important
3 process (sequencing) Date of sequencing run sequencing_run_date string Free text Date of sequencing run 2016-12-16 important
3 process (sequencing) Sequencing kit sequencing_kit string Free text Name, manufacturer, order and lot numbers of sequencing kit FullSeq 600, Alumina, #M123456C0, 789G1HK important
4 data (raw reads) Raw sequencing data file type file_type string Controlled vocabulary: [‘fasta’, ‘fastq’] File format for the raw reads or sequences   important
4 data (raw reads) Raw sequencing data file name filename string Free text File name for the raw reads or sequences. The first file in paired-read sequencing MS10R-NMons on-C7JR9_S1 _R1_001.fas tq important
4 data (raw reads) Read direction read_direction string Controlled vocabulary: [‘forward’, ‘reverse’, ‘mixed’] Read direction for the raw reads or sequences. The first file in paired-read sequencing forward important
4 process (sequencing) Forward read length read_length integer Any positive integer Read length in bases for the first file in paired-read sequencing 300 important
4 data (raw reads) Raw sequencing data file name paired_filename string Free text File name for the second file in paired-read sequencing MS10R-NMons on-C7JR9_S1 _R2_001.fas tq important
4 data (raw reads) Read direction paired_read_direction string Controlled vocabulary: [‘forward’, ‘reverse’, ‘mixed’] Read direction for the second file in paired-read sequencing reverse important
4 process (sequencing) Paired read length paired_read_length integer Any positive integer Read length in bases for the second file in paired-read sequencing 300 important
5 process (computational) Software tools and version numbers software_versions string Free text Version number and / or date, include company pipelines IgBLAST 1.6 important
5 process (computational) Paired read assembly paired_reads_assembly string Free text How paired end reads were assembled into a single receptor sequence PandaSeq (minimal overlap 50, threshold 0.8) important
5 process (computational) Quality thresholds quality_thresholds string Free text How sequences were removed from (4) based on base quality scores Average Phred score >=20 important
5 process (computational) Primer match cutoffs primer_match_cutoffs string Free text How primers were identified in the sequences, were they removed/masked/etc? Hamming distance <= 2 important
5 process (computational) Collapsing method collapsing_method string Free text The method used for combining multiple sequences from (4) into a single sequence in (5) MUSCLE 3.8.31 important
5 process (computational) Data processing protocols data_processing_protocols string Free text General description of how QC is performed Data was processed using […] important
5 data (processed sequence) V(D)J germline reference database germline_database string Free text Source of germline V(D)J genes with version number or date accessed. ENSEMBL, Homo sapiens build 90, 2017-10-01 important
6 data (processed sequence) V gene with allele v_call string Free text V gene with allele. If referring to a known reference sequence in a database the relevant gene/allele nomenclature should be followed (e.g., IGHV4-59*01 if using IMGT/GENE-DB). IGHV4-59*01 important
6 data (processed sequence) D gene with allele d_call string Free text First or only D gene with allele. If referring to a known reference sequence in a database the relevant gene/allele nomenclature should be followed (e.g., IGHD3-10*01 if using IMGT/GENE-DB). IGHD3-10*01 important
6 data (processed sequence) J gene with allele j_call string Free text J gene with allele. If referring to a known reference sequence in a database the relevant gene/allele nomenclature should be followed (e.g., IGHJ4*02 if using IMGT/GENE-DB). IGHJ4*02 important
6 data (processed sequence) C region c_call string Free text Constant region gene with allele. If referring to a known reference sequence in a database the relevant gene/allele nomenclature should be followed (e.g., IGHG1*01 if using IMGT/GENE-DB). IGHG1*01 important
6 data (processed sequence) IMGT-JUNCTION nucleotide sequence junction string Free text Junction region nucleotide sequence, where the junction is defined as the CDR3 plus the two flanking conserved codons. TGTGCAAGAGC GGGAGTTTACG ACGGATATACT ATGGACTACTG G important
6 data (processed sequence) IMGT-JUNCTION amino acid sequence junction_aa string Free text Amino acid translation of the junction. CARAGVYDGYTMDYW important
6 data (processed sequence) Read count duplicate_count integer Any positive integer Copy number or number of duplicate observations for the query sequence. For example, the number of UMIs sharing an identical sequence or the number of identical observations of this sequence absent UMIs. 123 important
6 data (processed sequence) Cell index cell_id string Free text Identifier defining the cell of origin for the query sequence. W06_046_091 important