MiAIRR Data Elements

The AIRR Community has agreed to six high-level data sets that will guide the publication, curation and sharing of AIRR-Seq data and metadata: Study and subject, sample collection, sample processing and sequencing, raw sequences, processing of sequence data, and processed AIRR sequences.

Download as TSV.

Set Subset Designation Field Type Format Definition Example
1 study Study ID study_id string Free text Unique ID assigned by study registry PRJNA001
1 study Study title study_title string Free text Descriptive study title Effects of sun light exposure of the Treg repertoire
1 study Study type study_type string Ontology: { name: NCIT, top_node: {id: C15320, value: Study Design}, draft: True, url: https://ncit.nci.nih.gov/ncitbrowser/ } Type of study design id: C15197, value: Case-Control Study
1 study Study inclusion/exclusion criteria inclusion_exclusion_criteria string Free text List of criteria for inclusion/exclusion for the study Include: Clinical P. falciparum infection; Exclude: Seropositive for HIV
1 study Grant funding agency grants string Free text Funding agencies and grant numbers NIH, award number R01GM987654
1 study Contact information (data collection) collected_by string Free text Full contact information of the data collector, i.e. the person who is legally responsible for data collection and release. This should include an e-mail address. Dr. P. Stibbons, p.stibbons@unseenu.edu
1 study Lab name lab_name string Free text Department of data collector Department for Planar Immunology
1 study Lab address lab_address string Free text Institution and institutional address of data collector School of Medicine, Unseen University, Ankh-Morpork, Disk World
1 study Contact information (data deposition) submitted_by string Free text Full contact information of the data depositor, i.e. the person submitting the data to a repository. This is supposed to be a short-lived and technical role until the submission is relased. Adrian Turnipseed, a.turnipseed@unseenu.edu
1 study Relevant publications pub_ids string Free text Publications describing the rationale and/or outcome of the study PMID:85642
1 study Keywords for study keywords_study array Free text Keywords describing properties of one or more data sets in a study [‘contains_ig’, ‘contains_paired_chain’]
1 subject Subject ID subject_id string Free text Subject ID assigned by submitter, unique within study SUB856413
1 subject Synthetic library synthetic boolean T | F TRUE for libraries in which the diversity has been synthetically generated (e.g. phage display)  
1 subject Organism organism string Ontology: { name: NCBITAXON, top_node: {id: 7776, value: Gnathostomata}, draft: False, url: https://www.ncbi.nlm.nih.gov/taxonomy } Binomial designation of subject’s species id: 9096, value: Homo sapiens
1 subject Sex sex string Controlled vocabulary: [‘male’, ‘female’, ‘pooled’, ‘hermaphrodite’, ‘intersex’, ‘not collected’, ‘not applicable’] Biological sex of subject female
1 subject Age minimum age_min number Any positive number Specific age or lower boundary of age range. 60
1 subject Age maximum age_max number Any positive number Upper boundary of age range or equal to age_min for specific age. This field should only be null if age_min is null. 80
1 subject Age unit age_unit string Ontology: { name: Units of measurement ontology, top_node: {id: UO_0000003, value: time unit}, draft: True, url: http://www.ontobee.org/ontology/UO } Unit of age range id: UO_0000036, value: year
1 subject Age event age_event string Free text Event in the study schedule to which Age refers. For NCBI BioSample this MUST be sampling. For other implementations submitters need to be aware that there is currently no mechanism to encode to potential delta between Age event and Sample collection time, hence the chosen events should be in temporal proximity. enrollment
1 subject Ancestry population ancestry_population string Free text Broad geographic origin of ancestry (continent) list of continents, mixed or unknown
1 subject Ethnicity ethnicity string Free text Ethnic group of subject (defined as cultural/language-based membership) English, Kurds, Manchu, Yakuts (and other fields from Wikipedia)
1 subject Race race string Free text Racial group of subject (as defined by NIH) White, American Indian or Alaska Native, Black, Asian, Native Hawaiian or Other Pacific Islander, Other
1 subject Strain name strain_name string Free text Non-human designation of the strain or breed of animal used C57BL/6J
1 subject Relation to other subjects linked_subjects string Free text Subject ID to which Relation type refers SUB1355648
1 subject Relation type link_type string Free text Relation between subject and linked_subjects, can be genetic or environmental (e.g.exposure) father, daughter, household
1 diagnosis and intervention Study group description study_group_description string Free text Designation of study arm to which the subject is assigned to control
1 diagnosis and intervention Diagnosis disease_diagnosis string Free text Diagnosis of subject Multiple myeloma
1 diagnosis and intervention Length of disease disease_length string Physical quantity Time duration between initial diagnosis and current intervention 23 months
1 diagnosis and intervention Disease stage disease_stage string Free text Stage of disease at current intervention Stage II
1 diagnosis and intervention Prior therapies for primary disease under study prior_therapies string Free text List of all relevant previous therapies applied to subject for treatment of Diagnosis melphalan/prednisone
1 diagnosis and intervention Immunogen/agent immunogen string Free text Antigen, vaccine or drug applied to subject at this intervention bortezomib
1 diagnosis and intervention Intervention definition intervention string Free text Description of intervention systemic chemotherapy, 6 cycles, 1.25 mg/m2
1 diagnosis and intervention Other relevant medical history medical_history string Free text Medical history of subject that is relevant to assess the course of disease and/or treatment MGUS, first diagnosed 5 years prior
2 sample Biological sample ID sample_id string Free text Sample ID assigned by submitter, unique within study SUP52415
2 sample Sample type sample_type string Free text The way the sample was obtained, e.g. fine-needle aspirate, organ harvest, peripheral venous puncture Biopsy
2 sample Tissue tissue string Free text The actual tissue sampled, e.g. lymph node, liver, peripheral blood Bone marrow
2 sample Anatomic site anatomic_site string Free text The anatomic location of the tissue, e.g. Inguinal, femur Iliac crest
2 sample Disease state of sample disease_state_sample string Free text Histopathologic evaluation of the sample Tumor infiltration
2 sample Sample collection time collection_time_point_relative string Physical quantity Time point at which sample was taken, relative to Collection time event 14 d
2 sample Collection time event collection_time_point_reference string Free text Event in the study schedule to which Sample collection time relates to Primary vaccination
2 sample Biomaterial provider biomaterial_provider string Free text Name and address of the entity providing the sample Tissues-R-Us, Tampa, FL, USA
3 process (cell) Tissue processing tissue_processing string Free text Enzymatic digestion and/or physical methods used to isolate cells from sample Collagenase A/Dnase I digested, followed by Percoll gradient
3 process (cell) Cell subset cell_subset string Ontology: { name: CL, top_node: {id: CL_0000542, value: lymphocyte}, draft: True, url: https://www.ebi.ac.uk/ols/ontologies/cl } Commonly-used designation of isolated cell population id: CL_0000972, value: class switched memory B cell
3 process (cell) Cell subset phenotype cell_phenotype string Free text List of cellular markers and their expression levels used to isolate the cell population CD19+ CD38+ CD27+ IgM- IgD-
3 process (cell) Single-cell sort single_cell boolean T | F TRUE if single cells were isolated into separate compartments  
3 process (cell) Number of cells in experiment cell_number integer Any positive integer Total number of cells that went into the experiment 1000000
3 process (cell) Number of cells per sequencing reaction cells_per_reaction integer Any positive integer Number of cells for each biological replicate 50000
3 process (cell) Cell storage cell_storage boolean T | F TRUE if cells were cryo-preserved between isolation and further processing True
3 process (cell) Cell quality cell_quality string Free text Relative amount of viable cells after preparation and (if applicable) thawing 90% viability as determined by 7-AAD
3 process (cell) Cell isolation / enrichment procedure cell_isolation string Free text Description of the procedure used for marker-based isolation or enrich cells Cells were stained with fluorochrome labeled antibodies and then sorted on a FlowMerlin (CE) cytometer
3 process (cell) Processing protocol cell_processing_protocol string Free text Description of the methods applied to the sample including cell preparation/ isolation/enrichment and nucleic acid extraction. This should closely mirror the Materials and methods section in the manuscript Stimulated wih anti-CD3/anti-CD28
3 process (nucleic acid [pcr]) Target locus for PCR pcr_target_locus string Controlled vocabulary: [‘IGH’, ‘IGI’, ‘IGK’, ‘IGL’, ‘TRA’, ‘TRB’, ‘TRD’, ‘TRG’] Designation of the target locus according to IMGT nomencleature IGK
3 process (nucleic acid [pcr]) Forward PCR primer target location forward_pcr_primer_target_location string Free text Position of the most distal nucleotide templated by the forward primer or primer mix IGHV, +23
3 process (nucleic acid [pcr]) Reverse PCR primer target location reverse_pcr_primer_target_location string Free text Position of the most proximal nucleotide templated by the reverse primer or primer mix IGHG, +57
3 process (nucleic acid) Target substrate template_class string Controlled vocabulary: [‘DNA’, ‘RNA’] The class of nucleic acid that was used as primary starting material for the following procedures RNA
3 process (nucleic acid) Target substrate quality template_quality string Free text Description and results of the quality control performed on the template material RIN 9.2
3 process (nucleic acid) Template amount template_amount string Physical quantity Amount of template that went into the process 1000 ng
3 process (nucleic acid) Library generation method library_generation_method string Controlled vocabulary: [‘PCR’, ‘RT(RHP)+PCR’, ‘RT(oligo-dT)+PCR’, ‘RT(oligo-dT)+TS+PCR’, ‘RT(oligo-dT)+TS(UMI)+PCR’, ‘RT(specific)+PCR’, ‘RT(specific)+TS+PCR’, ‘RT(specific)+TS(UMI)+PCR’, ‘RT(specific+UMI)+PCR’, ‘RT(specific+UMI)+TS+PCR’, ‘RT(specific)+TS’, ‘other’] Generic type of library generation RT(oligo-dT)+TS(UMI)+PCR
3 process (nucleic acid) Library generation protocol library_generation_protocol string Free text Description of processes applied to substrate to obtain a library that is ready for sequencing cDNA was generated using
3 process (nucleic acid) Protocol IDs library_generation_kit_version string Free text When using a library generation protocol from a commercial provider, provide the protocol version number v2.1 (2016-09-15)
3 process (nucleic acid) Complete sequences complete_sequences string Controlled vocabulary: [‘partial’, ‘complete’, ‘complete+untemplated’] To be considered complete, the procedure used for library construction MUST generate sequences that 1) include the first V segment codon that encodes the mature polypeptide chain (i.e. after the leader sequence) and 2) include the last complete codon of the J segment (i.e. 1 bp 5’ of the J->C splice site) and 3) provide sequence information for all positions between 1) and 2). To be considered complete & untemplated, the sections of the sequences defined in points 1) to 3) of the previous sentence MUST be untemplated, i.e. MUST NOT overlap with the primers used in library preparation. partial
3 process (nucleic acid) Physical linkage of different loci physical_linkage string Controlled vocabulary: [‘none’, ‘hetero_head-head’] Describes the mode of linkage if a method was used which physically links nucleic acids derived from distinct loci in a single-cell context. hetero_head-head
3 process (sequencing) Batch number sequencing_run_id string Free text ID of sequencing run assigned by the sequencing facility 160101_M01234_0201_000000000-D2T7V
3 process (sequencing) Total reads passing QC filter total_reads_passing_qc_filter integer Any positive integer Number of usable reads for analysis 10365118
3 process (sequencing) Sequencing platform sequencing_platform string Free text Designation of sequencing instrument used Alumina LoSeq 1000
3 process (sequencing) Read lengths read_length string Free text Read length in bases for each direction [300,300]
3 process (sequencing) Sequencing facility sequencing_facility string Free text Name and address of sequencing facility Seqs-R-Us, Vancouver, BC, Canada
3 process (sequencing) Date of sequencing run sequencing_run_date string Free text Date of sequencing run 2016-12-16
3 process (sequencing) Sequencing kit sequencing_kit string Free text Name, manufacturer, order and lot numbers of sequencing kit FullSeq 600, Alumina, #M123456C0, 789G1HK
4 data (raw reads) Raw sequencing data file type file_type string Controlled vocabulary: [‘fasta’, ‘fastq’] File format for the raw reads or sequences  
4 data (raw reads) Raw sequencing data file name filename string Free text File name for the raw reads or sequences. The first file in paired-read sequencing MS10R-NMonson-C7JR9_S1_R1_001.fastq
4 data (raw reads) Read direction read_direction string Controlled vocabulary: [‘forward’, ‘reverse’, ‘mixed’] Read direction for the raw reads or sequences. The first file in paired-read sequencing forward
4 data (raw reads) Raw sequencing data file name paired_filename string Free text File name for the second file in paired-read sequencing MS10R-NMonson-C7JR9_S1_R2_001.fastq
4 data (raw reads) Read direction paired_read_direction string Controlled vocabulary: [‘forward’, ‘reverse’, ‘mixed’] Read direction for the second file in paired-read sequencing reverse
5 process (computational) Software tools and version numbers software_versions string Free text Version number and / or date, include company pipelines IgBLAST 1.6
5 process (computational) Paired read assembly paired_reads_assembly string Free text How paired end reads were assembled into a single receptor sequence PandaSeq (minimal overlap 50, threshold 0.8)
5 process (computational) Quality thresholds quality_thresholds string Free text How sequences were removed from (4) based on base quality scores Average Phred score >=20
5 process (computational) Primer match cutoffs primer_match_cutoffs string Free text How primers were identified in the sequences, were they removed/masked/etc? Hamming distance <= 2
5 process (computational) Collapsing method collapsing_method string Free text The method used for combining multiple sequences from (4) into a single sequence in (5) MUSCLE 3.8.31
5 process (computational) Data processing protocols data_processing_protocols string Free text General description of how QC is performed Data was processed using […]
5 data (processed sequence) V(D)J germline reference database germline_database string Free text Source of germline V(D)J genes with version number or date accessed. ENSEMBL, Homo sapiens build 90, 2017-10-01
6 data (processed sequence) V gene v_call string Free text V gene with allele. If referring to a known reference sequence in a database, such as IMGT/GENE-DB, the relevant gene/allele nomenclature should be followed (e.g., IGHV4-59*01). IGHV4-59*01
6 data (processed sequence) D gene d_call string Free text D gene with allele. If referring to a known reference sequence in a database, such as IMGT/GENE-DB, the relevant gene/allele nomenclature should be followed (e.g., IGHD3-10*01). IGHD3-10*01
6 data (processed sequence) J gene j_call string Free text J gene with allele. If referring to a known reference sequence in a database, such as IMGT/GENE-DB, the relevant gene/allele nomenclature should be followed (e.g., IGHJ4*02). IGHJ4*02
6 data (processed sequence) C region c_call string Free text C region gene with allele. If referring to a known reference sequence in a database, such as IMGT/GENE-DB, the relevant gene/allele nomenclature should be followed (e.g., IGHM*01). IGHM*01
6 data (processed sequence) IMGT-JUNCTION nucleotide sequence junction string Free text Junction region nucleotide sequence, where the junction is defined as the CDR3 plus the two flanking conserved codons. TGTGCAAGAGCGGGAGTTTACGACGGATATACTATGGACTACTGG
6 data (processed sequence) IMGT-JUNCTION amino acid sequence junction_aa string Free text Junction region amino acid sequence. CARAGVYDGYTMDYW
6 data (processed sequence) Read count duplicate_count integer Any positive integer Copy number or number of duplicate observations for the query sequence. For example, the number of UMIs sharing an identical sequence or the number of identical observations of this sequence absent UMIs. 123
6 data (processed sequence) Paired-chain index pair_id string Free text Valid sequence_id that was determined by experimental or computational means to be associated with the current Rearrangement on the cellular level. ABC314159
6 data (processed sequence) Cell index cell_id string Free text Identifier defining the cell of origin for the query sequence. W06_046_091