MiAIRR Data Elements¶
The AIRR Community has agreed to six high-level data sets that will guide the publication, curation and sharing of AIRR-Seq data and metadata: Study and subject, sample collection, sample processing and sequencing, raw sequences, processing of sequence data, and processed AIRR sequences.
Set | Subset | Designation | Field | Type | Format | Definition | Example |
---|---|---|---|---|---|---|---|
1 | study | Study ID | study_id | string | Free text | Unique ID assigned by study registry | PRJNA001 |
1 | study | Study title | study_title | string | Free text | Descriptive study title | Effects of sun light exposure of the Treg repertoire |
1 | study | Study type | study_type | string | Ontology: { name: NCIT, top_node: {id: C15320, value: Study Design}, draft: True, url: https://ncit.nci.nih.gov/ncitbrowser/ } | Type of study design | id: C15197, value: Case-Control Study |
1 | study | Study inclusion/exclusion criteria | inclusion_exclusion_criteria | string | Free text | List of criteria for inclusion/exclusion for the study | Include: Clinical P. falciparum infection; Exclude: Seropositive for HIV |
1 | study | Grant funding agency | grants | string | Free text | Funding agencies and grant numbers | NIH, award number R01GM987654 |
1 | study | Contact information (data collection) | collected_by | string | Free text | Full contact information of the data collector, i.e. the person who is legally responsible for data collection and release. This should include an e-mail address. | Dr. P. Stibbons, p.stibbons@unseenu.edu |
1 | study | Lab name | lab_name | string | Free text | Department of data collector | Department for Planar Immunology |
1 | study | Lab address | lab_address | string | Free text | Institution and institutional address of data collector | School of Medicine, Unseen University, Ankh-Morpork, Disk World |
1 | study | Contact information (data deposition) | submitted_by | string | Free text | Full contact information of the data depositor, i.e. the person submitting the data to a repository. This is supposed to be a short-lived and technical role until the submission is relased. | Adrian Turnipseed, a.turnipseed@unseenu.edu |
1 | study | Relevant publications | pub_ids | string | Free text | Publications describing the rationale and/or outcome of the study | PMID:85642 |
1 | study | Keywords for study | keywords_study | array | Free text | Keywords describing properties of one or more data sets in a study | [‘contains_ig’, ‘contains_paired_chain’] |
1 | subject | Subject ID | subject_id | string | Free text | Subject ID assigned by submitter, unique within study | SUB856413 |
1 | subject | Synthetic library | synthetic | boolean | T | F | TRUE for libraries in which the diversity has been synthetically generated (e.g. phage display) | |
1 | subject | Organism | organism | string | Ontology: { name: NCBITAXON, top_node: {id: 7776, value: Gnathostomata}, draft: False, url: https://www.ncbi.nlm.nih.gov/taxonomy } | Binomial designation of subject’s species | id: 9096, value: Homo sapiens |
1 | subject | Sex | sex | string | Controlled vocabulary: [‘male’, ‘female’, ‘pooled’, ‘hermaphrodite’, ‘intersex’, ‘not collected’, ‘not applicable’] | Biological sex of subject | female |
1 | subject | Age minimum | age_min | number | Any positive number | Specific age or lower boundary of age range. | 60 |
1 | subject | Age maximum | age_max | number | Any positive number | Upper boundary of age range or equal to age_min for specific age. This field should only be null if age_min is null. | 80 |
1 | subject | Age unit | age_unit | string | Ontology: { name: Units of measurement ontology, top_node: {id: UO_0000003, value: time unit}, draft: True, url: http://www.ontobee.org/ontology/UO } | Unit of age range | id: UO_0000036, value: year |
1 | subject | Age event | age_event | string | Free text | Event in the study schedule to which Age refers. For NCBI BioSample this MUST be sampling. For other implementations submitters need to be aware that there is currently no mechanism to encode to potential delta between Age event and Sample collection time, hence the chosen events should be in temporal proximity. | enrollment |
1 | subject | Ancestry population | ancestry_population | string | Free text | Broad geographic origin of ancestry (continent) | list of continents, mixed or unknown |
1 | subject | Ethnicity | ethnicity | string | Free text | Ethnic group of subject (defined as cultural/language-based membership) | English, Kurds, Manchu, Yakuts (and other fields from Wikipedia) |
1 | subject | Race | race | string | Free text | Racial group of subject (as defined by NIH) | White, American Indian or Alaska Native, Black, Asian, Native Hawaiian or Other Pacific Islander, Other |
1 | subject | Strain name | strain_name | string | Free text | Non-human designation of the strain or breed of animal used | C57BL/6J |
1 | subject | Relation to other subjects | linked_subjects | string | Free text | Subject ID to which Relation type refers | SUB1355648 |
1 | subject | Relation type | link_type | string | Free text | Relation between subject and linked_subjects, can be genetic or environmental (e.g.exposure) | father, daughter, household |
1 | diagnosis and intervention | Study group description | study_group_description | string | Free text | Designation of study arm to which the subject is assigned to | control |
1 | diagnosis and intervention | Diagnosis | disease_diagnosis | string | Free text | Diagnosis of subject | Multiple myeloma |
1 | diagnosis and intervention | Length of disease | disease_length | string | Physical quantity | Time duration between initial diagnosis and current intervention | 23 months |
1 | diagnosis and intervention | Disease stage | disease_stage | string | Free text | Stage of disease at current intervention | Stage II |
1 | diagnosis and intervention | Prior therapies for primary disease under study | prior_therapies | string | Free text | List of all relevant previous therapies applied to subject for treatment of Diagnosis | melphalan/prednisone |
1 | diagnosis and intervention | Immunogen/agent | immunogen | string | Free text | Antigen, vaccine or drug applied to subject at this intervention | bortezomib |
1 | diagnosis and intervention | Intervention definition | intervention | string | Free text | Description of intervention | systemic chemotherapy, 6 cycles, 1.25 mg/m2 |
1 | diagnosis and intervention | Other relevant medical history | medical_history | string | Free text | Medical history of subject that is relevant to assess the course of disease and/or treatment | MGUS, first diagnosed 5 years prior |
2 | sample | Biological sample ID | sample_id | string | Free text | Sample ID assigned by submitter, unique within study | SUP52415 |
2 | sample | Sample type | sample_type | string | Free text | The way the sample was obtained, e.g. fine-needle aspirate, organ harvest, peripheral venous puncture | Biopsy |
2 | sample | Tissue | tissue | string | Free text | The actual tissue sampled, e.g. lymph node, liver, peripheral blood | Bone marrow |
2 | sample | Anatomic site | anatomic_site | string | Free text | The anatomic location of the tissue, e.g. Inguinal, femur | Iliac crest |
2 | sample | Disease state of sample | disease_state_sample | string | Free text | Histopathologic evaluation of the sample | Tumor infiltration |
2 | sample | Sample collection time | collection_time_point_relative | string | Physical quantity | Time point at which sample was taken, relative to Collection time event | 14 d |
2 | sample | Collection time event | collection_time_point_reference | string | Free text | Event in the study schedule to which Sample collection time relates to | Primary vaccination |
2 | sample | Biomaterial provider | biomaterial_provider | string | Free text | Name and address of the entity providing the sample | Tissues-R-Us, Tampa, FL, USA |
3 | process (cell) | Tissue processing | tissue_processing | string | Free text | Enzymatic digestion and/or physical methods used to isolate cells from sample | Collagenase A/Dnase I digested, followed by Percoll gradient |
3 | process (cell) | Cell subset | cell_subset | string | Ontology: { name: CL, top_node: {id: CL_0000542, value: lymphocyte}, draft: True, url: https://www.ebi.ac.uk/ols/ontologies/cl } | Commonly-used designation of isolated cell population | id: CL_0000972, value: class switched memory B cell |
3 | process (cell) | Cell subset phenotype | cell_phenotype | string | Free text | List of cellular markers and their expression levels used to isolate the cell population | CD19+ CD38+ CD27+ IgM- IgD- |
3 | process (cell) | Single-cell sort | single_cell | boolean | T | F | TRUE if single cells were isolated into separate compartments | |
3 | process (cell) | Number of cells in experiment | cell_number | integer | Any positive integer | Total number of cells that went into the experiment | 1000000 |
3 | process (cell) | Number of cells per sequencing reaction | cells_per_reaction | integer | Any positive integer | Number of cells for each biological replicate | 50000 |
3 | process (cell) | Cell storage | cell_storage | boolean | T | F | TRUE if cells were cryo-preserved between isolation and further processing | True |
3 | process (cell) | Cell quality | cell_quality | string | Free text | Relative amount of viable cells after preparation and (if applicable) thawing | 90% viability as determined by 7-AAD |
3 | process (cell) | Cell isolation / enrichment procedure | cell_isolation | string | Free text | Description of the procedure used for marker-based isolation or enrich cells | Cells were stained with fluorochrome labeled antibodies and then sorted on a FlowMerlin (CE) cytometer |
3 | process (cell) | Processing protocol | cell_processing_protocol | string | Free text | Description of the methods applied to the sample including cell preparation/ isolation/enrichment and nucleic acid extraction. This should closely mirror the Materials and methods section in the manuscript | Stimulated wih anti-CD3/anti-CD28 |
3 | process (nucleic acid [pcr]) | Target locus for PCR | pcr_target_locus | string | Controlled vocabulary: [‘IGH’, ‘IGI’, ‘IGK’, ‘IGL’, ‘TRA’, ‘TRB’, ‘TRD’, ‘TRG’] | Designation of the target locus according to IMGT nomencleature | IGK |
3 | process (nucleic acid [pcr]) | Forward PCR primer target location | forward_pcr_primer_target_location | string | Free text | Position of the most distal nucleotide templated by the forward primer or primer mix | IGHV, +23 |
3 | process (nucleic acid [pcr]) | Reverse PCR primer target location | reverse_pcr_primer_target_location | string | Free text | Position of the most proximal nucleotide templated by the reverse primer or primer mix | IGHG, +57 |
3 | process (nucleic acid) | Target substrate | template_class | string | Controlled vocabulary: [‘DNA’, ‘RNA’] | The class of nucleic acid that was used as primary starting material for the following procedures | RNA |
3 | process (nucleic acid) | Target substrate quality | template_quality | string | Free text | Description and results of the quality control performed on the template material | RIN 9.2 |
3 | process (nucleic acid) | Template amount | template_amount | string | Physical quantity | Amount of template that went into the process | 1000 ng |
3 | process (nucleic acid) | Library generation method | library_generation_method | string | Controlled vocabulary: [‘PCR’, ‘RT(RHP)+PCR’, ‘RT(oligo-dT)+PCR’, ‘RT(oligo-dT)+TS+PCR’, ‘RT(oligo-dT)+TS(UMI)+PCR’, ‘RT(specific)+PCR’, ‘RT(specific)+TS+PCR’, ‘RT(specific)+TS(UMI)+PCR’, ‘RT(specific+UMI)+PCR’, ‘RT(specific+UMI)+TS+PCR’, ‘RT(specific)+TS’, ‘other’] | Generic type of library generation | RT(oligo-dT)+TS(UMI)+PCR |
3 | process (nucleic acid) | Library generation protocol | library_generation_protocol | string | Free text | Description of processes applied to substrate to obtain a library that is ready for sequencing | cDNA was generated using |
3 | process (nucleic acid) | Protocol IDs | library_generation_kit_version | string | Free text | When using a library generation protocol from a commercial provider, provide the protocol version number | v2.1 (2016-09-15) |
3 | process (nucleic acid) | Complete sequences | complete_sequences | string | Controlled vocabulary: [‘partial’, ‘complete’, ‘complete+untemplated’] | To be considered complete, the procedure used for library construction MUST generate sequences that 1) include the first V segment codon that encodes the mature polypeptide chain (i.e. after the leader sequence) and 2) include the last complete codon of the J segment (i.e. 1 bp 5’ of the J->C splice site) and 3) provide sequence information for all positions between 1) and 2). To be considered complete & untemplated, the sections of the sequences defined in points 1) to 3) of the previous sentence MUST be untemplated, i.e. MUST NOT overlap with the primers used in library preparation. | partial |
3 | process (nucleic acid) | Physical linkage of different loci | physical_linkage | string | Controlled vocabulary: [‘none’, ‘hetero_head-head’] | Describes the mode of linkage if a method was used which physically links nucleic acids derived from distinct loci in a single-cell context. | hetero_head-head |
3 | process (sequencing) | Batch number | sequencing_run_id | string | Free text | ID of sequencing run assigned by the sequencing facility | 160101_M01234_0201_000000000-D2T7V |
3 | process (sequencing) | Total reads passing QC filter | total_reads_passing_qc_filter | integer | Any positive integer | Number of usable reads for analysis | 10365118 |
3 | process (sequencing) | Sequencing platform | sequencing_platform | string | Free text | Designation of sequencing instrument used | Alumina LoSeq 1000 |
3 | process (sequencing) | Read lengths | read_length | string | Free text | Read length in bases for each direction | [300,300] |
3 | process (sequencing) | Sequencing facility | sequencing_facility | string | Free text | Name and address of sequencing facility | Seqs-R-Us, Vancouver, BC, Canada |
3 | process (sequencing) | Date of sequencing run | sequencing_run_date | string | Free text | Date of sequencing run | 2016-12-16 |
3 | process (sequencing) | Sequencing kit | sequencing_kit | string | Free text | Name, manufacturer, order and lot numbers of sequencing kit | FullSeq 600, Alumina, #M123456C0, 789G1HK |
4 | data (raw reads) | Raw sequencing data file type | file_type | string | Controlled vocabulary: [‘fasta’, ‘fastq’] | File format for the raw reads or sequences | |
4 | data (raw reads) | Raw sequencing data file name | filename | string | Free text | File name for the raw reads or sequences. The first file in paired-read sequencing | MS10R-NMonson-C7JR9_S1_R1_001.fastq |
4 | data (raw reads) | Read direction | read_direction | string | Controlled vocabulary: [‘forward’, ‘reverse’, ‘mixed’] | Read direction for the raw reads or sequences. The first file in paired-read sequencing | forward |
4 | data (raw reads) | Raw sequencing data file name | paired_filename | string | Free text | File name for the second file in paired-read sequencing | MS10R-NMonson-C7JR9_S1_R2_001.fastq |
4 | data (raw reads) | Read direction | paired_read_direction | string | Controlled vocabulary: [‘forward’, ‘reverse’, ‘mixed’] | Read direction for the second file in paired-read sequencing | reverse |
5 | process (computational) | Software tools and version numbers | software_versions | string | Free text | Version number and / or date, include company pipelines | IgBLAST 1.6 |
5 | process (computational) | Paired read assembly | paired_reads_assembly | string | Free text | How paired end reads were assembled into a single receptor sequence | PandaSeq (minimal overlap 50, threshold 0.8) |
5 | process (computational) | Quality thresholds | quality_thresholds | string | Free text | How sequences were removed from (4) based on base quality scores | Average Phred score >=20 |
5 | process (computational) | Primer match cutoffs | primer_match_cutoffs | string | Free text | How primers were identified in the sequences, were they removed/masked/etc? | Hamming distance <= 2 |
5 | process (computational) | Collapsing method | collapsing_method | string | Free text | The method used for combining multiple sequences from (4) into a single sequence in (5) | MUSCLE 3.8.31 |
5 | process (computational) | Data processing protocols | data_processing_protocols | string | Free text | General description of how QC is performed | Data was processed using […] |
5 | data (processed sequence) | V(D)J germline reference database | germline_database | string | Free text | Source of germline V(D)J genes with version number or date accessed. | ENSEMBL, Homo sapiens build 90, 2017-10-01 |
6 | data (processed sequence) | V gene | v_call | string | Free text | V gene with allele. If referring to a known reference sequence in a database, such as IMGT/GENE-DB, the relevant gene/allele nomenclature should be followed (e.g., IGHV4-59*01). | IGHV4-59*01 |
6 | data (processed sequence) | D gene | d_call | string | Free text | D gene with allele. If referring to a known reference sequence in a database, such as IMGT/GENE-DB, the relevant gene/allele nomenclature should be followed (e.g., IGHD3-10*01). | IGHD3-10*01 |
6 | data (processed sequence) | J gene | j_call | string | Free text | J gene with allele. If referring to a known reference sequence in a database, such as IMGT/GENE-DB, the relevant gene/allele nomenclature should be followed (e.g., IGHJ4*02). | IGHJ4*02 |
6 | data (processed sequence) | C region | c_call | string | Free text | C region gene with allele. If referring to a known reference sequence in a database, such as IMGT/GENE-DB, the relevant gene/allele nomenclature should be followed (e.g., IGHM*01). | IGHM*01 |
6 | data (processed sequence) | IMGT-JUNCTION nucleotide sequence | junction | string | Free text | Junction region nucleotide sequence, where the junction is defined as the CDR3 plus the two flanking conserved codons. | TGTGCAAGAGCGGGAGTTTACGACGGATATACTATGGACTACTGG |
6 | data (processed sequence) | IMGT-JUNCTION amino acid sequence | junction_aa | string | Free text | Junction region amino acid sequence. | CARAGVYDGYTMDYW |
6 | data (processed sequence) | Read count | duplicate_count | integer | Any positive integer | Copy number or number of duplicate observations for the query sequence. For example, the number of UMIs sharing an identical sequence or the number of identical observations of this sequence absent UMIs. | 123 |
6 | data (processed sequence) | Paired-chain index | pair_id | string | Free text | Valid sequence_id that was determined by experimental or computational means to be associated with the current Rearrangement on the cellular level. | ABC314159 |
6 | data (processed sequence) | Cell index | cell_id | string | Free text | Identifier defining the cell of origin for the query sequence. | W06_046_091 |