Schema Release Notes#

Version 1.4.1: August 27, 2022#

Version 1.4 schema release.

New General Purpose Schema:

  1. Introduced the experimental DataFile object, which defines a JSON file holding Repertoire metadata, data processing analysis objects, or any object in the AIRR Data Model.

  2. Introduced the experimental RepertoireGroup Schema for describing collections of repertoires to be analyzed together.

  3. Introduced the experimental InfoObject Schema, which provides information about data and ADC API responses.

  4. Introduced the experimental TimePoint Schema for defining the time point at which an observation or other action was performed.

New Germline and Genotype Schema:

The following experimental schema were introduced to support storage of VDJ germline reference sequences, VDJ genotypes, and MHC genotypes:

  1. GermlineSet: Defines a collection of AlleleDescriptions from the same strain or species.

  2. AlleleDescription: Details of a putative or confirmed Ig receptor gene/allele inferred from one or more observations.

  3. RearrangedSequence: Details of a directly observed rearranged sequence or an inference from rearranged sequences contributing support for a gene or allele.

  4. UnrearrangedSequence: Details of an unrearranged sequence contributing support for a gene or allele.

  5. SequenceDelineationV: Delineation of a V-gene in a particular system.

  6. GenotypeSet: Defines a collection a VDJ genotypes for a given subject.

  7. Genotype: Enumerates the alleles and gene deletions inferred in a single subject for a single locus.

  8. MHCGenotypeSet: Defines a collection of MHC genotypes for a given subject.

  9. MHCGenotype: Details the genotype of major histocompatibility complex (MHC) class I, class II and non-classical loci.

  10. Acknowledgement: Defines contributors to the germline or genotype description.

New Single-cell Schema:

The following experimental schema were introduced to improve support for single-cell data and extend the Cell schema.

  1. CellExpression: Defines a container to store single-cell expression level measurements.

  2. Receptor: Describes a complete receptor protein sequence and its reactivity.

Rearrangement Schema:

  1. Added the optional fields v_frameshift, j_frameshift, d_frame and d2_frame defining annotations related to alignment reading frames.

  2. Added the optional field umi_count to represent the count of distinct UMIs for a sequence.

  3. Modified the definition of duplicate_count to remove ambiguity with the new umi_count field in a single-cell context. There is now a distinction between duplicate observed sequences (duplicate_count) and UMIs (umi_count).

  4. The optional quality and quality_alignment alignment fields were added to store Phred quality scores for base calls in the sequence and sequence_alignment fields, respectively.

  5. The following optional fields were added to denote constant region (c_call) alignment positions: c_sequence_start, c_sequence_end, c_germline_start, c_germline_end, c_alignment_start, c_alignment_end.

Study Schema:

  1. Added the optional fields study_contact to store contact information for the primary study contact.

  2. Modified the enumerated values supported by keywords_study to the following set: contains_ig, contains_tr, contains_paired_chain, contains_schema_rearrangement, contains_schema_clone, contains_schema_cell, contains_schema_receptor

  3. Added the optional fields adc_publish_date and adc_update_data that timestamp AIRR Data Commons initial publication and last update, respectively.

Subject Schema:

  1. Added the optional genotype field linking to the new GenotypeSet and MHCGenotypeSet objects.

Sample Schema:

  1. Added the required field collection_time_point_relative_unit defining the units for the sample collection timestamp.

  2. Modified the type of the field collection_time_point_relative from a string to a number defined in combination with the new unit ontology field collection_time_point_relative_unit.

NucleicAcidProcessing Schema:

  1. Added the required field template_amount_unit defining the units for the input template quantification.

  2. Modified the type of the template_amount field from a string to a number defined in the combination with the new unit ontology field ``template_amount_unit`.

Clone Schema:

  1. Added the optional clone_count field to specify absolute count of clonal members.

  2. Added the optional umi_count field to specify the total UMI count of all clonal members.

Cell Schema:

  1. Removed the field expression_tabular whose functionality has been replaced by the new CellExpression schema.

Version 1.3.1: October 13, 2020#

Version 1.3 documentation patch release.

Alignment Schema:

  1. Added the deprecation tags for rearrangement_id, which were accidentally left out of the v1.3.0 release.

Version 1.3.0: May 28, 2020#

Version 1.3 schema release.

New Schema:

  1. Introduced the Repertoire Schema for describing study meta data.

  2. Introduced the PCRTarget Schema for describing primer target locations.

  3. Introduced the SampleProcessing Schema for describing experimental processing steps for a sample.

  4. Replaced the SoftwareProcessing schema with the DataProcessing schema.

  5. Introduced experimental schema for clonal clusters, lineage trees, tree nodes, and cells as Clone, Tree, Node, and Cell objects, respectively.

General Updates:

  1. Added multiple additional attributes to a large number of schema propertes as AIRR extension attributes in the x-airr field. The new Attributes object contains definitions for these x-airr field attributes.

  2. Added the top level required property to all relevant schema objects.

  3. Added the title attribute containing the short, descriptive name to all relevant schema object fields.

  4. Added an example attribute containing an example data value to multiple schema object fields.

AIRR Data Commons API:

  1. Added OpenAPI V2 specification (specs/adc-api.yaml) for AIRR Data Commons API major version 1.

Ontology Support:

  1. Added Ontology and CURIEResolution objects to support ontologies.

  2. Added vocabularies/ontologies as JSON string for: Cell subset, Target substrate, Library generation method, Complete sequences, Physical linkage of different loci.

Rearrangement Schema:

  1. Added the complete_vdj field to annotate whether a V(D)J alignment was full length.

  2. Added the junction_length_aa field defining the length of the junction amino acid sequence.

  3. Added the repertoire_id, sample_processing_id, and data_processing_id fields to serve as linkers to the appropriate metadata objects.

  4. Added a controlled vocabulary to the locus field: IGH, IGI, IGK, IGL, TRA, TRB, TRD, TRG.

  5. Deprecated the rearrangement_set_id and germline_database fields.

  6. Deprecated rearrangement_id field and made the sequence_id field be the primary unique identifer for a rearrangement record, both in files and data repositories.

  7. Added support secondary D gene rearrangement through the additional fields: d2_call, d2_score, d2_identity, d2_support, d2_cigar np3, np3_aa, np3_length, n3_length, p5d2_length, p3d2_length, d2_sequence_start, d2_sequence_end, d2_germline_start, d2_germline_start, d2_alignment_start, d2_alignment_end, d2_sequence_alignment, d2_sequence_alignment_aa, d2_germline_alignment, d2_germline_alignment_aa.

  8. Updated field definitions with more concise V(D)J call descriptions.

Alignment Schema:

  1. Deprecated the rearrangement_set_id and germline_database fields.

  2. Added the data_processing_id field.

Study Schema:

  1. Added the study_type field containing an ontology defined term for the study design.

Subject Schema:

  1. Deprecated the organism field in favor of the new species field.

  2. Deprecated the age field.

  3. Introduced age ranges: age_min, age_max, and age_unit.

Diagnosis Schema:

  1. Changed the type of the disease_diagnosis field from string to Ontology.

Sample Schema:

  1. Changed the type of the tissue field from string to Ontology.

CellProcessing Schema:

  1. Changed the type of the cell_subset field from string to Ontology.

  2. Introduced the cell_species field which denotes the species from which the analyzed cells originate.

NucleicAcidProcessing Schema:

  1. Defined the template_class field as type string.

  2. Added a controlled vocabulary the library_generation_method field.

  3. Changed the controlled vocabulary terms of complete_sequences. Replacing complete & untemplated with complete+untemplated and adding mixed.

  4. Added the pcr_target field referencing the new PCRTarget schema object.

SequencingRun Schema:

  1. Added the sequencing_run_id field which serves as the object identifer field.

  2. Added the sequencing_files field which links to the RawSequenceData schema objects defining the raw read data.

RawSequenceData Schema:

  1. Added the file_type field defining the sequence file type. This field is a controlled vocabulary restricted to: fasta, fastq.

  2. Added the paired_read_length field defining mate-pair read lengths.

  3. Defined the read_direction and paired_read_direction fields as type string.

DataProcessing Schema:

  1. Replaces the SoftwareProcessing object.

  2. Added data_processing_id, primary_annotation, data_processing_files, germline_database and analysis_provenance_id fields.

Version 1.2.1: Oct 5, 2018#

Minor patch release.

  1. Schema gene vs segment terminology corrections

  2. Added Info object

  3. Updated cell_subset URL in AIRR schema

Version 1.2.0: Aug 18, 2018#

Peer reviewed released of the Rearrangement schema.

  1. Definition change for the coordinate fields of the Rearrangement and Alignment schema. Coordinates are now defined as 1-based closed intervals, instead of 0-based half-open intervals (as previously defined in v1.1 of the schema).

  2. Removed foreign study_id fields

  3. Introduced keywords_study field

Version 1.1.0: May 3, 2018#

Initial public released of the Rearrangement and Alignment schemas.

  1. Added required and nullable constrains to AIRR schema.

  2. Schema definitions for MiAIRR attributes and ontology.

  3. Introduction of an x-airr object indicating if field is required by MiAIRR.

  4. Rename rearrangement_set_id to data_processing_id.

  5. Rename study_description to study_type.

  6. Added physical_quantity format.

  7. Raw sequencing files into separate schema object.

  8. Rename Attributes object.

  9. Added primary_annotation and repertoire_id.

  10. Added diagnosis to repertoire object.

  11. Added ontology for organism.

  12. Added more detailed specification of sequencing_run, repertoire and rearrangement.

  13. Added repertoire schema.

  14. Rename definitions.yaml to airr-schema.yaml.

  15. Removed c_call, c_score and c_cigar from required as this is not typical reference aligner output.

  16. Renamed vdj_score, vdj_identity, vdj_evalue, and vdj_cigar to score, identity, evalue, and cigar.

  17. Added missing c_identity and c_evalue fields to Rearrangement spec.

  18. Swapped order of N and S operators in CIGAR string.

  19. Some description clean up for consistency in Rearrangement spec.

  20. Remove repeated objects in definitions.yaml.

  21. Added Alignment object to definitions.yaml.

  22. Updated MiARR format consistency check TSV with junction change.

  23. Changed definition from functional to productive.

Version 1.0.1: Jan 9, 2018#

MiAIRR v1 official release and initial draft of Rearrangement and Alignment schemas.